OCR and FileHold – Transforming Digital Documents Part 2

Wednesday, March 20, 2019

Optical Character Recognition, or OCR, is a powerful and newly-affordable tool to convert scanned images into fully searchable text. In the last article, we introduced you to OCR and how it is used by FileHold. In this article, we will take a deeper dive into Server-Side OCR and its capacity to transform your document processes.

What is Server-Side OCR?

FileHold’s Server-Side OCR is a service that automatically runs any PDF or TIF scanned image in the Library through an optical character recognition process. This found-text is added as a new layer on the document, and then updated in the Search Index. The module allows you to add archives of scanned paper documents into FileHold and make them completely searchable in an easy action. Regardless of document size, single-page reports and thousands of pages of testimony alike can be made into searchable text.

What is the Search Index?

When adding a document to FileHold, the contents are automatically added to the Search Index. This allows users to search the document contents for key words or phrases with great speed and accuracy. It also applies the user’s role to searches – the document’s location and use type (schema) determines who is allowed to see, and therefore search, the document.

Metadata is also added to the Search Index. Documents are tagged with metadata when they are added to FileHold. This serves two major functions: first, it allows you to tag documents with common information based on their use. For instance, a Purchase Order number could be associated with a Sales Order, Internal Requisition, Invoice, and Bill of Lading – each document would benefit from having the PO number associated as a metadata field to show their common connection. Second, metadata gives a searchable value to documents in the repository which do not have searchable values – like a video or music file. Along with file properties, users are able to use metadata as direct search terms, or to filter general searches and refine results – as in, show all document with PO # 12345 added two weeks ago, and the Library contents, metadata fields, and file properties are searched instantly using the Search Index – with relevant results presented in order of relevancy.

Scanned documents without a text layer are essentially images, so there is nothing inherent to add to the Search Index: only the metadata. It would require a lot of work to manually go through each scanned document and add all relevant values to the metadata to be captured by the Search Index. Sometimes,the documents being scanned are simply archives, documents that might be needed someday but not today - so the work involved for this granular filing is not worth the effort.

Here, Server-Side OCR becomes a powerful way to make scanned documents more useful. Instead of looking for metadata to capture details, the Search Index captures all this information and makes it fully searchable automatically. This means older documents retained for archival or regulatory requirement can now have their contents as searchable as any full text documents with minimal effort. Instead of merely replacing paper storage with digital documents, FileHold’s Server-Side OCR module makes these into viable records for your organization’s use.

Does Server-Side OCR alter my documents?

FileHold never modifies your documents – they are always preserved in the original state and version as part of your document audit trail. Since the Server-Side OCR is applied to a document already in the FileHold Library, it creates a new version of the original with the extracted text layer. This text is added to the search index, and the new version is kept forward as the primary document, with the original “flat” scanned document retained in the background. Either can be shared as needed. This can be a great asset for legal organizations, where there are distinct advantages to sharing non-OCR versions of documents externally, while retaining the fully searchable document internally.

What about poor-quality scans, will OCR still work?

Sometimes the scan quality of your documents is out of your control. Low resolution or poor-quality scans can give some false results in the search index – see Part 1 of this series, where we discuss image quality. First, note that this will likely be an issue only with older documents – modern scanners produce much higher quality images than the recommended 300 DPI without unwieldy file size. Second, FileHold’s search can perform nuanced searches for documents using stemming, fuzzy, synonym and phonic terms to expand your search results to phonetic matches, misspelled words, and wildcards to help mitigate errors from low quality scans. These settings can be fine tuned to match the document processes that match your organization’s needs. Third, metadata notes can be made to documents where the quality is too low for reliable OCR, allowing users to create searchable notes when they find errors in the OCR and ensuring documents can be found by other users.

What are some use cases for Server-Side OCR?

Anytime you have scanned documents, server-side OCR can help to index and utilize your digital archives. For instance:

  • Digitize store rooms of documents into digital records with minimal interaction to bring in the new scans into FileHold, and let Watched Folder, Auto-Filing, and Server-Side OCR do the work for you. Not only do you now have a digital copy of your paper records that can be backed-up in case of emergency, you can recover the floor-space previously occupied by paper.
  • Process already scanned archives into searchable records with almost effort. These don’t even need to be from your organization – you can add emailed catalogue and other sales materials into FileHold and let Server-Side OCR help find the information you’ve been looking for.
  • Large volumes of documents, from legal discovery to annual reports, often arrive un-indexed and needing to be read through thoroughly to find slivers of information. With a high-speed or multi-function workstation document scanner, you can quickly turn these pages of noise into highly searchable documents for immediate use and find what you need in an instant.

In Part 3, we will explore zonal OCR, for detailed processing before your document arrives in FileHold. To learn more about server-side OCR or to review your document processes, contact your sales consultant at sales@filehold.com today!


Chris Oliver

Chris Oliver brings his twenty years of experience in management in the entertainment industry to FileHold Systems as the sales consultant for the Eastern United States and Canada. To learn more about how FileHold DMS can work for you, contact him at coliver@filehold.com.