OCR and FileHold – Transforming digital documents part 1

Optical Character Recognition, or OCR, has emerged in the last decade as a powerful and newly affordable tool to allow computers to convert scanned image into fully useable text. Likely you have heard this term before and may be wondering how it can be used with your business processes – this series of articles is here to help. This first article will introduce you to how FileHold can use OCR to enhance the utility of your digital documents organization wide.

Why is OCR useful?

As organizations move to digital solutions, reducing or eliminating paper becomes more important. However diligent or ambitious paperless goals are, paper still appears: mail brings in invoices and account statements; shipping departments print paper waybills and customs documents for drivers; legal documents arrive in paper tomes to reduce transparency: paper is inescapable. The solution for the last two decades has been to run these documents through a scanner and recycle the now-redundant paper. These new digital documents may reduce the paper in the office, but they are still very limited. Since they are scans, there is no text on the documents, they are essentially low-resolution photos.

This is where OCR comes in - to bring scanned text back to life as searchable values. The OCR “crawls” the document looking for recognizable characters. FileHold can then use these found characters in a variety of uses to enhance your digital document experience.

Is OCR always accurate?

OCR can almost seem magical in its ability to reach into scanned documents and find information – but it has limitations:

  • Poor quality scans. The standard threshold for OCR readability is 300 DPI for image quality; anything below this can be challenging for an OCR engine to accurately read. This is often out of your control: if you are sent a low-quality scan, OCR may be powerless. As they say: garbage in, garbage out.
  • Skewed scans. If a document is scanned off-center, then OCR software may not read the text correctly. Some OCR software will re-center and de-skew the document to ensure accurate reading.
  • Damaged documents. If a document is in bad shape prior to scanning – crumpled up, torn, etc - OCR may not read the information through the damage.
  • Handwriting. Standard OCR works on text and will not work with hand-writing. In the case of a manually filled out form, OCR will fail. This is a factor with a hand-written note or comment on a scanned document – the information will not be processed by OCR.

Each of these problems may challenge the software, but the human eye can see the problem right away. OCR is not infallible, always able to read every scan every time, and any product that pretends to be that solution should be approached cautiously. FileHold offers OCR as part of your total document solution, to be deployed with other processes to ensure document quality and best practices for accurate OCR.

Why is accuracy important with document processing?

In FileHold, OCR is used for document indexing. This typically means either tagging documents with information, also called metadata, or adding text from OCR to the search index. With metadata, accuracy is absolutely vital for repository health. Inaccurate tags mean false or missing results from searches and improper filing of documents, which is disastrous. There are two truisms to metadata tagging:

1. Metadata must be accurate to be valuable.

2. Accuracy decreases as the labor intensity of data entry increases.

If each document requires dozens of tags, all of which need to be manually entered to the document, the likelihood of these tags being incorrect increases with each manual entry. Therefore, the best practices for metadata should be to minimize the number of tags required for each document, and then streamline the metadata entry for those tags.

Can OCR be used to tag documents?

If metadata needs to be accurate, and if OCR cannot be assured of accuracy, does it have application to tagging a document? Absolutely! The key point is not to rely on the OCR as being infallible, but instead allowing it to pre-populate information and then ensure its accuracy with a visual audit prior to entry. FileHold makes sure your documents get this auditing step with very simple tools to ensure metadata accuracy.

What kind of OCR services are available with FileHold?

FileHold offers OCR in three modes:

  1. Server-Side OCR to extract text from scanned documents in the repository and add that to the search index.
  2. Click to Tag to add onscreen values as metadata tags.
  3. Zonal OCR for complete document text extraction and metadata tagging through our partner software, SmartSoft Capture.

What are the differences between the OCR features of FileHold?

Here is a simple chart to show the major differences between the modes of OCR deployment in FileHold:

Feature

Click to Tag

Zonal OCR

Server-Side OCR

Automatic or manual

Manual

Manual

Automatic

Can be used for metadata

Yes

Yes

No

Can make document searchable

No

Yes

Yes

OCR results can be audited

Yes

Yes

No

Can find values independently

No

Yes

Yes

Requires partner software

No

Yes

No

Standard Feature

Yes

No*

No**

Works before or after added to system

Before

Before

After

Works on multiple documents as a time

No

No

Yes

 

* A complimentary license of SmartSoft Capture, FileHold’s partner and provider of zonal OCR document processing, is offered with all new installs of FileHold Express and FileHold Enterprise. This can be installed on multiple workstations, but only one instance can be running at a time. Additional licenses can be purchased as needed.

** Server-Side OCR is an optional module, and can be activated at a one-time cost, after which the organization would own the module in perpetuity.

Which model of OCR is best for my organization?

We will explore this further in the next article. In brief, it all depends on what document process you are looking to optimize. If you are looking for a way to pull multiple fields of metadata from a scanned standardized document or form, then Zonal OCR would be best. If you need to occasionally grab a field of information from a document, then Click to Tag would work well. If you need to make large volume document or documents into searchable information, Server-Side OCR would be an excellent .

Is OCR expensive?

No, and considering the return on investment, it’s one of the most cost-effective outlays your organization can make. When we match the OCR deployment to best practices, the value becomes clear:

  • Zonal OCR scans standardized forms into fully searchable documents with exact categorization and information tagging, while deducing misfiling with incorrect metadata. Zonal OCR creates a path to a true paperless method of processing forms with accuracy.
  • Server-Side OCR takes volumes of scanned documents and makes them fully searchable. Whether dealing with newly arrived documents or eliminating stacks of paper records, Server-Side OCR will quickly pay for itself as a way to make paper fully-searchable and free from occupying office space.
  • Click to Tag provides a quick way for users to ensure their metadata values added to the system match what are on the document – even when the document is an image. Best of all, this is a standard feature of FileHold, so it adds value with no additional cost.

In Part 2, we will explore Server-Side OCR in full. If you have any questions about OCR and how if can transform your document processing, contact us at [email protected].

Image
Chris Oliver

Chris Oliver brings his twenty years of experience in management in the entertainment industry to FileHold Systems as the Client Training and Retention Advocate. To learn more about how FileHold DMS can work for you, contact him at [email protected].