The FileHold server side OCR feature can provide OCR (optical character recognition) for PDF and TIFF documents that are in the FileHold library so that they can be indexed and searched. The OCR mechanism is located on the FileHold server which uses a queue to process the documents. Once the mechanism completes the processes of OCR’ing the document, the document is checked in as a new version that contains a text layer that allows the document to be indexed and searched within the document management system.
The criteria for adding a document to OCR processing queue are:
- The document must be an “Electronic Document” format. Electronic records and offline documents will not be processed.
- Only PDF and TIF/TIFF type documents are processed. TIFF images are converted to searchable PDF documents.
- Only the latest version of the documents can be processed. This is because a new version is created once the document has been OCR’d. The owner of the original document remains the owner for the new OCR’d version.
The resulting text layer is dependent upon the quality of the document being OCR'd. Therefore, to ensure accuracy of the resulting text layer, the quality of the documents should be reasonably high. Poorer quality scans will be difficult to OCR so quality checks on these documents should be done. The OCR engine cannot detect if an image is rotated so ensure that your documents can be read right to left and the text is oriented horizontally.
Server side OCR is an optional feature that is controlled in the FileHold license. To purchase the server side OCR feature, contact email@example.com.
CAUTION: If a document goes through the server side OCR process, a new version of the document is generated. This new version will not be associated with any workflows that had occurred on the previous version and therefore will lose its review and approval statuses. The newly generated version will need to go though the workflow process again if those statuses need to be maintained between versions.
The scheduled task “FH OCR documents” can be modified for the frequency and time frame for when the OCR’ing occurs in the Task Scheduler. The following items can be configured in the web.config file located in C:\Program Files\FileHold Systems\Application Server\LibraryManager if needed when OCR'ing documents.
- The entry is called <add key="OcrCommandTimeoutSec" value="270" />. This is the maximum amount of time in which the server side OCR task runs. The OCR process continues if this value is exceeded. This value does not usually need to be changed unless there are a large number of documents in the queue and more than one document needs to be processed per execution.
- The maximum number of documents that can be processed in the set amount of time can be configured in the same web.config file under the entry <add key="OcrMaxDocuments" value="10" />.
- For larger size documents (over 10 MB), the WebServiceCallTimeoutSec setting in the web.config file should be set to 3600. This forces the Library Manager to wait for a longer response time from other services in order to process the documents without timeouts.
- To restrict OCR by file type, add the following line to the DocumentRepository web.config <add key="OcrExcludeExtension" value="tif" />. Value can be “tif” or “pdf”.
- To restrict OCR by maximum page size, add the following line to the DocumentRepository web.config <add key="OcrMaxResolution" value="1920x1080" />. Value should be formatted according to the following pattern: [width]x[height]. Dimensions should be in pixels. When this option is configured, the system checks prior to attempting to OCR whether size of any page in the document is higher than provided maximum value.
The languages included with the OCR engine are:
The default configuration is:
- DPI resolution is 300.
- Language is English
The language configuration for OCR can be modified by a setting in the web.config file server under C:\Program Files\FileHold Systems\Application Server\DocumentRepository. Under <appSettings>, add the following parameters:
<add key="OcrLang" value="language_code" />
Language code values for the included languages are: german or deu, english or eng, french or fra, and spanish or spa.
The dots per inch (DPI) setting for the OCR engine controls how the engine renders the page internally on the server for processing. It cannot be used to improve the resolution of the document. If the document is scanned at 150 dpi, it will be up scaled to 300 dpi by default. However, this up scaled document will simply contain a larger rendition of any problems due to the original scanned resolution. OCR recognition will not generally improve at a resolution above 300 dpi. By default, a document with higher resolution will be down scaled to 300 dpi. This down scaling will preserve sufficient detail in the document for the OCR process. The OCR recognition process will be faster when the DPI setting is lower, however, this can also reduce the accuracy of the recognition. Depending on the original quality of the documents it may be possible to get reasonable quality recognition at 200 or 150 dpi. It is recommended documents be tested to look for a balance of performance and recognition quality.
<add key="OcrDpiResolution" value="dpi" />
Important: Server Side OCR component doesn't work properly for non-Latin languages when there is no Arial Unicode MS font installed on the application server. If this font is missing, install it in the [Drive]:\Windows\Fonts directory. More information about font and licensing:
Enabling Server Side OCR
Server side OCR can be a time consuming mechanism; therefore, documents are added to a queue to be processed. All new documents, new versions, manually added or through an automatic import mechanism (such as watched folders or managed imports), are automatically added to the queue. Existing repository documents can be added manually to the queue.
You can enforce the priority for newly added documents or versions so that they will take a higher priority in the queue via a setting. They will be processed before any existing documents in the queue. If the setting is not enforced, documents are taken from the queue in the order they are added without taking priority into account.
For the "Add existing documents to OCR queue" option, a configuration setting "OcrTotalOfExistingDocuments" is used for the OCR queue. First, FileHold processes any newer documents or versions, then looks at the queue. If the queue is large because it needs to also process a large number of existing documents, this can affect system performance. The OcrTotalOfExistingDocuments can help reduce these effects. The default value is 1,000,000 but can be adjusted in the web.config file located in C:\Program Files\FileHold Systems\Application Server\LibraryManager. With a larger number of documents in the queue, it is recommended that:
- Perform operation outside the working hours.
- Extend the WebServiceCallTimeoutSec setting for WebClient to prevent the timeout on the client side. It’s not necessary, regardless of the timeout on the client side, operation will be continued on the server side.
- Extend the LongSqlCommandTimeoutSec setting for LibraryManager.
To enable server side OCR
- In Web Client, go to Administration Panel > System Configuration > Settings > General.
- In the Server Side OCR area, select the Enable Server Side OCR check box.
- To enforce the priority for newly added documents or versions so that they take a higher priority in the queue, select the Enforce a higher priority for newly added or checked in documents check box. If the setting is not enforced, documents are taken from the queue in the order they are added without taking priority into account.
- Click Update.
To add existing documents in the repository to the queue
- Go to Administration Panel > System Configuration > Settings > General.
- Click Add existing documents to OCR queue.
- At the message prompt, click OK to continue with the process. This adds existing PDF and TIFF documents in the repository to the queue for processing. Only the last version of the document will be processed. They are added to the queue with a low priority and do not affect the position of existing documents in the queue.
OCR Queue Status and Reprocessing Documents
In the OCR Queue Status page, the current status of the OCR engine and any warnings or errors for documents cannot be processed are shown.
In the General area, the following information is displayed: the status of the OCR engine (enabled/disabled), if the higher priority of newly added documents or versions is enabled, the number of pending documents, and the number of processing errors as well as the list of errors.
When an error or warning occurs while the server performs the OCR, the document is removed from the queue and added to the List of Errors. The List of Errors shows the type (warning or error), FileHold ID, date that the error occurred, and the error details. OCR errors can occur when:
- The document is checked out.
- The document is under an active workflow.
- The document is encrypted, password protected, or corrupted.
- The document does not have any valid text that can be recognized.
- A newer version of a document has been checked in.
- File has an invalid extension.
This information about the error is displayed in the Details column. If an error occurs for checked out or active workflow documents, these can be repaired by re-adding the documents to the queue.
Once the OCR mechanism completes, the OCR’d document is checked in as a new version. The OCR’d PDF is checked in with the same owner as the previous owner. This new version is then processed by the full text search engine so it becomes searchable.
To view the OCR status and reprocess documents
- In the Web Client, go to Administration Panel > System Management > OCR Queue Status.
- In the General area, the following is displayed:
- OCR functionality status – Shows if the server side OCR engine is enabled or disabled. This is enabled by a system administrator.
- Higher priority for newly added or checked in documents – Shows if the priority for newly added documents or versions is enabled. If enabled, these documents take a higher priority in the queue. If the setting is not enabled, documents are taken from the queue in the order they are added without taking priority into account.
- Number of pending documents – The number of documents that are waiting to be processed by the OCR engine.
- Number of errors while processing – The number of documents cannot be OCR’d.
- To review the list of warnings and errors, the documents that triggered an issue are displayed below. The list of errors displays:
- Type — If the issue is a warning or an error. Warning are displayed for non permanent or non-technical errors such as if a document has a workflow or checked out. Documents with warnings can be re-added to the OCR queue.
- FileHold ID of the document.
- Date and time the OCR error occurred.
- Details of the problem. Warnings occur if the document is checked out, the document is under an active workflow. Errors occur when the document is encrypted, password protected, or corrupted, or the document does not have any valid text that can be recognized.
- To restrict the list to a specific date(s) when the error(s) occurred, enter a date in the To and From fields and click Apply.
- To reprocess a document with a warning, select the check box next to the warning and click Re-add document(s) to OCR Queue. To documents are re-added to the OCR queue for processing.
- To clear any errors from the view, select the check box next to the error and click Clear error(s). The errors are removed from the list.