Server side OCR (Optical Character Recognition)
The FileHold server side OCR feature can provide OCR (optical character recognition) for PDF and TIFF documents that are in the FileHold library so that they can be indexed and searched. The OCR mechanism is located on the FileHold server which uses a queue to process the documents. Once the mechanism completes the processes of OCR’ing the document, the document is checked in as a new version that contains a text layer that allows the document to be indexed and searched within the document management system. Server-side OCR is an optional feature that is controlled in the FileHold license called “OCR Module”. To purchase the server-side OCR feature, contact [email protected].
The criteria for adding a document to OCR processing queue are:
- The document must be an “Electronic Document” format. Electronic records and offline documents are be processed.
- Only PDF and TIF/TIFF type documents are processed. TIFF images are converted to searchable PDF documents.
- Only the latest version of the documents can be processed. This is because a new version is created once the document has been OCR’d. The owner of the original document remains the owner for the new OCR’d version.
The resulting text layer is dependent upon the quality of the document being OCR'd. Therefore, to ensure accuracy of the resulting text layer, the quality of the documents should be reasonably high. Poorer quality scans are difficult to OCR so quality checks on these documents should be done. The OCR engine cannot detect if an image is rotated so ensure that your documents can be read left to right and the text is oriented horizontally.
Documents undergoing server-side OCR can also be compressed to reduce file size. This is controlled by the license setting “Image/PDF compression”. A file can only be compressed or OCR’d once. The original version of the document can be removed by enabling the setting “OcrRemoveOriginalDocuments“ in the web config file in Library Manager. If this option is enabled, the pre-OCR/pre-compressed version is soft deleted and a new version is checked in. System administrators can decide which technique should be enabled or disabled in order to keep required level of documents optimization. See Image/PDF compression options for settings.
Server side OCR is an optional feature that is controlled in the FileHold license. To purchase the server side OCR feature, contact [email protected].
If a document goes through the server side OCR process, a new version of the document is generated. This new version is not associated with any workflows that had occurred on the previous version and therefore loses its review and approval statuses. The newly generated version needs to go though the workflow process again if those statuses need to be maintained between versions.
Enabling server side OCR
Server side OCR can be a time consuming mechanism; therefore, documents are added to a queue to be processed. All new documents, new versions, manually added or through an automatic import mechanism (such as watched folders or managed imports), are automatically added to the queue. Existing repository documents can be added manually to the queue.
You can enforce the priority for newly added documents or versions so that they take a higher priority in the queue via a setting. They are processed before any existing documents in the queue. If the setting is not enforced, documents are taken from the queue in the order they are added without taking priority into account.
For the "Add existing documents to OCR queue" option, a configuration setting "OcrTotalOfExistingDocuments" is used for the OCR queue. First, FileHold processes any newer documents or versions, then looks at the queue. If the queue is large because it needs to also process a large number of existing documents, this can affect system performance. The "OcrTotalOfExistingDocuments" can help reduce these effects. The default value is 1,000,000 but can be adjusted in the web.config file located in C:\Program Files\FileHold Systems\Application Server\LibraryManager. With a larger number of documents in the queue, it is recommended that:
- Perform operation outside the working hours.
- Extend the WebServiceCallTimeoutSec setting for WebClient to prevent the timeout on the client side. It’s not necessary, regardless of the timeout on the client side, operation is continued on the server side.
- Extend the LongSqlCommandTimeoutSec setting for LibraryManager.
To enable server side OCR
- In Web Client, go to Administration Panel > System Configuration > Settings > General.
- In the Server Side OCR area, select the Enable Server Side OCR check box.
- To enforce the priority for newly added documents or versions so that they take a higher priority in the queue, select the Enforce a higher priority for newly added or checked in documents check box. If the setting is not enforced, documents are taken from the queue in the order they are added without taking priority into account.
- Click Update.
To add existing documents in the repository to the queue
- Go to Administration Panel > System Configuration > Settings > General.
- Click Add existing documents to OCR queue.
- At the message prompt, click OK to continue with the process. This adds existing PDF and TIFF documents in the repository to the queue for processing. Only the last version of the document is processed. They are added to the queue with a low priority and do not affect the position of existing documents in the queue.
Server side OCR engine configuration
Except where otherwise stated, the following parameters can be added to the web.config file in C:\Program Files\FileHold Systems\Application Server\DocumentRepository, under the <appSettings> section.
Configuration Name | Description |
---|---|
FH OCR documents scheduled task |
The scheduled task “FH OCR documents” can be modified for the frequency and time frame for when the OCR’ing occurs in the Task Scheduler. |
OcrMaxDocuments |
The maximum number of documents that can be processed in the set amount of time. For example: <add key="OcrMaxDocuments" value="10" />. Use the web.config file in C:\Program Files\FileHold Systems\Application Server\LibraryManager, under the <appSettings> section |
OcrCommandTimeoutSec |
This is the maximum amount of time in which the server side OCR task runs. The OCR process continues if the timeout value is exceeded. This value does not usually need to be changed unless there are a large number of documents in the queue and more than one document needs to be processed per execution. For example: <add key="OcrCommandTimeoutSec" value="270" />. Use the web.config file in C:\Program Files\FileHold Systems\Application Server\LibraryManager, under the <appSettings> section |
WebServiceCallTimeoutSec |
For larger size documents (over 10 MB), the WebServiceCallTimeoutSec setting in the web.config file should be set to 3600. This forces the Library Manager to wait for a longer response time from other services in order to process the documents without timeouts. Use the web.config file in C:\Program Files\FileHold Systems\Application Server\LibraryManager, under the <appSettings> section |
OcrLang |
The default languages included with the OCR engine are:
Language code values for the included languages are: german or deu, english or eng, french or fra, and spanish or spa. Languages can by combined with ‘+’ sign, eg. eng+fra (FileHold 16.2). For other language support, contact [email protected]. <add key="OcrLang" value="eng+fra" /> |
OcrDpiResolution |
The dots per inch (DPI) setting for the OCR engine controls how the engine renders the page internally on the server for processing. It cannot be used to improve the resolution of the document. If the document is scanned at 150 dpi, it is up-scaled to 300 dpi by default. However, this up-scaled document simply contains a larger rendition of any problems due to the original scanned resolution. OCR recognition does not generally improve at a resolution above 300 dpi. By default, a document with higher resolution is down scaled to 300 dpi. This down scaling preserves sufficient detail in the document for the OCR process. The OCR recognition process is faster when the DPI setting is lower, however, this can also reduce the accuracy of the recognition. Depending on the original quality of the documents it may be possible to get reasonable quality recognition at 200 or 150 dpi. It is recommended documents be tested to look for a balance of performance and recognition quality. <add key="OcrDpiResolution" value="250" /> |
OcrImageProcessingFlags (not applicable in FileHold 16.2 or higher versions) |
The following are flags that can be set for image processing:
<add key="OcrImageProcessingFlags" value="OCR_Image_Autorotate" /> |
OcrRegionMode (not applicable in FileHold 16.2 or higher versions) |
A region mode specifier. Useful for increasing OCR accuracy and speed where the type of input is known ahead of time, for example a single line or a single paragraph / block of text. Allowed values:
<add key="OcrRegionMode" value="OCR_Auto" /> |
OcrWhiteList |
A list of character to accept as recognizable symbols. All others are ignored. <add key="OcrWhiteList" value="é" /> |
OcrBlackList (not applicable in FileHold 16.2 or higher versions) |
A list of characters to deny as acceptable symbols for recognition. All others are considered suitable candidates for OCR identification. <add key="OcrBlackList" value="|" /> |
OcrExcludeExtension |
Allows exclude files with specific extension from being added to the OCR queue. Possible values:
<add key="OcrExcludeExtension" value="tiff" /> |
OcrMaxResolution |
Maximum allowed page resolution (px) of files being processed by OCR. Resolution should be provided in following format: <width>x<height>, eg. 1920x1080. Documents which contain at least one page which dimensions are greater than provided option are ignored during OCR process. If this option is not provided, no limitation is used. <add key="OcrMaxResolution" value="1920x1080" /> |
OcrMode (FileHold 16.2) |
Gives a high-level option to determine whether documents should be processed quickly or rather with better accuracy. By default, OCR engine performs in FavorSpeed mode. Possible values:
<add key="OcrMode" value="FavorAccuracy" /> |
OcrPageRange (FileHold 16.2) |
Allows to set the page range to be processed, for example, "1;4;5" to process pages 1, 4 and 5 or "1-5;10" to process pages from 1 to 5 and page 10. Set this parameter to "*" to process all pages of the current document. This setting is applied to all documents. By default, all pages in documents are processed. <add key="OcrPageRange" value="1-2" /> |
OcrThreads (FileHold 16.2) |
Allows to specify number of threads to use while processing document. By default, the value is 0 and the OCR engine automatically maximizes the performance based on the number of available cores, an estimation of the required memory to run the task and the amount of memory available to allocate. Setting a specific low value will effectively throttle the amount of resources that OCR will use and make those resources available for other tasks. <add key="OcrThreads" value="2" /> |
OcrTimeout (FileHold 16.2) |
Allows to provide time interval (in seconds) that specifies the maximum time allowed for the whole OCR process of single document before it is automatically interrupted. By default, timeout is set to 600 seconds (10 minutes). <add key="OcrTimeout" value="1200" /> |
OcrAutoRotate (FileHold 16.3) |
Allows to enable/disable rotation of pages according to the prevailing orientation of text presented on each page, rotation is verified and improved for each page separately. <add key="OcrAutoRotate" value="true" /> |
OcrDeskew (FileHold 16.3) |
Allows to enable/disable compensation of small rotations to the document that occur during the scanning process. <add key="OcrDeskew" value="true" /> |
OcrRemoveBlankPages (FileHold 16.3) |
Allows to enable/disable removal of detected blank pages in the multi-page tiff images. <add key="OcrRemoveBlankPages" value="true" /> |
OcrDehole (FileHold 16.3) |
Allows to enable/disable removal of punch holes (visible as large black dots after the scanning process) located on edges of the document pages for multi-page tiff files. <add key="OcrDehole" value="true" /> |
Image/PDF compression options
Configuration Name | Description |
---|---|
PdfCompression_Enabled |
Enables compression. By default, this is enabled. <add key="PdfCompression_Enabled" value="true" /> |
PdfCompression_Images |
Enables compression of the image-based pages. Image-based pages are considered to contain nothing but one fully visible image covering the whole page area. By default, this is enabled. When PdfCompression_Images is disabled, following settings are not applied: PdfCompression_ImageDpi, PdfCompression_ImageSmoothing, PdfCompression_ImageQuality <add key="PdfCompression_Images" value="true" /> |
PdfCompression_ImageDpi |
Resolution (dpi) of the image-based pages. By default, this is set to 100. <add key="PdfCompression_ImageDpi" value="100" /> |
PdfCompression_ImageSmoothing |
Enhanced smoothing technique when processing the image compression. It improves the contrast of the image by reducing noise. That means the produced image is less pixelated, but it's file size can increase. By default, this is set to false. <add key="PdfCompression_ImageSmoothing" value="false" /> |
PdfCompression_ImageQuality |
Level of quality used to compress JPEG image. Allowed value from 0 (worst quality) to 100 (best quality). By default, this is set to 60. <add key="PdfCompression_ImageQuality" value="60" /> |
PdfCompression_RemoveAnnotations |
Removes all annotations from whole document. By default, this is set to false. <add key="PdfCompression_RemoveAnnotations" value="false" /> |
PdfCompression_RemoveLinks |
Removes all hyperlinks from the whole document. By default, this is set to false. <add key="PdfCompression_RemoveLinks" value="false" /> |
PdfCompression_RemoveMetadata |
Removes document-level and page-level metadata. By default, this is set to false. <add key="PdfCompression_RemoveMetadata" value="false" /> |
PdfCompression_RemoveThumbnails |
Removes pages thumbnail images. By default, this is set to false. <add key="PdfCompression_RemoveThumbnails" value="false" /> |
PdfCompression_RemoveBookmarks |
Removes all bookmark items from document. By default, this is set to false. <add key="PdfCompression_RemoveBookmarks" value="false" /> |
PdfCompression_RemoveEmbeddedFiles |
Removes all embedded (attached) files from document. By default, this is set to false. <add key="PdfCompression_RemoveEmbeddedFiles" value="false" /> |
PdfCompression_RemoveFormFields |
Removes all form fields from document. By default, this is set to false. <add key="PdfCompression_RemoveFormFields" value="false" /> |
PdfCompression_RemoveJavascript |
Removes all JavaScript objects from document. By default, this is set to false. <add key="PdfCompression_RemoveJavascript" value="false" /> |
PdfCompression_PackFonts |
Optimizes fonts stored in document. By default, this is set to false. <add key="PdfCompression_PackFonts" value="false" /> |
PdfCompression_Linearize |
Document optimization for Fast Web View mode. By default, this is set to false. <add key="PdfCompression_Linearize" value="false" /> |
Server Side OCR component doesn't work properly for non-Latin languages when there is no Arial Unicode MS font installed on the application server. If this font is missing, install it in the [Drive]:\Windows\Fonts directory. More information about font and licensing: http://www.microsoft.com/typography/Fonts/font.aspx?FMID=1081, http://www.fonts.com/font/monotype/arial-unicode
OCR queue status and reprocessing documents
In the OCR Queue Status page, the current status of the OCR engine and any warnings or errors for documents cannot be processed are shown.
In the General area, the following information is displayed: the status of the OCR engine (enabled/disabled), compression status (enabled/disabled), if the higher priority of newly added documents or versions is enabled, the number of pending documents, and the number of processing errors as well as the list of errors.
When an error or warning occurs while the server performs the OCR, the document is removed from the queue and added to the List of Errors. The List of Errors shows the type (warning or error), FileHold ID, date that the error occurred, and the error details. OCR errors can occur when:
- The document is checked out.
- The document is under an active workflow.
- The document is encrypted, password protected, or corrupted.
- The document does not have any valid text that can be recognized.
- A newer version of a document has been checked in.
- File has an invalid extension.
- Sometimes negative compression ratios are displayed in the logs. In this case, the compression attempts can be abandoned if the setting “PdfCompression_AbandonNegativeCompression” is enabled in the Document Repository web config file. The default is set to true. If enabled, the error message displays in the OCR and compression queue report, “Compression resulted in <x>% larger file, so compression attempt abandoned.”
The information about the error is displayed in the Details column. If an error occurs for checked out or active workflow documents, these can be repaired by re-adding the documents to the queue.
Once the OCR mechanism completes, the OCR’d document is checked in as a new version. The OCR’d PDF is checked in with the same owner as the previous owner. This new version is then processed by the full text search engine so it becomes searchable.
To view the OCR status and reprocess documents
- In the Web Client, go to Administration Panel > System Management > OCR and compression queue.
- In the General area, the following is displayed:
- OCR functionality status – Shows if the server side OCR engine is enabled or disabled. This is enabled by a system administrator.
- Compression functionality status – The status can be "Enabled", "Disabled" or "Not licenced".
- Higher priority for newly added or checked in documents – Shows if the priority for newly added documents or versions is enabled. If enabled, these documents take a higher priority in the queue. If the setting is not enabled, documents are taken from the queue in the order they are added without taking priority into account.
- Number of pending documents – The number of documents that are waiting to be processed by the OCR engine.
- Number of errors while processing – The number of documents cannot be OCR’d.
- To review the list of warnings and errors, the documents that triggered an issue are displayed below. The list of errors displays:
- Type — If the issue is a warning or an error. Warning are displayed for non permanent or non-technical errors such as if a document has a workflow or checked out. Documents with warnings can be re-added to the OCR queue.
- FileHold ID of the document.
- Date and time the OCR error occurred.
- Details of the problem. Warnings occur if the document is checked out, the document is under an active workflow. Errors occur when the document is encrypted, password protected, negative file compression, corrupted, or the document does not have any valid text that can be recognized.
- To restrict the list to a specific date(s) when the error(s) occurred, enter a date in the To and From fields and click Apply.
- To reprocess a document with a warning, select the check box next to the warning and click Re-add document(s) to OCR Queue. To documents are re-added to the OCR queue for processing.
- To clear any errors from the view, select the check box next to the error and click Clear error(s). The errors are removed from the list.