Scanning Essentials – Part 2: Planning, Equipment, & Software (Part of the Going Paperless series)


The modern office wants to be paperless, but Going Paperless does not just mean getting rid of paper. FileHold has helped hundreds of organizations become more efficient with a searchable, controlled repository of documents and records and move away from physical media. The FileHold team understands the challenges customers experience Going Paperless: scanning archives, dealing with newly arriving paper documents, eliminating physical sign-offs and approvals, and ensuring records are kept for compliance. The Going Paperless series features best practice advice to help you, and your organization, get started with some simple ways to reduce cost, increase efficiency, and modernize your team.

Other articles in the series:

Scanning Essentials - Part 1: Assessing your Needs

There are multiple things to consider with scanning. Evaluating resources, both what is available and what is needed, is essential to a successful scanning project. This article will look at the tools (scanners and software) and the talent (the people doing the scanning) and suggest some best methods for a successful project.

Physical Scanning Resources

Scanners

As hardware changes regularly, this will be a general review of the categories of scanner technology, with consideration of your organizational resources. For instance, an office may benefit from a multi-function WorkCenter that provides printing, email, and faxing as well as scanning, whereas a remote worker might use a flatbed desktop scanner.

Type of Scanner

Description

Typical Use

Volume

Velocity

Variety

Veracity

Value

Portable/Handheld

Small scanners, often USB powered, designed for scanning on the go.

Out-of-the-office; shop floor; odd-sized documents

Very low

Very slow

High*

Moderate

Low Cost

Flatbed

Desktop scanners where each page of the document is laid down, scanned, and then replaced with the next page. The scan has an optical device that moves to capture the image.

Books; fine art; delicate documents; non-standard paper or materials (ie, receipts, check stubs)

Very low

Very slow

High*

High

Low Cost

Sheetfed flatbed scanner

Uses same technology as a flatbed scanner, except it draws the paper past a stationary optical device. Often incorporated into printers and multi-function work centers. Documents must be standard-size and unbound for the feeder.

Standard Letter or Legal (8 1/2") wide paper: Office documents, loose leaf paper, archives.

High

Fast

Low*

Moderate

Moderate

Drum Scanner

Often associated with high-speed scanning, this sheet feed brings the documents through a photomultiplier tube quickly and accurately. Documents must be similar in size.

Same as sheetfed, but with large volume scanning

Very High

Very fast

Low*

High

High Cost

* Regarding variety: scanning can be done as discrete sheets or in batches of paper. Sheetfed and drum scanners are designed to batch scan: whatever is in the tray is processed at once, and the default is typically to output a single batch file, although software can make these discrete documents. However, the nature of a feed-based scanner is to aim for uniformity in document size. Portable and flatbed scanners, on the other hand, process one page at a time by design, and these can be any size. Therefore, these scanners have more flexibility.

From this, we can see there is no “one-size-fits-all” piece of hardware. Document & scanning needs must be considered carefully to ensure there is a positive match between technological utility and process needs. For instance, a front-end receptionist might benefit from a sheetfed scanner to process incoming invoices, a flatbed scanner for non-standard documents like bills of lading, or a multi-function work center that does both.

Finally, make sure the expenditure for hardware matches your needs. Always consider rentals of equipment for short-term scanning projects: a high-speed scanner or a specialized flatbed scanner for books might save a tremendous amount of labor.

Labor

Scanning is more than the optical capture of the document – it is also the document preparation, the scanning, and dealing with the now-scanned paper. There is physical prep needed in document preparation for scanning: bindings need to be removed to ensure that equipment is not damaged by an errant staple scraped across the scanner, butterfly clips need to be removed, and cerlox bindings detached. Scale is essential here, as no one of these tasks is particularly difficult in a one-off process, but will become more intensive (and tedious) when repeated thousands of times. The technology available will also determine the amount of prep; flatbed scanners are one page at a time, so it means bindings do not have to be removed. Once scanned, does the paper need to be saved? Invoices and memos can be recycled, but books might need to be reshelved. Consider the after-product of scanning in your preparation.

Are you adding scanning to someone’s existing responsibilities? Scanning is designed to make operations more efficient in the long term, but they do involve a little more front-end effort. The benefits of this effort – not needing to refile documents, having their contents be searchable, not requiring off-site storage, etc – are most often felt later, and so the person doing the scanning is doing more work personally, but not organizationally. Consider an invoice for the front-line receptionist; in a paper-based process, all that needs to happen on arrival is to place the invoice in a folder or a mailbox and send it on its way. Paper is infamously inefficient, for that single document needs to move physically from person to person through its cycle of review and approval. Adding the scanning to the receptionist’s workload may be seen as a burden and cause resistance. Can the process be made easier? Scanning can be streamlined. If the volume is high, a sheetfed scanner and daily batch processing of invoices might be the most efficient, least-invasive process; if the volume is low, ad-hoc scanning at the desk instead of going to a copy room and work center might be best. We encourage you to consider these factors before finalizing your physical scanning process. We also would suggest communicating the value of the paperless office to your team. Help them to understand how a few moments of additional task eliminate the need to search for, retrieve, and refile the document again. Projects always excel when team members are on board, and seeing how an individual’s efforts helps the larger picture can smooth many concerns!

Finally, for large projects, hiring scanning professionals (and their equipment) can be the most cost-effective route, as they are familiar with specific document processing and organizational needs. If you choose to go this route, consider the electronic resources needed to process and store the documents, as these will have an impact on the scanning professional’s operation.

Electronic Scanning Resources

Scanning Software

This is the software that interfaces with the workstation computer to capture the image of the scan. Often packaged with the driver, the scanning software allows the user to configure the quality of the scanned image. For a DMS like FileHold, the image produced by the scanner – as configured by the scanner software – can be stored like any other document. They can be as large or as compressed as your needs demand. However, some guidelines are useful for basic scanning. Here are some basic terms and considerations.

Scan Resolution

Expressed in DPI (dots per inch), the scan resolution controls how much detail is captured with the scan. The higher the DPI, the more image detail is captured, but at a slower scanning speed and with a larger output file size. Each organization will need to assess its source documents to determine the ideal resolution settings. Here is a rough guideline for black-and-white scanning:

DPI

Quality

Use

72

Very low

Thumbnail image

150

Low

Website

300

Moderate

Text

600

Moderate

Text & Images

1200

High

Publication & printing

Image resolution will affect the ability of optical character recognition (OCR) to accurately find information. 300 DPI is the standard recognized resolution for success with OCR in black and white scans.

Color

Most modern scanners will process documents in black-and-white, greyscale, or color. Each has its place in digitization, with black-and-white images leaning to text, and color for artwork or images. Color scanning typically produces the largest output files and black and white the smallest (when scanning in the same DPI). However, the color scanning process often needs to be much higher resolution. Some color scanning software does not reduce the DPI being scanned, it scans at a high DPI and then compresses the document to appear as a lower DPI. To the eye, these look acceptable: here is a screenshot of a low-quality scan of a check stub at screen resolution:

 A picture containing text

Description automatically generated

Figure 1 - Screen Resolution appearance of a compressed color scan

However, if we zoom into the image, we can see artifacts from the compression, in this case from 1200 to 300 DPI.

Graphical user interface, text

Description automatically generated

Figure 2 - Zoomed Image of a compressed color scan

These artifacts, or “ghosts”, make the image look pixilated and unfocused. These documents are not suitable for printing, as the image quality’s ghosts become intensified as noise in the printing. The compression can cause other processes, such as OCR, to misinterpret the image and produce inaccurate results. Therefore, when color scanning, the DPI should be turned up to avoid image compression or the scanner software should be configured to ensure it scans at the DPI instead of scanning high and compressing.

Before we move on, ask what value a color scan would have here. Do we need it to be blue because the checks are blue? Is that essential to the validity of the information on the check? By scanning in black-and-white, we can avoid the compression issues or scanner settings altogether to ensure the documents being imported maintain their value at the lowest useful resolution.

File Output

Scanner software can output scanned documents in different formats. The two most common are .pdf and .tiff, although other formats, like .jpg or .gif, are also popular. Compression is typically applied to formats like .jpg, so be careful to not introduce artifacts through the formatting.

Miscellaneous Settings

The scanning software might allow users to select other variables, such as contrast or brightness adjustment, sharpness, image clean-up, de-skewing, and even low-end OCR. These change from software to software, and organizations should explore what options are available and fine-tune them before undertaking large projects.

Processing Software

Processing software exists between the completed scan and the storage. In some cases, they can talk directly to the scanner via TWAIN, a protocol that communicates between software and imaging devices. Most modern scanners/scanning software use TWAIN, so there should be little issue with processing software communicating with scanning software.

Zonal OCR

Unlike general OCR, which only looks for text, zonal OCR applies logic to find information on a document. For instance, where it finds the text “Invoice No:”, it might look at the characters that follow and assume that is the invoice number, which is then added to that field for verification. As documents are searched, or “crawled”, the zonal OCR software remembers where it found that value, making the processing of the next document of that type more efficient.

FileHold offers a license of SmartSoft Capture with all new deployments to assist organizations with their zonal OCR processing of documents. This is just one of the many products that SmartSoft offers, which offer other approaches to meet your scanning needs.

SmartSoft Product

Description

Volume*

Velocity

Variety

Veracity

Value

Ideal for:

Capture

Basic form scanning. Includes barcode separation. Uses Tesseract OCR engine. Can be used without OCR for images with a text layer.

Low

Low

Low (only one document type at a time)

High

High

Small organizations with occasional scanning needs (ie, less than 1000 pages/month).

Capture Plus

All features of Capture, including the Nuance OCR engine for greater and more accurate data capture.

Low

Low

Low (only one document type at a time)

Very High

High

Small organizations with occasional scanning needs where documents are of low quality, such as digital archives.

Invoice

Includes all elements of Capture. Allows for line-item scanning and processing for greater data capture.

Low

Low

Low (only one document type at a time)

Very High

High

Dedicated low-volume scanning of invoices, such as an accounting firm.

Pro

Online service, paid-per-click instead of installed. Offers much greater processing speed and indexing.

Moderate

High

High (can be trained to recognize different types)

Very High

High

Organizations with more than 1000 pages of documents per month requiring zonal OCR processing.

* A word about Volume – it is very difficult to give hard numbers on what constitutes high and low volume. As a general rule of thumb, 1000 documents/month is the guideline for a higher-end use, but that does not consider the organization’s resources for labor, workstations, document length, complexity, or all the other factors that might make this an easy target for one organization and under-powered for another. Capture is provided as a starting point, and FileHold recommends a trial of the software before its use in large-volume applications.

SmartSoft’s products include a step of manual verification of the metadata captured. OCR, while amazing, is not infallible and can produce single-case errors. For instance, a low-quality scan or color scan full of artifacts might produce an “rn” as an “m”, or a “cl” as a “d”. The manual verification step ensures the zonal OCR has not made an error and exports the correct information to FileHold. Some organizations may feel this step is not necessary; however, we disagree. Metadata must be accurate to retain value, and relying on AI processes to not make mistakes can result in documents being “lost” as the index is not working correctly. The manual step of verification is minimal and ensures content being captured is accurate.

Each of these SmartSoft products integrate with FileHold through the Manage Import process. This associates the tagged zonal-OCR information with the document metadata, and auto-filing rules or database lookups can be applied to ensure the document goes to the right place with the right information. Again, this is why the verification step SmartSoft requires is so crucial: if the information is correct, you can be assured that the FileHold processing will be reliable, repeatable, and reproducible.

Finally, online software is offering more options for scanning processing at greater speeds. By utilizing the more dedicated processing power of the cloud, the speed at which these documents can apply OCR and zonal recognition is far greater than a workstation. The cost of these online services may offset the operational processing cost.

FileHold

Additional software may not be necessary for your scanned documents. For instance, scans of long documents can be added directly to FileHold just like a standard document, where the metadata can be added manually or found in an external database. This is often faster than the processing time needed for zonal OCR to find content. Where a schema has minimal metadata, or document metadata cannot be found on the document, adding it directly to FileHold and bypassing processing software may be the most efficient path.

FileHold can also offer some processing services for scanned documents:

  • Upon intake, documents can have an OCR layer applied to index the text automatically in the search index.
  • WebCap can be used with your scanner to directly intake documents.
  • Click to Tag lets you visually grab information and OCR it for metadata index.

Document type should be considered when selecting an intake route. For example, invoices are short, metadata-rich documents so zonal OCR is an efficient way to batch-process multiple documents at a time and export them as discrete elements into FileHold. For HR documentation, where the only metadata needed might be on page one, and minimal at that, Click to Tag and Server-side OCR will be very efficient to process the document and have its content fully searchable.

Other Software

Other software is available for document processing that may offer additional functionality, but these can be very expensive. Each organization will need to weigh the benefit of these other products to see if their ROI can be better justified. FileHold offers a variety of tools to help bring these documents into the repository as efficiently as possible.

Closing Thoughts

The process of scanning – document preparaton, image capture, software processing, and import to a document repository like FileHold – can vary widely depending on your organizational needs, resources, and obligations. We hope this blog has given some good general advice on these practical elements. Our next article will look at some bigger picture questions on document scanning and putting all the pieces together. To get our team engaged with helping your organization with their scanning, contact your sales representative or [email protected].