File compression for a smaller faster repository

Disk space is cheap, why compress? This may be true, but why waste and it is not always true. If you think about storage on a global scale, there is significant environmental impact. More electricity, more cooling, more data centers, etc. Why contribute to that?

What if disk space is not so cheap or not so insignificant? Companies are increasingly converting storage from CAPEX to OPEX as they move to storage subscriptions in the cloud. For some reason cloud storage costs more. Of course, it usually comes with third party management, linked backup, georedundancy, encryption, a reliable lifecycle and other tidbits you ignored when the drive was sitting in the server in the back corner of the office and you mostly kept your fingers crossed.

What if disk space is not just about one file? How many users have a copy of the file on their local workstation? In their browser download cache? Did the file get emailed to anyone in an attachment? Did it get emailed to EVERYONE!?! Suddenly a 1-megabyte file might actually consume dozens or hundreds of megabytes in your organization (don’t forget the sync’d copies on your Exchange server).

What if it is not just about disk space? You need to upload the file to get it into the repository. You may need to download it to view it. And don’t forget all those emails all taking up network bandwidth inside and outside of your organization. You can tell users to “send a link” as much as you like, but you know they want to attach. Bandwidth is not the only loss as your network is likely a constrained resource when compared to CPU speed, so everything takes longer when it is bigger.

There are a couple of options we see for compressing files in our everyday computer use. One is to simply tell the OS to compress a folder. This will use less storage, but it does not actually make the file smaller, so network and external storage still suffers. You could zip the file, but then effort is required to unzip it before it can be used. Another downside to technologies like that is they do not take into account the special cases that exist in image and PDF content as they tend to treat files all using the same compression technique.

In FileHold version 16.2 we added an option for all OCR users to compress their PDF and image documents to PDF after the OCR process completes. We extended compression as a standalone option starting with 16.3.

Compression works in the same way as OCR on the server by queuing documents as they are added to FileHold or queuing all existing documents to get things started. It will find TIFF or PDF documents that it has not previously compressed and compress them using a variety of techniques either on their own or after OCR has completed.

We use a few key techniques to compress the files.

  • Image compression or recompression and scaling
  • Color detection and normalization
  • Mixed raster content (MRC) compression
  • PDF normalization and unnecessary content removal

Images are often scanned at resolutions that are excessively high such as over 300 DPI and even those that are scanned at 300 DPI may not need that amount of data once the OCR process is complete, so the resolution can be scaled down. Additional, if the image was compressed by the scanning software, it may not have been done in a way optimal for long term storage. Bitonal images are compressed with JBIG2 and JPEG is used for full color.

We frequently see black and white images scanned using an unnecessary color space designed for color photographs. They may have a bit space of 8 or up to 24 (read, bigger than necessary file). This case is automatically detected and the optimal color scheme is selected.

MRC compression rethinks how the PDF file is structured to an optimal method for storage. It does this by segmenting the page into layers. The binary layer contains text and uniform graphic elements which can be substantially compressed with JBIG2. The background layer from the original image is largely retained and compressed with JPEG. You can think of the foreground layer as the chromatic channel for the binary layer and we can compress it with JPEG. Altogether, image size can be reduced up to 8 to 10 times more than JPEG on its own. It also has the side effect of improving the image quality over a simple JPEG compression.

Finally, the engine can be configured to clean up things like unneeded points on paths in images originating from CAD systems and remove things like unnecessary decimals in floating point number, converting smooth lines to curves, removing annotation, Javascript, links, metadata, thumbnails, bookmarks, embedded files, form fields, unneeded font information, and fast web view data.

Practically then, how much will my documents be compressed? Good question. The level of compression will vary according to many of the factors above. The best way to know is to give it a try. Current customers can request a trial for compression and everyone else can ask their account manager for help. Just send them some sample documents and they can give you a report from the compression engine as the details are logged with the normal document usage data.

If you want a taste without all the work, I have tried a few tests with scanned and generated documents. For that latter case I used a couple of reports with dozens to hundreds of pages. Pages mostly included text, but there were some images in places like title pages, footers, etc. As expected, the results were minimal, but there was some improvement. You can see details of the results in the document usage log. 

Image
FileHold document compression report

For a more interesting result, I scanned a tourist brochure I picked up at the Great Wall of China. It had a combination of images, text and a textured background. I scanned the image at 300 dpi and processed the TIFF output three ways: color uncompressed, color compressed using 50% JPEG and gray scale with LZW compression. I captured a screen shot at WQHD resolution from the FileHold viewer set to full screen width to get a practical view of what an end user might see and set the contents side-by-side below. In all cases, the original is on the left and the compressed file is on the right. The default compression parameters that ship with FileHold were used.

Image
Compression colour comparison

The images and background have lost detail and there is some color shift in the text, which has effectively been sharpened and made more readable. The original file is 23.71 MB and the compressed file is 185.25 KB for an approximately 13000% reduction.

Image
File compression with colour

As you can see, the 50% JPEG compression has produced a less readable original. The compressed version has similar changes as the first image, but now the text is somewhat easier to read than the original. We started with a file of 1.03 MB and a compressed version of 195.1 KB for only a 438% improvement and a larger file than the first. The JPEG compression in the original seems to have made it harder to compress further.

Image
File compression grey scale

Finally, the gray scale version. Less the color, the visual results are very similar to the first case. The original was 6.3 MB and the compressed version was 171.92 KB for the best reduced result, but only a small margin better than the full color.
 

 

Image
Russ Beinder

Russ Beinder is the Chief Technology Officer at FileHold. He is an entrepreneur, a seasoned business analyst, computer technologist and a certified Project Management Professional (PMP). For over 30 years he has used computer technology to help organizations solve business problems.