From the Mailbag: Extraction Rules Tips and Tricks
Document Management software that can extract information from forms or email and automatically read file properties for metadata.
Extraction rules are an under-utilized tool for FileHold and can simplify metadata entry by automating the extraction of file properties with predefined mapping. The minimum role to configure extraction rules is Library Administration, but the rules apply to all documents added from the FileHold Desktop Application (FDA) or with Automated Document Importation (ADI).
Extract form information automatically
I am using PDF form extraction rules. I have a form already being used, and I have modified it as a new second form to add to FileHold – but the old extraction rules are being applied. How can I have the modified form accepted as a new form by the extraction rule?
Every PDF form has a globally unique identifier (GUID) which is generated when the form is created. Users cannot see or access this form-defining GUID, but it helps software recognize and identify one form from another. The GUID is generated only at creation: taking an existing form, editing it, and saving it with a new name does not create a new GUID. To FileHold, the forms are the same because the GUIDs match.
The solution is either to create a new form or trick your PDF editor into generating a new GUID. Since creating from scratch is not practical, you will need to merge or assemble the form using your PDF editing software. The output PDF will have a new GUID. It is important to not use a print function or anything that might “flatten” the PDF, keeping the text but losing the fields that are passed into the file’s properties for the Extraction rule to capture. You will then be able to add your new form to the extraction rule and have the fields extracted as metadata.
Note: FileHold’s Document Assembly generates a new PDF from multiple compatible documents as a printer, flattening the images. This does not create a new GUID.
Automate file property extraction in document management
We have an extraction rule for one type of file, but we need it to go to a different schema: how do you apply an extraction rule to one file type but many schemas?
In a nutshell, you can’t – but there is a great workaround using common metadata fields. A common metadata field is any field used by more than one schema. You might recall seeing these listed in the drop-down menu for your advanced searches:
You can create a generic schema that contains all the metadata fields you are looking to extract from your document’s file properties. Once the document is in the inbox, you can redefine the schema to the preferred one which has all (or just some) of the common metadata fields to the generic schema – and now, you have the extraction rule grabbing the correct information but with the preferred schema.
Extracting email headers in document management software
We use an alternative to Outlook for our emails. How do I change the email extraction rule so it fits this email extension?
The Email extraction rule is provided with .msg files (the default for Outlook emails) built in as the default. If you want to use another format, you can build a File Properties Rule that covers the alternative extension but maps to an email-centric schema. Here is a typical Email Extraction Rule that maps .msg files to an Email Schema:
Here is a File Properties Rule that maps .eml files (the default for Gmail) to the same Email schema:
By creating a custom File Properties rule, you can do all the same actions as the email extraction rule.
Automate file property rules
Can extraction rules work with other functions, like auto-filing, auto-tagging, or Events?
Yes, and it makes a very powerful suite of tools. Let’s say I have some digital archives of Excel files I’m looking to bring into a controlled environment like FileHold. The creation date for the original files will need to be preserved, so I can create an extraction rule for these and map them to a schema built for the task. First, an auto-filing rule can use the creation date to file the document into a folder by year; second, an auto-tagging rule can define further vital metadata for the documents when they come into their auto-filed folder; third, an Event can send that document to the Archive, or convert it to record using that extracted creation date. These are just some of the tools that can work with existing content to minimize user interaction with documents or records to make processing more efficient.
To learn more about Extraction Rules, please have a look at our Knowledge Base, or send questions to [email protected].
Chris Oliver brings his twenty years of experience in management in the entertainment industry to FileHold Systems as the Client Training and Retention Advocate. To learn more about how FileHold DMS can work for you, contact him at [email protected].