Automatic extraction of XML content controls from Microsoft Word content controls

FileHold recommends using PDF forms instead of Word forms as Microsoft no longer supports the Content Control Toolkit.

You can create a “XML Node Extraction Rule” for a Microsoft Word document (e-Form) that has content controls. After the document has been properly configured, the values in the content controls can be extracted into the metadata fields when the e-Form is added to FileHold.

Using Microsoft Word 2007 or higher, you can create forms using the content controls available in Microsoft Word developer mode.

Once an e-Form is created in Microsoft Word, use the Word 2007 Content Control Toolkit to map the controls on the form to the custom XML parts. The free toolkit is made by Microsoft, is actively supported and available for download here: http://dbe.codeplex.com/

The Content Control Toolkit has been archived by Microsoft. To install, follow these instructions: https://social.msdn.microsoft.com/Forums/en-US/bb33d060-49a6-407d-a003-6609727b8be8/codeplex-archive-files-how-to-use-them

Extraction rules will work only when adding documents through the FileHold Desktop Application (FDA), Watched folders, or through Automatic Document Importation.

Once a Word form has its XML nodes mapped and given a unique namespace in the toolkit, you can then create the extraction rule in FileHold. You use the Microsoft Word e-Form that was mapped as the template.

After the extraction rule in FileHold added, the form can be used as a template available for download in FileHold. Users will get a copy of the form, fill out the form, save the form as a new file and add it to FileHold. When the form is added to FileHold, the mapped fields on the form will be automatically extracted to the metadata fields.

Extraction rules can be used in conjunction with the Import Jobs (Automatic Document Importation). The extraction rules are automatically applied when an import job is processing documents on the server.

WARNING: XML Node Extraction should be configured by someone who is familiar with using the Developer Tools in Microsoft Word, writing XML, and the Content Control Toolkit. If you require assistance with setting this feature up, please contact [email protected] for a quote.

Watch the Using XML Node Extraction Rules with E-Forms training video.

The following are the steps to creating an XML Node extraction rule:

Image
How to set up Microsoft Word extraction rules

Step 1: Create an e-form in Microsoft Word using developer tools

In the first step, you will need to create an e-Form using the Developer ribbon in Microsoft Word. Use the content controls in the e-Form fields as this is the information that will get extracted into the metadata fields of the document schema.

The following is an example of an e-Form created in Microsoft Word. You can see where the content controls are in the Invoice on the right side that say "Click here to enter text". On the Invoice on the left side, values have been entered into the content controls such as invoice number, date, total and so on. These are the values that will be extracted into the metadata fields.

Be sure the document is saved as a docx.

Image
Microsoft Word form example

This help article is not going to explain how to create e-Forms using Microsoft Word. For more information on creating content controls in Microsoft Word, see the Microsoft Word online help.

Step 2: Use Content Control Toolkit to map "XML nodes" to e-form content controls

As previously mentioned, the Microsoft Word e-Form will require some additional configuration before the values from the content controls can be extracted. After the e-Form is created, the second step is to use the Word 2007 Content Control Toolkit to map the content controls in the e-Form to the custom XML nodes created in the toolkit. The free toolkit has been archived by Microsoft. The toolkit is a stand-alone, light-weight tool that opens any Word Open XML document (eg .docx) and lists all of the content controls inside of it.

In the toolkit, a XML code is written that contains the “XML nodes” that will be mapped to the content controls on the e-form and assigned a unique namespace. The XML nodes define which content control values will be extracted to the metadata fields from the e-Form. The unique namespace is required in order to create the unique extraction rule in the document management software.

After creating the XML nodes in the XML code, the XML nodes are dragged and dropped to the content controls to "bind" the content together. Once they are "bound", the document is saved and used to create the extraction rule in the document management software.

To map the XML nodes to the content controls

  1. Download the Content Control Toolkit.
  2. Open the Microsoft Word e-Form you created in Step 1 in the toolkit. There is a list of all the content controls in the e-Form.
Image
Content control toolkit mapping
  1. Create an XML file that contains a unique namespace and the XML nodes that you want to bind to the content controls. The unique namespace must be unique and written in a format of:

    <form xmlns="http://youruniquenamespace">

    You can do this in the Content Control Toolkit > Custom XML Parts > Edit View tab or in another application such as Notepad and copy it over into the Edit View tab. For more information, see the Help in the Content Control Toolkit.

    In the example below, the XML was written in the Content Control Toolkit > Edit View tab:
Image
Content control toolkit mapping
  1. Once your XML code has been created and is valid, you can bind the content controls to the XML. Validate the XML code using the Check Syntax button (Checkmark button).
  2. Click on the Bind View tab.
  3. Bind the Custom XML Parts to the Content Controls by dragging and dropping the XML node to the content control. Note that you should drag and drop slowly to ensure that the items are "bound". The example below shows how to bind the XML nodes to the content controls in the Content Control Toolkit via dragging and dropping.

WARNING: This step in the process can be a bit "finicky". This is due to the Content Control Toolkit which FileHold cannot do anything about since it is a 3rd party product.

Image
Mapping content controls to xml parts
  1. Save and Close the e-Form after the XML nodes have been bound to the content controls.
  2. To ensure that the form has been mapped correctly, open the form again in the Content Control Toolkit.
  3. In the Namespace area, click the down arrow to ensure there is only one namespace in the list. If there are additional namespaces, delete them.
  4. Review the bound content controls and enure the correct XML node has been mapped.
  5. Save and Close the e-Form once you are sure everything is correct.

Step 3: Create XML node extraction rule in FileHold

The next step is to create the XML Node Extraction Rule in the document management software. When creating the rule, you will need to select the mapped Microsoft Word e-Form document as the template to create the rule from. The unique namespace that was given to the document in the Content Control Kit will allow the extraction rule to recognize that the values in that document can be extracted. Having a unique namespace allows you to create as many XML Node extraction rules for as many documents that you like as long as the namespace for each document is unique.

In the example, a specific schema called “Pet Store Supply – Invoice” was created to contain the metadata fields that will be extracted from the e-Form. When creating the XML Node extraction rule, you map the metadata field names in the schema to the “XML nodes” created in the Content Control Kit. Notice that the unique namespace is displayed in the Select XML Node window.

To create the XML Node Extraction rule

  1. Do one of the following:
  • In the FDA, log in as a library administration and go to Tools > Extraction Rules.
  • In the Web Client, go to Administration Panel > Library configuration > Extraction Rules.
  1. Click Add XML Nodes Rule.
  2. In the Select Template File window, select the e-Form you configured in Step 2 using the Content Control Toolkit.
  3. In the XML Nodes Rule window, enter a name for the rule.
  4. Enter a description for the rule (optional).
  5. To enable the rule, ensure the Rule is Enabled check box is selected.
  6. In the Document Schema list, select the schema that is to be used for this rule. You may need to create a new schema for the document type.
  7. Map the Source field to the Destination Metadata field. Click ... to select the XML Node from the list. Ensure that the unique namespace is selected in the Select XML Node window. For example, map the <invoice_number /> XML node to the Invoice Number metadata field.
Image
XML nodes extraction rule
  1. When all the fields are mapped, click OK.
  2. The Extraction Rule will appear in the list of extraction rules.
Image
Extraction rules list
  1. **Log off and log back into FileHold.** Do not skip this step.

Step 4: Add the form to the FileHold library

A Library Administrator or someone with sufficient permissions can add the mapped Microsoft Word e-Form to the document management system. When the form is added to FileHold, the rule will automatically recognize the e-form (due to the unique namespace) and the metadata fields values will be extracted from the form. This form can be set to read-only so that the form can only be downloaded by users.

Step 5: Download and fill out the e-form

Users can download the e-Form and fill out the information. When the filled out e-Form is added to FileHold, the rule will automatically recognize the e-Form and extract the values in the content controls into the metadata values. In the example below, the e-Form has been filled out and the contents of the content controls on the e-Form have been extracted into the corresponding metadata fields.

Image
Metadata extraction from XML nodes in Word form

 

NOTE: There is a fee if you require assistance with setting this feature up. Please contact [email protected] for a quote.