1.877.833.1202

Excluding/Including File Types to be Indexed

Excluding File Types from Full-text Search

Any type of file can be added to FileHold and normally FileHold will index the contents of every file it is given in the full text search (FTS) engine. This is a good thing for in most cases, but there are some files that could increase the size and complexity of the FTS index without adding any significant value. An unnecessarily large or complex index could affect system performance. Examples of files that probably should not have their contents indexed include binary files, database files, and other similar files.

A Windows server administrator can define file types whose contents will be excluded from the FTS index. These documents will still have their metadata and properties indexed; only the contents will be excluded from the index.

The settings for file exclusions are maintained in the web.config file. This file will typically be found at the following location.

C:\Program Files\FileHold Systems\Application Server\FullTextSearch

Modify the following section in the web.config file:

<add key=”ExcludedFilesList” value=”” />

This is an XML file and it can be edited with any text editor. An XML editor may provide better results as it will help ensure the correct XML syntax is maintained.

The excluded list includes compound files, such as a zip archive. For example, if you exclude *.XLSX, then a XLSX file inside a zip archive is not indexed. For example, “*.ZIP;*.RAR;*.MDB;*XLSX;” would exclude ZIP, RAR, MDB, and XLSX files from being indexed. The format of each entry includes an asterisk (*), a period (.) in front of the extension and each file type is followed by a semicolon (;). For example:

<add key=”ExcludedFilesList” value=“*.ZIP;*.RAR;*.MDB;*XLSX;” />

The change to the configuration will happen immediately in the application server after the web.config file is saved. However, it will only affect new documents added to the index. If a file with an excluded type already exists in the system the index can be manually rebuilt to exclude the file.

TIP: There may be circumstances where it is desirable to exclude a single file, but index all other similar files. Since the exclusion rules apply to all files of the same type you could create an exclusion type called ".donotindex". Then, whenever you have a file that you do not want to be included in the index you can append ".donotindex" to its file name before adding it to FileHold.

Limiting Full Text Indexing to Specific File Formats

The full text search index can be limited to only index certain file types in the system. This will ignore all other file extensions from being indexed. This can prevent from having a potentially very large full text search index that does not contain much search value and cause search time-outs.

A Windows server administrator can define file types whose contents will be included in the FTS index. Excluded documents will still have their metadata and properties indexed; only the contents is excluded from the index.

The settings for file exclusions are maintained in the web.config file. This file will typically be found at the following location: C:\Program Files\FileHold Systems\Application Server\FullTextSearch

Modify the following section in the web.config file:

<add key=”IncludedFilesList” value="" />

This is an XML file and it can be edited with any text editor. An XML editor may provide better results as it will help ensure the correct XML syntax is maintained.

A list of file extensions which are exclusively indexed can be listed in the following entry. In the example,“*.MSG;*.DOCX;*.PDF;” would index only MSG, DOCX and PDF documents. The format of each entry includes an asterisk (*), a period (.) in front of the extension and each file type is followed by a semicolon (;). If the value is empty then all file types (excluding items in the ExcludedFilesList entry) are indexed. For example:

<add key="IncludedFilesList" value="*.MSG;*.DOCX;*.PDF;" />

For compound files, such as a zip archive, all files inside a zip are assumed to be included in the index unless specified. To specify that only certain file types inside a zip are indexed, the format is “*.ZIP>*.DOC; *.ZIP>*.XLS; *.ZIP>*.PDF;”. This would index only file types of *.DOC, *.XLS, and *.PDF that are inside a zip archive.

For MSG or EML files, to index the email itself and only certain types of attachments, the format is “*.MSG;*.MSG>*.MSG;*.MSG>*.DOCX;*.MSG>*.PDF;”. This would index only the emails and MSG, DOCX, and PDF attachments.