How FileHold searches for documents
FileHold stores information necessary for searching in two locations: the system's SQL databases and full text indexes. The search tool allows the user to search both locations transparently in a single search. Generally, searching the SQL databases is faster than searching full text, but searching full text is much more flexible.
In addition to the full text content of document files, the full text index contains current metadata values for each document version and the current document name for each document version. Since permission related information is stored in the SQL databases, searching the full text index will always be followed by a permissions check in the database.
Regardless of how or where users search, they can only retrieve documents they have permission to access. For example, in order for a user to see a document in their search results they would have to be a member of the cabinet, folder, and schema that the document belongs to. Visibility to a document may also be affected by the hidden until approved feature.
Some of the options described on this page can only be changed by customers on self-hosted systems. FileHold Cloud customers should contact support@filehold.com if changes on the application server are required.
Search processing overview
The following are how the document management system conducts searches for documents:
- The "Contains FTS" operator searches metadata, full text file content, and file properties. It searches whole words only, but you can search for partial words using wildcards in addition to creating very complex search expressions.
- The "Contains DB" operation only searches metadata and version properties. It does not have any concept of words like the full text index, partial data can be searched using wildcards.
- Searching specific metadata fields or version properties typically produces more accurate results than a full text search as this information tends to be curated as documents are added.
- You can often get faster search results if you do not use the CONTAINS in FTS operator since the search consist of a single SQL query rather than a full text followed by an SQL query.
- When the user invokes search functionality by right clicking on a cabinet, drawer, folder group, or folder, the search will default to included only documents in that portion of the library.
- When your full text index search is constrained by a "folder equals" or "document schema equals" condition, performance is often improved over a general full text search as a list of possible documents is provided to the full text engine along with the other search critiera.
- By default contents of the library archive are not included in search results. Select the Include in Archive check box in the Advanced Search options to expand the search to include the Library Archive.
- If using a simple search or the Boolean search, the search engine ignores all document fields such as those created by file properties.
- Documents that have been soft or hard deleted from the system are not included in search results unless they were deleted by converting them to an offline document. In this case, the metadata for the document will be searchable.
- By default, only the latest version of a document is searched. The document usage history and document version history are not included in the search scope. To expand the search to include all document versions, select the Include All Document Versions check box in the advanced search options.
- Search results always come from the contents of folders. My FileHold results, search results, virtual folders, and document tray contents are not searched as these are only temporary links to the documents in library or library archive folders.
- The system stores all changes to metadata field values but only the current version is searched by default. To search using old metadata field values, select the Search Using Historical Metadata Fields check box in the advanced search options. Metadata fields that have been deleted from the system configuration are no longer available for searching.
When you create a search with multiple criteria any search criteria using the CONTAINS in FTS operator are executed first. The results from full text index searches are then combined with the results from any other search criteria to produce the final result set.
If the CONTAINS in FTS search criteria are very broad it may cause the search to take a long time even if other search criteria would narrow the search. The full text index does not know about the documetnt's location in the library or archive, its schema, or whether or not it is the current document version. So, regardless of settings for the final result set, all possible documents are returned in the first phase. If the full text search criteria is too broad or the system does not have sufficient performance, the search may timeout and produce no results.
Regardless of how many CONTAINS in FTS searches conditions are present, they are all combined to one for execution. By default, this first phase of a search using a contains fts operator is not limited in the number of documents that can be returned. Contains fts searches that can return very large numbers of documents can have a negative impact on overall system performance. This default can be adjusted.
Full text searching is all about words
When using a full text condition in your search critiera, you will be searching for what the full text index thinks is a word. The defintion of a word is controlled by a number of factors.
- By default the longest word is 32 characters.
- Spaces separate words, but the definition of spaces includes most punctuation characters by default.
- Words are made of letters, numbers, and underscore characters by default.
- You can search portions of words using a technique called truncation. This allows for wildcard characters to be inserted before, after, or inside words to represent any normal word characters.
- Stemming can be used to automatically include grammatical variations of a word in search results.
- Fuzzy searching is a special type of wildcard that allows you to more easily select for misspelled words.
Limiting intermediate full text search results
Searching very common words in a full text index can produce an explosion of intermediate results. If the number of results exceeds the processing power of the system to resolve in a reasonable period of time, the search will timeout. Adding wildcards to a search criteria can make it even easier to get an excessive number of results. This extra processing cost may be sufficient enough to impact the performance of other users. For example, an anonymous portal user could create a search that would return 100000 intermediate results, but they do not have access to any of them. However, both application server and SQL server processing would be impacted.
Different user groups can be controlled from creating expensive searches by disabling their ability to create an adhoc search. This would force them to use only searches that have be curated by an administrator through a saved search.
However, it may be desirable to leave adhoc searches open to users that may create an expensive search by accident. For this case, administrators can limit the total number of intermediate results the system will allow to be processed. If the user's search would cause the limit to be exceeded, they would get a warning dialog and either have to create a more refined search or ask the administrator to allow more intermediate results.
The new entry in the web config file in C:\Program Files\FileHold Systems\Application Server\FullTextSearch is under <appSettings>:
<add key="LimitNumberOfEntriesToReturn" value="0" />
The default value is 0 and this means the intermediate results are not limited.
If a full text search exceeds the limit then a message is displayed:
“Your search with the CONTAINS in FTS operator would consume more server resources than allowed by your system administrator. Consider narrowing your search conditions to reduce the possible results. Your search would have returned {x} documents and your system administrator has set a limit of {x} documents.”
Noise words (Ignored words) in full text searching
A noise word is a word such as "the" or "if" that is so common that it is not useful in searches. To save time, noise words are not indexed and are ignored in index searches. All single letters are ignored and include the list of words in the table below.
There may be circumstances where a modified noise word list is required. The noise word list is contained in a text file. FileHold stores the default noise word list file on the application server.
C:\Program Files\FileHold Systems\Application Server\FullTextSearch\dtSearch\noise.dat
When a full text index is first initialized, the noise word list will be copied from this file to the index. It can only be changed by rebuilding the index.
Letter | Noise Words |
A | a, about, after, all, also, an, and, another, any, are, as, at |
B | be, because, been. before, being, between, both, but, by |
C | came, can, come, could |
D | did, do |
E | each, even |
F | for, from, further, furthermore |
G | get, got |
H | had, has, have, he, her, here, hi, him, himself, how, however |
I | i, if, in, indeed, into, is, it, its |
J | just |
L | like |
M | made, many, me, might, more, moreover, most, much, must, my |
N | never, not, now |
O | of, on, only, or, other, our, out, over |
S | said, same, see, she, should, since, some, still, such |
T | take, than, that, the, their, them, then, there, therefore, these, they, this, those, through, thus, to, too |
U | under, up |
V | very |
W | was, way, we, well, were, what, when, where, which, while, who, will, with, would |
Y | you, your |
The full text alphabet
For the separation of characters into words for adding to the full text index, it is important to have an alphabet suitable for the documents that will be indexed.
By default, the English alphabet, arabic numerals, and the underscore characters are considered letters in the full text alphabet and only they will be able to make up words. Most punctuation characters are the same as a space and are used to separate words. There are hyphen characters that get treated according to the options for hyphens. Finally, there are are a number of characters that are treated as though they do not exist, such as most control characters.
There may be circumstances where a modified alphabet is required. The alphabet is contained in a text file with a section for each category of characters. The same character cannot appear in more than one section of the file. For example, the underscore character could be moved from the [Letters] section of the file and placed in the [Hyphens] or [Spaces] section.
FileHold stores the default alphabet file on the application server.
C:\Program Files\FileHold Systems\Application Server\FullTextSearch\dtSearch\default.abc
Where needed, characters can be inserted using their hexidecimal equivalents in the form \XX.
When a full text index is first initialized, the alphabet will be copied from this file to the index. It can only be changed by rebuilding the index.