Power searching with dynamic fields and RAW mode

Full text search is a powerful concept, but sometimes it can be tricky to get exactly the results you want without manually sifting through documents that matched a simple search criterion to find the few or one result you really wanted. The good news is there are lots of methods to narrow your search including the magic of dynamic fields.

All documents indexed in the FileHold full text index have fields that can be called out when you are specifying search criteria. Some of them are specifically added by FileHold such as the document name, each individual metadata field current value and all metadata values in one big field. There are other fields added automatically by the full text search engine. For example, the subject of an email, the camera in a JPEG or a header tag in an HTML file.

If you use the RAW mode of for full text searching, you have several options to search for the information in these fields. If you want to find the emails with “URGENT” in the subject, you could specify a search criteria value like Subject::URGENT or Subject contains URGENT. Note that capitalization is not normally important in these queries; I am just using it for clarity.

Image
FileHold advanced search

However, there are many times where the information you need is not in a nicely prepared field, but it does exist in a roughly structured format in the document you are looking for. There are many examples of documents like this such as application forms, purchase orders, engineering documents, etc.

There are also documents like scientific papers that are both complex and regular by convention. For example, they typically start with an "abstract" section which is followed by a "keywords" section then an "introduction" section. If I am interested in finding scientific papers that have certain keywords, I could use this regular structure to generate a dynamic field in something called a segmented search.

Image
Search in scientific paper

In our first example Subject contains URGENT, we saw the form <field> contains <value>. A segmented search will allow us to create a field on-the-fly using the regular structure of the document. In this case, we know that the keyword we are looking for will be between the heading “KEYWORDS” and the heading “INTRODUCTION”. If we are looking for documents with the keyword “sort”, we could form our query like (KEYWORDS to INTRODUCTION) contains SORT.

Image
FileHold advanced raw search

We just created a dynamic field containing all the text between the words “keywords” and “introduction”. Our target words to find do not have to be as simple as in this example. Perhaps we want the keywords to contain sort and algorithm and variations on those words. I could update my criteria to (KEYWORDS to INTRODUCTION) contains (SORT~ and ALGORITHM~).

Image
FileHold advanced raw search

I have used the Boolean operator “and” to require both “sort” and “algorithm” to be in the keywords and thrown in the stemming modifier just in case the keywords are “sorted” or “algorithms”.

Much more complex expressions are possible as are combining multiple dynamic fields in the same query. You can use multiple words on either side of the TO connector. If your field delimiter words include "to", you must enclose it in quotation marks. There may be additional challenges finding a regular document format for files created using optical character recognition when the general layout of the document is both horizontal and vertical, like multiple columns. Our knowledgebase has much more information on searching if you would like to learn more.

The heavy lifting for full text search in FileHold has been provided by dtSearch® for more than 15 years. Their web site provides a number of podcasts and podcast digest articles on how to realize the greatest potential from full text search. Take a break with dtSearch and find out how fuzzy search will help you with your OCR'd documents or what is behind the mystery "relevancy" value we display in your search results.

 

Image
Russ Beinder

Russ Beinder is the Chief Technology Officer at FileHold. He is an entrepreneur, a seasoned business analyst, computer technologist and a certified Project Management Professional (PMP). For over 30 years he has used computer technology to help organizations solve business problems.