Learn how to build a data-driven, knowledge-based enterprise. Register for KMWorld today!

Text Analytics for Enterprise Search
The Essential Components for High Performance Systems

This article is part of the Best Practices White Paper Enterprise Search [May 2008]
Page 1 of 2 next >>


   Bookmark and Share

Automatic derivation of meta-information broadens search technique capabilities and can improve results up to 400%.

When ZyLAB was founded in 1983, full-text retrieval was a new technology whose application and relevance in the marketplace was untested. Although its basic algorithms originated in the 1960s, full-text search was still not broadly viewed as a trustworthy enhancement to traditional key-field searching on meta-information (i.e. information about information in a file). Acceptance of these new tools was slow to materialize and only came about after heavy market evangelism by some early adapters who envisioned the future potential of these tools.

But as we know, perceptions change fast in technology. By the late 1990s, the increasing capacity of computers and further sophistication of search algorithms enabled Internet search engines to realize the powerful potential of full-text search. Full-text retrieval had become the de facto standard for search, and, perhaps as a result, a lot of people no longer felt there was a need to add and search additional metadata.

Beyond the Google Standard
Now, an entire generation of tech-savvy computer users exist whose expectations and perceptions of full-text search functionality and performance are almost completely influenced by the "Google effect." In most instances, this type of approach works fine if users only need to find the most appropriate website for answering general questions. Users type in full-text keywords and expect to see the most relevant document or website appear at the top of a result list. Page-link and similar popularity-based algorithms work very well in this context.

But problems arise when users view this searching model as the default approach to finding any kind of information. People who have become conditioned to viewing search through the prism of Google-type approaches often are not interested in, or even aware of, other search techniques. However, a lot of information that may be vital for them to know may not come to light using only these basic search techniques. If, for example, a user’s search is related to fraud and security investigations, (business) intelligence, or legal or patent issues, other searching techniques are needed that support different sets of issues and requirements, such as the following:

Focusing on optimized relevance. The first requirement of broader search applications is that not only does the best document need to be found, but all potentially relevant documents need to be located and sorted in a logical order, based on the investigator’s strategic needs. "Popularity-based" results generated by Internet search engines cannot support these criteria. Consider all the criminal elements that have vested interests in keeping themselves and their activities anonymous. Many of these people understand how basic search engines work and how to minimize their exposure to these search mechanisms so that they don’t appear at the top of results lists.

Handling massive data collections. Another issue impacting effective strategic searching is how to conduct extensive searches among extremely large data collections. For example, if email collections need to be investigated, these repositories are no longer gigabytes in size; rather, they can be a terabyte or more. When handling this volume of data, plain full-text search simply cannot effectively support finding, analyzing, reviewing and organizing all potentially relevant documents.

Finding information based on words not located in the document. In this context, consider investigators who may have some piece of information concerning an investigation but don’t necessary know other details they may be looking for. Who is associated with a suspect? What organizations are involved? What aliases are associated with bank accounts, addresses, phone records or financial transactions? Traditional precision-focused, full-text approaches are not going to help users find hidden or obscure information in these contexts. The searching framework must take into account additional information, which can be obtained by using text analytics to extract meta-information from the original document to provide other insights.

Defining relevancy. When defining a search’s relevance, all factors that could be in play during a specific search instance must be accounted for (in the context of overall goals). Using the investigative example again, consider possible involved parties and what "relevance" would mean to their actual search:

  • Investigators want to comb documents to find key facts or associations (the "smoking gun");
  • Lawyers need to find privileged or responsive documents;
  • Patent lawyers need to search for related patents or prior art;
  • Business intelligence professionals want to find trends and analyses; and
  • Historians need to find and analyze precedents and peer-reviewed data.

All of these instances require not only sophisticated search capabilities but also different context-specific functionalities for sorting, organizing, categorizing, classifying, grouping and otherwise structuring data based on additional meta-information, including document key fields, document properties and other context-specific meta-information. Utilizing this additional information will require a whole spectrum of additional search techniques, such as clustering, visualization, advanced (semantic) relevance ranking, automatic document grouping and categorization.

New Expectations for Search Performance
The insights mentioned above have been confirmed by various scientific research. For instance, during the TREC 2007 (http://trec-legal.umiacs.umd.edu/) legal conference, presented studies concluded that traditional keyword and Boolean searches (such as those found in Internet search engines) resulted in only 20% of all present, relevant information being found. Again, for many common usages, finding the best 20% of documents is usually enough, but if need dictates that all potentially relevant documents must be found, 20% isn’t going to get the job done.

(This result is in line with the findings of the seminal Blair and Maron study in the ‘80s. Here, highly qualified lawyers and paralegals thought they had found 75% of the relevant documents in a specific case, but the reality was that they only found 20%. One could conclude that the performance of Boolean keyword searching has not improved in 30 years.)

Page 1 of 2 next >>

Search KMWorld

Connect