-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

'Findability': The Key to Enterprise Search

The enterprise search business has evolved significantly over the last few years as mainstay vendors and some new entrants jockey for market share and the opportunity to enhance the findability of information from within numerous company data silos. The market itself, however, has grown slowly over the last years despite increased interest in search and related capabilities (such as classification, navigability of structured data, etc.) As a result, many vendors in the space increasingly distance themselves from the pure-play search vendor to present themselves as unstructured information management tools, "business-process fusion suites" and content management applications built around search.

Nevertheless, it is evident in 2005 that search is ubiquitous in enterprise applications. Recent focus on Web search engines, as giants vie for Web-based advertising dollars, has increased focus and interest in search. And the essence remains findability. Regardless of how a search application is wrapped, if it does not allow one to locate relevant content relatively easily, then it will likely not result in any significant return on investment, even over longer periods of time.

The Basics: Precision and Recall

To assist with the task of retrieving information, search has been primarily a keyword-based endeavor, where searchers have attempted as best they could to retrieve relevant documents accurately by matching up the keywords in a given query with the same words in the documents. Without some means of deriving the correct senses for words in the query and at data indexing time, lexical polysemy and homonymy result in a significant number of misinterpretations and thus blur the quality of retrieval results. We are now seeing more promotion of improvements to better achieve accurate and relevant search results through "smart query modification," which allows for variations in spelling (through sophisticated pattern search) as well as allowing for use of synonyms of query terms. Most of this development is tailored at achieving best results while optimizing index size, providing a scalable indexing and querying paradigm as well as promoting flexibility through SDKs that allow for customization of search and retrieval.

But information management paradigms are changing. Unit costs of disk space, RAM and processing units are decreasing. The world where recall was of the utmost importance (because it paid to retrieve more than was necessary to ensure finding all relevant information) worked on the smaller document collections of decades past, but cannot possibly apply to the vast quantities of information that we must now be able to handle. Precision is becoming ever more important in the context where the simple increase in document collection size translates into increases in the number of relevant documents for a given query. The technological approach must be able to account for this and include metrics to quantify achievements in precision enhancement, all the while attempting to minimize sacrificing completeness of search results sets.

To better grasp the balancing act between precision and recall, some definitions are in order. It is critical to understand the key role that recall and precision play in qualifying how "good" a given retrieval system is. They are not trivial to measure in a practical sense since—amongst other considerations—enumerating all existing relevant documents in a real document collection is not necessarily a simple task. Precision and recall can be described formulaically; see the examples at the bottom of this page.

It is important for enterprise decision makers opting for a search solution to put questions to vendors regarding algorithmic approaches for computing relevancy of search results and tunability of the ranking mechanisms. Ultimately it is necessary for them to put vendors through test scenarios to validate the quality of search results. Examining what approaches a search company has taken to minimize false positives and false negatives can also help in determining the likely quality of the search tool.

(False positives are hits on documents that include one or many of the keywords, but with the wrong meaning. The result is lower precision as these false hits can adversely affect the number of relevant documents returned as well the number of total documents returned. False negatives are misses on documents that did contain the conceptual information included in the query. Recall is adversely affected in this case as key documents are potentially not returned when this happens.)

Critical Information Nuggets

Ultimately, what a company's users are most interested in is ensuring that they retrieve documents that best reflect the meaning of their queries. Searching for "genome of the common fruit fly" or "PET scanners" should yield results consistent with the meanings of these expressions. From a formal perspective, this touches upon information theory. From Shannon's Mathematical Theory of Communication (1949), we can infer that documents, as per Shannon's principles of communication theory, have an associated entropic value related to the language they are in. Shannon's measure of entropy came to be taken as a measure of the information contained in a message, as opposed to the portion of the message that is strictly determined (hence predictable) by inherent structures.

In information theory, entropy is conceptually the actual amount of information in a piece of data. As an example, entirely random byte data has approximate computable entropy of about 8, since you never know what the next character will be. A long string of "A"s has computable entropy of 0, since you know that the next character will always be an "A". The entropy of a general-language English corpus will tend to be about 1.5. (We may say that two things must share some information when they are linked, and things will be isolated if they share no information with others.)

In other words, entropy helps to measure how "unexpected" information contained in a message is. This also tells us that if we examine two expressions that occur in a document, the most unexpected one (which can be interpreted as the expression or term that occurs less frequently) likely communicates more information. This can be illustrated with certain stop-words such as "the," which occur extremely frequently in most English documents but never communicate any information of interest about the meaning of these documents.

Thus, we anticipate that information theory is of great interest in attempting to ascertain how expressions contribute information to the meaning of a document. But more interestingly: which are the more important information nuggets in a document? Getting at such critical information nuggets is essential to ensuring that relevant documents will be retrieved or presented in browsable classifications.

Keys to Ensuring High Findability of Data

Relevant Synonyms

Accounting for synonyms and closely related terms in searches is a very good way to ensure that no important documents will be missed. This should be done in a domain- or topic-specific manner, which will yield very high quality results. As an example, consider the technical synonyms below (from a synthetic plastics sample) which should be accounted for regardless of which specific one was used in a query.

Word Classes and Entities

An enterprise search application that includes the means to identify specific word classes, such as abbreviations and acronyms, and distinguish these across domains or topics can give a significant boost to an information retrieval system's ability to yield relevant results. Taking this a step further by ensuring that specific entity types are equally identifiable will yield even more promising results. Some such types are easier to recognize (such as e-mail and IP addresses) while other types such as named entities (people, place and organization names) may be a bit more difficult. Many vendors offer strong capabilities on this front, which is well worth verifying.

The Frequency of Meanings

When dealing with a polysemous term, i.e. one with several meanings, often this term will have a statistically dominant meaning. The oft-cited example of "bank" is a case in point. This term has several meanings, but there is one very predominant one, which is related to "financial institution." The impact of this dominant meaning is that any search on an alternate meaning, such as the "bank of a river" that includes the original term "bank," will likely yield search results that include highly ranked documents about "financial institutions" even if we were indeed looking for data to do with river banks. We end up once again with higher recall but poorer precision, so mechanisms that help to account for frequency of meaning can be very useful.

Tracking Information Nuggets

Analysis of thesaural and semantic sources, as well as publicly available corpora, indicate that the approximate ratio of multiword expressions to single words in general language is about 0.8 (this typically applies to Western European and North American languages where compounding is not that marked). Multiword expressions can be idioms such as "to kick the bucket" or "bright eyed and bushy tailed," or can be collocations such as "white wine" or "warm greeting" or "chemical vapor deposition."

When looking at topic- or industry-specific corpora or content (such as scientific publications), the ratio of multi-word expressions to single terms is closer to four, i.e. four times as many multiword expressions as single words are representative of a given topic or industry. This is linked to the relative importance of expressions versus single words in describing a topic and particularly in an industrial or technical field. This is also reflected in information theory, as it is expected that expressions will contribute more to the meaning of a document. It is therefore expected that this effect will be more significant in technical or industry-specific documents.

The Bigger Picture

In addition to the above, the use of formal knowledge structures such as taxonomies, thesauri and ontologies can also vastly improve findability of content, particularly in cases where direct search approaches do not apply so well. Examples are cases where the user is not exactly certain what he is looking for, or when the need arises to topically browse across a classification of documents in order to learn about a subject. As an example, imagine having to set up a query (in fact, it would be many queries) to run against a document collection containing information regarding research activities of pharmaceutical companies. How could one query the system in order to obtain an answer to the question: "Which companies are working on what kind of drugs?" It would be nearly impossible. However, cross-referencing two classifications dealing with pharmaceutical companies and diseases would instantaneously yield through a two-dimensional table view, the distribution of diseases being worked on by specific pharmaceutical companies.

In summary, companies looking to add search to an existing suite of enterprise applications or to enhance existing search capabilities should ultimately be thinking about findability of their data. In so doing, they should probe vendors and integrators on keys related to linguistic and algorithmic approaches used to help with precision, recall, related terms, word classes, frequency of meanings, handling critical expressions and handling categorization and classification. Vendors should be using knowledgebases, knowledge structures and algorithms to help with these keys. If these points are well addressed by a vendor's offering, it is well worth moving on to the next steps of evaluating other factors related to configuration, ease-of-use and pricing. The road to ensuring return on investment for search lies along the path of findability.


Precision =

Number of relevant documents returned (assessed by user)


All documents returned (as returned by IR system)

(The desired behavior is that precision tends toward "1," as this implies no time is wasted looking at non-pertinent documents.)

Recall=

Number of relevant documents returned (assessed by user)


All existing relevant documents (requires estimate)

(The desired behavior is that recall tends toward "1," as this implies the user has not missed any relevant information.)


Alkis Papadopoullos, director of linguistic technologies at Convera, directs the evolution of Convera's language analysis, taxonomy development and discovery products—key components in Convera's RetrievalWare 8 information discovery and analysis platform. He has a masters' degree in physics, speaks five languages fluently and has worked in computational linguistics software development for 10 years. Convera is a leading provider of mission-critical enterprise search and categorization solutions. Convera's RetrievalWare solutions provide highly scalable, fast, accurate and secure search across 200 forms of information, in 45 languages. More than 800 customers in 33 countries rely on Convera's search solutions to power a broad range of mission-critical applications. For more information please visit Convera

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues