Vote Now for the KMWorld Readers' Choice Awards !

Search and retrieval lay the foundation to knowledge discovery

This article appears in the issue May 1998 [Volume 7, Issue 6]

"When I was a child in Philadelphia, my father told me that I didn't need to memorize the contents of the Encyclopedia Britannica; I just needed to know how to find what is in it." --Richard Saul Wurman, "Information Anxiety," 1989

The Web represents the quintessential application for text retrieval--a virtual, limitless library with no inherent intelligent means of effectively cataloging its contents. The unique combination of open, wide-scale information accessibility and content-based retrieval tools have made us realize that "finding what is in it" goes beyond retrieving words and phrases. That realization is behind the evolution of many aspects of the EDMS market to knowledge management. The strategic goal of this information age is not merely to harness information but to leverage the knowledge that it represents. That encompasses not only the information stored within each document, but the unstated higher level of knowledge that comes from the relationships and trends among documents, and the experiences, perspectives and instinct of the users.

Content-retrieval tools

Consider the functionality provided by simple search engines on the Internet and within intranet environments. While the hyperlinks that embody the Web are one means of information navigation, they are dependent on author-based creation and therefore rely on the subjectivity of another user. Text retrieval is a powerful means to supplement the retrieval capabilities of the hyperlinks, enabling users to quickly and intuitively locate any and all Web pages/documents that display relevance to their interest. But as we all know, when such powerful technology is coupled with the volumes of text contained in the Web, the result is overwhelming. Query results such as "3,652 documents found" offer little insight to the user whose goal is to hone in on the most powerful and relevant sites related to his/her query.

That paradox stems from the initial focus on processing power in Web-based search tools. Many users believe that the value of a query engine is its ability to find as many documents as possible, from as many sites as possible, in the shortest time. Focus needs to be placed not only on the processing power of the search engine (recall), but in its ability to intelligently discern the essence of the query, and to provide accuracy and relevance in the query result set (precision).

Evaluation of search engines should begin with an inward focus. Users' evaluations must include both a benchmark of the engine's processing power and its approach to searching. If the researchers have a keen understanding of the subject matter, are familiar with the subject's lexicon, and are looking for precise results, finite-based query tools are suitable. Those tools are based exclusively on exact word-based retrieval. Those documents that contain the words specified in the query are identified and retrieved.

Rudimentary searching

Even within that rudimentary research model, evaluation must consider the underlying search methodology of the product. The most commonly used is the inverted word index of the document's content coupled with a Boolean word query facility. That approach is predicated upon exact word matches; there can be no variance (e.g. cultural differences in spelling, misspellings) between the query term and the term(s) within the documents themselves. The Boolean-based query facility can allow you to expand your query--or increase your recall--by joining terms and ideas using the "or" operator, or narrow your query results (e.g. control the precision of your query result) via support for the "and" operator. That basic search strategy is the core of many search products from ZyLab (www.zylab.com), Fulcrum (www.fulcrum.com), Verity (www.verity.com), Excalibur (www.excalib.com), Dataware (www.dataware.com) and InText (www.intext.com). (Many of those products offer functionality far beyond that through the application or integration of addition query/search enhancing tools.)

Patterns

There are alternative approaches to the basic search methodology, namely pattern recognition and n-grams. With pattern recognition, the search methodology offered within Excalibur's RetrievalWare, word approximations are supported. For example, a search for the word "color" would also retrieve "colour." The engines produce both a document lexicon and a numerical representation of the query terms that represent the patterns within the text. Thus, similar patterns can be retrieved as "approximations" of the word. That approach is helpful when querying in an environment where spelling differences exist, where users and/or authors make spelling mistakes, and/or when input to the document repository is provided by a less than perfect mechanism (i.e. OCR).

Use of pattern recognition technology in conjunction with semantic processing/full-text search capabilities makes it possible to search images, video or any other digital data types. RetrievalWare employs adaptive pattern recognition processing (APRP), a technology that identifies unique clusters or signatures in digital information. That visual retrieval engine can be found on the Yahoo (www.yahoo.com) site to search photographs.

N-grams, also known as suffix arrays, build indexes similar to an inverted word index, but the granularity of the index is finer. For example, in the OpenText search engine, a uni-gram system is built, in which each character within the body of text is indexed. In such an environment, it becomes feasible to search for word stems (e.g. "cardio" to find all instances of text related to the heart), with the same speed and efficacy associated with full-word queries. That approach to searching supports sophisticated manipulation of text strings and trends across the bodies of documents. For example, a query can be posed to identify the longest string of text that occurs in more than one document that contains the phrase "In God We Trust."

Investigative research

A more complex search model is investigative research, an approach that is more interactive and dynamic. The user has a conceptual idea of the subject matter of interest, but is not intimately familiar with the subject vernacular or the availability of relevant documentation. The research process must do more than merely accelerate the search of the document collection; it must also help the user interrogate the document collection in an iterative manner. Add-on tools and advanced search methodologies such as synonym files, thesaurus, proximity searching, topic trees, heuristic association, semantic networks and clusters help the user broaden the search and provide intelligent insight into the manifestation of ideas and concepts found within words and phrases. Retrieval becomes more conceptual. The trick is to determine the level of intelligence required--the degree to which the intelligence is provided out-of-the-box vs. through input from users. The various approaches available appear similar at a cursory view. It is the processing logic that accounts for the significant differences in the results of a single query executed in two different search environments.

Serendipity

Lastly, there is the serendipitous search environment. The user has no direct interest in either the amount of information available or the relevance of information to one specific concept. Instead, it is the linkage of concepts and documents that leads to user discovery. The technology to support that (albeit in an elementary way) has existed for many years. However, it was not until the advent of the Web that there was a suitable environment in which to make it practical and valuable. It is the world of Web surfing. While many consider that environment the exclusive domain of the hyperlinks of the intranet, search tools such as heuristic association and document clustering should also be considered.

Heuristic associations are dynamically determined relationships between words and concepts that link one document to another. For example, if searching on "AIDS research," the engine would automatically spawn subsequent searches for related concepts such as AZT, pneumonia and drug addiction, among other topics. Document clustering, on the other hand, employs several approaches to determine word relationships to concepts before a query is issued, and uses that knowledge to create nodes, or clusters of subject content found within a collection of documents. The documents are subsequently categorized or linked to the concept nodes, based on the degree to which they exhibit insight into that subject area. User queries are then matched for their relevancy to the subject nodes. Those documents that are linked to that subject node are retrieved as Òrelevant" to the query, even if they do not specifically contain any of the expressed query terms.

In a world where a simple query may result in Ò3,452 documents found," relevancy ranking is essential, yet an often overlooked feature of many search engines. A search engine that employs relevancy ranking techniques returns an intelligently ranked list of documents in the order of perceived value to the query posed. Recall of the system is encouraged, but not at the price of lost precision. Approaches to relevancy ranking range from simple query term tallies suitable for finite research, to compound approaches using term weighting, fuzzy logic, omni-term skewing and word/document density, applicable to the demands of investigative and serendipitous research. It is up to the user to determine the sophistication of the approach taken by the search engine when evaluating relevancy.

Relevancy ranking often goes hand-in-hand with a powerful search front end known as query-by-example, which supports the idea, ÒI am not sure what I am looking for, but if I see it I will recognize it." Query-by-example allows you to submit a document as a query. In essence, you are instructing the system to find more documents similar to the one in focus and content. Here again, however, you must look at the sophistication of the underlying process by which the Òseed" document is analyzed and translated into a query. Approaches that do no more than decompose the document into a series of Boolean Òor" statements for execution in an inverted word index will not provide much clarity or precision and may only exasperate the researcher.

Seeing to the knowledge within

We do not want to be miners of information, but rather discoverers of knowledge. The Òit" referred to in Òfind what is in it" is not simply the information, but the knowledge captured and implied within the documents, among the documents and within our experiences, queries and perspectives. That has caused many traditional vendors such as Fulcrum, Verity and Excalibur to make dramatic changes to their products, and new vendors such as Semio (www.semio.com), CompassWare (www.compassware.com) and InXight (www.inxight.com) to develop and release a new breed of product. Those products are using features such as automatic document abstracting, semantic analysis and agent technology to propagate searches across multiple environments, combine them into a single set of findings, and produce a new sub-set of information to propagate another search. That new product set is at the forefront of solutions for the emerging market for knowledge management. Knowledge is specifically defined as the processing of information combined with experience and perspective--separating the relevant from the irrelevant, adding a new level of insight. Individual documents give way to clusters of knowledge. Trends across documents stored in myriad and diverse collections are identified and brought to the userÕs attention, in a manner far faster than human processing (skimming/reading) alone could ever accomplish. Those queries result in a self-creating, virtual document that synthesizes relevant content found across the sub-collection of documents identified through an initial query and all the subsequent searches that it initiated.


Search KMWorld

Connect