Automating perception Part 2
By Tony McKinley
There are two sides to information retrieval (IR): categorization and classification. You can think of the former as the way the database orders data, and you can think of the latter as the way you enter a query in a search engine. One is fixed—database, and one is dynamic—query. All IR vendors address the spectrum of challenges between rigid categorization and fluid classification.
Last month we reviewed three vendors, this month we look at three more: Autonomy, Inxight and Verity. Inxight comes from high-end Palo Alto Research Center roots, with 20 years of science fueling its semantic analytics. Autonomy exploded on the market during the dot.com boom, and its statistical approach to document understanding sets its apart. Verity has been here since there has been a “here” in the information retrieval industry, and its products reflect lessons learned in enterprise implementations.
Information retrieval is an area of computer programming that is a lot easier to use than spreadsheets and databases, but it is usually underestimated and taken for granted as simple “search.” IR's the machine analog of human reading and remembering, and it is much easier for the everyday user than programming. That invisibility of the programming arises from the way IR software takes a query in any format and nets vast schools of facts in response to our curiosity. Search on the Web is so easy that it leads us to underestimate the effort to find information.
“Autonomy is in the business of getting you out of the task of looking for data. No dictionaries, no thesauri are necessary; our algorithms are at the core,” says Ron Kolb, the director of technology strategy for the company.
Kolb says companies in the sector "are trying to address the problem that has always been with us, the need to retrieve information on demand. Better still, our goal is to proactively present critical information.” What makes Autonomy different, he maintains, is “a fundamentally unique technology"—a reference to the mathematical theories of probability that Autonomy uses to organize and correlate unstructured data.
Kolb explains, “Bayesian techniques provide pattern isolation and help us build relationships to other patterns.” He uses this example: “The room was dark, very dark, so dark I didn’t notice the dinosaur in the corner.” Your favorite search engine, he say, would categorize those words as “dark,” and you would miss the “dinosaur."
That math theory is Autonomy’s approach to categorization. Autonomy’s correlative to other vendors’ classification process is also mathematical: “Shannon’s algorithms identify the most important patterns; by identifying the most unique elements, you find the dinosaur,” Kolb says, adding, “Another benefit of mathematical analysis is that we can recognize any language.”
Autonomy applies a unique set of categorizers to any collection of info, and derives a unique taxonomy for each collection. That is different from traditional categorization of predefined taxonomies, like thesauri and dictionaries.
Inxight approaches the challenge of organizing unstructured data from the other end of the spectrum, by treating the data as language, by applying semantic analysis rather than mathematical analysis.
“We’re the 'Uncola' of search," says Dave Spenhoff, marketing VP. "We provide another way of solving the problem. People don’t think in keywords; people think in concepts and sentences. We provide rich, natural language. We go beyond search by understanding verbs, nouns, adjectives, tenses. By understanding language, we can identify names, people, places, companies.”
“Search is 20th century technology,” Spenhoff says. "This is the end of search as we know it ... We provide richer semantic understanding of context, so the user doesn’t have to figure out everything. Consider the phrase 'the price of tea in China.' This statement is filterable by concept, so we can provide concept-linking, and find other documents with similar concepts. We bring together many approaches, concept plus entity plus keywords, to provide similarity linking.”
Spenhoff continues, “We provide deep drill-down, taking enterprise IR to the next level. We are supporting experts in our customers.” Like the other vendors, Spenhoff sees the value proposition of the technology extends beyond the universal search features of typical document management systems and intranet engines.
“We augment today’s search systems," he says. "We provide a better layer of IR, and we can apply this to a constant flow of new content. We can figure out the subjects of new documents and who needs to get them.”
“Corporations make huge investments in information-based research, such as the investments in the pharmaceutical industry in new drug discovery,” Spenhoff says. "Rather than lab work, 25% to 50% of the time is based on info-based activity. While pharma companies have universal access to search technologies already embedded in the enterprise, we augment that, we provide a better layer of IR than enterprise search solutions. Our layer delivers what search would miss--those serendipitous discoveries, the synchronicity effect. For example, what other effects does a drug have--effects it was not designed for, what other applications might arise? We help find those insights.”
Verity, under a former CEO in the 1990s, actually aimed to make its products synonymous with the entire market for search. Today Verity is vastly deployed in many guises, from ubiquitous desktop products like Adobe Acrobat to dominant enterprise systems like Documentum.
“Search has evolved,” says, Prabhakar Raghavan, Verity CTO and 15-year-veteran, “and what we are finding to be extremely popular is parametric search.”
“Parametric search combines full text and typical RDBMS fields like date, size, language,” Raghavan explains, “and we can provide click-speed interactions across millions of documents.” That functionality has been described as classification by other vendors, because it is built on a robust categorization structure to allow the user to dynamically find and select slices of data.
“We deliver fluid navigation through relational taxonomies. We can use the concept of taxonomy as navigation paradigm,” Raghavan says. "We can provide a relational taxonomy. For example, if we are searching for technology companies in Sunnyvale, we can group them by geography, size, industry, revenues, any number of taxonomies. We can present the results of search queries, in three or five taxonomies at once, and the user can drill down through documents in each category.”
The benefit, according to Raghavan, is that “the results draw users on; we provide a guided navigation” through unstructured data.
As an example of how the technology applies to real-world requirements, Raghavan says, “Résumés are written without a data type definition (DTD), there is no coding to distinguish one job from another or to identify specific qualifications. Résumés represent unstructured data that must be entered into a spreadsheet-like or database-style format to be truly useful. We need to both extract the structure of documents and exploit the structure. This is the next step, to categorize documents now, and to perform entity extraction next.”
In the example of résumés, the technology allows free-form text to be translated to tables. Raghavan explains, “We can lay out a table of data, where they worked, and when, reducing unstructured data to structure.” That approach normalizes common categories of data within résumés, no matter how uniquely it is laid out in the document.
All of the vendors in this article offer features like visualization and taxonomy tools that we haven't discussed here. When it comes time to decide on a product, you must consider the internal logic of your own collection to decide if a traditional, semantic approach (dictionary, thesaurus, authority table) or a dynamic approach (statistical, probabilistic, mathematical) categorization and classification approach works best for you.
Information retrieval approaches
Autonomy mathematical algorithms
Inxight linguistics analysis
Verity Parametric search
Tony McKinley is with Input Solutions, e-mail email@example.com, and the author of “From Paper to Web” (imagebiz.com).