SpeechTechMag.com: Automating perception part 3

By Tony McKinley

Earlier articles in this series have addressed various vendors' approaches to categorization and classification technologies in support of information retrieval products. In this article, we examine three more vendors who apply those technologies.

Vivisimo, a spin-off of Carnegie Mellon University's computer science development environment, has been in business three years, developing solutions to overlay the search process and deliver valuable information from within very large, dynamic collections.

Nstein Technologies offers a suite of products aimed at the e-publishing market, with the specific mission of enabling its customers to maximize the value of their information assets by automating the workflows necessary to effectively present and deliver those assets to information consumers.

Text Analysis is an early stage company that offers a product called VisualText, which, as the name suggests, provides a development environment for creating advanced text processing applications.

Vivisimo

Raul Valdes-Perez, president of Vivisimo, describes his corporate mission as "organized information from anywhere, any time, in any language." Using the example of Stanford University's Highwire Press, which publishes many science journals including "New England Journal of Medicine," "Journal of the AMA" and "Science," Valdes-Perez explains, "We create and show organized information without the endless cost and complexity of taxonomy building and maintenance. The Vivisimo Clustering Engine automatically clusters the top 500 search results into a spontaneous taxonomy for easy thematic browsing.".

A simple search for "coronavirus" in Medline—one of the largest scientific databases in the world—produced 3,587 results, the kind of outcome that many Web searchers are familiar with, creating a huge hit list that is almost useless unless your hits are in the top few pages. That is where the efficiency boost of the Vivisimo Clustering Engine kicks in. By applying clustering to the hit list, the user is presented with an organized view of the hit documents, dynamically separated into folders under specific topics. That allows searchers to immediately focus on the hits in which they are specifically interested. As Valdes-Perez explains, "Within a few hits, the user has a comprehensive view of all the world's knowledge on coronavirus."

Valdes-Perez continues, "You can also get an outline of your query results from the folder topics. We've had students write essays directly from this. For example, imagine if you did a search on 'American Civil War.' The topics of the folders would generate an outline of all the relevant events and trends in the entire war ... When it's all put together, you see all of the results in context, rather than mixed in a long list of hits."

Describing the technology further, Valdes-Perez says, "We perform clustering on scientific articles by parsing the title of each document, the abstract which is provided by the author, and applying controlled vocabularies. And clustering is based on overall similarities among documents ...

"We use both statistical and linguistic methods to determine folders and contents. Statistical methods count terms, words and phrases. Linguistic methods employ stop words, morphology such as stemming, abbreviations and acronyms, and syntax. We employ a knowledge of synonyms, and what kinds of phrases appear in a body of knowledge, so that phrases are equivalent to words in our dynamic clustering process." In that way, folder labels are dynamically generated, and clustered hit lists are presented in context, in folders in which all the documents are clustered around similar characteristics

Nstein

Randall Marcinko, president and COO of Nstein, explains, "We are in the creation-of-metadata business. We focus on e-publishing, and we deliver our value proposition through our clients. First, we deliver value to our clients who are secondary data publishers through automation of indexing, which delivers a direct cost savings to them." Enterprise users can learn lessons from e-publishers, because the solutions they need have similar automated indexing requirements, and the e-publishing market has addressed them on a scale as vast and challenging as any enterprise is likely to face.

"Second," Marcinko continues, "we help business journals be more creative in the way they make information accessible. We provide better navigation, so that less effort by search users will yield more pertinent results. This means e-publishers will sell more content, by letting consumers know it is there and available. Third, we offer excellent professional taxonomies." One example is Nstein's recent announcement of licensing the BIOSIS (biosis.org) taxonomy of the life sciences, which has been meticulously developed for almost 80 years.

"We don't replace subject matter experts," Marcinko explains, "but we automate the routine work to such an extent that frees SMEs to use their real talent, adding precision to the results."

Charles Alexander, Nstein VP of sales & marketing, says, "Our value is that we deliver cost savings. Our customers are e-publishers who buy and sell content, and enterprise users who support knowledge workers." Enterprise KM projects face the same tasks, perhaps not on the same scale, as e-publishers. Alexander says, "We save time in indexing. At the end of the day, we help our customers organize their information, which helps their users retrieve information."

Marcinko adds, "We help on both sides, organization and retrieval, which are the two parts of this business. Navigation is strong, which is a simplified form of retrieval."

Referring to Nstein's automated features, Marcinko says, "We organize information by Concepts, Categories and Entities, which enables very powerful data mining ... We can apply either a predefined taxonomy, like BIOSIS, or deliver categorization on the fly. In every case, we aim to help the user find the information they are looking for on the first go."

"Most enterprises have document management systems, like Documentum or Interwoven," Alexander says. "These systems are a niche opportunity for us, to add significant value in indexing. Doc man systems are excellent file cabinets. We are the filing system within the cabinet."

Text Analysis

Text Analysis has recently released VisualText 1.7, which is described as an integrated development environment for building information extraction systems, natural language processing systems and text analyzers.

Says Maureen McHenry, VP of Text Analysis, "VisualText is the first and only development environment for text analysis programming. The next step in search is the ability to parse information. More deep and precise extraction is going to be required. Categorizers put things into buckets, but you still have to read them. There will be a requirement for another layer."

A technical architect for a major integrator says that a key advantage of VisualText is that it enables the specialization of a knowledge solution to a customer's domain. He reports that the software significantly reduced development time on his current project, and that with it, he could specialize the analysis to match the customer's terminology, culture, style of documentation and domain knowledge.

Tony McKinley is with Input Solutions (inputsolutions.com), e-mail tony@inputsolutions.com, and the author of "From Paper to Web" (imagebiz.com)