By Judith Lamont, KMWorld senior writer
Taxonomies provide an effective way to organize and access unstructured information, but project planners should factor in resources to maintain them.
Taxonomies classify information into logical categories that allow users to readily browse through content. They are often used in tandem with search and retrieval tools (keyword- or concept-based) to help locate target information. However, unlike search technology alone, taxonomies reveal the overall structure of a knowledgebase, in a hierarchy that is visible to the user. The user navigates through sub-categories to narrow the search, a process that helps avoid false hits that are outside the area of interest. When used with search and retrieval tools, taxonomies aid in efficiency by limiting the volume of material that must be searched.
In some software solutions, the model has a single underlying taxonomy in which a document appears only once, and separate classification schemes that may present the document more than once. For example, the same document may show up in a hierarchy organized by industry or by geography. Some in the industry describe the two as a canonical taxonomy and a presentation taxonomy respectively. In other solutions, the model entails one taxonomy that allows a document to appear in multiple places within it.
Maintaining a taxonomy consists of two primary activities: incorporating new documents into the existing structure and changing the structure to accommodate new information that cannot fit into the existing one. Those processes are usually carried out through a combination of automation and human intervention. Classification techniques include keywords, statistical analyses that look for patterns of words, and use of a semantic network or ontology that analyzes words for their meaning in context. Analytical capabilities can help determine when a new category is needed, and how the documents would be redistributed. According to Laura Ramos, director at the Giga Information Group, maintenance is the most expensive part of a taxonomy project, yet is often overlooked in the planning process.
Although a careful analysis of content and needs is an important step in developing a taxonomy, trying to perfect it beforehand may be counterproductive. WordsToHyperlink The American Hospital Association (aha.org) developed a basic taxonomy to organize information contained in its own Web site and those of its nearly 20 affiliated organizations. Originally, each of the member Web sites had its own server, content manager and software. Few of the sites had a search capability, and none of them had a taxonomy, so finding information was difficult.
AHA has brought all those communities into a portal, hospitalconnect.com, that uses a single content manager, taxonomy and search engine. Using Verity, healthcare professionals in different communities can now search across all of the sites in one step. The taxonomy allows users to narrow their searches to specific categories; they can also choose to search a single site.
“One of our biggest challenges was to establish a taxonomy that was meaningful to a diverse group of communities that includes hospital CEOs, nurse executives, healthcare engineers and others,” says Herman Baumann, AHA’s executive director for business development and e-commerce. “We developed a relatively simple taxonomy to begin with, knowing that it will evolve over time.”
Baumann describes the taxonomy as a work in progress, and is comfortable with the change management process. “Nothing is cast in stone,” he says, “and as we gain more experience, we will refine our categories.” Those decisions will be aided by Verity’s analytical capabilities, which AHA uses to provide quarterly reports on activity within each category in the taxonomy. When such analysis showed that searchers often misspelled HIPAA (Health Insurance Portability and Accountability Act) as “HIPPA,” for example, AHA added it as a keyword to retrieve the appropriate documents even if the search term was misspelled. Although categorization is done manually by subject matter experts, AHA plans to use Verity’s automated capabilities for that function in the future.
Auto-classification or machine learning generates the rules for categories by having the system read exemplary documents. Verity also uses concept extraction to create categories. However, having the option for human intervention is critical, says Prabhakar Raghavan, Verity's CTO. “When categorization is completely automated, linkages may change because business rules are modified or document usage changes,” he notes, “so it is important to have a review process." Verity can also import existing taxonomies and populate them.
Go with a standard taxonomy
The news industry faces a more dynamic information environment than most. USA Today’s Web site has more than 200,000 pages that are updated 24/7. The site originally was maintained manually. When USA Today built its new XML-based editorial system, it also wanted to automatically process the large volume of incoming news content for its Web site, archives and PDA delivery. USAToday.com chose the Applied Semantics News Series solution for categorization, summarization and metatagging. The automatic categorization module now tags incoming content received from reporters or newswire services with the International Press Telecommunications Council subject codes. Typical subjects include arts, crime, health, and politics. Applied Semantics’ summarization feature automatically creates lead-in text for each story, and the metatagging module adds conceptual keyword tags. The content changes rapidly but the taxonomy is relatively stable, and rules for classification are consistently applied.
“Users are becoming much more educated about taxonomies,” says Eva Ho, director of marketing at Applied Semantics. “Industries that have captured and stored information in document management systems now want more advanced search techniques.” Certain industries, such as news and pharmaceuticals, have developed standard taxonomies, but for many companies that want to organize their internal documents, it makes more sense to build their own and “tune” it to their content.
Applied Semantics’ News Series is based on an ontology, or network of word meanings, that encompasses over a million words, half a million concepts and tens of millions of relationships. An ontology enables “disambiguation” based on word meanings--for example, to distinguish the meanings of “java” as coffee, a country or a programming language. One advantage of using an ontology for categorization is that the system does not need to be “trained” with representative documents, as systems based on statistical techniques do. Furthermore, having an existing knowledgebase enables accurate categorization for very short documents as well as longer ones.
Follow the life cycle
The NewsEdge line of products and services offered by Dialog provides specialized business news and uses its own taxonomy that organizes information by geography, industry and type of business activity. Information is collected from thousands of sources, organized and sent electronically to subscribers. The subscribers in turn use the information on Web sites and corporate intranets. After performing the categorization process manually for more than a decade, the NewsEdge services began using the Stratify Discovery System from Stratify.
“Our key issues are keeping the taxonomy current and presenting information the way the user wants to see it,” says Steve Samler, Dialog’s architect of content enhancement. The Discovery System categorizes stories automatically, with ongoing review by editors who make decisions about both content and presentation format.
“Part of our value-added in filtering stories is the judgment of our subject matter experts,” explains Samler. “For example, they know that a merger between two very large pharmaceutical companies is more important than a merger between two minor players, and they would select that story.”
The categorization process is now five to seven times more efficient than previously, and information gets pushed out to clients much more quickly. In addition, customers are more confident that all of the relevant stories have been included. The Discovery System uses multiple techniques (statistical, keywords, source and Boolean classifiers) in parallel to categorize documents.
“It is important that all four techniques be used together on every document to provide good categorization and searching,” says Ramana Venkata, CTO of Stratify. Stratify enables companies to manage the total taxonomy life cycle, from creating and defining it to testing and refining it.
“To ensure the accuracy of classification models, “ Venkata notes, “administrators should be able to test and modify categories in real time without affecting the production environment.” For example, the software performs ongoing analyses to revise document clusters, and highlights documents that do not fit the categorization scheme. Looking at what does not fit into the taxonomy helps reveal what needs to be adjusted.
An innovative approach just announced by Convera combines searching with classification to produce “dynamic classification” in real time. When a document enters Convera’s RetrievalWare system, it is indexed and tagged with keywords, concepts and entities, and then placed in a taxonomy. But rather than classifying documents ahead of time for searching, dynamic classification accomplishes both in one pass when the user carries out a search. When the user enters a search query, the system presents a hierarchy of classification folders created as a result of the query. Dynamic classification, therefore, combines the benefits of a targeted search with the discovery options implicit in browsing. That approach is an extension of Convera’s model in which the foundation taxonomy underlies a variety of presentation classifications.
“Categorization allows indexing at the corporate level,” says Claude Vogel, Convera’s chief scientist, “but classification better serves the needs of individuals or communities.” Users still have the option of establishing permanent classifications, or of converting the newly created dynamic classifications into persistent ones. That capability will be available with the release of RetrievalWare 8.0, scheduled for late April.
At present, RetrievalWare updates the system by indexing new documents against the taxonomy and routing them to whatever classifications have been activated. If a new document cannot fit into the taxonomy, the category can be adjusted by “latching,” which expands it to accommodate the new material. Other options include find a synonym that relates the new document to an existing category, or creating a new category.
“Maintenance is about shuffling the documents, testing new classifications and finding out if they are still valid,” continues Vogel, “and this can be done automatically.” RetrievalWare allows an administrator to see how all the documents would flow into new classifications, a process that takes about five seconds for 50,000 documents.
Giga analyst Ramos points to analytical tools as key to good maintenance of taxonomies. "These analytics help companies know when it's time to make changes," she says. "Tools and techniques that show which categories are used a lot, or only rarely, and how closely documents fit a particular category, are critical to keeping categories updated and meaningful."
Ad hoc categorization from Thunderstone
A would-be purchaser who goes to eBay begins with the entire content of the Web site as the search universe, and then selects from a category or types in a search term. Categories include products, geographical regions, themes or stores. When the results are returned, the shopper can sort by price, date or other dimensions. This high-end, high-performance application is typical of those implemented by Thunderstone, using its Texis suite of relational database tools.
“Our product is unique in being able to store and search text documents of unlimited size within standard database tables,” says Doran Howitt, marketing VP at Thunderstone. The company describes this achievement as its core capability. The Texis suite is a set of software tools that can be used to develop a variety of applications, including portals, content management and e-commerce. It is geared toward applications with very large data sets and complex searching requirements. The search capability can also be used against external documents such as Web pages.
The Texis Categorizer assigns documents to the categories in a taxonomy and automatically attaches subject codes and other metadata after being trained on sample documents. Authorized users can create new categories. In addition, users can enter notes back into the database, add new keywords or assign categories to a document, which Thunderstone calls ad hoc categorization or categorization on the fly.
“Users may know what category to assign to a document better than any software,” Howitt says. “But usually that knowledge goes to waste.” Howitt believes that more customers should capture feedback from users after they view a document and, after appropriate review, put it back into the database and make it searchable.
He also notes the importance of a close and interactive relationship between categorization and search. For example, users may benefit from seeing a count of how many search hits are in each category, as on eBay. Texis is optimized to do operations like that efficiently, including keeping on-the-fly changes indexed in real time.
Judith Lamont is a research analyst with Zentek Corp., e-mail firstname.lastname@example.org.