Presenting the case for automated classification

Many corporations lack the basic searching and browsing infrastructure that is widely available on the Web; that has led to many an uncomfortable conversation between Web-savvy CEOs and their CIOs. Yet before we jump to the conclusion that those Web tools are just what we need inside the firewall, we should consider this: Many researchers in the field of information retrieval (IR) consider the "best of the Web" to be woefully inadequate.

Mehran Sahami, an IR researcher at Stanford, puts it this way:

"The recent explosion of online information É has given rise to a number of query-based search engines and manually constructed topical hierarchies. However, these tools are quickly becoming inadequate as query results grow incomprehensibly large, and manual classification in topic hierarchies creates an immense bottleneck."

A first-hand look at the problem

Although our primary target may be corporate information, the Web provides an ideal proving ground for information retrieval technology. Or maybe we should call it a "disproving ground," because if we look closely, we can see clearly the problems that Sahami cites.

Using AltaVista (www.altavista.com) for an experiment, let’ s say we want to do some research on the planet Pluto. A fairly intelligent query might be "Pluto planet orbit." (According to Alan Marwick, a research manager at IBM’ s Thomas J. Watson Research Center, the estimated average query length in full-text retrieval systems is 1.8 to 2.2 words, so we’ re already providing above average input with those three words.) The result: 234,935 hits. Limiting hits by using the more complex required word syntax of "+Pluto +planet +orbit," we are rewarded with an oh-so-manageable 38,682 hits. And in that more selective case, we would miss any document that didn’ t include all three words.

It gets worse. Although results are returned to us in relevance order, the whole concept of relevance gets a little crazy with result sets that huge.

As Sahami mentions, the alternative to full-text searching is category browsing through manually constructed topical hierarchies. In the world of the Web, Yahoo (www.yahoo.com) is a prime example. The operative phrase here is "manually constructed": Yahoo’ s hierarchy and cross-linking structure are manually maintained. Because of the need for manual intervention, it is impractical for Yahoo to represent more than a small fraction of Web content. Likewise, manual intervention means that cross-linking is only practical when applied sparingly.

Let’ s try another experiment, again looking for information on Pluto but this time using Yahoo browsing. We must first determine that "science" is the correct starting point. The next step is pretty clear - "astronomy." But at this point, shouldn’ t we see "planets"? Not according to Yahoo: To get to planets, you must travel via the "solar system" category. So finally we reach "science > astronomy > solar system > planets > Pluto." Navigation in our experimental case was only slightly confusing, but the planet Pluto is as concrete a topic as you will find; browsing for less concrete subject matter in Yahoo can be a frustrating experience.

So, here we are at Pluto. And what do we find when we arrive? Eight measly links! If we assume that AltaVista found all the Web pages relevant to Pluto, and that this example is typical, our experiment would indicate that the Yahoo topic hierarchy represents just .02% of the Web. While the links that Yahoo does provide are probably very relevant - perhaps a better starting place than the 38,682 hits returned by AltaVista - the "manually constructed" nature of today’ s Yahoo is clearly limiting its reach to a tiny fraction of Web content.

Eliminating manual classification

IR researchers like Sahami are pursuing a technology breakthrough that may make the concept of a corporate Yahoo much more viable: automated classification. With an automated classification tool in place, documents and other knowledge assets are automatically classified within the topical hierarchy, eliminating the manual classification process. As a result, the corporation can enjoy the benefits of browsing a far more comprehensive topic hierarchy - closer to 100% than .02%.

Automated classification solutions fall into three categories:

  • rules-based - The simplest and least sophisticated option, available today. This approach classifies documents based on business rules. The approach suffers from two limiting factors: the complexity of manually maintaining separate business rules for each classification, and the difficulty in precisely classifying documents based on business rules. As a result, this method is only viable for coarse classification into a relatively small topical hierarchy.
  • text analysis-based - More sophisticated, emerging into the marketplace now. In this approach, document clustering provides the foundation for automated classification. In simple terms, clustering works by considering documents as "bags of words," and using vector mathematics to determine the relative similarity between documents. If the human user creates a topic hierarchy and provides sample documents for each classification, then we can use document similarity algorithms to determine the classification(s) that a new document best fits. This approach offers a reduction in human effort, since it is easier to provide sample documents than to create business rules, and an increase in precision, because text analysis generally does a better job of classification than business rules.
  • self-structuring - The most advanced approach; at least a year from commercial realization. In this scenario, the classification system needs no human input to do its job, but instead analyzes the corporation’ s knowledge assets, automatically determines an appropriate topical hierarchy, and then automatically classifies the assets. Lest that sound a little too good, it is unlikely that such a system will operate without any human intervention. Instead, the system might "propose" a topical hierarchy, which would then be "tuned" by the human user. Still, such a system would be a great advance on earlier systems because the need for human involvement with each and every classification is eliminated. As a result, large and rich topical hierarchies become viable; Dave Newbold at Lotus Development (www.lotus.com) reports lab testing of automatically generated hierarchies with over 1 million topic nodes.

Not a panacea but important

Automated classification is not a total solution to the problem of information overload and certainly isn’ t "knowledge management in a box." Significant problems remain in the areas of knowledge UIs and in better understanding the human process of "finding out about." But the automated classification solutions emerging in the market are certainly important and deserve careful attention from any corporation working in knowledge management.

Jack Ivers is chief technologist at GlobalServe (www.globalservecorp.com), 800-586-3667 x202,E-mail jack_ivers@globalservecorp.com

Selected companies with auto classification tools

Company Product Web

IBM Intelligent Miner for Text www.software.ibm.com/data/iminer/fortextAutonomy Knowledge Server www.autonomy.com/knowledge/ks_categ.htmCartia Theme Scape www.cartia.comVerity Knowledge Organizer www.verity.com/products/ko/index.html

KMWorld Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues