By Susan Feldman
Categorization: the basic cognitive process of arranging into classes or categories—WordNet lexical database, Cognitive Science Lab, Princeton University
The human brain is a wonderful information processor. We take in innumerable details with every glance, sound or touch. Yet we are seldom overwhelmed with the magnitude of the information we are processing. One reason that we are able to cope with so much input is that we categorize it all. We look for what is new, what is different, what has changed. Then we try to match the new information to the categories that already exist in our minds. We need only find the category that a new thing is most similar to, and then drop it into that convenient "bucket." We use the changed information to alter and improve existing categories. The rules associated with each category tell us how to react to it (Plant=> vine with groups of 3 leaves=> it might be poison ivy so don't touch it).
Since categorization helps us navigate the real world, it makes sense to enlist it in navigating cyberspace as well. We need all the help we can get in this invisible world. Cyberspace can't send us the images, tastes, sounds, touches or smells that would help us to understand its contents. At least not yet.
Reasons for categorizing
Information systems, like the brain, also need order if users are to make sense of their contents. Categorization helps users navigate or browse through collections, Web sites or search results. By grouping too many discrete items into understandable categories, users can quickly eliminate what is irrelevant or not interesting, and just pay attention to what matters most.
As collections of information have grown, it has become imperative to figure out how to improve information finding. And that is why you see the ferment of activity today that surrounds taxonomy building, categorization and faceted navigation. Classification and categorization projects, however, come with some significant costs attached to them. Therefore, it is important to understand why you need to categorize before you undertake a major project. Organizations may want to provide limited or extensive categorization in order to:
1. Provide a browsing as well as a searching feature on the intranet or Web site. Browsing and searching are two different kinds of information seeking tasks, and both need to be accommodated in a good information access system.
2. Improve the accuracy of search. Search engines need good clues about the topics in a document and their relative importance. By adding subject tags, as well as names of major people, places and things, to the document metadata, you provide additional clues to the search engine about what the major topics in a document might be. That improves relevance ranking.
3. Remove "false drops" in search results by "disambiguating" terms. Since words have more than one meaning, a search engine can use categorization to decide which meaning is the right one: If the topic is "politics," an article on rhododendrons won't be retrieved when you are looking for "Bush."
4. Group results lists by topic. A list of documents categorized by topic or by names of people is much more useful than a simple list of documents sorted by relevance, date or title. Users can drill down quickly to what interests them.
5. Improve navigation, particularly in e-commerce and product sites, by guiding users to the product they need and eliminating dead ends.
You can also use categories to:
- give an overview of a complex topic
- help users differentiate among very similar topics, such as types of cancer on the National Cancer Institute Web site;
- filter out spam from e-mail;
- monitor messages for regulatory compliance;
- visualize a collection for better exploration and interaction;
- locate similar or related items, even if they contain different words.
Types of categorization
Until recently, categorization was a manual process. Subject experts spent their days reading and classifying documents, books, pictures or videos. But manual categorization is slow, and indexers are expensive. Most organizations want instant access to information today, and that makes automatic classification a hot topic. Note, however, that most automatic systems still require a human at some point in the process: in establishing the categories, developing taxonomies, choosing appropriate training sets of documents or reviewing results in order to monitor the system. Manual categorization is slow and labor-intensive, but it ensures that the user will never come across a "stupid computer error." It is most appropriate for collections that add only a few items at a time. Manual categorization or classification systems rely on a pre-established set of categories, often in the form of a taxonomy or ontology. Each category is usually defined by a set of rules that describe the characteristics of items in that group. Those standards are necessary because without them, each indexer would create his own set of categories, and indexing consistency would suffer. Once the rules have been established, it makes sense to try to automate the process. Computers should be able to learn to follow rules if they are clearly defined.
Automatic categorization is extremely fast, once it has been implemented. It is indispensable for handling real-time indexing of news feeds, for clustering results sets on the fly or for classifying large numbers of new e-mail messages. Implementation, depending on the software, can take hours, days, weeks or months. Many require an underlying taxonomy or ontology plus rules to create the categories before the system takes over. Some automatic systems don't require a taxonomy, and in others, a taxonomy is optional. Most automatic systems today can import a taxonomy in order to tailor the results to the contents of the collection.
Hybrid systems automatically categorize, but include humans in the process, either to write rules or select training sets for machine learning-based systems. People are also used to approve or override categories that are assigned by the system, and to monitor the accuracy of the process. Because "topic drift" as well as new terminology may arise from time to time, some oversight of any system is probably a good idea.
Technologies for automatic or hybrid categorization
Some of the technologies now used in categorization applications include:
- Clustering, which groups similar items together based on a statistical measurement of similarity. Search engines use those same measurements to establish the similarity of a document to a query. The trick in a good clustering engine is to first choose the right number of clusters into which to separate the documents in a collection, and second, to decide how to label each cluster automatically so that the labels differentiate among the clusters clearly and make sense to the searcher. And that is no mean trick. Vivisimo is an example of a clustering engine. Clustering engines require the smallest amount of preparation and pre-processing time, and can be up and running in hours. They do not show relationships among topics, as a taxonomy-based system would.
- Rule writing is generally a manual process, although some taxonomy tools can aid in developing rules that are clear and don't conflict with each other. Rules usually look like a kind of rich Boolean query. The process is time-consuming, but the result usually makes sense to a person. Changing the rules i