-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

Taxonomies, Categorization and Organizational Agility

Leveraging unstructured information is a chronic challenge for companies competing in today’s economy. Product development, sales and marketing, as well as executive planning and decision-making all depend upon information that resides within corporate documents. However, unstructured data—the 85% of corporate content that doesn’t fit neatly into relational database tables—is very difficult to manage. Critical information is usually scattered amongst documents, emails, and Web pages. This information resides in applications that use different management systems, ranging from the file system, to high-end content management systems. It remains a notorious fact of corporate life that these systems cannot easily exchange information, nor do they provide users with a means to easily explore and navigate documents from multiple sources. As a result, decision makers remain largely unable to leverage unstructured data to gain valuable business insights.

Companies need a platform to establish a shared vocabulary across disparate sources of unstructured information. In today’s economy, where execution is key to competitive advantage, the ability to exploit critical insights based upon the flow of unstructured data is essential to an organization’s ability to act quickly and efficiently. To succeed, organizations must be able to de-position a competitive threat sooner rather than later, as well as identify and develop new revenue opportunities. If a company cannot provide a transparent view of its unstructured data, employees will not be able to consistently locate nor share documents, thereby significantly hindering their ability to act effectively.

Software to organize and manage unstructured information is available today to address these business issues. The software must automatically create and extend a taxonomy of business topics, classify the business information accurately, and provide human oversight to easily review and modify the taxonomy and ongoing document classifications.

Enterprisewide Taxonomies

An enterprisewide taxonomy provides a shared vocabulary for companies to classify and organize documents. A taxonomy consists of a hierarchically organized set of topics that companies use to share information, and allows users to easily locate pertinent documents. Categorization software uses taxonomies to consistently classify new documents into appropriate topics that users can browse or query. However, creating a taxonomy that accurately reflects corporate business practices and adds value directly to business processes is a major challenge for many companies.

The task of creating a taxonomy can be daunting. Whereas Web sites are often organized using at most a few hundred topics, enterprise taxonomies can often contain upward of 5,000 to 15,000 nodes, as we have seen with our customers in the oil and gas, telecommunication, and news aggregation industries. Developing such taxonomies manually is an extremely labor-intensive effort. Stratify has encountered situations where companies developing taxonomies have averaged three person-days per topic for several thousand topics. Finally, in the absence of a neutral analysis of a document corpus, manual taxonomy development can easily run afoul of internal political agendas.

There are several ways in which taxonomy and categorization software can simplify the process of taxonomy development:

First, the software can provide an environment that enables users to manually create and manage taxonomies using a range of editing capabilities;

Second, it can leverage previous taxonomy investments, as well as industry standard taxonomies, whether implemented on the file system, as a Web sitemap, or in XML, by directly importing them;

Third, it can automatically create topics (linguistically known as concept clusters), which an information manager can modify as needed;

Fourth, it can automatically create the hierarchical taxonomy relationships between topics.

Each of these approaches has its merits, but for companies with complex document collections requiring in-depth taxonomies for their organization, creating a taxonomy automatically can save significant time and resources.

There are several software approaches for creating topics. Keywords can be used to define topics, while statistical algorithms can be used to look for broad statistical similarities among documents, and form groups (or clusters) of documents that share many significant words or phrases. Each cluster then becomes a sample set of representative documents for a given topic, defining it through example. In Stratify’s experience the statistical approach yields more malleable results, since it can better differentiate between general topics and their more specific, granular subtopics, such as “sports” from “basketball” and “baseball”. Although there are various statistical algorithms that can be used, the end result is the creation of topics with associated training documents.

Categorization Methodologies

The real strength of a taxonomy is its ability to organize unstructured information and provide a vocabulary for business users to locate and share information in their daily tasks. Categorization software classifies documents into a taxonomy based on classification models. There are several algorithms that can be used for this task. Among them are:

Statistical classifiers, which provide accurate classification for broad and abstract concepts based on robust training sets;

Keyword classifiers, which excel at defining granular topics within broad statistically generated topics;

Source classifiers, which leverage pre-existing definitions from sources; for example, documents from a Web site or news service that are already categorized;

Boolean classifiers, which can accommodate complex classification rules, as well as many legacy information systems.

Each of these classifiers provides optimum results for different use cases, which occur in a normal document corpus and taxonomy. Often it makes sense to create rules between classifiers as to when and how they should operate. In many situations the optimal use of keyword and Boolean classifiers is to provide more granular topic distinctions under broad-based, statistically derived topics. In this case, it can be necessary not only to create a hierarchical relationship between topics, but also to create an explicit rule regarding how and when a classifier should operate based upon an earlier classification or condition being satisfied. Stratify’s research indicates that a parallel classification architecture utilizing multiple classifiers provides the best results.

Taxonomies can help companies organize disparate intranets, as well as informational resources within enterprise portals. Likewise, taxonomies are critical tools for new product development and competitive intelligence, providing the real-time access to organized information required to respond to changing market conditions. There are many ways users leverage taxonomy and categorization software. Examples are:

1. Related documents. Users can easily access topics and information related to a given document. They can also discover new relationships between topics not necessarily apparent from the content of a single document.

2. Real-time classification. The software can classify documents or textual fragments in real time. This provides users with recommended classifications, insights into the recommended classifications, and the ability to manually revise them within an integrated publishing workflow process.

3. Personal agents. Agents can be easily constructed to notify users of new documents as they are classified into particular topics, as well as changes to related documents or topics. These notices can be delivered in any medium, such as email or wireless devices.

4. Categorized search. Taxonomies can be transparently integrated with search engines to deliver categorized search, which organizes search results by topic and provides direct taxonomy navigation.

Although difficult, creating a taxonomy is just the first, albeit crucial, step to leveraging informational resources in support of organizational agility. The ongoing challenge that an enterprise must overcome is how to keep the taxonomy accurate and up-to-date. Because taxonomies exist within dynamic organizational and market environments, they must constantly change to accurately reflect the state of informational resources as well as organizational imperatives.

Taxonomies in Dynamic Environments

Every business encounters situations of rapid, structural changes to content domains (resulting from an acquisition or a new CEO initiative), rather than an incremental evolution based upon normal business activities. In such situations, subject matter experts (SMEs) are hard pressed to respond quickly and accurately to realign a company’s informational resources in support of newly defined business objectives. To successfully respond to such situations, SMEs require automated support from taxonomy and categorization software.

The need to refine and enhance taxonomies and their associated classification models occurs in several areas, including:

  • Adding new topics to capture changing relationships between informational resources being classified, reflecting new subject domains that the taxonomy must accommodate;;
  • Optimizing the taxonomy structure to more accurately reflect both the informational content as well as organizational requirements;;
  • Deleting and/or aggregating topics that are no longer of value;;
  • Increasing topic coherency and optimizing statistical training sets to maintain or enhance classification accuracy as content changes over time.;

Automated taxonomy software enables SMEs to use different taxonomies and document corpora to enhance a taxonomy. SMEs can compare the results of an automatic taxonomy revision with the original taxonomy, analyze how and why the results were achieved, and quickly decide whether to accept the results or amend them in any way.

In many ways taxonomy and categorization software acknowledges and respects SMEs’ practical understanding of content domains. This is usually the result of intensive training and ongoing familiarity and interaction with the document corpus. It is through this interaction that SMEs come to gain practical familiarity with the domain, enabling them to “intuitively” know where to classify a document, as well as to know when a new topic is needed.

However, it becomes increasingly difficult to maintain this level of practical knowledge in the face of a high volume of documents, and next to impossible when confronted with entire new corpora that must be immediately integrated into the taxonomy. Likewise, no single expert can maintain an in-depth grasp of a taxonomy that numbers in the thousands of topics. In such situations the SME needs help to leverage his/her expertise to efficiently manage the taxonomy.

Automated taxonomy building software can operate on any topic, subtree, or complete taxonomy to optimize training sets as well as provide recommended taxonomy enhancements or revisions. These results empower SMEs to utilize their expertise to validate the results, and tune them as appropriate based upon knowledge inaccessible to the software. For instance, taxonomy managers at a major oil and gas company organize much of their information based on business processes, which are not explicitly mentioned within documents themselves. To leverage this capability, SMEs must be able to optimize the taxonomy while respecting its established, high-level organization.

With the support of taxonomy and categorization software that automatically refines taxonomies, organizations can quickly realign their informational resources to support new business objectives. Entirely new domains can be quickly added to the taxonomy, together with classification models. This provides users with immediate access to newly relevant topics and their associated documents. Personal agents can be devised that will immediately provide data on these new topics. These changes to the taxonomy can also be easily integrated into existing business applications. With these, and other capabilities, users can immediately begin to draw on new informational resources to achieve new business objectives, whether it be the swift integration of a newly acquired company or the pursuit of a new marketing or product initiative.

Summary Competitive companies are increasingly using taxonomies to organize their documents to improve critical business functions. New advances in information theory have made it possible to generate and maintain taxonomies at a much lower cost. Best practices require automated taxonomy-building software with robust, workflow-based administration tools, accurate classification through parallel classifiers, and management tools to easily refine taxonomies even for the most dynamic corporations.


Stratify, Inc.—Discover More™ Stratify is the emerging leader in unstructured data management software. The Stratify Discovery System is a complete enterprise software platform that helps companies harness today’s vast corporate information overload by automating the process of organizing, classifying and presenting the business-critical, unstructured information that is usually found in documents, presentations and Web pages. Named as one of The Red Herring 100 for 2001, Stratify is headquartered in Mountain View, CA. For more information, please visit Stratify.

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues