-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

The Importance of Hierarchy Building in Managing Unstructured Data

Since the advent of the relational database, enterprises have been able to manage structured data effectively through the thousands of applications that enhance company operations. Unstructured data—the 85% of corporate content that doesn’t fit neatly into the rows and columns of a database—is far more difficult to manage, as it is scattered throughout documents, emails, and web pages. Applications remain largely unable to exploit unstructured data, depriving decision-makers of valuable business insights and wasting countless hours in fruitless searches.

Companies need a platform for managing unstructured data that will enable their employees to find and use textual information as easily as they can now call up last quarter’s revenues. Integrating such functionality within content management systems, enterprise portals, intranets, search engines and even business applications helps companies use their intellectual capital more effectively. Businesses can recoup their investments in these technologies by creating a consistent and customized organizational framework and automating the process of classifying unstructured data into this framework.

Importance of Hierarchies

Unstructured data, by definition, is not created within a pre-determined framework, such as a database schema. To organize unstructured content and ease its interpretation, an external framework must be applied retroactively. Classification, whether manual or automatic, is the process of applying the new framework to the data.

The role of technology in managing structured and unstructured data differs in one critical way. In both cases, technology needs to manage both the data and the frameworks that organize it. For structured data, the key job of technology is to store and retrieve the data safely and securely. Database schemas provide the necessary framework to achieve these goals. For unstructured data, the primary technological challenge is to extract and specify an organizational framework based on the patterns and relationships implicit within textual data.

Such a framework most often takes the form of a topic hierarchy, an organized set of topics relevant to the unstructured content. Defining such topic hierarchies is the essential first step in managing unstructured data. But it is a challenging step, for three main reasons. First, these hierarchies often need to be quite elaborate to be useful. Sorting hundreds of thousands of documents into a handful of topics doesn’t much help someone searching for the one key document among them. While useful hierarchies often consist of several hundred topics, we know of enterprises that have developed topic hierarchies with thousands of nodes. Second, these hierarchies must be closely tailored to the needs of the intended user. In the business world, this means that each enterprise must define a hierarchy customized for its activities and needs. Pre-defined hierarchies can serve as a useful starting point, especially if they are already customized for a given industry, but an enterprise must usually customize them further—the quality of the hierarchy largely determines the quality of the information delivered to end users. Third, defining a topic hierarchy manually is a complicated and highly subjective exercise, one which is difficult to coordinate across locations and departments within an enterprise.

For all these reasons, any technology vendor that aims to help companies manage unstructured information should help them build and maintain topic hierarchies. Unfortunately, the technology available from the 1990s has not addressed this issue for customers. A content syndication company told me of their struggles with a categorization solution produced by another vendor. It took eight of their employees working full time for four months to produce a hierarchy of 400 topics in the largely manual process required by that solution. Building large hierarchies manually can clearly become cost-prohibitive very quickly.

Automatic Hierarchy Generation

Building a hierarchy involves both defining a set of topics and organizing them into a hierarchical tree. Defining topics requires communicating their meaning to whatever person or technology will classify documents into them. Simple text descriptions allow human editors to understand what a given topic is about, but automatic classification systems require more specialized topic definitions. Keyword-based systems define topics with sets of keyword queries. Statistical classification systems use sets of example documents to define topics. Because building keyword queries or collecting sets of example documents can be labor intensive, any automatic classification system should help its administrators define topics and assemble them into a hierarchy. Several techniques can automatically generate hierarchies, but they each follow a similar pattern. First, a set of representative documents is collected. Using these documents, the automatic technique in question produces a list of significant topics and arranges them in relation to each other. Human administrators then adjust the resulting hierarchy manually.

Each of these steps is important. The more accurately the sample documents represent those that the classification system will handle in practice, the better the resulting hierarchy will be. While the quality of the approach used to generate the hierarchy is obviously important, no automatically generated hierarchy will be perfect, and human administrators must have the ability to revise the suggested hierarchy. In practice, this means a rich set of tools must support the hierarchy editing process, helping administrators to both diagnose problems and solve them.

Keyword identification—One approach to automatic topic identification is to look for significant words or phrases among the input documents. Each word or phrase becomes a topic in the resulting hierarchy. One advantage of this approach is that the words or phrases can be used as the titles of the topics. On the other hand, it often has trouble distinguishing distinctive words, such as "football," from less distinctive words like "play." It also struggles to identify broader topics not easily described by single words or phrases. For example, identifying “sports” as a topic is very difficult for this approach, as documents about sports almost never contain the word “sports.” This approach also struggles to organize topics into a hierarchy, as it is hard to define the relationships between single words or phrases non-arbitrarily.

Clustering—Instead of identifying individual words or phrases, clustering looks for broad statistical similarities among documents and forms groups, or clusters, of documents that share many significant words or phrases. Each cluster then becomes a sample set of representative documents for a given topic, defining it through example. Clustering is an effective tool for organizing a large number of documents, especially when each document is about a single topic.

Clustering methods can produce a topic hierarchy. The clustering algorithm can build a hierarchy simply by observing and making explicit the hierarchical relationships between larger and smaller clusters. This process can proceed in two different ways. A top-down approach divides the sample set of documents into smaller and smaller groups until the lowest-level topics are defined, while a bottom-up approach aggregates small clusters into larger and larger groups until the highest-level topics are defined.

The two approaches to clustering documents have different strengths and weaknesses. The top-down approach is better at producing crisp top-level distinctions, but it can miss important low-level topics. A mistake in allocating documents at a high level in the topic hierarchy can split up documents that might otherwise form a very coherent lower-level group. Conversely, the bottom-up approach to clustering is better at defining all the significant low-level topics, but it can produce obscure high-level topics.

No matter which approach is used, clustering software typically adjusts the initial hierarchy by moving documents to different clusters to increase overall coherence. The Stratify Hierarchy Builder makes special efforts to produce coherent topics. If a sample document does not fit neatly into any existing topic and there are not enough similar documents to justify creating a new topic, it is designated as an outlier and not placed in any cluster. Conversely, if a given document has content related to two different topics, the hierarchy builder will place the document in both clusters.

Human Control

No automatic hierarchy-building technology is perfect, and there is no current substitute for the judgment of a person familiar with the information needs of an enterprise. This means that human administrators need to be able to review and alter an automatically generated concept hierarchy. In this way, the speed of machine computation can combine with human judgment to produce a customized topic hierarchy quickly.

At Stratify, we chose to pursue a bottom-up approach to clustering precisely because it is more amenable to human review. Bottom-up clustering does a better job of capturing all significant groupings of documents. As a result, tuning a hierarchy generated by bottom-up clustering most often means adjusting the position of an already well-defined topic in a hierarchy. This is a far easier task than constructing a missing topic from documents scattered among other clusters, as top-down clustering approaches require.

Meaningful human review requires a set of software tools with at least three capabilities. First, the tools must allow administrators to manipulate the topics within the hierarchy and individual documents within topic clusters. Second, the tools should provide insights into why the clustering software acted as it did, so it can be adjusted accordingly. Third, the tools should allow administrators to control how many topics are automatically generated. This is a question of goals, not fact. In examining a wide variety of document collections and both human- and machine-generated hierarchies, Stratify researchers have found that there are always further meaningful subdivisions of a document corpus until there are only two or three closely-related documents per cluster. The needs of a business, not the structure innate in its documents, must therefore determine the right number of topics.

Effective management of unstructured data—and the effective deployment of content management systems, intranets, and search engines—requires a well-designed topic hierarchy. Such a hierarchy enables documents and data feeds to be classified in real time to allow employees to find the information they need to do their jobs. But the sheer variety of the information needs of a large company means that building a functioning topic hierarchy for an automatic classification system is often a labor-intensive and expensive process. Any unstructured data management solution must therefore largely automate the hierarchy building process. Neither clustering nor keyword identification can fully automate this process, so human editors must be able to refine an automatically generated hierarchy through a set of software tools. Solutions that automate repetitive tasks and allow effective human review can minimize the work of editors. In the end, the quality of its topic hierarchy determines how successfully a company can manage unstructured data.

Stratify (formerly PurpleYogi) is the emerging leader in unstructured data management software. The Stratify Discovery System is a complete enterprise software platform that helps companies automate the process of organizing, classifying and presenting the business-critical, unstructured information that is usually found in documents, presentations and Web pages. By structuring previously difficult-to-organize information, Stratify software technology increases the value of existing corporate applications such as enterprise search engines, news aggregation services, customer relationship management (CRM), sales force automation tools (SFA), content management software, document management systems and corporate portals in various industries.


Founded in September 1999, Stratify is a privately held company that has received funding from Mobius Venture Capital (formerly Softbank Venture Capital), In-Q-Tel, Intel Capital, H&Q Asia Pacific/At India and Skyblaze Ventures LLC. Named as one of The Red Herring 100 for 2001, Stratify is headquartered in Mountain View, CA. For more information about Stratify, please visit Stratify.

Stratify, Inc.—Discover More™

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues