Text mining's next step
In a move that should help jump-start the emerging text mining industry, IBM (www.ibm.com) announced in August the release of an impressive set of infrastructure components known collectively as Unstructured Information Management Architecture or UIMA ("you-EE-ma").
UIMA is not (as some media outlets reported) a search engine. Rather, it is a framework for plugging in and extending third-party text analysis tools as part of more sophisticated applications. UIMA provides a set of common services to those packages, and more importantly, enables them to interoperate concurrently on the same content items.
Technically, UIMA (www.research.ibm.com/uima) consists of a software development kit (SDK), documentation and reference implementation. IBM announced that the core architecture--which is being advanced in cooperation with the Defense Advanced Research Projects Agency (DARPA, www.darpa.mil) and other research institutions--would be submitted to open source later this year.
Using UIMA's Java APIs, other software companies or enterprises can develop specific, interoperable text analytics packages. Explains Jay Henderson, director of product marketing at text analytics vendor and IBM partner ClearForest (www.clearforest.com), "UIMA allows us to focus on product development and let the community focus on infrastructure."
ClearForest is not alone. IBM has made several of its tools UIMA-compliant, including its OmniFind search engine and other analytics packages. Other vendors--including iPhrase (www.iphrase.com), Cognos (www.cognos.com), Endeca (www.endeca.com), and Inxight (www.inxight.com)-- are also making their products UIMA-compliant. In fact, UIMA represents a kind of coming-of-age for text mining, a growing industry that is presently comprised of a collection of often little-known tools that attempt in different ways to glean the same kind of intelligence from unstructured information that data mining packages yield from structured data.
UIMA implicitly recognizes that most of those tools work for very limited-use cases and that comprehensive solutions typically require a mix of methodologies, from linguistics and semantic approaches to statistical and spatial analysis. DARPA's involvement makes sense:
The U.S. intelligence community has fiddled extensively with all of these approaches, but struggled at times to combine different solutions. UIMA offers a pipeline-based processing engine that allows solutions developers to string together different tools to analyze a given content item and generate a single set or multiple sets of results. For example, one product could perform linguistic analysis of a French language document, while another tool simultaneously extracts concepts from an English language translation. Of course, setting up complex pipelines could require licensing multiple products and involve extensive customization. IBM clearly sees an important services play here well beyond licensing OmniFind software. Enterprises interested in comprehensive text mining solutions should set budgets and expectations accordingly.
Tony Byrne is founder of CMS Watch (www.cmswatch.com), which publishes vendor-neutral technology comparison guides.