What You Know...and What You Don't
A Brief Foray Into Text Analytics As We Know It
Scads of words have been written about "enterprise search," "knowledge management," "information access," etc. In fact, I am responsible for a scad or two myself.
And, of course, it makes sense: When 90% of the information your company possesses is in the form of unstructured text files and email, plus more-or-less formal formats (contracts, PowerPoints, legal documents and marketing material, etc.), it’s painfully obvious that tools to access that content will emerge as key components of the knowledge-worker toolset.
But what hasn’t been covered quite as well are the text-mining and analytic tools that exist to find content—and the many relationships between content objects—that are not yet part of the average, daily knowledge worker’s regimen.
The way it’s often been put is this: Search is useful when you know basically what it is you’re looking for. A specific email... a contract for a specific deal... the document I half-started last week but didn’t get around to finishing (that’s me!). You pretty much know what’s included in the object, so it’s possible to do a text search. Or you know about when it arrived, so you can do a date search. Or you remember who sent it to you, so you can do an author query. Those are simple examples, but you get the idea.
But text analytics does something else. Text analytics is used when you don’t know what you’re looking for. Or maybe better said, when you don’t even know there’s anything there in the first place.
I know... it sounds counter-intuitive. But I remember a while ago, then-Secretary of Defense Donald Rumsfeld was widely ridiculed for a statement he made that, basically, sums up knowledge management. While he might have been deserving of ridicule for many things, this statement was not one of them.
"... There are known knowns; these are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.
Laugh all you want, but it was actually quite a brilliant observation. How much of the time do your knowledge workers perform under a cloak of "un-knowing what they don’t know they know?" In other words, how much information is available in the vast bags of words that make up your content repositories, if only you could connect the dots between seemingly unrelated elements to create an entirely new, relevant element of knowledge?
Getting to the Heart of Text
That’s what text analytics are all about. Or maybe you prefer "text mining" as the term of art. I don’t really care what it’s called. But it’s a powerful tool for everyone from marketing executives to customer relationship managers to CIA operatives. (Oops, maybe I wasn’t supposed to talk about that last one.
I wanted to learn more about text analytics (and share it with you today), so I went straight to the horse’s mouth, as it were. Turns out, a good buddy of mine and the magazine’s—Dr. Johannes (Jan) Scholtes, president and CEO of ZyLAB North America—is also the current "extraordinary chair on text mining of the department of knowledge engineering of the University of Maastricht, the Netherlands." I had no idea what that meant when I first heard of it, but I now have a little context: according to the website: "The core tasks of the chair of text mining will be education and research. Scholtes explained that ‘the chair will focus on the teaching of text mining methodologies for language-dependent feature selection and feature extraction, so that documents can be equipped with several additional entities, attributes, facts, events, feelings and relationships that can be thoroughly searched, visualized, analyzed and filtered by using advanced user interfaces.'"
Oooooookay. I guess this is another case where I can truly say "I don’t know what I don’t know." So I had a little conversation with Jan to see if we could shed some light on the business applicability of text analytics.
He did his best to explain it to me in small words: "Text mining refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining encompasses several computer science disciplines with a strong orientation toward artificial intelligence in general, including (but not limited to) pattern recognition, neural networks, natural language processing, information retrieval and machine learning," he began.
"An important difference with search is that search requires a user to know what he or she is looking for, while text mining attempts to discover information in a pattern that is not known beforehand," he explained.
"Text mining is particularly interesting in areas where users have to discover new information. This is the case, for example, in criminal investigations, legal discovery and due-diligence investigations. Such investigations require 100% recall," Jan said, "meaning users cannot afford to miss any relevant information. In contrast, a user searching the Internet for background information using a standard search engine simply requires any information (as opposed to all information) as long as it is reliable. In a due diligence, for example, a lawyer certainly wants to find all possible liabilities and is not interested in finding only the obvious ones.
The trade-off between "recall" versus "precision" has been a frequent topic in these KMWorld White Papers. It’s no more relevant than in the area of text analytics. "Increasing recall almost certainly will decrease precision," said Jan, "implying that users have to browse large collections of documents that may or may not be relevant. Standard approaches use language technology to increase precision, but when text collections are not in a single language, or are not domain-specific or contain variable size and type documents, existing methods either fail or are so sophisticated that the user does not comprehend what is happening and loses control. A different approach is to combine standard relevance-ranking with adaptive filtering and interactive visualization that is based on features (i.e. metadata elements) that have been extracted earlier."