What to Expect When You’re Expecting Text Analytics: A Checklist
These days, most organizations realize there is plenty of hidden value in the unstructured data they handle every day. But there is still a fair amount of unfamiliarity around what text analytics is all about, and integrating this technology as part of an existing workflow is often approached without looking at the entire picture.
We have come up with a checklist of five items that can help you frame the conversation for a successful text analytics project:
1. CONTENT. The first investigation should be about the nature of the documents we will work with, since content comes in all shapes and sizes. Typical questions are about two sides of the same coin: format and language. Are these documents plain text or do they require some parsing/conversion? Maybe we’ll deal with PDF files containing scanned pages instead of digital text, and this will require OCR technology which will not guarantee optimal results. About the language: Is it very clear, well-written text (like a news article) or does it present plenty of abbreviations, informal terminology and emojis? Is it full of technical or legal jargon? Does it present tables and images (as with an engineering research paper)? We should know everything about the content that will be analyzed for three fundamental reasons: it gives us an understanding of the work that needs to be done, it provides us with a reality check about our project, and it ultimately defines the success of the final application.
2. FUNCTIONALITY. Extracting information from content is an activity performed through a series of tools. Text analytics is not one process. Therefore, it’s very important to understand what the technical user experience we have in mind will call for. Sometimes it’s a simple matter of classification, but in other cases multiple services should be blended (e.g., extracting entities and exposing the relative sentiment values expressed in a document). A few more examples: If our need is to abstract concepts from a text, then a semantic network might be required; if we want to be able to recommend similar documents to our readers, then we can leverage every single piece of intelligence that can be gathered; Linked Data (that is, connecting elements of our documents to an external source) isn’t possible without a knowledge base that exposes this kind of information. Every problem requires a different approach.
3. QUALITY. Setting adequate expectations in terms of quality is essential to framing any text analytics initiative. The first reference should be human performance for similar tasks. A case in point is an experiment we did a couple of years ago, where operators manually tagged opinions expressed in documents in the form of positive, negative or neutral sentiment. While the operators’ professionalism could not be doubted, accuracy nonetheless hit a wall at 92%, short of the 100% one might have naturally expected. Furthermore, operators often disagreed with one another about what the proper tags should be, and varied in their choices as a function of time of day. In short, humans are not perfect, and likewise, machines don’t necessarily provide perfect results every time either. On the other hand, they do offer a productivity multiplier, accelerating certain tasks tenfold or more. Given the range of highly sensitive text analytics applications in production today (Homeland Security, Insurance, Banking, to name a few), even the most demanding users find this option compelling.
4. PRECISION/RECALL. In the field of Information Retrieval, “Precision” refers to a set of results that is correct, while “Recall” refers to one that is comprehensive. We spontaneously look for both, but often there is a trade-off between the two. Understanding whether your situation implies a preference for either can act as a guide to making suitable design decisions.
As an example, banks’ anti-money laundering taskforces typically need to try to identify every potential criminal activity, a powerful incentive to focus on high Recall, even if it implies manually rooting out some false positives.
5. INTEGRATION. When text analytics is to become part of an existing operation, we want to carefully consider the two elements that sit before and after the actual analysis step: where the documents are (acquisition), and where the results of the analysis will be applied (application). Here’s a list of sample questions for the acquisition side of the problem: Are the documents in a database? In a shared drive on the company’s network? Perhaps, the data will mainly come from Twitter and Instagram feeds? Is our content going to come with metatags that can help us to improve quality? Regarding the application aspect, experience shows that it’s almost impossible to visualize our final solution without worrying about System Integration. Are there security concerns? Should every communication be encrypted? What are the scalability requirements? Is this data extraction and tagging activity going to enrich documents? Or end up directly on a public website? In a database? This happened to me more than once: The second I started drawing a possible architecture on a whiteboard, everyone could suddenly see problems that nobody (me included) had noticed before.
These steps and this brief list of questions serve also another indirect purpose: They force us to think our solution through, from start to finish; and quickly shape a clear idea of where we’re going the very moment our journey starts.
Expert System is a leading provider of cognitive computing and text analytics software based on the proprietary semantic technology of Cogito. The products and solutions based on Cogito’s advanced analysis engine and complete semantic network exceed the limits of conventional keywords, and offer a complete set of features including: semantic search, text analytics, development and management of taxonomies and ontologies, automatic categorization, extraction of data and metadata, and natural language processing.