Even within a single functional area such as fraud, text analytics applications are targeted to specific industries. The SAS Enterprise Fraud and Financial Crimes solutions have different specializations for insurance, supply chain, government and financial services. "These are highly specialized areas," says Fiona McNeill, head of global product marketing for SAS Text Analytics. "The rules are based on structured and unstructured data, and the goal is to measure the statistical probability of fraud and present that information." While some areas of fraud are fairly commonplace, such as Medicare, others are unexpected. "The SAS software detected fraud at a Midwest milk producer and manufacturer in its bottle return program," McNeill says.
Big data and text analytics
Text analytics predates so-called "big data" but has been given a boost by the availability of that technology. "What's new and different is the ability to apply analytics to large bodies of text arriving at high speed," says David Menninger, head of strategy and business development at Greenplum, which is now owned by EMC. The Greenplum platform combines its analytic database with Greenplum HD, which is based on Apache Hadoop. "In the past, text analytics was applied to historical information, but now that large amounts of text can be processed in a more dynamic mode, organizations can react in near real time," he adds.
The speed of collecting and analyzing large amounts of data makes the detection of trends much more feasible. "Information from Tweets and other social media can be collected, put through a sentiment analysis process, and then sent to a database where it can be tracked over time," Menninger says. "In fact the more critical question to be answered than whether a given Tweet is positive or negative is how sentiment is changing over time as a result of reactions to a product change or impact of a marketing campaign." The flexibility and low cost (relative to high-performance computers) of analysis using Hadoop make it a good partner for text analytics, where data volume continues to increase.
Organizations often sense that they can get more mileage out of their information but don't know exactly how. "We are approached by business users who don't know what their customers are thinking, what products they should be building or whether their documents are stored in compliance with regulations," says Annie Weinberger, VP of Promote Solutions, a division of Autonomy (an HP company) that focuses on multichannel customer interactions and digital marketing. "Sometimes they are approaching information from a perspective of search and keywords, but that view assumes that they always know what they're looking for."
The infrastructure for Autonomy's platform is based on its Intelligent Data Operating Layer (IDOL) technology, which aggregates indexed content from any of 1,000 formats and stores it in a proprietary repository designed for analyzing and retrieving information. "When customers want to inject meaning into a channel or explore an issue such as a lower-than-desired ranking in their market, we help them leverage text analytics to achieve understanding of the data in a contextual way," explains Weinberger. "For example, when a company wants to get ahead of damaging complaints, we can show clusters of information that reveal major concepts that explain why people are calling."
Search, discover and analyze are all part of the same need that users have to interact with data, according to Grant Ingersoll, chief scientist at Lucid Imagination. "Search with humans in the loop is a critical aspect of understanding information, but text analysis, which can be carried out on an ongoing basis in the background, can produce unexpected results," he says. "In the pharmaceutical industry, for example, such analyses can draw from many sources and provide insights about drug development that might not have occurred to the user."
Lucid Imagination provides search, discovery and analytics software based on Apache Lucene and Apache Solr technology. In May, the company introduced LucidWorks Big Data, a cloud-based development stack incorporating multiple open source projects including Lucene/Solr and Hadoop for search and analytics of both structured and unstructured content.
Although text analytics in general and its inner workings in particular are still something of a mystery to business users, the gap is narrowing between that group and the IT department. "We see some businesspeople who are as technical as some engineers," Ingersoll says, "particularly in the younger generations coming through who have grown up with computers as part of their lives."