In general, text analysis refers to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text analysis differs from traditional search in that, whereas search requires a user to know what he or she is looking for, text analysis attempts to discover information in a pattern that is not known beforehand (through the use of advanced techniques such as pattern recognition, natural language processing, machine learning and so on). By focusing on patterns and characteristics, text analysis can produce better search results and deeper data analysis, thereby providing quick retrieval of information that otherwise would remain hidden.
Text analysis is particularly interesting in areas where users must discover new information, such as in criminal investigations, legal discovery and when performing due-diligence investigations. Such investigations require 100% recall; i.e., users cannot afford to miss any relevant information. In contrast, a user who uses a standard search engine to search the Internet for background information simply requires any information as long as it is reliable. During due diligence, a lawyer certainly wants to find all possible liabilities and is not interested in finding only the obvious ones.
Challenges Facing Text Analysis
Due to the global reach of many investigations, a lot of interest also exists with text analysis in multi-language collections. Multi-language text analysis is much more complex than it appears because, in addition to differences in character sets and words, text analysis makes intensive use of statistics as well as the linguistic properties (such as conjugation, grammar, tenses or meanings) of a language. A number of multi-language issues will be addressed later in this article.
But perhaps the biggest challenge with text analysis is that increasing recall can compromise precision, meaning that users end up having to browse large collections of documents to verify their relevance. Standard approaches to countering decreasing precision rely on language-based technology, but when textcollections are not in one language, are not domain-specific and/or contain documents of variable sizes and types, these approaches often fail or are too sophisticated for users to comprehend what processes are actually taking place, thereby diminishing their control.
Furthermore, according to Moore’s Law, computer processor and storage capacities double every 18 months, which, in the modern context, also means that the amount of information stored will double during this timeframe as well. The continual, exponential growth of information means most people and organizations are always battling with the specter of information overload.
Although effective and thorough information retrieval is a real challenge, the development of new computing techniques to help control this mountain of information is advancing quickly as well. Text analysis is at the forefront of these new techniques, but it needs to be used correctly and understood according to the particular context in which it’s applied. For example, in an international environment, a suitable text analysis solution may consist of a combination of standard relevance-ranking with adaptive filtering and interactive visualization, which is based on utilizing features (i.e. metadata elements) that have been extracted earlier.
Control of Unstructured Information
More than 90% of all information is unstructured, and the absolute amount of stored unstructured information increases daily. Searching within this information, or performing analysis using database or data mining techniques, is not possible, as these techniques work only on structured information. The situation is further complicated by the diversity of stored information: scanned documents, email and multimedia files (speech, video and photos).
Text analysis neutralizes these concerns through the use of various mathematical, statistical, linguistic and pattern-recognition techniques that allow automatic analysis of unstructured information as well as the extraction of high quality and relevant data. ("High quality" here refers to the combination of relevance [i.e. finding a needle in a haystack] and the acquiring of new and interesting insights.) With text analysis, instead of searching for words, we can search for linguistic word patterns, which enables a much higher level of search.
Text analysis is often mentioned in the same sentence as information visualization, in large part because visualization is one of the viable technical tools for information analysis after unstructured information has been structured.
A common visualization approach is a "treemap," in which an archive is presented as a colored grid (see figure left). The components of the grid are color-coded and sized based on their interrelationships and content volume. This structure allows you to get a quick visual representation of areas with the most entities. A value can also be allocated to a certain type of entity, such as the size of an email or a file.
These types of visualization techniques are ideal for allowing an easy insight into large email collections. Alongside the structure that text analysis techniques can deliver, use can also be derived from the available attributes such as "sender," "recipient," "subject," "date," etc.
Text Analysis on Non-English Documents
As mentioned earlier, many language dependencies need to be addressed when text- analysis technology is applied to non-English languages.