-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

The E-Discovery Arms Race: Semantics... Semantics... Semantics

There has been a confluence of e-discovery software, case law and information technology integration, which have all helped shape today’s e-discovery market. As we look toward the future, we ask ourselves how to continue to optimize e-discovery to intelligently reduce data volumes, efficiently decrease collection quantity and streamline the review process, while also delivering the highest possible document accuracy, reliability and repeatability. Finding electronically stored information (ESI) is becoming more challenging. To complicate matters further, attorneys are expanding discovery motions to include new ESI repositories, such as SharePoint and other collaborative tools. So, how can the problem be addressed to balance the opposing constraints of ESI volume, e-discovery expense and relevant document precision and accuracy? The answer is simple, but will be challenging to implement. The future e-discovery arms race is in the development of advanced, intelligent analytics capabilities. In other words, it is all about the semantics.

1.  Intelligent culling—The first advance in reducing non-relevant ESI collection (or ESI culling) was simple file identification. Software delivered the capability to identify file types quickly and easily to exclude operating system files (e.g. CABs) and other program executable files (e.g. Word, Numbers, Keynote, etc.), which are resident on all computers and do not contain relevant ESI. Intelligent culling reduces collection volume by 50% to 60% over the traditional brute force forensic collection.

2.  Keyword search—The second advance was Boolean keyword search, which has been a powerful e-discovery tool. Keyword search has become more sophisticated with the addition of keyword spelling/misspelling variants and root word variations, e.g. “talk” and “talking.” However, keyword search requires a priori knowledge for what one is looking, which limits success. As valuable as it is, keyword searches often include many non-relevant documents (false positives) or exclude too many relevant documents (false negatives).

3.  Synonymy and polysemy challenges—The complication is within our language usage. We have synonymy effects, which is that one of two or more words in the same language have the same meaning e.g. “student” and “pupil.” To further complicate maters, the polysemy effect is that many individual words have more than one meaning. The impact of polysemy on search complexity is staggering. For computer systems, polysemy is a major obstacle in attempting to deal with human language because the most frequently used terms have several common meanings. For example, the word fire can mean: a combustion activity; to terminate employment; to launch, or to excite (as in fire up). For the 200 most-polysemous terms in English, the typical verb has more than twelve common meanings, or senses. The typical noun from this set has more than eight common senses. The multiple meanings of common terms can turn a simple keyword search into thousands of potential results.

4.  Intelligent analysis—What are the next steps in intelligent search and identification analytics technologies? Conceptual search. The idea is to develop the ability to search on an idea and retrieve responses, which are relevant to the concept and are ranked by potential relevance. There are nascent concept search capabilities in the market today, which are a first step. However, the effects of synonymy and polysemy have slowed the development of concept search engines. There is research in the following areas of concept search which hold promise to increase search relevance and accuracy and to reduce false positives:

  • Word sense disambiguation (WSD). WSD technologies help derive the actual meanings of the words, and their underlying concepts, rather than by simply matching character strings like keyword search technologies. Research has progressed steadily to the point where WSD systems achieve sufficiently high levels of accuracy on a variety of word types and ambiguities;
  • Latent semantic analysis (LSA). LSA is a natural language processing technique that uses vectorial semantics (documents and queries are represented as vectors within a linear algebra matrix) to analyze relationships between a set of documents and the terms they contain and how the terms are correlated. After analyzing, LSA constructs a set of related concepts to the document and terms therein. In other words, LSA searches documents for themes within the language usage and extracts the concepts, which are common to the documents; and
  • Local co-occurrence statistics. Local co-occurrence statistics is a technique that counts the number of times pairs of terms appear together (co-occur) within a given period, where a period is equal to a predetermined window of terms or sentences within a document or documents.

Each of the above techniques by themselves will not likely be a complete solution to the e-discovery concept search challenge. However, these methods combined and intelligently integrated together within an overarching concept search engine will be a step in the right direction. As focus increases on conceptual search technologies to increase ESI accuracy while reducing review expense, the winning products will likely have the best analytical technologies.


For additional information please visit, emc.com/sourceonecity or kazeon.com/discover.

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues