-->

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

E-Discovery: Searching for the Narrative

At its core, e-discovery is about the search for electronic evidence for use in the legal or regulatory process. I recently was with a US Circuit court judge who was very surprised to learn that a company cannot simply run a “Google” type search across all their systems to locate potentially relevant electronically stored information (ESI) for production. This leading jurist’s reaction reflects a common misperception about search that plays out in e-discovery.

Legal search is materially different than using an Internet search engine, such as Google, Bing, Yahoo, etc. The difference between the two comes down to this: What are you hoping to retrieve with your search?

For the most part, when using an Internet search engine, or even a corporate file share, the objective is to retrieve a specific result to recall or reference prior events or decisions, or to use as a guide for future action. Google might return 1.2 million results, but for the most part, we only care about the first page (top 10 results) and we rarely go past the third page (top 30 results). We refer to this as ranked search.

Legal search, on the other hand, has a completely different objective, which is to locate ALL relevant documents (“recall”) while including only potentially relevant documents (“precision”). In theory, a perfect legal search would have 100% recall, i.e., the complete set of documents we want, and 100% precision, i.e., accurately retrieve only the documents we want. A perfect legal search does not exist in the real world.

The Problem of Too Many False Positives

When we preserve and collect ESI for use in the legal process, we almost always end up with substantially more irrelevant documents (false positives) than relevant documents (true positives). When Google returns 1.2 million results in a ranked search, the vast majority of them do not bear any relevance to the information you were seeking. These false positive results in a ranked search have no real impact, as they sit in the background, sight unseen, and disappear when the browser window is closed. But for legal search, we are regularly collecting hundreds of thousands, if not millions, of files for review to determine responsiveness, and moreover, to find patterns that will uncover the narrative of the case. With the continued explosion of data in the workplace, we have moved beyond the capability (or cost effectiveness) of human effort to reasonably review every document.

For many years, lawyers have used search terms in an attempt to pull true positives out of the sea of false positives. In “meet and confer” meetings, attorneys exchange lists of keywords and search terms, and then expect that all the documents within the scope of discovery will magically be found and exchanged. Yet, study after study demonstrates that humans are horrible at coming up with search terms with any reasonable degree of recall or precision. In large part, this is because of the extreme diversity of the use of language in our everyday communications. Our writing is filled with synonyms, polysemes, acronyms, jargon, and even nicknames, all of which play into and amplify this problem. Human languages and the individuals who use them are complex, and meaning can be intentionally or unintentionally masked.

Fortunately, we now have technology that can help us solve these problems.

Technology Assisted Review and Semantic Search

An e-discovery practitioner with access to and experience with advanced technology utilizing semantic search functionality, like Brainspace Discovery 5, has a reasonable chance of cutting through the myriad of false positives to identify true positives and moreover locate key documents that tell a story that may be not apparent on the surface.

Semantic search is an umbrella term covering several different, specialized methods. These methods utilize computer algorithms to bring together similar documents based on the text contained in the documents and their metadata. When used appropriately, each method can form an integral part of a practitioner’s toolkit.

New tools make use of several forms of semantic search, combined with graphic visualizations and transparent feedback, to provide an e-discovery practitioner with a multi-dimensional view of their dataset.

Clustering.

Today’s advanced software automatically organizes the dataset into clusters of documents it thinks are logically related based on common language. The program then displays the results in an interactive visualization in the form of a wheel. For demonstrative purposes, figure 1 on page 20 of the February White Paper (or download PDF of article) shows a wheel of all English language pages contained on the Wikipedia.com website. Groups of articles on similar subjects are clustered together with greater cohesion as you visually move towards the periphery of the wheel.

Thus, rather than relying on a human-generated standard taxonomy, such as the Dewey Decimal System or the facets of an online retail site, software takes the most commonly used terms in each set, and displays them in a way that is functionally self-explanatory.

By itself, clustering can be enormously valuable in understanding the structure of the dataset. For example, there would be no functional method for me to use a conventional search engine to find all Wikipedia articles related to music, film and television. However, I can now quickly identify sets of materials that are most likely to contain articles related to my goal, and even further narrow the topics, such as Wikipedia entries discussing movie cast and key crew.

Concept search and term expansion.

Concept search is the ability to find documents that are conceptually linked, but do not share the same words and phrases. For example, using a basic search for “thanksgiving” will return documents that contain that word. But a concept search for “thanksgiving” will return documents that include other related terms, including turkey, stuffing, dressing, gravy, mashed potatoes, cranberries, pumpkin pie, football, pigskin, pilgrims, etc.

What most concept search technology lacks is transparency. Not only does ?Brainspace Discovery show me the words and phrases contained in the document results, but also terms and phrases inferred by my original query—this is known as term or phrase expansion. I have the ability to provide direct feedback to the machine by excluding terms that I think it incorrectly associated with my intended results, and weighting the importance or value of the other terms.

It is important to note that while a user can exclude an unexpected and undesired return in the expansion list, a process must be in place to examine WHY an unexpected result appeared. Many times, it is the unexpected result that may reveal material facts hidden by idiosyncratic or coded language.

Classification.

While many semantic search tools rely predominantly on clustering, a few have advanced to allow a user to interact with the dataset and the algorithm and not only refine a search, but create a new subset of the data with its own cluster wheel. This functionality is called “focus” as it allows the user to focus in on particular portions of the dataset and explore a more defined universe.

So if, for example, instead of the larger topic of music, film and television, I wanted to search for all articles on Wikipedia on the topic of love, I could run a concept search for the term “love” and then create a “focus” on the resulting subset. Now I have a new wheel that just focuses on the concept of “love” segregated at a high level into music, film, and TV about “love.” See Figure 2 on page 21 of the February White Paper (or download PDF of article).

Combining the concept search, term/phrase expansion and classification, I can then explore the different aspects of “love” and further focus my search. For example, if I run a concept search for “fictional romance” within the “love” classification focus, I can retrieve Wikipedia articles that include the terms and phrases protagonist, major characters, secondary character, roguish, young heroine, eponymous protagonist, sympathetic character, anti-hero, course of the stories, ne’er-do-well, and many others.

I can thus start with a large concept and work down the various paths, ignoring groups of documents that I know are not relevant, explore groups that I don’t understand, and select groups using the tag function that I know to contain relevant data that should be looked at on a document-by-document basis.

Litigators tell stories to juries, and those stories require documentary evidence to prove or disprove facts in contention. Just as individuals may disappear into the darkness of a large city, and with them their stories, documents disappear into large datasets, and with them the stories contained therein. E-discovery practitioners now have extraordinary toolsets to help solve the complex issues related to legal search, cutting through the morass of false positives to find the hidden narrative.


Brainspace Corporation is a pioneer and recognized leader in helping enterprise clients derive meaning, gain insights and identify human connections in unstructured data. Our unique solutions utilize our patented Brainspace platform and are leading the industry in text analytics, accelerating institutional learning and reinventing how organizations exchange knowledge and expertise. Our customers include the Fortune 500, leading consulting firms, legal service providers, and government agencies.

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues