February 2, 2009
By Dr. Johannes C. Scholtes Chief Strategy Officer, ZyLAB North America LLC
Article

The Difference Between Legal Search and Web Search
What You Should Know About Search Tools for E-Discovery

In many instances, when in-house legal professionals require advanced searching capabilities for e-discovery and legal activities, they often default to in-house variants of common Web search tools. However, Web search tools are not optimized for the types of activities associated with e-discovery, in large part because fundamental differences exist between the capabilities of Web search engines and the real search functionality and approaches needed to support the strategic requirements of legal, law enforcement and intelligence applications.

One of the most compelling differences is that typical Web search engines are optimized to find only the most relevant documents; they are not optimized to find all relevant documents. Consider that with Web search engines, most companies and organizations place a premium on being found as close to the top of a search list as possible. Experienced users have become quite savvy in utilizing search engine optimization techniques to enhance high rankings. This level of sophistication works in both directions, though. People involved in criminal activities (such as fraud) don’t want to be in the top 10 of a search engine result list, so they use advanced techniques to hide their documented activities and avoid appearing in any search list.

As a result, those searching in legal or law enforcement environments need to find all potentially relevant documents. Moreover, these investigators require different tool functionalities to quickly and efficiently navigate and review relevant document sets. The combination of these two requirements encompasses the practical difference between common Web search tools and legal search tools tailored for discovery-type activities.

An additional technical consideration is that, although Web search engines use many optimizations to continually perform real-time indexing of the Web, these optimizations come at a price: documents in non-standard formats will not be found, long documents will require a lot of time to review, and the processing of complex queries will be very slow (if even possible). Hit highlighting and hit navigation are often not available or operate too slowly. Moreover, with Web search engines, after documents are found, tagging them is not possible, nor can they be exported in a format required by regulators or courts.

Recognizing Different Search Capabilities
The strict e-discovery obligations and deadlines spelled out in the Federal Rules of Civil Procedures (FRCP) have highlighted the need for powerful in-house search technology, particularly in light of the current credit crisis. Meeting these requirements is becoming increasingly difficult given that the data repositories through which organizations now have to search for relevant and non-privileged documents are immense and ever-growing (i.e. they contain terabytes rather than "just" gigabytes of information). Given this context, consider the typical progression of e-discovery activities for most organizations: an organization receives a legal hold letter from a regulator or a third party; relevant custodians are established; and the organization’s email and electronic files are handed off to a legal service bureau. The bureau would then find all documents that needed to be transferred to a third party in a specific, legally acceptable format. In some cases, these documents are in native file formats; in other cases, these formats are TIFF prints from electronic and paper files. The cost of such data processing services can be hundreds of thousands of dollars for a typical collection of, for instance, 250GB.

A remedy to alleviating these costs would be to implement an appropriate in-house search engine that could make a pre-selection of relevant documents and then create the document set that needs to be reviewed. However, many cases exist in which, rather than using an e-discovery-appropriate search tool, organizations implement Web search technology or Web search appliances to perform full-text searches on large email or electronic file collections throughout corporate networks. These organizations soon realize that the technical constraints of Web search technologies compromise the ability to meet set deadlines and address the requirements of regulators and courts, all of which can lead to higher costs and possible fines. Unfortunately, the limitations of Web search technologies are often not discovered until it’s too late.

Understanding Search in E-Discovery
Searching is not only important for finding potentially relevant documents; it is also very important for supporting early case assessment activities. You must be able to quickly perform thorough and complex searches through your document repository, especially when you consider that searchers are under severe time constraints and/or are expensive investigators or (external) counsel.

ZyLAB has seen the most client "pain" when in-house legal teams and third parties confer to define the relevant search queries. As parties negotiate which documents need to be disclosed, lawyers establish what they consider the best Boolean, proximity and quorum operators needed to find specific data, and these operators are often combined and nested in hierarchical structures separated by brackets. Typical queries contain hundreds of words, and to catch spelling variations (e.g., from typos or optical character recognition [OCR] errors), a good search tool must be able to utilize wildcards (placeholders for beginning, middle and end of words) and fuzzy search (including support for first character changes).

Web search technologies are either unable to execute such queries or are too slow when attempting them. In these cases, executing a negotiated Boolean can take several days to finish, if it doesn’t crash the system, so the query must be cut into smaller queries, with all spelling variations specified, which leads to an even more complicated search framework.

In addition, if a regulator or judge wants to verify that you have delivered all potentially relevant data, running additional fuzzy or wildcard searches might be required to find other documents. Cases are trending in this direction, and you need to make sure your in-house system can support it. You must be able to tag relevant documents or set them aside for deferred or external review, and you need to be able to show how you searched and what the results were.

Furthermore, your search engine needs to produce exactly the same results anytime it is used on the same data collection. Web search engines or engines based on certain high-dimensional statistical relevance ranking technology tend to produce different results over time. Cases relying on these kinds of searches are compromised in court.

Understanding Full-Text Indexing Processes
Most search engines use a "tokenizer" to enhance the searchability of data by removing punctuation and noise words, identifying words and determining character-set mappings (for foreign languages). This type of capability enhances your ability to perform the necessary full-text indexing of all relevant data. Of course, Web appliances can index for you, but their reporting and auditing functions may not match the standards required by regulators and the courts. With a Web search engine, you may not know exactly what data is in your index, and more specifically, what data is not in your index.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

The Difference Between Legal Search and Web Search
What You Should Know About Search Tools for E-Discovery

How Knowledge Graphs Make Generative AI Consumable in Enterprise Environments

Building a KM Foundation for Enterprise AI

TRANSFORMING ENTERPRISE KNOWLEDGE: THE JOURNEY TO SAFE, SECURE, AND TRUSTWORTHY AI

More

Intelligent Content Management: Game-Changing Technologies and Strategies

Optimizing LLMs with RAG: Key Technologies and Best Practices

What's Ahead in Search: AI, NLP, Knowledge Graphs, and More

Rethinking KM for Agility, Efficiency, and Innovation

More Webinars

KMWorld 2024 Is Nov. 18-21 in Washington, DC. Register now for Super Early Bird Savings!

The Difference Between Legal Search and Web Search What You Should Know About Search Tools for E-Discovery

The Difference Between Legal Search and Web Search
What You Should Know About Search Tools for E-Discovery