Bloomberg Law
March 31, 2015, 1:30 PM UTC

Searching for Meaning, Not Words in eDiscovery

Thomas Barnett
Paul Hastings LLP

Editor’s Note: This article is written by a special counsel at Paul Hastings, who focuses on eDiscovery and data science.

By Thomas I. Barnett, Special Counsel, eDiscovery and Data Science, Paul Hastings LLP

Imagine taking the deposition of a key witness in a highly contentious lawsuit: No lawyer would ever create a list of words and ask the witness, “Please tell me every time in the last five years you have used the following words.”

Unfortunately, this is a fair approximation of what happens when lawyers use key word searches to identify relevant information in large data sets. Even more advanced approaches, like predictive coding, are actually nothing more than key word searching on steroids: comparing all of the words in a document — based on their frequency and proximity— with all of the other documents in the set. The technology used to search discovery data has been around for decades and its effectiveness is rarely questioned outside of a few forward thinking judicial opinions.

In recent years, however, there have been unprecedented advances in a number of areas of computer science that open the door to better ways of finding information—not just matching key words. The legal profession needs to take advantage of these options to get beyond the confines of conventional approaches.

Some examples of the technology used extensively in other industries include parallel processing — linking computers together to drastically increase computing power and speed, as well as advanced machine learning and statistical engineering to derive knowledge from data far more effectively. It’s time for legal profession to take advantage of these powerful tools to attack the ever expanding oceans of data, find critical information faster and help clients control costs.

Consider the current state of affairs: Most of the technology for classifying documents that is currently available in the legal industry relies primarily on the text of the documents—either matching individual or small combinations of words with key word searching, or using all of the words in a document as in predictive coding.

A skilled lawyer seeking information about a case considers far more than simply the particular words that were used in a communication. The background, foundation and context for the information is essential. This is so basic to how we communicate, question and learn about the world that most people don’t even stop to think about it—but it is not how the vast majority of current technology used in eDiscovery works.

Using key words, or any of their variants such as “concept” search, clustering, and predictive coding is based on what works well for computers—not on how people actually think, learn and communicate.

While computers are far superior in speed and accuracy when it comes to identifying patterns, the fastest most powerful computers can’t even come close to a human’s ability to grasp the context of events, determine and rank the importance of statements or actions, and interpret nuance in language. These are the very things that allow us to understand and make judgments every day and to arrive at what we believe to be the actualmeaningof events.

What traditional eDiscovery approaches fail to do is incorporate and correlate various types of data that can be easily extracted and brought into the process. While these types of data are well known and commonly used in sorting and organizing data, they have not been put together and analyzed collectively using advanced machine learning techniques and statistical engineering approaches with the aim of understanding the context of communications and events. In addition to the text of a document, other information can be accessed and analyzedin combinationto better understand and derive knowledge from the data:

  • Metadata(a mainstay of eDiscovery) is information that accompanies our documents (e.g., emails, word processing documents) but is not part of what we think of as the content of the document. Such information includes the date and time the document was created, accessed, modified, or the time an email was sent, received, opened, and who sent it and received it. For example, metadata such as the senders and recipients of emails are commonly used to organize emails into groups for review but that information it is not typically combined with other types of data to determine meaning. Further, the more complex approaches like predictive coding and concept searching generally rely on the text of the documents alone.


  • Entitydata (also referred to asextractedmetadata) is information contained in the text of documents that can be identified and classified into pre-defined categories such as names of persons, organizations, locations, quantities, monetary values, and so on. In other words, instead of looking at the text of a document solely as a string of characters that may or may not match another string of characters, this approach incorporates knowledge about elements of the document into the analysis. What do the words in the document actually refer to or mean?

For example, suppose you have an email from one person to another with text that matches a set of search terms. That by itself may not provide enough useful information. But what if you were able search for documents based on whether they discussed specific events, at a specific time, involving specific people—even if you don’t know or can’t guess the actual words that were used in the communication? That is possible by combining metadata and entity data along with text in a single process. This is just one example of what is possible by incorporating and blending technology and processes used in areas outside of the legal profession and the eDiscovery industry.

We as a profession need to get beyond the confines of conventional approaches. We rely on these approaches either out of habit and familiarity, or because these are the only offerings being plied by providers which have invested in their development. In order to survive, we need to take advantage of the best available techniques and technology. Failing to do so will leave us drowning in the ever increasing oceans of data.

Learn more about Bloomberg Law or Log In to keep reading:

Learn About Bloomberg Law

AI-powered legal analytics, workflow tools and premium legal & business news.

Already a subscriber?

Log in to keep reading or access research tools.