eDiscovery and the Internet of Things

Editor’s Note: The author of this post leads product teams to develop legal technology.

By Mark Kerzner, Chief Product Architect, LexInnova

The “Internet of Things” (IoT) means that there are, and will be, many more sources of information available for eDiscovery than ever before, and that the volume of this information is growing at an accelerating rate.

This can play out both for advantage and disadvantage in litigation, so the questions of requesting, preserving, reviewing and producing the information are best to be addressed soon. We suggest a balanced approach to gradual implementation for all stages of eDiscovery in the IoT.

Introduction

Every week there are a few new write-ups on eDiscovery and the IoT. However, by far the majority misses the important understanding of what Big Data is. They take it as “just a lot of data, more than a lawyer can read,” and proceed to describe the standard text analytics techniques, often promoting their particular brand of text analytics.

The key word searches and statistical analysis of text has been around since the late nineties. What makes Big Data special are the new, technical challenges, often abbreviated as 3V: Volume, Velocity, Variety.

The June 2015 McKinsey report, “The Internet of Things: Mapping the Value beyond the Hype” pegs the total IoT economic potential at $11 trillion in the coming decade, while the Cisco “Internet of Everything” puts it at $19 trillion.

Thus, there is no doubt that the IoT revolution is bound to happen. There is also little doubt that it will present lawyers and technology experts with significant challenges, but also opportunities. The IoT is definitely a technology problem, as opposed to societal or political. Thus, the means for addressing it will also be technological.

Luckily, technology giants such as Google, Facebook, Yahoo and Twitter have faced technology challenges for a long time, and have developed and made available a number of solutions. These solutions are largely open source, resulting in robust, dynamic software with low total cost of ownership (TOC). The first and foremost of these solutions is Hadoop, which provides scalable storage and computational capability on clusters of commodity servers.

Along with Hadoop, there are also Flume and Sqoop for data collection; Pig and Hive for processing; HBase, Cassandra and other NoSQL solutions for generating quick response to queries against a large dataset, Spark and Storm for real-time processing, and many more.

What you need to know about Big Data technology

Anyone who wants to understand the challenges of Big Data will do well by familiarizing oneself with Hadoop and its ecosystem. Otherwise, one’s understanding will be too vague and nebulous. There are a number of resources available on the internet which one can reference to improve their understanding of Hadoop including my free book, Hadoop Illuminated.

Implementation tips (and reference architecture) for eDiscovery and IoT

I have put together a reference architecture, represented by the diagram below. It is based on my extensive experience of teaching and practicing Big Data, as well as on the feedback from my Hadoop-based solution for eDiscovery.

LexInnova diagram

In brief, you use a data collection tool that deposits the data in HDFS, the storage component of Hadoop. The tool could be Flume for batch collections or Storm/Kafka for more real-time components. One must also consider Apache Nifi, a product of 8 years of NSA development; it solves many IoT load-balancing challenges and gives an HTML5-based interface for configuring data streams in real time.

Don’t make a mistake of putting everything into a NoSQL solution like HBase or Cassandra. This will make your solution inflexible, fast for some queries but extremely slow for others, which you have not anticipated. Afterwards, when you have seen the typical queries, you will be able to add a NoSQL component. Do it by replicating, not replacing the data.

For processing, use Hadoop or Spark. This will make sure that you can process any amount of data (think terabytes) in any required time frame (think hours for Hadoop, minutes for Spark).

Analytics use the most advanced tools, such as GATE, which stands for “General Architecture for Text Engineering”. We, for example, are applying our work at DARPA, to extract meaning, not just keywords or statistics, from  documents.

Review, just as well, should be built on Big Data NoSQL tools, which will guarantee super-fast performance on any volume of data or any number of reviewers. You can consult my book “HBase Design Patterns,” which works out many of the issues important for eDiscovery.

Finally, and this has been said many times, security should not be an afterthought, but should be a part of your development from the beginning. “Hadoop Security” can be said to be the first and only definitive book on the subject right now.