Document Review and Predictive Coding: Video Talks – Part Seven

March 23, 2016

predictive_coding_Step-8This is the seventh and last of seven informal video talks on document review and predictive coding. The first video explained why this is important to the future of the Law. The second talked about step one, ESI Communications. The third about step two, Multimodal Search Review. The fourth about step three, Random Baseline. The fifth about steps four, five and six, the predictive coding steps that iterate during rounds of machine training. The sixth about ZEN Quality Assurance Tests, where ZEN stands for Zero Error Numerics.

This last video talks about step eight in work flow, Phased Productions. This is also sometimes known as second pass review, where you identify confidential and privileged documents and take appropriate action of logging, redacting or fixing special legends on the face of Tiffed or PDF productions. You are probably already well acquainted with this step. Again this is fairly easy and straightforward. Here is my short, five-minute wrap up.



For details on phased productions see Electronic Discovery Best Practices. For information on all eight steps see Predictive Coding 3.0More information on document review and predictive coding can be found in the fifty-six articles published here.

Document Review and Predictive Coding: Video Talks – Part Four

March 11, 2016

predictive_coding_Step-3This is the fourth of seven informal video talks on document review and predictive coding. The first video explained why this is important to the future of the Law. The second talked about ESI Communications. The third about Multimodal Search Review. This video talks about the third step of the e-Discovery Team’s eight-step work flow, shown above, Random Baseline Sample.

coin_flipAlthough this text intro is overly long, the video itself is short, under eight minutes, as there is really not that much to this step. You simply take a random sample at or near the beginning of the project. Again, this step can be used in any document review project, not just ones with predictive coding. You do this to get some sense of the prevalence of  relevant documents in the data collection. That just means the sample will give you an idea as to the total number of relevant documents. You do not take the sample to set up a secret control set, a practice that has been thoroughly discredited by our Team and others. See Predictive Coding 3.0.

thumb_ruleIf you understand sampling statistics you know that sampling like this produces a range, not an exact number. If your sample size is small, then the range will be very high. If you want to reduce your range in half, which is a function in statistics known as a confidence interval, you have to quadruple your sample size. This is a general rule of thumb that I explained in tedious mathematical detail several years ago in Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022. Our Team likes to use a fairly large sample size of about 1,533 documents that creates a confidence interval of plus or minus 2.5%, subject to a confidence level of 95% (meaning the true value will lie within that range 95 times out of 100). More information on sample size is summarized in the graph below. Id.


The picture below this paragraph illustrates a data cloud where the yellow dots are the sampled documents from the grey dot total, and the hard to see red dots are the relevant documents found in that sample. Although this illustration is from a real project we had, it shows a dataset that is unusual in legal search because the prevalence here was high, between 22.5% and 27.5%. In most data collections searched in the law today, where the custodian data has not been filtered by keywords, the prevalence is far less than that, typically less than 5%, maybe even less that 0.5%. The low prevalence increases the range size, the uncertainties, and requires a binomial calculation adjustment to determine the statistically valid confidence interval, and thus the true document range.


For example, in a typical legal project with a few percent prevalence range, it would be common to see a range between 20,000 and 60,000 relevant documents in a 1,000,000 collection. Still, even with this very large range, we find it useful to at least have some idea of the number of documents they are looking for. That is what the Baseline Step can provide to you, nothing more nor less.

95 Percent Confidence Level with Normal Distribution 1.96If you are unsure of how to do sampling for prevalence estimates, your vendor can probably help. Just do not let them tell you that it is one exact number. That is simply a point projection near the middle of a range. The one number point projection is just the top of the typical probability bell curve shown above, which illustrates a 95% confidence level distribution. The top is just one possibility, albeit slightly more likely than either end points. The true value could be anywhere in the blue range.

To repeat, the Step Three prevalence baseline number is always a range, never just one number. Going back to the relatively high prevalence example, the below bell cure shows a point projection of 25% prevalence, with a range of 22.2% and 27.5%, creating a range of relevant documents of from between 225,000 and 275,000. This is shown below.


confidence interval graph showing standard distribution and 50% prevalenceThe important point that many vendors and other “experts” often forget to mention, is that you can never know exactly where within that range the true value may lie. Plus, there is always a small possibility, 5% when using a sample size based on a 95% confidence level, that the true value may fall outside of that range. It may, for example, only have 200,000 relevant documents. This means that even with a high prevalence project with datasets that approach the Normal Distribution of 50% (here meaning half of the documents are relevant), you can never know that there are exactly 250,000 documents, just because it is the mid-point or point projection. You can only know that there are between 225,000 and 275,000 relevant documents, and even that range may be wrong 5% of the time. Those uncertainties are inherent limitations to random sampling.

Shame on the vendors who still perpetuate that myth of certainty. Lawyers can handle the truth. We are used to dealing with uncertainties. All trial lawyers talk in terms of probable results at trial, and risks of loss, and often calculate a case’s settlement value based on such risk estimates. Do not insult our intelligence by a simplification of statistics that is plain wrong. Reliance on such erroneous point projections alone can lead to incorrect estimates as to the level of recall that we have attained in a project. We do not need to know the math, but we do need to know the truth.

The short video that follows will briefly explain the Random Baseline step, but does not go into the technical details of the math or statistics, such as the use of the binomial calculator for low prevalence. I have previously written extensively on this subject. See for instance:

Byte and Switch

If you prefer to learn stuff like this by watching cute animated robots, then you might like: Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search. But be careful, their view is version 1.0 as to control sets.

Thanks again to William Webber and other scientists in this field who helped me out over the years to understand the Bayesian nature of statistics (and reality).

For details on all eight steps, including this third step, see Predictive Coding 3.0More information on document review and predictive coding can be found in the fifty-six articles published here.




Document Review and Predictive Coding: Video Talks – Part One

March 1, 2016

predictive_coding_3.0This is the first of seven informal video talks on document review and predictive coding. These short videos share my thoughts on the e-Discovery Team’s eight-step work flow for document review, shown above. I explain predictive coding and the Team’s Hybrid Multimodal Method. This first video addresses the big picture, why it is critical to our system of justice for the legal profession to keep up with technology, including especially active machine learning (predictive coding).

The flood of data now all too often hides the truth and frustrates justice. Cases tend to be decided on shadows, smoke and mirrors, because the key documents cannot be found. The needles of truth hide in vast haystacks in the clouds. Justice demands the truth, the full truth, not some bastardized twitter version.

Lady JusticeThe use of AI in legal search can change that. It can empower lawyers to find the needles and decide cases on what really happened, and do so quickly and inexpensively. It can usher in a new age of greater justice for all, blind to wealth and power. The stability of society demands nothing less.

4-5-6-only_predictive_coding_3.0The videos after this introduction are more technical. They delve into details of the work flow and show that it is easier than you might think. After all, only two of the eight steps (four and six) are unique to document reviews that use predictive coding. The others are found in any large scale review project, or should be.

For a more systematic explanation of the methods and eight-steps see Predictive Coding 3.0. Still more information on predictive coding and electronic document review can be found in the fifty-six articles published here on the topic since 2011.



Why the ‘Google Car’ Has No Place in Legal Search

February 24, 2016

Google_Car_HybridHybrid Multimodal is the preferred legal search method of the e-Discovery Team. The Hybrid part of our method means that our Computer Assisted Review, our CAR, uses active machine learning (predictive coding), but still has a human driver. They work together. Our review method is thus like the Tesla’s Model S car with full autopilot capabilities. It is designed to be driven by both Man and Machine. Our CAR is unlike the Google car, which can only be driven by a machine. When it comes to legal document review, we oppose fully autonomous driving. In our view there is no place for a Google car in legal search.

Google cars have no steering wheel, no brakes, no gas pedal, no way for a human to drive it at all. It is fully autonomous. A human driver cannot take over, even if they wanted to. In Google’s view, allowing humans to take over makes driverless cars less safe. Google thinks passengers could try to assert themselves in ways that could lead to a crash, so it is safer to be autonomous.

Tesla_autopilotWe have no opinion about the driverless automobile debate, and only like the analogy up to a point. Our opinion is limited to computer assisted review CARs that search for relevant evidence in law suits. For purposes of Law, we want our CARs to be like a Tesla. You can let the car drive and go hands free, if and when you want to. The Tesla AI will then drive the car for you. But you can still drive the car yourself. The second you grab the wheel, the Tesla senses that and turns the Autopilot off. Full control is instantly passed back to you. It is your car, and you are the driver, but you can ask your car to help you drive, when, in your judgment, that is appropriate. For instance, it has excellent fully autonomous parallel parking features, and you can even summon it to come pick you up from out of a nearby parking lot, a truly cool valet service. It is also good in slow commuter traffic and highways, much like cruise control.

When it comes to law, and legal review, we want an attorney’s hands on, or at least near the wheel at all times. Our Hybrid Multimodal approach includes an autopilot mode using active machine learning, but our attorneys are always responsible. They may allow the programmed AI to take over in some situations, and go hands free, much like autonomous parallel parking or highway driving, but they always control the journey.

Defining the Terms

The e-Discovery Team’s Hybrid Multimodal method of document review is based on a flexible blend of human and machine skills, where a lawyer may often delegate, but always retains control. Before we explore this further, a quick definition of terms is in order. Multimodal means that we use all kinds of search methods, and not just one type. For example, we do not just use active machine learning, a/k/a Predictive Coding, to find relevant documents. We do not just use keyword search, or concept search. We use every kind of search we can. This is shown in the search pyramid below, which does not purport to be complete, but catches the main types of document search used today. Using our car analogy, this means that when a human drives, they have a stick shift, and can run in many gears, use many search engines. They can also let go of the wheel, when they want to, and use AI-enhanced search.Search_pyramid

man_robotWe call this a Hybrid method because of the manner in which we use one particular kind of search, predictive coding. To us predictive coding means active machine learning. See eg. Legal Search Science. It is a Man-Machine process, a hybrid process, where we work together with our machine, our robot, whom we call Mr. EDR. In other words, we use the artificial intelligence generated by active machine learning, but we keep lawyers in the loop. We stay involved, hands on or near the wheel.

Augmentation, Not Automation

iron_manThe e-Discovery Team’s Hybrid approach enhances what lawyers do in document review. It improves our ability to make relevance assessments of complex legal issues. The hybrid approach thus leads to augmentation, where lawyers can do more, faster and better. It does not lead to automation, where lawyers are replaced by machines.

The Hybrid Multimodal approach is designed to improve a lawyer’s ability to find evidence. It is not designed to fully automate the tasks. It is not designed to replace lawyers with robots. Still, since one lawyer with our methods can now do the work of hundreds, some lawyers will inevitably be out of a job. They will be replaced by other, more tech savvy lawyers that can work with the robots, that can control them and be empowered by them at the same time. This development in turn creates new jobs for the experts who design and care for the robots, and for lawyers who find new ways to use them.

robots_newspaperWe think that empowering lawyers, and keeping them in the loop, hands near the wheel, is a good thing. We believe that lawyers bring an instinct and a moral sense that is way beyond the grasp of all automation. Moreover, at least today, lawyers know the law, and robots do not. The active machine learning process – predictive coding – begins with a blank slate. Our robots only know what we teach them about relevance. This may change soon, but we are not there yet. See Another advantage that we currently have, again one that may someday be replaced, is legal analysis. Humans are capable of legal reasoning, at least after years of schooling and years of legal practice. Right now no machine in the world is even close. But again, we concede this may someday be automated, but we suspect this is at least ten years away.

Robot_with_HeartThe one thing we do not think can ever be automated is the human moral sense of right and wrong, our ethics, our empathy, our humor, our instinct for justice, and our capacity for creativity and imagination, for molding novel remedies to attain fair results in new fact scenarios. This means that, at the present time at least, only lawyers have an instinct for the probative value of documents and their ability to persuade. Even if legal knowledge and legal analysis are some day programmed into a machine, we contend that the unique human qualities of ethics, fairness, empathy, humor, imagination, creativity, flexibility, etc., will always keep trained lawyers in the loop. When it comes to questions of law and justice, humans will always be needed to train and supervise the machines. Not everyone agrees with us.

There is a struggle going on about this right now, one that is largely under the radar. The clash became apparent to the e-Discovery Team during our venture into the world of science and academia at TREC 2015. Some argue that lawyers should be replaced, not enhanced. They favor fully automated methods for a variety of reasons, including cost, a point with which we agree, but also including the alleged inherent unreliability and dishonesty of humans, especially lawyers, a point with which we strenuously disagree. Some scientists and technologists do not appreciate the unique capabilities that humans bring to legal search. More than that, some even think that lawyers should not to be trusted to find evidence, especially documents that could hurt their client’s case. They doubt our ability to be honest in an adversarial system of justice. They see the cold hard logic of machines as the best answer to human subjectivity and deceitfulness. They see machines as the impartial counter-point to human fallibility. They would rather trust a machine than a lawyer. They see fully automated processes as a way to overcome the base elements of man. We do not. This is an important Roboethics issue that has ramifications far beyond legal search.

con manAlthough we have faced our fair share of dishonest lawyers, we still contend they are the rare exception, not the rule. Lawyers can be trusted to do the right thing. The few bad actors can be policed. The existence of a few unethical lawyers should not dictate the processes used for legal search. That is the tail wagging the dog. It makes no sense and, frankly, is insulting. Just because there are a few bad drivers on the road, does not mean that everyone should be forced into a Google car. Plus, please remember the obvious, these same bad actors could also program their robots to do evil for them. Asimov’s laws are a fiction. Not only that, think of the hacking exposure. No. Turning it all over to supposedly infallible and honest machines is not the answer. A hybrid relationship with Man in control is the answer. Trust, but verify.

JusticeThe e-Discovery Team members have been searching for evidence, both good and bad, all of our careers. We do not put our thumb on the scale of justice. Neither do the vast majority of attorneys. We do, however, routinely look for ways to show bad evidence in a good light; that is what lawyers are supposed to do. Making silk purses out of sow’s ears is Trial Law 101. But we never hide the ears. We argue the law, and application of the law to the facts. We also argue what the facts may be, what a document may mean for instance, but we do not hide facts that should be disclosed. We do not destroy or alter evidence. Explaining is fine, but hiding is not.

Many laypersons outside of the law do not understand the clear line. The same misunderstanding applies to some novice lawyers too, especially the ones that have only heard of trials. Hiding and destroying evidence are things that criminals do, not lawyers. If we catch opposing counsel hiding the ball, we respond accordingly. We do not give up and look for ways to turn our system of justice over to cold machines.


Robot_CAR_driverWe should not take away everyone’s license just because a few cannot drive straight. A Computer Assisted Review guided solely by AI alone has no place in the law. AI guidance is fine, we encourage that, that is what Hybrid means, but the CARs should always have a steering wheel and brake. Lawyers should always participate. It is total delegation to AI that we oppose, fully automated search. Legal robots can and should be our friends, but they should never be our masters.

Robot_handshakeHaving said that, we do concede that the balance between Man and Machine is slowly shifting. The e-Discovery Team is gradually placing more and more reliance on the Machine. We learned many lessons on that in our participation in the TREC experiments in 2015. The fully automated methods that the academic teams used did surprisingly well, at least in relatively simple searches requiring limited legal analysis. We expect to put greater and greater reliance on AI in years to come as the software improves, but we will always keep our hands near the wheel.

Mr_EDRWe believe in a collaborative Man-Machine process, but insist that Man, here Lawyers, be the leaders. The buck must stop with the attorney of record, not a robot, even a superior AI like our Mr. EDR. Man must be responsible. Artificial intelligence can enhance our own intelligence, but should never replace it. Back to the AI car analogy, we can and should let the robot drive from time to time, they are, for instance, great a parallel parking, but we should never discard the steering wheel. Law is not a logic machine, nor should it be. It is an exercise in ethics, in fairness, justice and empathy. We should never forget the priority of the human spirit. We should never put too much faith in inhuman automation.

For more on these issues, the hybrid multimodal method, competition with fully automated methods, and much more, please see the e-Discovery Team’s final report of its participation in the 2015 TREC, Total Recall Track, found on NIST’s web at: It was just published last week. At 116 pages, it should help you to fall asleep for many nights, but hopefully, not while you are driving like the bozos in the hands-free driving video below.


%d bloggers like this: