TAR Course: 1st Class

June 17, 2017

First Class: Background and History of Predictive Coding

Welcome to the first (and longest) of seventeen classes where we review the background and history of predictive coding, including some of the patents in this area. This first class is somewhat difficult, but worry not, most of the classes are easier. Also, you do not need to understand the patents discussed here, just the general ideas behind the evolution of predictive coding methods.

First Generation Predictive Coding, Version 1.0

To understand the current state of the art of any field,  you need to understand what came before. This is especially true in the legal technology field. Moreover, if you do not know history, you are doomed to repeat the mistakes of the past.

The first generation Predictive Coding, version 1.0, entered the market in 2009. That is when the first document review software was released that used active machine learning, a type of specialized artificial intelligence. This new feature allowed the software to predict the relevance or irrelevance of a large collection of documents by manual review of only a portion of the documents.

The active machine learning made it possible to predict the relevance, or not, of all documents in a collection based upon the review and coding of only some of the documents. The software learned from the manual coding of the small set of documents to predict the coding of all of them. The new feature also ranked all documents in the collection according to predicted probable relevance. It sorted all of the documents into binary categories with weighted values by using complex multidimensional mathematics and statistical analysis. This is typically illustrated by the diagram below. We will not go into the black box math in this course, only how to use this powerful new capabilities. But see: Jason R. Baron and Jesse B. Freeman, Quick Peek at the Math Behind the Black Box of Predictive Coding (2013).

The methods for use of predictive coding software have always been built into the software. The first version 1.0 software required a user to begin the review with a Subject Matter Expert (SME), usually a senior-level lawyer in charge of the case, to review a random selection of several thousand documents. The random documents they reviewed included a secret set of documents not identified to the SME, or anyone else, called a control set.

The secret control set supposedly allowed you to objectively monitor your progress in Recall and Precision of the relevant documents from the total set. It also supposedly prevented lawyers from gaming the system. As you will see in this class, we think the use of control sets was a big mistake. The control was an illusion, much like the below still image (not an animation) by our favorite optical illusionist, Akiyoshi Kitaoka, a Professor of Psychology in Kyoto.

Version 1.0 software set up the review project into two distinct stages. The first stage was for training the software on relevance so that it could predict the relevance of all of the documents. The second was for actual review of documents that the software had predicted would be relevant. You would do your training, then stop and do all of your review. There were two distinct stages.

Second Generation Predictive Coding, Version 2.0

The next generation of version 2.0 methodology continued to use secret control sets, but combined the two-stages of review into one. It was no longer train then review, but instead, the training continued throughout the review project. This continuous training improvement was  popularized by Maura Grossman and Gordon Cormack who called this method continuous active learning, or CAL for short. They later trademarked CAL, and so here we will just call it continuous training or CT for short. Under the CT method, which again was built into the software, the training continued continuously throughout the document review. There was not one stage to train, then another to review the predicted relevant document. The training and review continued together.

The main problem with version 2.0 predictive coding is that the use of secret control set continued. Please note that Grossman and Cormack’s method of review, which they call CAL, has never used control sets.

Third and Fourth Generation Predictive Coding, Versions 3.0 and 4.0

The next method of Predictive Coding, version 3.0, again combined the two-stages into one, the CT technique, but eliminated the use of secret control sets. Random sampling itself remained. It is the third step in the eight-step  process of both versions 3.0 and 4.0 that will be explained in the TAR Course, but the secret set of random documents, the control set, was eliminated.

The Problem With Control Sets

Although the use of a control set is basic to all scientific research and statistical analysis, it does not work in legal search. The EDRM, which apparently still promotes the use of a methodology with control sets, explains that the control set:

… is a random sample of documents drawn from the entire collection of documents, usually prior to starting Assisted Review training rounds. … The control set is coded by domain experts for responsiveness and key issues. … [T]he coded control set is now considered the human-selected ground truth set and used as a benchmark for further statistical measurements we may want to calculate later in the project. As a result, there is only one active control set in Assisted Review for any given project. … [C]ontrol set documents are never provided to the analytics engine as example documents. Because of this approach, we are able to see how the analytics engine categorizes the control set documents based on its learning, and calculate how well the engine is performing at the end of a particular round. The control set, regardless of size or type, will always be evaluated at the end of every round—a pop quiz for Assisted Review. This gives the Assisted Review team a great deal of flexibility in training the engine, while still using statistics to report on the efficacy of the Assisted Review process.

Control Sets: Introducing Precision, Recall, and F1 into Relativity Assisted Review (a kCura white paper adopted by EDRM).

Grossman_DavidThe original white paper written by David Grossman, entitled Measuring and Validating the Effectiveness of Relativity Assisted Review, is cited by EDRM as support for their position on the validity and necessity of control sets. In fact, the paper does not support this proposition. The author of this Relativity White Paper, David Grossman, is a Ph.D. now serving as the associate director of the Georgetown Information Retrieval Laboratory, a faculty affiliate at Georgetown University, and an adjunct professor at IIT in Chicago. He is a leading expert in text retrieval and has no connections with Relativity except to write this one small paper. I spoke with David Grossman on October 30, 2015. He confirmed that the validity, or not, of control sets in legal search was not the subject of his investigation. His paper does not address this issue. In fact, he has no opinion of the validity of control sets in the context of legal search. Even though control sets were mentioned, it was never his intent to measure their effectiveness per se.

David Grossman was unaware of the controversies in legal search when he made that passing reference, including the effectiveness of using control sets. He was unaware of my view, and that of many others in the field of legal search, that the ground truth at the beginning of a search project was more like quick sand. Although David has never done a legal search project, he has done many other types of real-world searches. He volunteered that he has frequently had that same quicksand type of experience where the understanding of relevance evolves as the search progresses.

The main problem with the use of the control set in legal search is that the SMEs, what EDRM here refers to as the domain experts, never know the full truth of document responsiveness at the beginning of a project. This is something that evolves over time. The understanding of relevance changes over time; it changes as particular documents are examined. The control set fails and creates false results because “the human-selected ground truth set and used as a benchmark for further statistical measurements” is never correct, especially at the beginning of a large review project. Only at the end of a project are we in a position to determine a “ground truth” and “benchmark” for statistical measurements.

This problem was recognized by another information retrieval expert, William Webber, PhD. William does have experience with legal search and has been kind enough to help me through technical issues involving sampling many times. Here is how Dr. Webber puts it in his blog Confidence intervals on recall and eRecall:

Using the control set for the final estimate is also open to the objection that the control set coding decisions, having been made before the subject-matter expert (SME) was familiar with the collection and the case, may be unreliable.

Having done many reviews, where Losey has frequently served as the SME, we are much more emphatic than William. We do not couch our opinion with “may be unreliable.” To us there is no question that at least some of the SME control set decisions at the start of a review are almost certainly unreliable.

KEYS_cone.filter-copyAnother reason control sets fail in legal search is the very low prevalence typical of the ESI collections searched. We only see high prevalence when the document collection is keyword filtered. The original collections are always low, usually less that 5%, and often less than 1%. About the highest prevalence collection we have ever searched was the Oracle collection in the EDI search contest and it had obviously been heavily filtered by a variety of methods. That is not a best practice because the filtering often removes the relevant documents from the collection, making it impossible for predictive coding to ever find them. See eg, William Webber’s analysis of the Biomet case where this kind of keyword filtering was used before predictive coding began. What is the maximum recall in re Biomet?, Evaluating e-Discovery (4/24/13).

The control set approach cannot work in legal search because the size of the random sample, much less the portion of the sample allocated to the control set, is never even close to large enough to include a representative document from each type of relevant documents in the corpus, much less the outliers. So even if the benchmark were not on such shifting grounds, and it is, it would still fail because it is incomplete. The result is likely to be overtraining of the document types to those that happened to hit in the control set, which is exactly what the control set is supposed to prevent. This kind of overfitting can and does happen even without exact knowledge of the documents in the control set. That is an additional problem separate and apart from relevance shift. It is a problem solved by the multimodal search aspects of predictive coding in version of 3.0 and 4.0 taught here.

William_webberAgain William Webber has addressed this issue in his typical understated manner. He points out in Why training and review (partly) break control sets the futility of using of control sets to measure effectiveness because the sets are incomplete:

Direct measures of process effectiveness on the control set will fail to take account of the relevant and irrelevant documents already found through human assessment.

A naïve solution to this problem to exclude the already-reviewed documents from the collection; to use the control set to estimate effectiveness only on the remaining documents (the remnant); and then to combine estimated remnant effectiveness with what has been found by manual means. This approach, however, is incorrect: as documents are non-randomly removed from the collection, the control set ceases to be randomly representative of the remnant. In particular, if training (through active learning) or review is prioritized towards easily-found relevant documents, then easily-found relevant documents will become rare in the remnant; the control set will overstate effectiveness on the remnant, and hence will overstate the recall of the TAR process overall. …

In particular, practitioners should be wary about the use of control sets to certify the completeness of a production—besides the sequential testing bias inherent in repeated testing against the one control set, and the fact that control set relevance judgments are made in the relative ignorance of the beginning of the TAR process. A separate certification sample should be preferred for making final assessments of production completeness.

Control sets are a good idea in general, and the basis of most scientific research, but it simply does not work in legal search. It was built into the version 1.0 and 2.0 software by engineers and scientists who had little understanding of legal search. They apparently had, and some still have, no real grasp at all as to how relevance is refined and evolves during the course of any large document review, nor of the typical low prevalence of relevance. The normal distribution in probability statistics is just never found in legal search.

The whole theory behind the secret control set myth in legal search is that the initial relevance coding of these documents was correct, immutable and complete; that it should be used to objectively judge the rest of the coding in the project. That is not true. In point of fact, many documents determined to be relevant or irrelevant at the beginning of a project may be considered the reverse by the end. The target shifts. The understanding of relevance evolves. That is not because of a bad luck or a weak SME (a subject we will discuss later in the TAR Course), but because of the natural progression of the understanding of the probative value of various types of documents over the course of a review.

Not only that, many types of relevant documents are never even included in the control set because they did not happen to be included in the random sample. The natural rarity of relevant evidence in unfiltered document collections, aka low prevalence, makes this more likely than not.

All experienced lawyers know how relevance shifts during a case. But the scientists and engineers who designed the first generation software did not know this, and anyway, it contravened their dogma of the necessity of control sets. They could not bend their minds to the reality of indeterminate, rare legal relevance. In legal search the target is always moving and always small. Also, the data itself can often change as new documents are added to the collection. In other areas of information retrieval, the target is solid granite, simple Newtonian, and big, or at least bigger than just a few percent. Outside of legal search it may make sense to talk of an immutable ground truth. In legal search the ground truth of relevance is discovered. It emerges as part of the process, often including surprise court rulings and amended causes of action. It is in flux. The truth is rare. The truth is relative.

schrodinger_quantum_uncertainityThe parallels of legal search with quantum mechanics are interesting. The documents have to be observed before they will manifest certainly as either relevant or irrelevant. Uncertainty is inherent to information retrieval in legal search. Get used to it. That is reality on many levels, including the law.

The control set based procedures were not only over-complicated, they were inherently defective. They were based on an illusion of certainty, an illusion of a ground truth benchmark magically found at the beginning of a project before document review even began. There were supposedly SME wizards capable of such prodigious feats. I have been an SME in many, many topics of legal relevance since I started practicing law in 1980. I can assure you that SMEs are human, all too human. There is no magic wizard behind the curtain.

Moreover, the understanding of any good SME naturally evolves over time as previously unknown, unseen documents are unearthed and analyzed. Legal understanding is not static. The theory of a case is not static. All experienced trial lawyers know this. The case you start out with is never the one you end up with. You never really know if Schrodinger’s cat is alive or dead. You get used to that after a while. Certainty comes from the final rulings of the last court of appeals.

The use of magical control sets doomed many a predictive coding project to failure. Project team leaders thought they had high recall, because the secret control set said they did, yet they still missed key documents. They still had poor recall and poor precision, or at least far less than their control set analysis led them to believe. See: Webber, The bias of sequential testing in predictive coding, June 25, 2013, (“a control sample used to guide the producing party’s process cannot also be used to provide a statistically valid estimate of that process’s result.”) I still hear stores from reviewers where they find precision of less than 50% using Predictive Coding 1.o and 2.0 methods, sometimes far less. Our goal is to use predictive coding 4.0 methods to increase precision to the 80% or higher level. This allows for the reduction of cost without sacrifice of recall.

Many attorneys who worked with predictive coding software versions 1.0 or 2.0, where they did not see their projects overtly crash and burn, as when missed smoking gun documents later turn up, or where reviewers see embarrassingly low precision, were nonetheless suspicious of the results. Even if not suspicious, they were discouraged by the complexity and arcane control set process from every trying predictive coding again. As attorney and search expert J. William (Bill) Speros likes to say, they could smell the junk science in the air. They were right. I do not blame them for rejecting predictive coding 1.0 and 2.0. I did too, eventually. But unlike many, I followed the Hacker Way and created my own method, called version 3.0, and then in later 2016, version 4.0. We will explain the changes made from version 3.0 to 4.0 later in the course.

funny_wizardThe control set fiction put an unnecessarily heavy burden upon SMEs. They were supposed to review thousands of random documents at the beginning of a project, sometimes tens of thousands, and successfully classify them, not only for relevance, but sometimes also for a host of sub-issues. Some gamely tried, and went along with the pretense of omnipotence. After all, the documents in the control set were kept secret, so no one would ever know if any particular document they coded was correct or not. But most SMEs simply refused to spend days and days coding random documents. They refused to play the pretend wizard game. They correctly intuited that they had better things to do with their time, plus many clients did not want to spend over $500 per hour to have their senior trial lawyers reading random emails, most of which would be irrelevant.

I have heard many complaints from lawyers that predictive coding is too complicated and did not work for them. These complaints were justified. The control set and two-step review process were the culprits, not the active machine learning process. The control set has done great harm to the legal profession. As one of the few writers in e-discovery free from vendor influence, much less control, I am here to blow the whistle, to put an end to the vendor hype. No more secret control sets. Let us simplify and get real. Lawyers who have tried predictive coding before and given up, come back and try Predictive Coding 4.0.

Recap of the Evolution of Predictive Coding Methods

Version 1.0 type software with strong active machine learning algorithms. This early version is still being manufactured and sold by many vendors today. It has a two process of train and then review. it also uses secret control sets to guide training. This usually requires an SME to review a certain total of ranked documents as guided by the control set recall calculations.

Version 2.0 of Predictive Coding eliminated the two-step process, and made the training continuous. For that reason version 2.0 is also called continuous  training, CT. It did not, however,  reject the random sample step and its control set nonsense.

Predictive Coding 3.0 was a major change. It built on the continuous training improvements in 2.0, but also eliminated the secret control set and mandatory initial review of a random sample. This and other process improvements in Predictive Coding 3.0 significantly reduced the burden on busy SMEs, and significantly improved the recall estimates. This in turn improved the overall quality of the reviews.

Predictive Coding 4.0 is the latest method and the one taught in this course. It includes some variations in the ideal work flow, and refinements on the continuous active training to facilitate double-loop feedback. We call this Intelligently Space Training (IST) and is all part of our Hybrid Multimodal IST method. All of this will be explained in detail in this course.In Predictive Coding 3.0 and 4.0 the secret control set basis of recall calculation are replaced with a prevalence based random sample guide, and elusion based quality control samples and other QC techniques. These can now be done with contract lawyers and only minimal involvement by SME. See Zero Error Numerics. This will all be explained in the TAR Course. The final elusion type recall calculation is done at the end of the project, when final relevance has been determined. See: EI-Recall. Moreover, in the 3.0 and 4.0 process the sample documents are not secret. They are known and adjusted as the definitions of relevance change over time to better control your recall range estimates. That is a major improvement.

The method of predictive coding taught here has been purged of vendor hype and bad science and proven effective many times. The secret control set has never worked, and it is high time it be expressly abandoned. Here are the main reasons why: (1) relevance is never static, it changes over the course of the review; (2) the random selection size was typically too small for statistically meaningful calculations; (3) the random selection was typically too small in low prevalence collections (the last majority in legal search) for complete training selections; and (4) it supposedly required a senior SME’s personal attention for days of document review work, a mission impossible for most e-discovery teams.

Here is Ralph Losey talking about control sets in June 2017. He is expressing his frustration about vendors still delaying upgrades to their software to eliminate the control set hooey. Are they afraid of losing business in the eastern plains of the smoky mountains?

Every day that vendors keep phony control set procedures is another day that lawyers are mislead on recall calculations based on them; another day lawyers are frustrated by wasting their time on overly large random samples; another day everyone has a false sense of protection from the very few unethical lawyers out there, and the very many not fully competent lawyers; and another day clients pay too much for document review. The e-Discovery Team calls on all vendors to stop using control sets and phase it out of their software.

The First Patents

USPTOWhen predictive coding first entered the legal marketplace in 2009 the legal methodology used by lawyers for predictive coding was dictated by the software manufacturers, mainly the engineers who designed the software. See eg. Leading End-to-End eDiscovery Platform Combines Unique Predictive Coding Technology with Random Sampling to Revolutionize Document Review (2009 Press Release). Recommind was an early leader, which is one reason I selected them for the Da Silva Moore v. Publicis Groupe case back in 2011. On April 26, 2011, Recommind was granted a patent for predictive coding: Patent No. 7,933,859, entitled Full-Text Systems and methods for predictive coding. The search algorithms in the patent used Probabilistic Latent Semantic Analysis, an already well-established statistical analysis technique for data analysis. (Recommind obtained two more patents with the same name in 2013: Patent No. 8,489,538 on July 16, 2013; and Patent No. 8,554,716 on October 8, 2013.)

As the title of all of these patents indicate, the methods of use of the text analytics technology in the software were key to the patent claims. As is typical for patents, many different method variables were described to try to obtain as wide a protection as possible. The core method was shown in Figure Four of the 2011 patent.

Recommind_Patent4This essentially describes the method that I now refer to as Predictive Coding Version 1.0. It is the work flow I had in mind when I first designed procedures for the Da Silva Moore case. In spite of the Recommind patent, this basic method was followed by all vendors who added predictive coding features to their software in 2011, 2012 and thereafter. It is still going on today. Many of the other vendors also received patents for their predictive coding technology and methods, or applications are pending. See eg. Equivio, patent applied for on June 15, 2011 and granted on September 10, 2013, patent number 8,533,194; Kroll Ontrack, application 20120278266, April 28, 2011.

To my knowledge there has been no litigation between vendors. My guess is they all fear invalidation on the basis of lack of innovation and prior art.

The engineers, statisticians and scientists who designed the first predictive coding software are the people who dictated to lawyers how the software should be used in document review. None of the vendors seemed to have consulted practicing lawyers in creating these version 1.0 methods. I know I was not involved.

Ralph Losey

Losey in 2011 when first arguing against the methods of version 1.0

I also remember getting into many arguments with these technical experts from several companies back in 2011. That was when the predictive coding 1.0 methods hardwired into their software were first explained to me. I objected right away to the secret control set. I wanted total control of my search and review projects. I resented the secrecy aspects. There were enough black boxes in the new technology already. I was also very dubious of the statistical projections. In my arguments with them, sometimes heated, I found that they had little real grasp of how legal search was actually conducted or the practice of law. My arguments were of no avail. And to be honest, I had a lot to learn. I was not confident of my positions, nor knowledgeable enough of statistics. All I knew for sure is that I resented their trying to control my well-established, pre-predictive coding search methods. Who were they to dictate how I should practice law, what procedures I should follow? These scientists did not understand legal relevance, nor how it changes over time during the course of any large-scale review. They did not understand the whole notion of the probative value of evidence and the function of e-discovery as trial preparation. They did not understand weighted relevance, and the 7+/2 rule of judge and jury persuasion. I gave up trying, and just had the software modified to suit my needs. They would at least agree to do that to placate me.

Part of the reason I gave up trying back in 2011 is that I ran into a familiar prejudice from this expert group. It was a prejudice against lawyers common to most academics and engineers. As a high-tech lawyer since 1980 I have faced this prejudice from non-lawyer techies my whole career. They assume we were all just a bunch of weasels, not to be trusted, and with little or no knowledge of technology and search. They have no idea at all about legal ethics or professionalism, nor of our experience with the search for evidence. They fail to understand the central role of lawyers in e-discovery, and how our whole legal system, not just discovery, is based on the honesty and integrity of lawyers. We need good software from them, not methods to use the software, but they knew better. It was frustrating, believe me. So I gave up on the control set arguments and moved on. Until today.

In the arrogance of the first designers of predictive coding, an arrogance born of advanced degrees in entirely different fields, these information scientists and engineers presumed they knew enough to tell all lawyers how to use predictive coding software. They were blind to their own ignorance. The serious flaws inherent in Predictive Coding Version 1.0 are the result.

Predictive Coding Version 2.0 Adopts Continuous Training

The first major advance in predictive coding methodology was to eliminate the dual task phases present in Predictive Coding 1.0. The first phase of the two-fold version 1.0 procedure was to use active learning to train the classifier. This would take several rounds of training and eventually the software would seem to understand what you were looking for. Your concept of relevance would be learned by the machine. Then the second phase would begin. In phase two you actually reviewed the documents that met the ranking criteria. In other words, you would use predictive coding in phase one to cull out the probable irrelevant documents, and then you would be done with predictive coding. (In some applications you might continue to use predictive coding for reviewer batch assignment purposes only, but not for training.) The next phase two was all about review to confirm the prediction of classification, usually relevance. In phase two you would just review, and not also train.

In my two ENRON experiments in 2012 I did not follow this two-step procedure. I just kept on training until I could not find any more relevant documents. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One); Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two); Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (in PDF form and the blog introducing this 82-page narrative, with second blog regarding an update); Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods).

I did not think much about it at the time, but by continuing to train I used a, to me, perfectly reasonable departure from the version 1.0 method. I was using what is now promoted as the new and improved Predictive Coding 2.0. In this 2.0 version you combine training and review. The training is continuous. The first round of document training might be called the seed set, if you wish, but it is nothing particularly special. All rounds of training are important and the training should continue as the review proceeds, unless there are some logistical reasons not to. After all, training and review are both part of the same review software, or should be. It just makes good common sense to do that, if your software allows you to. If you review a document, then you might as well at least have the option to include it in the training. There is no logical reason for a cut-off point in the review process where training stops. I really just came up with that notion in Da Silva for simplicity sake.

In predictive coding 2.0 you do Continuous Training, or CT for short. It just makes much more sense to keep training as long as you can, if your software allows you to do that.

There are now several vendors that promote the capacity of continuous training and have it built into their review software, including Kroll.  Apparently many vendors still use the old dual task, stop training approach of version 1.0. And, most vendors still use, or at least give lip service to, the previously sacrosanct random secret control set features of version 1.0 and 2.0.

John Tredennick

John Tredennick

The well-known Denver law technology sage, John Tredennick, CEO of Catalyst, often writes about predictive coding methods. Here is just one of the many good explanations John has made about continuous training (he calls it CAL), this one from his article with the catchy name “A TAR is Born: Continuous Active Learning Brings Increased Savings While Solving Real-World Review Problems” (note these diagrams are his, not mine, and he here calls predictive coding TAR):

How Does CAL Work?

CAL turns out to be much easier to understand and implement than the more complicated protocols associated with traditional TAR reviews.

Catalyst_TAR1.0

A TAR 1.0 review is typically built around the following steps:

1. A subject matter expert (SME), often a senior lawyer, reviews and tags a sample of randomly selected documents to use as a “control set” for training.
2. The SME then begins a training process using Simple Passive Learning or Simple Active Learning. In either case, the SME reviews documents and tags them relevant or non-relevant.
3. The TAR engine uses these judgments to build a classification/ranking algorithm that will find other relevant documents. It tests the algorithm against the control set to gauge its accuracy.
4. Depending on the testing results, the SME may be asked to do more training to help improve the classification/ranking algorithm.
5. This training and testing process continues until the classifier is “stable.” That means its search algorithm is no longer getting better at identifying relevant documents in the control set.

Even though training is iterative, the process is finite. Once the TAR engine has learned what it can about the control set, that’s it. You turn it loose to rank the larger document population (which can take hours to complete) and then divide the documents into categories to review or not. There is no opportunity to feed reviewer judgments back to the TAR engine to make it smarter.

TAR 2.0: Continuous Active Learning

In contrast, the CAL protocol merges training with review in a continuous process. Start by finding as many good documents as you can through keyword search, interviews, or any other means at your disposal. Then let your TAR 2.0 engine rank the documents and get the review team going.

Catalyst_Tar.2.0

As the review progresses, judgments from the review team are submitted back to the TAR 2.0 engine as seeds for further training. Each time reviewers ask for a new batch of documents, they are presented based on the latest ranking. To the extent the ranking has improved through the additional review judgments, reviewers receive better documents than they otherwise would have.

Blue_Lexie_robot_blackJohn has explained to us that his software has never had a control set, and it allows you to control the timing of continuous training, so in this sense his Catalyst software is already fully Predictive Coding 3.0 and 4.0 compliant. Even if your software has control set features, you can probably still disable them. That is what I do with the Kroll software that I typically use (see eg MrEDR.com). I am talking about a method of use here, not a specific algorithm, nor patentable invention. So unless the software you uses forces you do a two-step process, or makes you use a control set, you can use these version 3.0 and 4.0 methods with it. Still, some modifications of the software would be advantageous to streamline and simplify the whole process that is inherent in Predictive Coding 3.0 and 4.0. For this reason I call on all software vendors to eliminate the secret control set now and the dual step process.

Version 3.0 Patents Reject the Use of Control and Seed Sets

Recommind_Patent_control_setThe main problem for us with the 1.0 work-flow methodology for Predictive Coding was not the two-fold nature of train then review, which is what 2.0 addressed, but its dependence on creation of a secret control set and seed set at the beginning of a project. That is the box labeled 430 in Figure Four to the Recommind patent. It is shown in Tredennick’s Version 1.0 diagram on the left as control set and seed set. The need for a random secret control set and seed set became an article of faith, one based on black letter statistics rules. Lawyers just accepted it without question as part of version 1.0 predictive coding. It is also one reason that the two-fold method of train then review, instead of CAL 2.0, is taking so long for some vendors to abandon.

Based on my experience and experiments with predictive coding methods since 2011, the random control set and seed set are both unnecessary. The secret control set is especially suspect. It does not work in real-world legal review projects, or worse, provides statistical mis-information as to recall. As mentioned, that is primarily because in the real world of legal practice relevance is a continually evolving concept. It is never the same at the beginning of a project, when the control set is created, as at the end. The engineers who designed version 1.0 simply did not understand that. They were not lawyers and did not appreciate the flexibility of the relevance. They did not know about concept drift. They did not understand the inherent vagaries and changing nature of the search target in a large document review project. They also did not understand how human SMEs were, how they often disagree with themselves on the classification of the same document even without concept drift. As mentioned, they were also blinded by their own arrogance, tinged with antipathy against lawyers.

They did understand statistics. I am not saying their math was wrong. But they did not understand evidence, did not understand relevance, did not understand relevance drift (or, as I prefer to call it, relevance evolution), and did not understand efficient legal practice. Many I have talked to did not have any real understanding of how lawyers worked at all, much less document review. Most were just scientists or statisticians. They meant well, but they did harm nonetheless. These scientists did not have any legal training. If they were any lawyers on the version 1.0 software development team, they were not heard, or had never really practiced law. (As a customer, I know I was brushed off.) Things have gotten much better in this regard since 2008 and 2009, but still, many vendors have not gotten the message. They still manufacture version 1.0 type predictive coding software.

Jeremy Pickens, Ph.D., Catalyst’s in-house information scientist, seems to agree with my assessment of control sets. See Pickens, An Exploratory Analysis of Control Sets for Measuring E-Discovery Progress, DESI VI 2015, where he reports on an his investigation of the effectiveness of control sets to measure recall and precision. Jeremy used the Grossman and Cormack TAR Evaluation Toolkit for his data and gold standards. Here is his conclusion:

A popular approach in measuring e-discovery progress involves the creation of a control set, holding out randomly selected documents from training and using the quality of the classification on that set as an indication of progress on or quality of the whole. In this paper we do an exploratory data analysis of this approach and visually examine the strength of this correlation. We found that the maximum-F1 control set approach does not necessarily always correlate well with overall task progress, calling into question the use of such approaches. Larger control sets performed better, but the human judgment effort to create these sets have a significant impact on the total cost of the process as a whole.

A secret control set is not a part of the Predictive Coding 4.0 method. As will be explained in this course, we still have random selection reviews for prevalence and quality control purposes – Steps Three and Seven – but the documents are not secret and they are typically used for training (although they do not have to be). Moreover, after version 3.0 we eliminated any kind of special first round of training seed set, random based or otherwise. The first time the machine training begins is simply the first round. Sometimes it is big, sometimes it is not. It all depends on our technical and legal analysis of the data presented or circumstances of the project. It also all depends on our legal analysis and the disputed issues of fact in the law suit or other legal investigation. That is the kind of thing that lawyers do everyday. No magic required, not even high intelligence; only background and experience as a practicing lawyer are required.

The seed set is dead. So too is the control set. Other statistical methods must be used to calculate recall ranges and other numeric parameters beyond the ineffective control set method. Other methods beyond just statistics must be used to evaluate the quality and success of a review project. See eg. EI-Recall and Zero Error Numerics that includes statistics, but is not limited to it).

Grossman and Cormack Patents

We do not claim any patents or other intellectual property rights to Predictive Coding 4.0, aside from copyrights to Losey’s writings, and certain trade secrets that we use, but have not published or disclosed outside of our circle of trust. But our friends Gordon Cormack and Maura Grossman, who are both now professors, do claim patent rights to their methods. The methods are apparently embodied in software somewhere, even though the software is not sold. In fact, we have never seen it, nor, as far as I know, has anyone else, except perhaps their students. Their patents are all entitled Full-Text Systems and methods for classifying electronic information using advanced active learning technique: December 31, 2013, 8,620,842, Cormack; April 29, 2014, 8,713,023, Grossman and Cormack; and, September 16, 2014, 8,838,606, Grossman and Cormack.

The Grossman and Cormack patents and patent applications are interesting for a number of reasons.  For instance, they all contain the following paragraph in the Background section explaining why their invention is needed. As you can see it criticizes all of the existing version 1.0 software on the market at the time of their applications (2013) (emphasis added):

Generally, these e-discovery tools require significant setup and maintenance by their respective vendors, as well as large infrastructure and interconnection across many different computer systems in different locations. Additionally, they have a relatively high learning curve with complex interfaces, and rely on multi-phased approaches to active learning. The operational complexity of these tools inhibits their acceptance in legal matters, as it is difficult to demonstrate that they have been applied correctly, and that the decisions of how to create the seed set and when to halt training have been appropriate. These issues have prompted adversaries and courts to demand onerous levels of validation, including the disclosure of otherwise non-relevant seed documents and the manual review of large control sets and post-hoc document samples. Moreover, despite their complexity, many such tools either fail to achieve acceptable levels of performance (i.e., with respect to precision and recall) or fail to deliver the performance levels that their vendors claim to achieve, particularly when the set of potentially relevant documents to be found constitutes a small fraction of a large collection.

They then indicate that their invention overcomes these problems and is thus a significant improvement over prior art. In Figure Eleven of their patent (shown below) they describe one such improvement, “an exemplary method 1100 for eliminating the use of seed sets in an active learning system in accordance with certain embodiments.”

Grossman_Cormack_Patent11

These are basically the same kind of complaints that I have made here against Predictive Coding 1.0 and 2.0. I understand the criticisms regarding complex interfaces, that rely on multi-phased approaches to active learning. If the software forces use of control set and seed set nonsense, then it is an overly complex interface. (It is not overly complex if it allows other types of search, such as keyword, similarity or concept, for this degree of complexity is necessary for a multimodal approach.) I also understand their criticism of the multi-phased approaches to active learning, which was fixed in 2.0 by the use of continuous training, instead of train and then review.

The Grossman & Cormack criticism about low prevalence document collections, which is the rule, not the exception in legal search, is also correct. It is another reason the control set approach cannot work in legal search. The number of relevant documents to be found constitutes a small fraction of a large collection and so the control set random sample is very unlikely to be representative, much less complete. That is an additional problem separate and apart from relevance shift.

About the only complaint the Grossman & Cormack patent makes that I do not understand is the gripe about large infrastructure and interconnection across many different computer systems in different locations. For Kroll software at least, and also Catalyst, that is the vendor’s problem, not the attorneys. All the user does is sign on to a secure cloud server.

Notice that there is no seed set or control set in the Grossman & Cormack patent diagram as you see in the old Recommind patent. Much of the rest of the patent, in so far as I am able to understand the arcane patent language used, consists of applications of continuous training techniques that have been tested and explained in their writings, including many additional variables and techniques not mentioned in their articles. See eg. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014. Their patent includes the continuous training methods, of course, but also eliminates the use of seed sets. I assumed this also means the elimination of use control sets, a fact that Professor Cormack has later confirmed. Their CAL methods do not use secret control sets. Their software patents are thus like our own 3.0 and 4.o innovations, although they do not use IST.

Go on to Class Two.

Or pause to do this suggested “homework” assignment for further study and analysis.

SUPPLEMENTAL READING: Review all of the patents cited, especially the Grossman and Cormack patents (almost identical language was used in both, as you will see, so you only need to look one). Just read the sections in the patents that are understandable and skip the arcane jargon. Also, suggest you read all of the articles cited in this course. That is a standing homework assignment in all classes. Again, some of it may be too technical. Just skip through or skim those sections. Also see if you can access Losey’s LTN editorial, Vendor CEOs: Stop Being Empty Suits & Embrace the Hacker Way. We suggest you also check out HackerLaw.org.

EXERCISES: What does TTR stand for in door number one of the above graphic? Take a guess. I did not use the acronym in this class, but if you have understood this material, you should be able to guess what it means. In later classes we will add more challenging exercises at the end of the class, but this first class is hard enough, so we  will let it go with that.

Students are invited to leave a public comment below. Insights that might help other students are especially welcome. Let’s collaborate!

_

e-Discovery Team LLC COPYRIGHT 2017

ALL RIGHTS RESERVED

_

 


Another TAR Course Update and a Mea Culpa for the Negative Consequences of ‘Da SIlva Moore’

June 4, 2017

We lengthened the TAR Course again by adding a video focusing on the three iterated steps in the eight-step workflow of predictive coding. Those are steps four, five and six: Training Select, AI Document Ranking, and Multimodal Review. Here is the new video introducing these steps. It is divided into two parts.

This video was added to the thirteenth class of the TAR Course. It has sixteen classes altogether, which we continue to update and announce on this blog. There were also multiple revisions to the text in this class.

Unintended Negative Consequences of Da Silva Moore

Predictive coding methods have come a long way since Judge Peck first approved predictive coding in our Da Silva Moore case. The method Brett Anders and I used back then, including disclosure of irrelevant documents in the seed set, was primarily derived from the vendor whose software we used, Recommind, and from Judge Peck himself. We had a good intellectual understanding, but it was the first use for all of us, except the vendor. I had never done a predictive coding review before, nor, for that matter, had Judge Peck. As far as I know Judge Peck still has not ever actually used predictive coding software to do document review, although you would be hard pressed to find anyone else in the world with a better intellectual grasp of the issues.

I call the methods we used in Da Silva Moore Predictive Coding 1.0. See: Predictive Coding 3.0 (October 2015) (explaining the history of predictive coding methods). Now, more than five years later, my team is on version 4.0. That is what we teach in the TAR Course. What surprises me is that the rest of the profession is still stuck in our first method, our first ideas of how to best use the awesome power of active machine learning.

This failure to move on past the Predictive Coding 1.0 methods of Da Silva Moore, is, I suspect, one of the major reasons that predictive coding has never really caught on. In fact, the most successful document review software developers since 2012 have ignored predictive coding altogether.

Mea Culpa

Looking back now at the 1.0 methods we used in Da Silva I cannot help but cringe. It is truly unfortunate that the rest of the legal profession still uses these methods. The free TAR Course is my attempt to make amends, to help the profession move on from the old methods. Mea Culpa.

In my presentation in Manhattan last month I humorously quipped that my claim to fame, Da Silva Moore, was also my claim to shame. We never intended for the methods in Da Silva Moore to be the last word. It was the first word, writ large, to be sure, but in pencil, not stone. It was like a billboard that was supposed to change, but never did. Who knew what we did back in 2012 would have such unintended negative consequences?

In Da Silva Moore we all considered the method of usage of machine learning that we came up with as something of an experiment. That is what happens when you are the first at anything. We assumed that the methods we came up with would quickly mature and evolve in other cases. They certainly did for us. Yet, the profession has mostly been silent about methods since the first version 1.0 was explained. (I could not take part in these early explanations by other “experts” as the case was ongoing and I was necessarily silenced from all public comment about it.) From what I have been told by a variety of sources many, perhaps even most attorneys and vendors are using the same methods that we used back in 2012. No wonder predictive coding has not caught on like it should. Again, sorry about that.

Why the Silence?

Still, it is hardly all my fault. I have been shouting about methods ever since 2012, even if I was muzzled from talking about Da Silva Moore. Why is no one else talking about the evolution of predictive coding methods? Why is mine the only TAR Course?

There is some discussion of methods going on, to be sure, but most of it is rehashed, or so high-level and intellectual as to be superficial and worthless. The discussions and analysis do not really go into the nitty-gritty of what to do. Why are we not talking about the subtleties of the “Stop decision?” About the in and outs of document training selection. About the respective merits of CAL versus IST? I would welcome dialogue on this with other practicing attorneys or vendor consultants. Instead, all I hear is silence and old issues.

The biggest topic still seems to be the old one of whether to filter documents with keywords before beginning machine training. That is a big, no duh, don’t do it, unless lack of money or some other circumstance forces you to, or unless the filtering is incidental and minor to cull out obvious irrelevant. See eg: Stephanie Serhan, Calling an End to Culling: Predictive Coding and the New Federal Rules of Civil Procedure, 23 Rich. J.L. & Tech. 5 (2016). Referring to the 2015 Rule Amendments, Serhan, a law student, concludes:

Considering these amendments, predictive coding should be applied at the outset on the entire universe of documents in a case. The reason is that it is far more accurate, and is not more costly or time-consuming, especially when the parties collaborate at the outset.

Also see eg, William Webber’s analysis of the Biomet case where this kind of keyword filtering was used before predictive coding began. What is the maximum recall in re Biomet?Evaluating e-Discovery (4/24/13). Webber, an information scientist, showed back in 2013 that when keyword filtering was used in the Biomet case, it filtered out over 40% of the relevant documents. This doomed the second filter predictive coding review to a maximum possible recall of 60%, even if it was perfect, meaning it would otherwise have attained 100% recall, which (almost) never happens. I have never seen a cogent rebuttal of this analysis; again, aside from proportionality, cost arguments.

There was discussion for a while on another important, yet sort of no-brainer issue, whether to keep on machine training or not, which Grossman and Cormack called Continuous Active Learning (CAL).  We did not do that in Da Silva Moore, but we were using predictive Coding 1.0 as explained by our vendor. We have known better than that now for years. In fact, later in 2012, during my two public ENRON document review experiments with predictive coding I did not follow the two-step procedure of version 1.0. Instead, I just kept on training until I could not find any more relevant documents. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One); Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents(Part Two); Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (in PDF form and the blog introducing this 82-page narrative, with second blog regarding an update); Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods).

Of course you keep training. I have never heard any viable argument to the contrary. Train then review, which is the protocol in Da Silva Moore, was the wrong way to do it. Clear and simple. The right way to do machine training is to  keep training until you are done with the review. This is the main thing that separates Predictive Coding 1.0 from 2.0. See: Predictive Coding 3.0 (October 2015). I switched to version 2.0 right after Da Silva Moore in late 2012 and started using continuous on my own initiative. It seemed obvious once I had some experience under my belt.  Still, I do credit Maura Grossman and Gordon Cormack with the terminology and scientific proof of the effectiveness of CAL, a term which they have now trademarked for some reason.  They have made important contributions to methods and are tireless educators of the profession. But where are the other voices? Where are the lawyers?

The Grossman and Cormack efforts are scientific and professorial. To me this is just work. This is what I do as a lawyer to make a living. This is what I do to help other lawyers find the key documents they need in a case. So I necessarily focus on the details of how to actually do active machine learning. I focus on the methods, the work-flow. Aside from the Professors Cormack and Grossman, and myself, almost no one else is talking about predictive coding methods. Lawyers mostly just do what the vendors recommend, like I did back in Da Silva Moore days. Yet almost all of the vendors are stagnant. (The new KrolLDiscovery and Catalyst are two exceptions, and even the former still has some promised software revisions to make.)

From what I have seen of the secret sauce that leaks out in predictive coding software demos of most vendors, they are stuck in the old version 1.0 methods. They know nothing, for instance, of the nuances of double-loop learning taught in the TAR Course. The vendors are instead still using the archaic methods that I thought were good back in 2012. I call these methods Predictive Coding 1.0 an 2.0. See: Predictive Coding 3.0 (October 2015).

In addition to continuous training, or not, most of those methods still use nonsensical random control sets that ignore concept drift, a fact of life in every large review project. Id. Moreover, the statistical analysis in 1.0 and 2.0 that they use for recall does not survive close scrutiny. Most vendors routinely ignore the impact of Confidence Intervals on range and the impact on low prevalence data-sets. They do not even mention binomial calculations designed to deal with low prevalence. Id. Also See: ZeroErrorNumerics.com.

Conclusion

The e-Discovery Team will keep on writing and teaching, satisfied that at least some of the other leaders in the field are doing essentially the same thing. You know who you are. We hope that someday others will experiment with the newer methods. The purpose of the TAR Course is to provide the information and knowledge needed to try these methods. If you have tried predictive coding before, and did not like it, we hear you. We agree. I would not like it either if I still had to use the antiquated methods of Da Silva Moore.

We try to make amends for the unintended consequences of Da SIlva Moore by offering this TAR Course. Predictive coding really is breakthrough technology, but only if used correctly. Come back and give it another try, but this time use the latest methods of Predictive Coding 4.0.

Machine learning is based on science, but the actual operation is an art and craft. So few writers in the industry seem to understand that. Perhaps that is because they are not hands-on. They do not step-in. (Stepping-In is discussed in Davenport and Kirby, Only Humans Need Apply, and by Dean Gonsowski, A Clear View or a Short Distance? AI and the Legal Industry, and A Changing World: Ralph Losey on “Stepping In” for e-Discovery. Also see: Losey, Lawyers’ Job Security in a Near Future World of AI, Part Two.) Even most vendor experts have never actually done a document review project of their own. And the software engineers, well, forget about it. They know very little about the law (and what they think they know is often wrong) and very little about what really goes on in a document review project.

Knowledge of the best methods for machine learning, for AI, does not come from thinking and analysis. It comes from doing, from practice, from trial and error. This is something all lawyers understand because most difficult tasks in the profession are like that.

The legal profession needs to stop taking legal advice from vendors on how to do AI-enhanced document review. Vendors are not supposed to be giving legal advice anyway. They should stick to what they do best, creating software, and leave it to lawyers to determine how to best use the tools they make.

My message to lawyers is to get on board the TAR train. Even though Da Silva Moore blew the train whistle long ago, the train is still in the station. The tracks ahead are clear of all legal obstacles. The hype and easy money phase has passed. The AI review train is about to get moving in earnest. Try out predictive coding, but by all means use the latest methods. Take the TAR Course on Predictive Coding 4.0 and insist that your vendor adjust their software so you can do it that way.


PERSPECTIVES ON PREDICTIVE CODING And Other Advanced Search Methods for the Legal Practitioner

December 9, 2016
book_cover

Click the Cover to Order from ABA

Editors: Jason R. Baron, Ralph C. Losey, Michael Berman

Foreword: Judge Andrew Peck

About the Book (an excerpt from Jason Baron’s Introduction)

Each of the three editors of this volume graduated law school in 1980, which has meant that we have been firsthand witnesses to the transformation of legal practice and especially discovery practice during the past few decades. There was a time when discovery meant searching only through boxes containing paper files, where the big case simply meant searching through more boxes in the client’s warehouse.

Discovery did not yet need an “e” as a prefix, and manual searches for relevant documents sufficed. Judge Andrew J. Peck notes this, as well, in his Foreword to this volume. Fast forward to the present, and how the world of lawyering has changed. The present “inflationary” period of information exploding has been built on copying machines and personal computers in the 1970s, e-mail beginning widespread use in the late 1980s, and the opening of the desktop to the Internet and especially the World Wide Web in the 1990s. The pace of change has only continued to accelerate since the turn of the century, with the emergence of social media and mobile devices in the last decade transforming what it means to conduct business. As this book goes to print, we are on the cusp of the Internet of Things, with smart devices proliferating and generating new data streams and new forms of evidence to search.

Today, every lawyer conducting “discovery” in civil litigation needs to confront the fact that—no matter how large or small the case may be—it is insufficient to simply define the search task as being limited to finding relevant documents in traditional paper files. The legal profession lives and breathes in a world of “electronically stored information” (ESI), a term of art introduced into legal practice by virtue of the 2006 amendments to the Federal Rules of Civil Procedure.

But what constitutes our doing a “reasonable” job in finding relevant evidence in a world exploding in data? The initial approach lawyers took (and still take) to confronting large volumes of ESI is to rely on keyword searching, supplemented by manual searches, to cull out relevant and privileged material before a production is made to opposing counsel. Although these “time-tested” approaches have their defenders, simple reliance on manual and keyword searching increasingly is seen as inadequate to the task at hand, both on grounds of accuracy and efficiency, as compared with more advanced search techniques.

Baron_at_blackboardThe editors of this book are readily willing to stipulate in advance that they have a strong bias in favor of advancing the cause of computer-assisted review and educating the profession on how more advanced search techniques work. In one way or another, they have spent the better part of the last 15 years engaged in initiating and participating in research projects, and academic conferences, joining think tanks, communicating through online media platforms, writing law reviews, authoring e-discovery books, and teaching e-discovery in law and graduate schools, in evangelizing on the topic of how lawyers may conduct “better” searches of electronic evidence using smarter methods than manual and keyword searching. Along the way, we have been fortunate to encounter a number of brilliant lawyers and scholars at the cutting edge of e-discovery and information science, many of whom we are grateful to for their contributions to this volume.

This book is an attempt to catch lightning in a bottle; namely, to provide a set of perspectives on predictive coding and other advanced search techniques, as they are used today by lawyers in pursuit of e-discovery, in investigations, and in other legal contexts, such as information governance. We are painfully aware that the shelf-life of publications such as the present work is not long. Nevertheless, we trust that a cross-section of related—and sometimes differing—perspectives on how today’s advanced search methods at the cutting-edge of legal practice will prove illuminating to a greater legal audience.  …

The book is meant to appeal both to practitioners who are seeking knowledge of what predictive coding and other advanced search methods are all about, as well as to those members of the legal community who are “inside the bubble” of e-discovery already and wish to be exposed to the latest, cutting-edge techniques. We would like to imagine that the book may also be read by lawyers who do not consider themselves litigators or e-discovery practitioners, but who wish to apply a knowledge of smart analytics in other legal contexts.

The reader should be aware that given the relative novelty of predictive coding and other advanced search methods, there have been and will continue to be disagreements over what constitutes “best practices” in the space, and the editors of course have their own preferences and biases. However, the book attempts to be inclusive of a range of views, not always necessarily our own. …

As this book goes to print, there appear to be voices in the profession questioning whether predictive coding has been oversold or overhyped, and pointing to resistance in some quarters to wholesale embrace of the types of algorithmics and analytics on display through-out this volume. Notwithstanding these critics, the editors of this volume remain serene in their certainty that the chapters in this book represent the future of e-discovery and the legal profession as it will come to be practiced into the foreseeable future, by a larger and larger contingent of lawyers. Of course, for some, the prospect of needing to be technically competent in advanced search techniques may lead to considerations of early retirement. For others, the idea that lawyers may benefit from embracing predictive coding and other advanced technologies is exhilarating. We hope this book inspires the latter feelings on the part of the reader.

________________

TABLE OF CONTENTS

FOREWORD: JUDGE ANDREW PECK

INTRODUCTION: Jason R. Baron

SEARCHING FOR ESI: SOME PRELIMINARY PERSPECTIVES

Chapter 1: The Road to Predictive Coding: Limitations on the Defensibility of Manual and Keyword Searching. Tracy D. Drynan and Jason R. Baron.

Chapter 2: The Emerging Acceptance of Technology-Assisted Review in Civil Litigation. Alicia L. Shelton and Michael D. Berman.

PRACTITIONER PERSPECTIVES

Chapter 3: A Tour of Technology-Assisted Review. Maura R. Grossman and Gordon V. Cormack.

Chapter 4: The Mechanics of a Predictive Coding Workflow. Vincent M. Catanzaro, Samantha Green, and Sandra Rampersaud.

Chapter 5: Reflections on the Cormack and Grossman SIGIR Study: The Folly of Using Random Search for Machine Training. Ralph C. Losey.

Chapter 6: TAR for the Small and Medium Case. William F. Hamilton.

Chapter 7: Reality Bites: Why TAR’s Promises Have Yet to Be Fulfilled. William P. Butterfield and Jeannine M. Kenney.

Chapter 8: Predictive Coding from the Defense Perspective: Issues and Challenges. Ronni D. Solomon, Rose J. Hunter-Jones, Jennifer A. Mencken, and Edward T. Logan.

Chapter 9: Safeguarding the Seed Set: Why Seed Set Documents May Be Entitled to Work–Product Protection. The Hon. John M. Facciola and Philip J. Favro.

Chapter 10: Experts on Computer-Assisted Review: Why Federal Rule of Evidence Should Apply to Their Use. The Hon. David J. Waxse and Brenda Yoakum-Kriz.

Chapter 11: License to Cull: Two-Filter Document Culling Method That Uses Predictive Coding and Other Search Tools. Ralph C. Losey.

INFORMATION RETRIEVAL PERSPECTIVES; E-Discovery Standards

Chapter 12: Defining and Estimating Effectiveness in Document Review. David D. Lewis.

Chapter 13: Metrics in Predictive Coding. William Webber and Douglas W. Oard.

Chapter 14: On the Place of Measurement in E-Discovery. Bruce Hedin, Dan Brassil, and Amanda Jones

Chapter 15: A Modest Proposal for Preventing e-Discovery Standards from Being a Burden to Practitioners, Clients, the Courts, or Common Sense. Gilbert S. Keteltas, Karin S. Jenson, and James A. Sherer.

ANALYTICS AND THE LAW

Chapter 16: Algorithms at the Gate: Leveraging Predictive Analytics in Mergers, Acquisitions, and Divestitures. Jeffrey C. Sharer and Robert D. Keeling

Chapter 17: The Larger Picture: Moving Beyond Predictive Coding for Document Productions to Predictive Analytics for Information Governance. Sandra Serkes.

Chapter 18: Predictive Analytics for Information Governance in a Law Firm: Mitigating Risks and Optimizing Efficiency. Leigh Isaacs.

Chapter 19: Finding the Signal in the Noise: Information Governance, Analytics, and the Future of Legal Practice. Bennett B. Borden and Jason R. Baron.

Chapter 20: Preparing for the Near Future: Deep Learning and Law. Kathryn Hume.

Appendix: The Grossman-Cormack Glossary of Technology-Assisted Review. Maura R. Grossman and Gordon V. Cormack.


PERSPECTIVES ON PREDICTIVE CODING

January 7, 2017
book_cover

Click the Cover to Order from ABA

My second book in late 2016 is a reference text on document review and predictive coding where I served as Editor and contributor of two chapters out of the twenty. PERSPECTIVES ON PREDICTIVE CODING And Other Advanced Search Methods for the Legal Practitioner. The Foreword is by Judge Andrew Peck. My Co-Editors are Jason R. Baron and Michael Berman. This book can be purchased online directly from the publisher, the ABA. You can also call ABA Customer Service at 800-285-2221 Monday-Friday between 9:00 AM and 6:00 PM ET. ABA members get a big discount. It should also be available on Amazon in June 2017, but in the meantime, the ABA has the exclusive.

In Perspectives I share editor duties with Jason R. Baron and Michael Berman, something I have never done before. I usually just drone on and on by myself, but this time I have help from top experts on predictive coding. The lengthy, very complete book on predictive coding has many contributing authors. Perspectives on Predictive Coding is the best reference book available on this subject, and, as an added bonus, it is big enough to stop any size door.

About the Book (an excerpt from Jason Baron’s Introduction)

Each of the three editors of this volume graduated law school in 1980, which has meant that we have been firsthand witnesses to the transformation of legal practice and especially discovery practice during the past few decades. There was a time when discovery meant searching only through boxes containing paper files, where the big case simply meant searching through more boxes in the client’s warehouse.

Discovery did not yet need an “e” as a prefix, and manual searches for relevant documents sufficed. Judge Andrew J. Peck notes this, as well, in his Foreword to this volume. Fast forward to the present, and how the world of lawyering has changed. The present “inflationary” period of information exploding has been built on copying machines and personal computers in the 1970s, e-mail beginning widespread use in the late 1980s, and the opening of the desktop to the Internet and especially the World Wide Web in the 1990s. The pace of change has only continued to accelerate since the turn of the century, with the emergence of social media and mobile devices in the last decade transforming what it means to conduct business. As this book goes to print, we are on the cusp of the Internet of Things, with smart devices proliferating and generating new data streams and new forms of evidence to search.

Today, every lawyer conducting “discovery” in civil litigation needs to confront the fact that—no matter how large or small the case may be—it is insufficient to simply define the search task as being limited to finding relevant documents in traditional paper files. The legal profession lives and breathes in a world of “electronically stored information” (ESI), a term of art introduced into legal practice by virtue of the 2006 amendments to the Federal Rules of Civil Procedure.

But what constitutes our doing a “reasonable” job in finding relevant evidence in a world exploding in data? The initial approach lawyers took (and still take) to confronting large volumes of ESI is to rely on keyword searching, supplemented by manual searches, to cull out relevant and privileged material before a production is made to opposing counsel. Although these “time-tested” approaches have their defenders, simple reliance on manual and keyword searching increasingly is seen as inadequate to the task at hand, both on grounds of accuracy and efficiency, as compared with more advanced search techniques.

Baron_at_blackboardThe editors of this book are readily willing to stipulate in advance that they have a strong bias in favor of advancing the cause of computer-assisted review and educating the profession on how more advanced search techniques work. In one way or another, they have spent the better part of the last 15 years engaged in initiating and participating in research projects, and academic conferences, joining think tanks, communicating through online media platforms, writing law reviews, authoring e-discovery books, and teaching e-discovery in law and graduate schools, in evangelizing on the topic of how lawyers may conduct “better” searches of electronic evidence using smarter methods than manual and keyword searching. Along the way, we have been fortunate to encounter a number of brilliant lawyers and scholars at the cutting edge of e-discovery and information science, many of whom we are grateful to for their contributions to this volume.

This book is an attempt to catch lightning in a bottle; namely, to provide a set of perspectives on predictive coding and other advanced search techniques, as they are used today by lawyers in pursuit of e-discovery, in investigations, and in other legal contexts, such as information governance. We are painfully aware that the shelf-life of publications such as the present work is not long. Nevertheless, we trust that a cross-section of related—and sometimes differing—perspectives on how today’s advanced search methods at the cutting-edge of legal practice will prove illuminating to a greater legal audience.  …

The book is meant to appeal both to practitioners who are seeking knowledge of what predictive coding and other advanced search methods are all about, as well as to those members of the legal community who are “inside the bubble” of e-discovery already and wish to be exposed to the latest, cutting-edge techniques. We would like to imagine that the book may also be read by lawyers who do not consider themselves litigators or e-discovery practitioners, but who wish to apply a knowledge of smart analytics in other legal contexts.

The reader should be aware that given the relative novelty of predictive coding and other advanced search methods, there have been and will continue to be disagreements over what constitutes “best practices” in the space, and the editors of course have their own preferences and biases. However, the book attempts to be inclusive of a range of views, not always necessarily our own. …

As this book goes to print, there appear to be voices in the profession questioning whether predictive coding has been oversold or overhyped, and pointing to resistance in some quarters to wholesale embrace of the types of algorithmics and analytics on display through-out this volume. Notwithstanding these critics, the editors of this volume remain serene in their certainty that the chapters in this book represent the future of e-discovery and the legal profession as it will come to be practiced into the foreseeable future, by a larger and larger contingent of lawyers. Of course, for some, the prospect of needing to be technically competent in advanced search techniques may lead to considerations of early retirement. For others, the idea that lawyers may benefit from embracing predictive coding and other advanced technologies is exhilarating. We hope this book inspires the latter feelings on the part of the reader.

________________

TABLE OF CONTENTS

FOREWORD: JUDGE ANDREW PECK

INTRODUCTION: Jason R. Baron

SEARCHING FOR ESI: SOME PRELIMINARY PERSPECTIVES

Chapter 1: The Road to Predictive Coding: Limitations on the Defensibility of Manual and Keyword Searching. Tracy D. Drynan and Jason R. Baron.

Chapter 2: The Emerging Acceptance of Technology-Assisted Review in Civil Litigation. Alicia L. Shelton and Michael D. Berman.

PRACTITIONER PERSPECTIVES

Chapter 3: A Tour of Technology-Assisted Review. Maura R. Grossman and Gordon V. Cormack.

Chapter 4: The Mechanics of a Predictive Coding Workflow. Vincent M. Catanzaro, Samantha Green, and Sandra Rampersaud.

Chapter 5: Reflections on the Cormack and Grossman SIGIR Study: The Folly of Using Random Search for Machine Training. Ralph C. Losey.

Chapter 6: TAR for the Small and Medium Case. William F. Hamilton.

Chapter 7: Reality Bites: Why TAR’s Promises Have Yet to Be Fulfilled. William P. Butterfield and Jeannine M. Kenney.

Chapter 8: Predictive Coding from the Defense Perspective: Issues and Challenges. Ronni D. Solomon, Rose J. Hunter-Jones, Jennifer A. Mencken, and Edward T. Logan.

Chapter 9: Safeguarding the Seed Set: Why Seed Set Documents May Be Entitled to Work–Product Protection. The Hon. John M. Facciola and Philip J. Favro.

Chapter 10: Experts on Computer-Assisted Review: Why Federal Rule of Evidence Should Apply to Their Use. The Hon. David J. Waxse and Brenda Yoakum-Kriz.

Chapter 11: License to Cull: Two-Filter Document Culling Method That Uses Predictive Coding and Other Search Tools. Ralph C. Losey.

INFORMATION RETRIEVAL PERSPECTIVES; E-Discovery Standards

Chapter 12: Defining and Estimating Effectiveness in Document Review. David D. Lewis.

Chapter 13: Metrics in Predictive Coding. William Webber and Douglas W. Oard.

Chapter 14: On the Place of Measurement in E-Discovery. Bruce Hedin, Dan Brassil, and Amanda Jones

Chapter 15: A Modest Proposal for Preventing e-Discovery Standards from Being a Burden to Practitioners, Clients, the Courts, or Common Sense. Gilbert S. Keteltas, Karin S. Jenson, and James A. Sherer.

ANALYTICS AND THE LAW

Chapter 16: Algorithms at the Gate: Leveraging Predictive Analytics in Mergers, Acquisitions, and Divestitures. Jeffrey C. Sharer and Robert D. Keeling

Chapter 17: The Larger Picture: Moving Beyond Predictive Coding for Document Productions to Predictive Analytics for Information Governance. Sandra Serkes.

Chapter 18: Predictive Analytics for Information Governance in a Law Firm: Mitigating Risks and Optimizing Efficiency. Leigh Isaacs.

Chapter 19: Finding the Signal in the Noise: Information Governance, Analytics, and the Future of Legal Practice. Bennett B. Borden and Jason R. Baron.

Chapter 20: Preparing for the Near Future: Deep Learning and Law. Kathryn Hume.

Appendix: The Grossman-Cormack Glossary of Technology-Assisted Review. Maura R. Grossman and Gordon V. Cormack.


%d bloggers like this: