TAR Course: 1st Class

First Class: Background and History of Predictive Coding

Welcome to the first (and longest) of seventeen classes in the training program. In this opening class we review the background and history of predictive coding, including some of the patents in this area. This first class is somewhat difficult, but worry not, most of the classes are easier. Also, you do not need to understand the patents discussed here, just the general ideas behind the evolution of predictive coding methods.

Version 1.0 – The Early Days of Predictive Coding and Axcelerate Software

The first generation of Predictive Coding software for lawyers was called Axcelerate by Recommind. It was released in 2009 as the first commercial program to use active machine learning for legal search. Leading End-to-End eDiscovery Platform Combines Unique Predictive Coding Technology with Random Sampling to Revolutionize Document Review (2009 Press Release). From the earliest versions, the software was able to predict the relevance or irrelevance of a large collection of documents by manually reviewing only a portion of them. It sorted all documents into binary categories with weighted values. At the time this was a revolutionary for lawyers. Below is a diagram from 2009 explaining what it did.

In this version 1.0 of predictive coding methods lawyers began the review with a Subject Matter Expert (SME) who reviewed a random selection of documents, including a secret control set. The process was divided into two stages: training the software and actual review of documents predicted to be relevant. This is the software and initial process we followed in the landmark case, Da Silva Moore, in 2011. On April 26, 2011, Recommind was granted a patent for predictive coding: Patent No. 7,933,859, entitled Full-Text Systems and methods for predictive coding. The search algorithms in the patent used Probabilistic Latent Semantic Analysis, an already well-established statistical analysis technique for data analysis. (Recommind obtained two more patents with the same name in 2013: Patent No. 8,489,538 on July 16, 2013; and Patent No. 8,554,716 on October 8, 2013.) In 2016, OpenText corporation acquired Recommind, and has since totally updated and improved the Axcelerate Software.

As the title of all of these patents indicate, the methods of use of the text analytics technology in the software were key to the patent claims. As is typical for patents, many different method variables were described to try to obtain as wide a protection as possible. The core method was shown in Figure Four of the 2011 patent.


This essentially describes the method that I now refer to as Predictive Coding Version 1.0. It is the work flow I had in mind when I first designed procedures for the Da Silva Moore case. In spite of the Recommind patent, this basic method was followed by all vendors who added predictive coding features to their software in 2011, 2012 and thereafter. It is still going on today. Many of the other vendors also received patents for their predictive coding technology and methods, or applications are pending. See eg. Equivio, patent applied for on June 15, 2011 and granted on September 10, 2013, patent number 8,533,194; Kroll Ontrack, application 20120278266; Cormack December 31, 2013, 8,620,842, Cormack; April 29, 2014, 8,713,023, Grossman and Cormack, September 16, 2014, 8,838,606,

Losey in 2011 when he first started arguing against the methods of version 1.0

I also remember getting into many arguments with the technical experts from several companies back in 2011, including especially Kroll Ontrack. That was when the predictive coding 1.0 methods hardwired into their software, as well as Recommind’s, were first explained to me. I objected right away to the secret control set. I wanted total control of my search and review projects. I resented the secrecy aspects. There were enough black boxes in the new technology already. I was also very dubious of the statistical projections. In my arguments with them, sometimes heated, I found that they had little real grasp of how legal search was actually conducted or the practice of law. My arguments were of no avail. And to be honest, I had a lot to learn. I was not confident of my positions, nor knowledgeable enough of statistics. All I knew for sure is that I resented their trying to control my well-established, pre-predictive coding search methods. Who were they to dictate how I should practice law, what procedures I should follow? These scientists did not understand legal relevance, nor how it changes over time during the course of any large-scale review. They did not understand the whole notion of the probative value of evidence and the function of e-discovery as trial preparation. They did not understand weighted relevance, and the 7+/2 rule of judge and jury persuasion. I gave up trying, and just had the software modified to suit my needs.

Part of the reason I gave up trying back in 2011 is that I ran into a familiar prejudice from this expert group. It was a prejudice against lawyers common to most academics and engineers. As a high-tech lawyer since 1980 I have faced this prejudice from non-lawyer techies my whole career. Many assume we were all just a bunch of lawyer weasels, not to be trusted, and with little or no knowledge of technology and search. They have no idea at all about legal ethics or professionalism, nor of our experience with the search for evidence. They fail to understand the central role of lawyers in e-discovery, and how our whole legal system, not just discovery, is based on the honesty and integrity of lawyers. We need good software from them, not methods to use the software, but they knew better. It was frustrating, believe me. So I gave up on the control set arguments and moved on. Gordon Cormack and Maura Grossman were not part of this stubborn group. They never included control sets in their methods.

In the arrogance of the first designers of predictive coding, an arrogance born of advanced degrees in entirely different fields, these information scientists and engineers presumed they knew enough to tell all lawyers how to use predictive coding software. They were blind to their own ignorance. The serious flaws inherent in Predictive Coding Version 1.0 are the result. The patents of Gordon Cormack and Maura Grossman were in part a reaction to these flaws and mistakes, as is shown in the body of their patent.

Still, even with poor methods, Predictive Coding software in 2009 disrupted electronic discovery by providing a tool that learned from a small set of coded documents how to predict the relevance of all of them. The new feature ranked all documents in the collection according to predicted probable relevance. It sorted all of the documents into binary categories with weighted values by using complex multidimensional mathematics and statistical analysis. We will not go into the black box math in this course, only how to use this powerful new capabilities. But see: Jason R. Baron and Jesse B. Freeman, Quick Peek at the Math Behind the Black Box of Predictive Coding (2013).

The methods for use of predictive coding software have always been built into the software. The first version 1.0 software required a user to begin the review with a SME, usually a senior-level lawyer in charge of the case, to review a random selection of several thousand documents. The random documents they reviewed included a secret set of documents not identified to the SME, or anyone else, called a control set. The secret control set supposedly allowed you to objectively monitor your progress in Recall and Precision of the relevant documents from the total set. It also supposedly prevented lawyers from gaming the system. As you will see, many of us think that the use of control sets in Predictive Coding for e-discovery was a big mistake. The control was an illusion and made the whole process unnecessarily complicated and time-consuming.

Eventually I came to understand that the experts who promoted a two-step process, train with what they called a seed set and a control set and then review with use of a secret control sets to calculate recall, did not understand a lot about document review in legal cases, For one thing, they did not understand the low prevalence of relevant documents in the data typically searched in legal cases. They also did not understand the impact of concept drift, where the SME concept of relevance will develop and improve over the course of a review. They also did not grasp the need for transparency in legal proceedings. There were enough black boxes in the new technology already.

My arguments with the software vendors were of no avail at that time. One vendor, Kroll Ontrack, was forced by me, as a major customer ,to modify a version of their software so that it could be used as an ongoing one-step process, without control sets, at least for my projects. Eventually, they saw it worked and gave me access to playgrounds they set up for me to experiment with new methods using the Enron database. I am grateful to them, and my law firm at the time, Jackson Lewis. I wrote and spoke publicly about this continuously. Other experts looked at the peculiarities of legal search and agreed with my analysis and attack of secret control sets and a two-step process. Chief among was Professor Gordon Cormack, who was well taught on the law by working very closely on review projects with attorney Maura Grossman. He also provided her with his own predictive coding software, which has never been offered for sale, but has been patented. The two of them, Grossman and Cormack, were, and still are, the dominant serious scholars and also tireless promoters of active machine learning for document review.

Second Generation Predictive Coding, Version 2.0

The original versions of most software available to attorneys, including the initial market leader Recommind, had two distinct stages: training the software to predict relevance, using control and seed sets (initial training set), and then actually reviewing the documents that were predicted to be relevant. We used it at first in the landmark Da Silva Moore case with Judge Peck, but later quickly abandoned this method in later cases in favor of a single ongoing process that combined training and review. That is the dividing line for version 2.0 of predictive coding.

In two of my ongoing ENRON experiments using Kroll Ontrack software in 2012, I did not follow this two-step procedure. I just kept on training until I could not find any more relevant documents. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One); Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents(Part Two); Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (in PDF form and the blog introducing this 82-page narrative, with second blog regarding an update); Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods).

In my 2012 published Enron experiments, I was using what was later promoted by others as Predictive Coding 2.0. In this 2.0 version you combine training and review. The training is continuous. The first round of document training might be called the seed set, if you wish, but it is nothing particularly special. All rounds of training are important and the training should continue as the review proceeds, unless there are some logistical reasons not to. After all, training and review are both part of the same review software, or should be. It just makes good common sense to do that, if your software will allow it. If you review a document, then you might as well at least have the option to include it in the training. There is no logical reason for a cut-off point in the review process where training stops. I really just came up with that notion in Da Silva because it was built into the Recommind software.

In predictive coding 2.0 you do Continuous Training, aka Continuous Active Learning (depends on the perspective, human or machine) or CAL for short. It just makes much more sense to keep training as long as you can, if your software allows you to do that. In addition to Cormack and Grossman, one vendor who quickly developed predictive coding software after Recommind, Catalyst, also did not use control sets. It was then owned by John Tredennick, a one-time trial attorney, tech expert, turned e-discovery entrepreneur. He often wrote then, and continues to do so now, about predictive coding methods. He was one of the first to follow Cormack and Grossman’s many writings on the subject and their description of ‘Continuous Active Learning or CAL. The Grossman and Cormack team were at the time, and still, are, the leading writers on predictive coding from a serious, scientific perspective, and a popular view too. Unlike me, they still travel the country to promote their methods of use (they still don’t offer Cormack’s software for sale or use).

John Trednnick quickly picked up on the Grossman and Cormack terms, specifically CAL, and began promotion his software as Predictive Coding 2.0. He combined, as I had been doing since 2012, the two-stages of review into one. It was no longer train, then review, but instead, the training continued throughout the review project. This continuous training improvement was popularized by Maura Grossman and Gordon Cormack and they first called that method continuous active learning. Under the CAL method, which again started to be built into the software by various vendors, the training continued continuously throughout the document review. There was not one stage to train, then another to review the predicted relevant document. The training and review continued together seamlessly. Here is a diagram that John Trednnick created to explain the then new process.

The main problem with version 2.0 of some vendors’ predictive coding software, not including Trednnick’s or Comack’s, was that the use of a random secret control set continued. Further, many promoted the use of randomly selected documents as an important part of the CAL process. Most vendors who moved from version one to two, and adopted some kind of continuous learning, instead of a two step process, still used random selection of documents for an initial control set and for part of the ongoing training. I did many experiments to test out the effectiveness of random selection for training (I never believed in control sets), and found some value to random as part of CAL, but only limited. It became just a small part of the multimodal tool set, the “hybrid multimodal” predictive coding 4.0 method now taught in my Tar course.

Predictive Coding 3.0 and the Abandonment of Secret Control Sets

Although I never used or believed in secret control sets, and neither did Cormack, Grossman nor Trednnick, most of the other vendors and writers on the field held onto the use of control sets for many years. Some still use them. The elimination of the use of control sets in CAL is my bright line between Predictive Coding versions two and three. Version Three software does not use control sets.

Although the use of a control set is basic to all scientific research and statistical analysis, it does not work in legal search. Control sets fail in legal search for multiple reasons, including the fact, which seems to elude most black and white minded engineering types, that the SME’s understanding of relevance evolves over time. The SMEs never know the full truth of document responsiveness at the beginning of a project as the two-step control set process assumes. Control set coding decisions are made before the SME is familiar with the collection and the case may be unreliable. Control sets can also lead to overtraining and overfitting of the document types, resulting in poor recall and precision.

Control sets are a good idea in general, and the basis of most scientific research, but simply do not work in legal search. It was built into the version 1.0 and 2.0 software by engineers and scientists who had little understanding of legal search. The secret control set does not work in real-world legal review projects. In fact, it provides statistical mis-information as to recall. Thta is primarily because in the real world of legal practice relevance is a continually evolving concept. It is almost never the same at the beginning of a project, when the control set is created, as at the end. The engineers who designed versions 1.0 and 2.0 simply did not understand that. They were not lawyers and did not appreciate the flexibility of the relevance. They did not know about concept drift. They did not understand the inherent vagaries and changing nature of the search target in a large document review project. They also did not understand how human SMEs were, how they often disagree with themselves on the classification of the same document even without concept drift. As mentioned, they were also blinded by their own arrogance, tinged with antipathy against lawyers.

They did understand statistics. I am not saying their math was wrong. But they did not understand evidence, did not understand relevance, did not understand relevance drift (or, as I prefer to call it, relevance evolution), and did not understand efficient legal practice. Many I have talked to did not have any real understanding of how lawyers worked at all, much less document review. Most were scientists or statisticians. They meant well, but they did harm nonetheless. These scientists did not have any legal training. If they were any experienced lawyers on the version 1.0 and 2.0 software development teams, they were not heard, or had never really practiced law. (As a customer, I know I was brushed off.) Things have gotten much better in this regard since 2011, but still, many vendors have not gotten the message. They still manufacture version 1.0 and 2.0 type predictive coding software.

The theory behind their use of secret control proponents is that the initial relevance coding of these documents is correct, immutable and complete; that it should be used to objectively judge the rest of the coding in the project. In practice many documents determined to be relevant or irrelevant at the beginning of a project may be considered the reverse by the end. The target shifts. The understanding of relevance evolves. That is not because of a bad luck or a weak SME (a subject discussed later in the Course), but because of the natural progression of the understanding of the probative value of various types of documents over the course of a review.

The ground truth at the beginning of a search project is quick sand. The understanding of relevance almost always evolves as the search progresses. The main problem with the use of the control set in legal search is that the SMEs never know the full truth of document responsiveness at the beginning of a project. This is something that evolves over time in all but the simplest projects. The understanding of relevance changes over time; it changes as particular documents are examined. The control set fails and creates false results because “the human-selected ground truth set and used as a benchmark for further statistical measurements” is never correct, especially at the beginning of a large review project. Only at the end of a project are we in a position to determine a “ground truth” and “benchmark” for statistical measurements.

This problem was recognized not only by Professor Cormack, but other experts who became involved in legal search. One of the most important for me was information retrieval expert, William Webber, PhD. William has been kind enough to help me through technical issues involving sampling many times. William gained experience with legal search by working with his professor, Douglas Oard, PhD. Professor Oard in turn had learned about the unique issues in legal search through working with an attorney with significant expertise in legal search, Jason R. Baron, who also taught me a great deal about the subject. Anyway, here is how Dr. Webber puts it in his blog Confidence intervals on recall and eRecall:

“Using the control set for the final estimate is also open to the objection that the control set coding decisions, having been made before the subject-matter expert (SME) was familiar with the collection and the case, may be unreliable.”

Dr. William Webber

Jeremy Pickens, Ph.D., who was Catalyst’s in-house information scientist, agree with this assessment of control sets. See Pickens, An Exploratory Analysis of Control Sets for Measuring E-Discovery ProgressDESI VI 2015, where he reports on an his investigation of the effectiveness of control sets to measure recall and precision. Jeremy used the Grossman and Cormack TAR Evaluation Toolkit for his data and gold standards. Here is his conclusion:

A popular approach in measuring e-discovery progress involves the creation of a control set, holding out randomly selected documents from training and using the quality of the classification on that set as an indication of progress on or quality of the whole. In this paper we do an exploratory data analysis of this approach and visually examine the strength of this correlation. We found that the maximum-F1 control set approach does not necessarily always correlate well with overall task progress, calling into question the use of such approaches. Larger control sets performed better, but the human judgment effort to create these sets have a significant impact on the total cost of the process as a whole.

Jeremy Pickens

In legal search the target is almost always moving and small. Also, the data itself can often change as new documents are added to the collection. In other areas of information retrieval, the target is solid granite, simple Newtonian, and big, or at least bigger than just a few percent. In other words, the prevalence of targets is higher. Outside of legal search it may make sense to talk of an immutable ground truth. But with legal search it does not. In legal search the ground truth of relevance is discovered. It emerges as part of the process, often including surprise court rulings and amended causes of action. It is a moving target. With legal search the truth is rare, the truth is relative.

The control set based procedures of versions one and two of predictive coding software were over-complicated and inherently defective. They were based on an illusion of certainty, an illusion of a ground truth benchmark magically found at the beginning of a project before document review even began. There were supposedly SME wizards capable of such prodigious feats. I have been an SME in many, many topics of legal relevance since I started practicing law in 1980. I can assure you that SMEs are human, all too human. There is no magic wizard behind the curtain.

GPT generated image

Moreover, the understanding of any good SME naturally evolves over time as previously unknown, unseen documents are unearthed and analyzed. Legal understanding is not static. The theory of a case is not static. Experienced trial lawyers know this. The case you start out with is never the one you end up with. You never really know if Schrodinger’s cat is alive or dead. You get used to that after a while. Certainty comes from the final rulings of the last court of appeals.

GPT generated image

The use of magical control sets doomed many a predictive coding project to failure. Project team leaders thought they had high recall, because the secret control set said they did, yet they still missed key documents. They still had poor recall and poor precision, or at least far less than their control set analysis led them to believe. See: Webber, The bias of sequential testing in predictive coding, June 25, 2013, (“a control sample used to guide the producing party’s process cannot also be used to provide a statistically valid estimate of that process’s result.”) I still hear stores from reviewers where they find precision of less than 50% using Predictive Coding 1.0 and 2.0 methods, sometimes far less. Our goal is to use predictive coding 4.0 methods to increase precision to the 80% or higher level. This allows for the reduction of cost without sacrifice of recall.

Here is my argument against control sets and rant against vendors still using this method in a video made in 2017.

The use of Control sets, a practice that sill continues today for some, despite by best efforts to stop it, has created many problems. Attorneys who worked with predictive coding software versions 1.0 or 2.0, have seen their projects overtly crash and burn, as when missed smoking gun documents later turn up, or where reviewers see embarrassingly low precision. May lawyers were suspicious of the results at first. Even if not suspicious, they were discouraged by the complexity and arcane control set process from every trying predictive coding again. As attorney and search expert J. William (Bill) Speros likes to say, they could smell the junk science in the air. They were right. I do not blame them for rejecting predictive coding 1.0 and 2.0. I did too, eventually. But unlike many, I followed the Hacker Way and created my own method, called version 3.0, and then in later 2016, version 4.0. We will explain the changes made from version 3.0 to 4.0 later in the course.

As legal understanding is not static, experienced trial lawyers know that the case they start out with is not the one they end up with. Many predictive coding projects fail due to the use of magical control sets, resulting in low recall and precision. Webber notes that using the same control sample to guide the process and evaluate results is not statistically valid. Despite this, some attorneys may not have seen their projects crash and burn, yet still reject predictive coding 1.0 and 2.0 due to complexity and the smell of ‘junk science’. To ensure better precision and cost reduction, predictive coding 4.0 methods were developed in 2016.

The control set fiction put an unnecessarily heavy burden upon SMEs. They were supposed to review thousands of random documents at the beginning of a project, sometimes tens of thousands, and successfully classify them, not only for relevance, but sometimes also for a host of sub-issues. Some gamely tried, and went along with the pretense of omnipotence. After all, the documents in the control set were kept secret, so no one would ever know if any particular document they coded was correct or not. But most SMEs simply refused to spend days and days coding random documents. They refused to play the pretend wizard game.

Another GPT generated image

Every day that vendors keep phony control set procedures in place, is another day that lawyers are mislead on recall calculations based on them; another day lawyers are frustrated by wasting their time on overly large random samples; another day everyone has a false sense of protection from the very few unethical lawyers out there, and incompetent lawyers; and another day clients pay too much for document review. I continue call upon on all vendors to stop using control sets and phase it out of their software.

The best SMEs correctly intuited that they had better things to do with their time, plus many clients did not want to spend over $1,000 per hour to have their senior trial lawyers reading random emails in a control set, most of which would be irrelevant. I have heard many complaints from lawyers that predictive coding is too complicated and did not work for them. These complaints were justified. The control set and two-step review process were the culprits, not the active machine learning process. The control set has done great harm to the legal profession. As one of the few writers in e-discovery free from vendor influence, much less control, I have been free to blow the whistle, to put an end to the vendor hype. No more secret control sets. Let us simplify and get real. Lawyers who have tried predictive coding before and given up, come back and try Predictive Coding 4.0.

Another reason control sets fail in legal search is, as mentioned, the very low prevalence typical of the ESI collections searched. We only see high prevalence when the document collection is keyword filtered. The original collections are always low, usually less that 5%, and often less than 1%. About the highest prevalence collection ever searched in open testing of competing search methods was the Oracle collection in the EDI search experiment. There, although not announced, it was obvious to most participants that it had been heavily filtered by a variety of methods. That is not a best practice because the filtering often removes the relevant documents from the collection, making it impossible for predictive coding to ever find them. More on that later in the course. Grossman and Cormack have also written extensively on the issue of keyword filtering.

The control set approach also rarely works well in legal search because of the small size of the random sample usually taken by most vendors. The sample is rarely large enough to include a representative document from each type of relevant documents in the corpus, much less the outliers. It is not even close. For that reason, even if the relevance benchmark is not always evolving during the review, concept drift, the control set would still fail because it is incomplete. The result is likely to be overtraining of the document types on those that happened to hit in the control set, which is exactly what the control set is supposed to prevent. This kind of overfitting can and does happen for a variety of reasons, not just do to improper training based on an incomplete control set. That is an additional problem separate and apart from relevance shift.

Not only that, many types of relevant documents are never included in the control set because they did not happen to be in the random sample. The natural rarity of relevant evidence in unfiltered document collections, aka low prevalence, makes the control set useless and worse, leads to over training in the CAL process of the few relevant documents caught in the random control set.

These are problems mitigated, if not solved by the hybrid multimodal search aspects of predictive coding in version 4.0 that I teach.

Lawyers still using version 2.0 methods should be wary about the use of control sets to certify the completeness of a production. Separate samples should be preferred for making final assessments of production completeness. Yes, there is still a place for random selection of documents, but only for quality control sampling, not for initial training. As is explained in my Predictive Coding course, a separate random sample should be used at the beginning and end of a project, but only to assess production completeness, not as a first-guess control set. Random sampling is critical for project evaluation, but it is a terribly inefficient guide for training. In version 3.0, random sampling remained, but was used in an entirely different way. It was the third step in the eight-step  process of both my versions 3.0 and 4.0, but the use of a secret set of random documents, the control set, was eliminated. That represents the dividing line for me between versions two and three.

Predictive Coding 4.0 – Still State of the Art in 2023

The next method of Predictive Coding, version 4.0, builds on the prior methods. It still combines the two-stages into one, the Continuous Training technique, and still does not use secret control sets, but now it also uses a variety of other methods in addition to predictive coding. This is what I call a multimodal method and a man-machine hybrid approach. There are several flavors of this available today, but this course will explain the the techniques I have developed and tested since 2012, I further refined these techniques in 2016 to include a new variation of CAL.

We do not claim any patents or other intellectual property rights to Predictive Coding 4.0, aside from copyrights to my writings, and certain trade secrets. But Gordon Cormack and Maura Grossman, who are both now professors (and also now married), do claim patent rights to their methods. The methods are apparently embodied in software somewhere, even though the software is not sold. In fact, we have never seen it, nor, as far as I know, has anyone else, except their students. Their patents are all entitled Full-Text Systems and methods for classifying electronic information using advanced active learning technique: December 31, 2013, 8,620,842, Cormack; April 29, 2014, 8,713,023, Grossman and Cormack; and, September 16, 2014, 8,838,606, Grossman and Cormack.

The Grossman and Cormack patents and patent applications are interesting for a number of reasons.  For instance, they all contain the following paragraph in the Background section explaining why their invention is needed. As you can see it criticizes all of the existing version 1.0 software on the market at the time of their applications (2013) (emphasis added):

Generally, these e-discovery tools require significant setup and maintenance by their respective vendors, as well as large infrastructure and interconnection across many different computer systems in different locations. Additionally, they have a relatively high learning curve with complex interfaces, and rely on multi-phased approaches to active learning. The operational complexity of these tools inhibits their acceptance in legal matters, as it is difficult to demonstrate that they have been applied correctly, and that the decisions of how to create the seed set and when to halt training have been appropriate. These issues have prompted adversaries and courts to demand onerous levels of validation, including the disclosure of otherwise non-relevant seed documents and the manual review of large control sets and post-hoc document samples. Moreover, despite their complexity, many such tools either fail to achieve acceptable levels of performance (i.e., with respect to precision and recall) or fail to deliver the performance levels that their vendors claim to achieve, particularly when the set of potentially relevant documents to be found constitutes a small fraction of a large collection.

They then indicate that their invention overcomes these problems and is thus a significant improvement over prior art. In Figure Eleven of their patent (shown below) they describe one such improvement, “an exemplary method 1100 for eliminating the use of seed sets in an active learning system in accordance with certain embodiments.”


These are basically the same kind of complaints that I have made here against the older versions of Predictive Coding. I understand the criticisms regarding complex interfaces, that rely on multi-phased approaches to active learning. If the software forces use of control set and seed set nonsense, then it is an overly complex interface. (It is not overly complex if it allows other types of search, such as keyword, similarity or concept, for this degree of complexity is necessary for a multimodal approach.) I also understand their criticism of the multi-phased approaches to active learning, which was fixed in 2.0 by the use of continuous training, instead of train and then review.

The Grossman & Cormack criticism about low prevalence document collections, which is the rule, not the exception in legal search, is also correct. It is another reason the control set approach cannot work in legal search. The number of relevant documents to be found constitutes a small fraction of a large collection and so the control set random sample is very unlikely to be representative, much less complete. That is an additional problem separate and apart from relevance shift.

Notice that there is no control set in the Grossman & Cormack patent diagram as you see in the old Recommind patent. Much of the rest of the patent, in so far as I am able to understand the arcane patent language used, consists of applications of continuous training techniques that have been tested and explained in their writings, including many additional variables and techniques not mentioned in their articles. See egEvaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014. Their patent includes the continuous training methods, of course, but also eliminates the use of seed sets. I assumed this also means the elimination of use control sets, a fact that Professor Cormack has later confirmed. Their CAL methods do not use secret control sets. Their software patents are thus like our own 3.0 and 4.0 innovations, although they do not use IST, as will be explained in the course.

Recap of the Evolution of Predictive Coding Methods

Version 1.0 type software with strong active machine learning algorithms. This early version is still being manufactured and sold by some vendors today. It has a two-step process of train and then review. it also uses secret control sets to guide training. This usually requires an SME to review a certain total of ranked documents as guided by the control set recall calculations. Beware and avoid this software.

Version 2.0 of Predictive Coding eliminated the two-step process, and made the training continuous. For that reason version 2.0 is also called CAL, continuous active training. It did not, however,  reject the random sample step and its control set nonsense. Avoid that type of software too.

Predictive Coding 3.0 was a major change. Control sets are excluded. It built on the continuous training improvements in 2.0, but also eliminated the secret control set with mandatory initial review of a random sample. This and other process improvements in Predictive Coding 3.0 significantly reduced the burden on busy SMEs, and significantly improved the recall estimates. This in turn improved the overall quality of the reviews.

Predictive Coding 4.0 is the latest method and the one taught in my TAR course. It includes some variations in the ideal work flow, and refinements on the continuous active training to facilitate double-loop feedback. We call this Intelligently Space Training (IST), instead of CAL. It is an essential part of our Hybrid Multimodal IST method. All of this will be explained in detail in this course. In Predictive Coding 3.0 and 4.0 the secret control set basis of recall calculation are replaced with a prevalence based random sample guides, and elusion based quality control samples and other QC techniques. These can now be done with contract lawyers and only minimal involvement by SMEs. See Zero Error Numerics. This will all be explained in the Course. The final elusion type recall calculation is done at the end of the project, when final relevance has been determined. See: EI-Recall.


Ralph in 2022, hoping he’s finally won his long battle against secret control sets

The method of predictive coding taught in this is predictive coding training course is Predictive Coding 4.0. It includes refinements on the continuous active training – Intelligently Space Training (IST) – and our Hybrid Multimodal IST methods. Predictive Coding 4.0 replaces the secret control set basis of recall calculation with prevalence based random sample guides, elusion based quality control samples and other QC techniques. These can be done with contract lawyers and minimal SME involvement. Final elusion type recall calculations are done at the end of the project, when final relevance is determined. Additionally, in version 4.0 of Predictive Coding, sample documents taken at the start for prevalence calculations are known and adjusted as relevance changes.

A secret control set is not a part of the Predictive Coding 4.0 method. We still have random selection reviews for prevalence and quality control purposes – Steps Three and Seven – but the documents are not secret and they are used for training. Moreover, 4.0 eliminates first round training seed sets, random based or otherwise. The first time the machine training begins is simply the first round. Sometimes the first training set of documents is large in number and type, sometimes it is not. There are no rules for the first training set. These topics will be further addressed in this course.

Go on to Class Two.

Or pause to do this suggested “homework” assignment for further study and analysis.

SUPPLEMENTAL READING: Review all of the patents cited, especially the Grossman and Cormack patents (almost identical language was used in both, as you will see, so you only need to look one). Just read the sections in the patents that are understandable and skip the arcane jargon. Also, suggest you read all of the articles cited in this course. That is a standing homework assignment in all classes. Again, some of it may be too technical. Just skip through or skim those sections. Also see if you can access Losey’s LTN editorial, Vendor CEOs: Stop Being Empty Suits & Embrace the Hacker Way

EXERCISES: What kind of AI is Predictive Coding and how does it differ from predictive word generation type AI and programs like ChatGPT?  

Students are invited to leave a public comment below. Insights that might help other students are especially welcome. Let’s collaborate!


Ralph Losey COPYRIGHT 2018, 2023


15 Responses to TAR Course: 1st Class

  1. […] This is another new video for the e-Discovery Team’s TAR Course. It is included in the new First Class that we just added to the […]

  2. Paul Park says:

    On Control Sets. So the analogy that I would use is the Pentium bug. The reason it didn’t get caught was a recurrent view on a random sample that could not hit the hole. The manifold (3D representation of the vector space) had multiple holes in it and the control set methodology was not able to see it. I think document review has the opportunity to build a first or second minima technique to identify more of the areas but given the model is still pending.

    • Ralph Losey says:

      Thank you for your interesting comment.

      The problem of missing key documents in a random sample is a common one. That is one of the problems with using a control set to judge recall. We address this issue of completeness and finding the back swans, by using a variety of different search methods throughout the project, not just vector space type document ranking. That is one reason we use a MULTIMODAL METHOD, to make is less likley to miss the rare document. Moreover, we never use mere chance as a method of locating relevant ESI. The folly of that method is well known and has been discussed on this blog many times over the years. As far as I know, no one uses that method any more. Now the primary enemy to progress is the use of the control set.

  3. […] three videos in this blog on the Hacker Way are also included in the First Class of the TAR […]

  4. […] The two videos in this blog on the Hacker Way are also included in the Welcome page of the TAR Course. Other minor improvements were made this week to the Welcome and the First Class. […]

  5. […] At this point in my career, I am an e-discovery specialist. I do not attempt to stay current in other substantive areas of the law. It is hard enough to stay current with e-discovery, both case law and new technology. Assuming a project has good communications (and I help out with that), there is no reason for me to know much more than the basics about a case. Also, as discussed in the Ninth Class of the TARcourse.com, we have multiple built-in safeguards for quality control. They catch and help correct mistakes and inconsistencies in relevance judgment. Such mistakes are inevitably in any complex project. The understanding of relevance naturally evolves as more ESI is reviewed. That is the main reason the first methods of predictive coding often worked poorly. They used large, random secret control sets that incorrectly assumed that relevance was fixed. We have fixed and stopped using control sets long ago. TARcourse.com – First Class: Background and History of Predictive Coding. […]

  6. […] TAR Course has a new class, the Seventeenth Class: Another “Player’s View” of the Workflow. […]

  7. […] to go Hybrid too. Be sure to use the most powerful search tool of all,  predictive coding. See TAR Course for detailed instruction on Hybrid Multimodal. The robots will eat your keywords for […]

  8. […] have shared how I use predictive coding with continuous training in my TARcourse.com online instruction program. The eight-step workflow is shown […]

  9. […] The six-step approach described here uses the costs incurred at the front end of the project to predict the total expense. The costs are controlled by use of best practices, such as contract review lawyers, but primarily by limiting the number of documents reviewed. Although it is somewhat easier to follow this approach using predictive coding and document ranking, it can still be done without that search feature. You can try this approach using any review software. It works well in small or medium sized projects with fairly simple issues. For large complex projects we still recommend using the eight-step predictive coding approach as taught in the TarCourse.com. […]

  10. […] Testing and refining keywords is legal work because relevance is determined by legal analysis, not computer nor technical analysis. IT is notorious for sometimes exceeding their bounds and thinking they know best. When it comes to the Law, to the requirements of an adequate search, to relevance, client IT is out of their depth. That is your role as lawyer in a well functioning e-Discovery Team. Not sure, take a refresher of the TAR Course. […]

  11. […] Hybrid Multimodal Predictive Coding 4.0. The basic search method is explained in the open-sourced TAR Course, but the Course does not detail how the method can be used in this kind of […]

  12. Timothy Plamondon says:

    TTR = Train, Test, Rank.

  13. […] is especially true with legal technology. This blog post also mirrors an updated version of the first class of my 2017 TAR Training Course. I am in the process of refreshing all classes. The other seventeen classes in the training program […]

  14. […] it summarizes. (Note: I have a grandchild in second grade.) I love it and hope you do too. It might even cause you to read my grown-up version. By the way, the images here are all by OpenAI’s Dall-E […]

Leave a Reply

%d bloggers like this: