Lawyers’ Job Security in a Near Future World of AI, the Law’s “Reasonable Man Myth” and “Bagley Two” – Part One

January 15, 2017

bad-robotDoes the inevitable triumph of AI robots over human reason and logic mean that the legal profession is doomed? Will Watson be the next generation’s lawyer of choice? I do no think so and have written many articles on why, including two last year: Scientific Proof of Law’s Overreliance On Reason: The “Reasonable Man” is Dead and the Holistic Lawyer is Born; and The Law’s “Reasonable Man,” Judge Haight, Love, Truth, Justice, “Go Fish” and Why the Legal Profession Is Not Doomed to be Replaced by Robots. In the Reasonable Man article I discussed how reasonability is the basis of the law, but that it is not objective. It depends on many subjective factors, on psychology. In the Scientific Proof article I continued the argument and argued:

The Law’s Reasonable Man is a fiction. He or she does not exist. Never has, never will. All humans, including us lawyers, are much more complex than that. We need to recognize this. We need to replace the Law’s reliance on reason alone with a more realistic multidimensional holistic approach.

Scientific Proof Article

brain_gears_NOTo help make my argument in the Scientific Proof article I relied on the analysis of Thomas H. Davenport and Julia Kirby in Only Humans Need Apply: Winners and Losers in the Age of Smart Machines (Harper 2016) and on the scientific work of Dan Ariely, a Professor of Psychology and Behavioral Economics at Duke University.

I cite to Only Humans Need Apply: Winners and Losers in the Age of Smart Machines to support my thesis:

Although most lawyers in the profession do not know it yet, the non-reasoning aspects of the Law are its most important parts. The reasoning aspects of legal work can be augmented. That is certain. So will other aspects, like reading comprehension. But the other aspects of our work, the aspects that require more than mere reason, are what makes the Law a human profession. These job functions will survive the surge of AI.

If you want to remain a winner in future Law, grow these aspects. Only losers will hold fast to reason. Letting go of the grip of the Reasonable Man, by which many lawyers are now strangled, will make you a better lawyer and, at the same time, improve your job security.

Also see Dean Gonsowski, A Clear View or a Short Distance? AI and the Legal Industry; and, Gonsowski, A Changing World: Ralph Losey on “Stepping In” for e-Discovery, (Relativity Blog).

Professor Ariely has found from many experiments that We’re All Predictably Irrational. In my article, Scientific ProofI point my readers to his many easily accessible video talks on the subject. I consider the implication of Professor Ariely’s research on the law:

Our legal house needs a new and better foundation than reason. We must follow the physicists of a century ago. We must transcend Newtonian causality and embrace the more complex, more profound truth that science has revealed. The Reasonable Man is a myth that has outlived its usefulness. We need to accept the evidence, and move on. We need to develop new theories and propositions of law that confirm to the new facts at hand. Reason is just one part of who we are. There is much more to us then that: emotion, empathy, creativity, aesthetics, intuition, love, strength, courage, imagination, determination – to name just a few of our many qualities. These things are what make us uniquely human; they are what separate us from AI. Logic and reason may end up being the least of our abilities, although they are still qualities that I personally cherish. …

Davinci_whole_manSince human reason is now known to be so unreliable, and is only a contributing factor to our decisions, on what should we base our legal jurisprudence? I believe that the Reasonable Man, now that he is known to be an impossible dream, should be replaced by the Whole Man. Our jurisprudence should be based on the reality that we are not robots, not mere thinking machines. We have many other faculties and capabilities beyond just logic and reason. We are more than math. We are living beings. Reason is just one of our many abilities.

So I propose a new, holistic model for the law. It would still include reason, but add our other faculties. It would incorporate our total self, all human attributes. We would include more than logic and reason to judge whether behavior is acceptable or not, to consider whether a resolution of a dispute is fair or not. Equity would regain equal importance.

A new schemata for a holistic jurisprudence would thus include not just human logic, but also human emotions, our feelings of fairness, our intuitions of what is right and just, and multiple environmental and perceptual factors. I suggest a new model start simple and use a four-fold structure like this, and please note I keep Reason on top, as I still strongly believe in its importance to the Law.


My Scientific Proof article included a call to action, the response to which has been positive:

The legal profession needs to take action now to reduce our over-reliance on the Myth of the Reasonable Man. We should put the foundations of our legal system on something else, something more solid, more real than that. We need to put our house in order before it starts collapsing around us. That is the reasonable thing to do, but for that very reason we will not start to do it until we have better motivation than that. You cannot get people to act on reason alone, even lawyers. So let us engage the other more powerful motivators, including the emotions of fear and greed. For if we do not evolve our work to focus on far more than reason, then we will surely be replaced.


AI can think better and faster, and ultimately at a far lower cost. But can AI reassure a client? Can it tell what a client really wants and needs. Can AI think out of the box to come up with new, creative solutions. Can AI sense what is fair? Beyond application of the rules, can it attain the wisdom of justice. Does it know when rules should be bent and how far? Does it know, like any experienced judge knows, when rules should be broken entirely to attain a just result? Doubtful.

I go on to make some specific suggestions, just to start the dialogue, and then closed with the following:

We must move away from over-reliance on reason alone. Our enlightened self-interest in continued employment in the rapidly advancing world of AI demands this. So too does our quest to improve our system of justice, to keep it current with the rapid changes in society.

Where we must still rely on reason, we should at the same time realize its limitations. We should look for new technology based methods to impose more checks and balances on reason than we already have. We should create new systems that will detect and correct the inevitable errors in reason that all humans make – lawyers, judges and witnesses alike. Bias and prejudice must be overcome in all areas of life, but especially in the justice system.

Computers, especially AI, should be able to help with this and also make the whole process more efficient. We need to start focusing on this, to make it a priority. It demands more than talk and thinking. It demands action. We cannot just think our way out of a prison of thought. We need to use all of our faculties, especially our imagination, creativity, intuition, empathy and good faith.

Reasonable Man Article

Reasonable_man_cloudTo help make my argument in the earlier blog, The Law’s “Reasonable Man,” Judge Haight, Love, Truth, Justice, “Go Fish” and Why the Legal Profession Is Not Doomed to be Replaced by Robots, I quoted extensively from an Order Denying Defendant’s Motion for Protective Order. The order arose out of a routine employment discrimination case. Bagely v. Yale, Civil Action No. 3:13-CV-1890 (CSH) (Doc. 108) (order dated April 27, 2015). The Order examined the “reasonability” of ESI accessibility under Rule 26(b)(2)(B) and the “reasonable” efforts requirements under Rule 26(b). I used language of that Bagley Order to help support my argument that there is far more to The Law than mere reason and logic. I also argued that this is a very good thing, for otherwise lawyers could easily be replaced by robots.

Another e-discovery order was entered in Bagley on December 22, 2016. Ruling On Plaintiff’s Motion To Compel. Bagely v. Yale, Civil Action No. 3:13-CV-1890 (CSH). Bagley Two again provokes me to write on this key topic. This second order, like the first, was written by Senior District Judge Charles S. Haight, Jr.. The eighty-six year old Judge Haight is becoming one of my favorite legal scholars because of his excellent analysis and his witty, fairly transparent writing style. This double Yale graduate has a way with words, especially when issuing rulings adverse to his alma mater. He is also one of the few judges that I have been unable to locate an online photo of, so use your imagination, which, by the way, is another powerful tool that separates us from AI juiced robots.

Lady JusticeI pointed out in the Reasonable Man article, and it bears repetition, that I am no enemy of reason and rationality. It is a powerful tool in legal practice, but it is hardly our only tool. It is one of many. The “Reasonable Man” is one of the most important ideas of Law, symbolized by the balance scales, but it is not the only idea. In fact, it is not even the most important idea for the Law. That honor goes to Justice. Lady Justice holding the scales of reason is the symbol of the Law, not the scales alone. She is usually depicted with a blindfold on, symbolizing the impartiality of justice, not dependent on the social status or position of the litigants.

My view is that lawyer reasoning should continue in all future law, but should augmented by artificial intelligence. With machines helping to rid us of hidden biases in all human reason, and making that part of our evaluation easier and more accurate, we are free to put more emphasis on our other lawyer skills, on the other factors that go into our evaluation of the case. These include our empathy, intuition, emotional intelligence, feelings, humor, perception (including lie detection), imagination, inventiveness and sense of fairness and justice. Reason is only one of many human capacities involved in legal decision making.

In Reasonable Man article I analyzed the first Bagley Order to help prove that point:

Bagley shows that the dividing line between what is reasonable and thus acceptable efforts, and what is not, can often be difficult to determine. It depends on a careful evaluation of the facts, to be sure, but this evaluation in turn depends on many subjective factors, including whether one side or another was trying to cooperate. These factors include all kinds of prevailing social norms, not just cooperativeness. It also includes personal values, prejudices, education, intelligence, and even how the mind itself works, the hidden psychological influences. They all influence a judge’s evaluation in any particular case as to which side of the acceptable behavior line a particular course of conduct falls.

In close questions the subjectivity inherent in determinations of reasonability is obvious. This is especially true for the attorneys involved, the ones paid to be independent analysts and objective advisors. People can, and often do, disagree on what is reasonable and what is not. They disagree on what is negligent and what is not. On what is acceptable and what is not.

All trial lawyers know that certain tricks of argument and appeals to emotion can have a profound effect on a judge’s resolution of these supposedly reason-based disagreements. They can have an even more profound affect on a jury’s decision. (That is the primary reason that there are so many rules on what can and cannot be said to a jury.)

lady_justice_not_blindIn spite of practical knowledge by the experienced, the myth continues in our profession that reasonability exists in some sort of objective, platonic plane of ideas, above all subjective influences. The just decision can be reached by deep, impartial reasoning. It is an article of faith in the legal profession, even though experienced trial lawyers and judges know that it is total nonsense, or nearly so. They know full well the importance of psychology and social norms. They know the impact of cognitive biases of all kinds, including, for example, hindsight biasSee Roitblat, The Schlemiel and the Schlimazel and the Psychology of Reasonableness (Jan. 10, 2014, LTN) (link is to republication by a vendor without attribution) (“tendency to see events that have already occurred as being more predictable than they were before they actually took place“); Also see Rimkus v Cammarata, 688 F. Supp. 2d 598 (S.D. Tex. 2010) (J. Rosenthal) (“It can be difficult to draw bright-line distinctions between acceptable and unacceptable conduct in preserving information and in conducting discovery, either prospectively or with the benefit (and distortion) of hindsight.” emphasis added); Pension Committee of the University of Montreal Pension Plan, et al. v. Banc of America Securities, LLC, et al., 685 F. Supp. 2d 456 (S.D.N.Y. Jan. 15, 2010 as amended May 28, 2010) at pgs. 463-464 (J. Scheindlin) (‘That is a judgment call that must be made by a court reviewing the conduct through the backward lens known as hindsight.” emphasis added).

In my conclusion to Reasonable Man article I summarized my thoughts and tried to kick off further discussion of this topic:

The myth of objectivity and the “Reasonable Man” in the law should be exposed. Many naive people still put all of their faith in legal rules and the operation of objective, unemotional logic. The system does not really work that way. Outsiders trying to automate the law are misguided. The Law is far more than logic and reason. It is more than the facts, the surrounding circumstances. It is more than evidence. It is about people and by people. It is about emotion and empathy too. It is about fairness and equity. It’s prime directive is justice, not reason.

That is the key reason why AI cannot automate law, nor legal decision making. Judge Charles (“Terry”) Haight could be augmented and enhanced by smart machines, by AI, but never replaced. The role of AI in the Law is to improve our reasoning, minimize our schlemiel biases. But the robots will never replace lawyers and judges. In spite of the myth of the Reasonable Man, there is far more to law then reason and facts. I for one am glad about that. If it were otherwise the legal profession would be doomed to be replaced by robots.

Bagley Two

Now let us see how Judge Haight once again helps prove the Reasonable Man points by his opinion in Bagley Two. Ruling On Plaintiff’s Motion To Compel (December 22, 2016), Bagely v. Yale, Civil Action No. 3:13-CV-1890 (CSH). In this opinion the reasonability of defendant Yale’s preservation efforts was considered in the context of a motion to compel discovery. His order again reveals the complexity and inherent subjectivity of all human reason. It shows that there are always multiple factors at work in any judge’s decision beyond just thought and reason, including an instinct born out of long experience for fairness and justice. Once again I will rely primarily on Judge Haight’s own words. I do so because I like the way he writes and because you need to read his original words to appreciate what I am talking about. But first, let me set the stage.

Reasonable_guageYale sent written preservation notices to sixty-five different people, which I know from thousands of matters is a very large number of custodians to put on hold in a single-plaintiff discrimination case. But Yale did so in stages, starting on March 1, 2013 and ending on August 7, 2014. Eight different times over this period they kept adding people to their hold list. The notices were sent by Jonathan Clune, a senior associate general counsel of Yale University. The plaintiff argued that they were too late in adding some of the custodians and otherwise attacked the reasonability of Yale’s efforts.

The plaintiff was not seeking sanctions yet for the suspected unreasonable efforts, they were seeking discovery from Yale as to details of these efforts. Specifically they sought production of: (1) the actual litigation hold notices; (2) the completed document preservation computer survey forms that were required to be returned to the Office of General Counsel by each Litigation Hold Recipient; and, (3) an affidavit detailing the retention and production for all non-ESI documents collected from each of the Litigation hold Recipients.

Yale opposed this discovery claiming any more information as to its preservation efforts was protected from discovery under the attorney-client privilege and attorney work product protection.  Yale also argued that even if the privileges did not apply here, the discovery should still be denied because to obtain such information a party must first provide convincing proof that spoliation in fact occurred. Yale asserted that the plaintiff failed to provide sufficient proof, or even any proof, that spoliation had in fact occurred.

Here is the start of Judge Haight’s evaluation of the respective positions:

Mr. Clune’s litigation hold notices stressed that a recipient’s failure to preserve pertinent documents could “lead to legal sanctions” against Yale. Clune was concerned about a possible sanction against Yale for spoliation of evidence. While Clune’s notices did not use the term, “spoliation” is a cardinal litigation vice, known by that name to trial lawyers and judges, perhaps unfamiliar to academics unable to claim either of those distinctions. Clune’s notices made manifest his concern that a trial court might sanction Yale for spoliation of evidence relevant to the University SOM’s decision not to reappoint Bagley to its faculty.

skull_bones_yaleNote the jab at academics. By the way, in my experience his observation is correct about the cluelessness of most law professors when it comes to e-discovery. But why does Judge Haight take the time here to point that out? This case did not involve the Law School. It involved the business school professors and staff (as you would expect). It is important to know that Judge Haight is a double Yale graduate, both undergraduate and law school. He graduated from Yale Law in 1955. He was even a member of Yale’s infamous of Skull and Bones society. (What does 322 really mean? Eulogia?) Perhaps there are some underlying emotions here? Judge Haight does seem to enjoy poking Yale, but he may do that in all his cases with Yale out of an eccentric kind of good humor, like a friendly shoulder punch. But I doubt it.

To be continued … 

Document Review and Predictive Coding: Video Talks – Part Four

March 11, 2016

predictive_coding_Step-3This is the fourth of seven informal video talks on document review and predictive coding. The first video explained why this is important to the future of the Law. The second talked about ESI Communications. The third about Multimodal Search Review. This video talks about the third step of the e-Discovery Team’s eight-step work flow, shown above, Random Baseline Sample.

coin_flipAlthough this text intro is overly long, the video itself is short, under eight minutes, as there is really not that much to this step. You simply take a random sample at or near the beginning of the project. Again, this step can be used in any document review project, not just ones with predictive coding. You do this to get some sense of the prevalence of  relevant documents in the data collection. That just means the sample will give you an idea as to the total number of relevant documents. You do not take the sample to set up a secret control set, a practice that has been thoroughly discredited by our Team and others. See Predictive Coding 3.0.

thumb_ruleIf you understand sampling statistics you know that sampling like this produces a range, not an exact number. If your sample size is small, then the range will be very high. If you want to reduce your range in half, which is a function in statistics known as a confidence interval, you have to quadruple your sample size. This is a general rule of thumb that I explained in tedious mathematical detail several years ago in Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022. Our Team likes to use a fairly large sample size of about 1,533 documents that creates a confidence interval of plus or minus 2.5%, subject to a confidence level of 95% (meaning the true value will lie within that range 95 times out of 100). More information on sample size is summarized in the graph below. Id.


The picture below this paragraph illustrates a data cloud where the yellow dots are the sampled documents from the grey dot total, and the hard to see red dots are the relevant documents found in that sample. Although this illustration is from a real project we had, it shows a dataset that is unusual in legal search because the prevalence here was high, between 22.5% and 27.5%. In most data collections searched in the law today, where the custodian data has not been filtered by keywords, the prevalence is far less than that, typically less than 5%, maybe even less that 0.5%. The low prevalence increases the range size, the uncertainties, and requires a binomial calculation adjustment to determine the statistically valid confidence interval, and thus the true document range.


For example, in a typical legal project with a few percent prevalence range, it would be common to see a range between 20,000 and 60,000 relevant documents in a 1,000,000 collection. Still, even with this very large range, we find it useful to at least have some idea of the number of documents they are looking for. That is what the Baseline Step can provide to you, nothing more nor less.

95 Percent Confidence Level with Normal Distribution 1.96If you are unsure of how to do sampling for prevalence estimates, your vendor can probably help. Just do not let them tell you that it is one exact number. That is simply a point projection near the middle of a range. The one number point projection is just the top of the typical probability bell curve shown above, which illustrates a 95% confidence level distribution. The top is just one possibility, albeit slightly more likely than either end points. The true value could be anywhere in the blue range.

To repeat, the Step Three prevalence baseline number is always a range, never just one number. Going back to the relatively high prevalence example, the below bell cure shows a point projection of 25% prevalence, with a range of 22.2% and 27.5%, creating a range of relevant documents of from between 225,000 and 275,000. This is shown below.


confidence interval graph showing standard distribution and 50% prevalenceThe important point that many vendors and other “experts” often forget to mention, is that you can never know exactly where within that range the true value may lie. Plus, there is always a small possibility, 5% when using a sample size based on a 95% confidence level, that the true value may fall outside of that range. It may, for example, only have 200,000 relevant documents. This means that even with a high prevalence project with datasets that approach the Normal Distribution of 50% (here meaning half of the documents are relevant), you can never know that there are exactly 250,000 documents, just because it is the mid-point or point projection. You can only know that there are between 225,000 and 275,000 relevant documents, and even that range may be wrong 5% of the time. Those uncertainties are inherent limitations to random sampling.

Shame on the vendors who still perpetuate that myth of certainty. Lawyers can handle the truth. We are used to dealing with uncertainties. All trial lawyers talk in terms of probable results at trial, and risks of loss, and often calculate a case’s settlement value based on such risk estimates. Do not insult our intelligence by a simplification of statistics that is plain wrong. Reliance on such erroneous point projections alone can lead to incorrect estimates as to the level of recall that we have attained in a project. We do not need to know the math, but we do need to know the truth.

The short video that follows will briefly explain the Random Baseline step, but does not go into the technical details of the math or statistics, such as the use of the binomial calculator for low prevalence. I have previously written extensively on this subject. See for instance:

Byte and Switch

If you prefer to learn stuff like this by watching cute animated robots, then you might like: Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search. But be careful, their view is version 1.0 as to control sets.

Thanks again to William Webber and other scientists in this field who helped me out over the years to understand the Bayesian nature of statistics (and reality).

For details on all eight steps, including this third step, see Predictive Coding 3.0More information on document review and predictive coding can be found in the fifty-six articles published here.




Predictive Coding 3.0

October 11, 2015


ralphlosey_cartoon_smallBefore describing the new version of Predictive Coding methodology shown in the chart animation, version 3.0, this blog will review and describe the prior versions predominantly used in the e-discovery world, including the main patents involved. The more recent U.S. patents of Maura Grossman and Gordon Cormack will also be reviewed. Their work seems fairly close to Predictive Coding 3.0, although we have no affiliation whatsoever, except for the fact that I am one of the many admirers of their research and writings.

Overview of the Three Generations of Predictive Coding Software

First generation Predictive Coding, version 1.0, entered the market in 2009. It used active machine learning with methodology requirements built into the software that you begin the review with an SME coding a random selection of several thousand documents. The random documents included a secret set of documents not identified to the user, any user, called a control set. The secret control set supposedly allowed you to objectively monitor your progress in Recall and Precision of the relevant documents from the total set. It also supposedly prevented lawyers from gaming the system. Version 1.0 software also had two distinct stages, one for training and another for review. The next generation of version 2.0 methodology combined the two-stages into one,where training continued continuously throughout the review. The method of Predictive Coding 3.0 again combines the two-stages into one, but also eliminates the secret control set. Random sampling itself remains, that is the third step in the eight-step version 3.0 process, but the secret set of random documents, the control set, is eliminated.Control-Sets

Although the use of a control set is basic to all scientific research and statistical analysis, it does not work in legal search. The EDRM, which apparently still promotes the use of a methodology with control sets, explains that the control set:

… is a random sample of documents drawn from the entire collection of documents, usually prior to starting Assisted Review training rounds. … The control set is coded by domain experts for responsiveness and key issues. … [T]he coded control set is now considered the human-selected ground truth set and used as a benchmark for further statistical measurements we may want to calculate later in the project. As a result, there is only one active control set in Assisted Review for any given project. … [C]ontrol set documents are never provided to the analytics engine as example documents. Because of this approach, we are able to see how the analytics engine categorizes the control set documents based on its learning, and calculate how well the engine is performing at the end of a particular round. The control set, regardless of size or type, will always be evaluated at the end of every round—a pop quiz for Assisted Review. This gives the Assisted Review team a great deal of flexibility in training the engine, while still using statistics to report on the efficacy of the Assisted Review process.

Control Sets: Introducing Precision, Recall, and F1 into Relativity Assisted Review (a kCura white paper adopted by EDRM).

Grossman_DavidThe original white paper written by David Grossman, entitled Measuring and Validating the Effectiveness of Relativity Assisted Review, is cited by EDRM as support for their position on the validity and necessity of control sets. In fact, the paper does not support this proposition. The author of this Relativity White Paper, David Grossman, is a Ph.D. now serving as the associate director of the Georgetown Information Retrieval Laboratory, a faculty affiliate at Georgetown University, and an adjunct professor at IIT in Chicago. He is an leading expert in text retrieval and has no connections with Relativity except to write this one small paper. I spoke with David Grossman on October 30, 2015. He confirmed that the validity, or not, of control sets in legal search was not the subject of his investigation. His paper does not address this issue. In fact, he has no opinion of the validity of control sets in the context of legal search.

David’s one study for kCura was limited to the narrow questions of: (1) whether statistical sampling creates representative samples, and (2) whether the retrieval of relevant documents improved during two rounds of predictive coding type training. The first question was very basic and the answer was, of course, yes, sampling works. The issue of control sets was not considered. Even though control sets were mentioned, it was never his intent to measure their effectiveness per se.

The second issue was also very basic, and his answer again was, of course, yes, training works. Still, he carefully qualified that answer and concluded only that he observed “improved effectiveness with almost each new round that was tried in our testing.” Measuring and Validating the Effectiveness of Relativity Assisted Review at pg 5. In my conversations with David he also confirmed that he did not design any of the Relativity software nor any of its methods. He was also unaware of the controversies in legal search, including the effectiveness of using control sets, and my view that the “ground truth” at the beginning of a search project was more like quick sand. Although David Grossman has never done a legal search project, he has done many other types of real-world searches. He volunteered that he has frequently had that same quicksand type of experience where the understanding of relevance evolves as the search progresses.

The problem with the use of the control set in legal search is that the SMEs, what EDRM here refers to as the domain experts, never know the full truth of document responsiveness at the beginning of a project. This is something that evolves over time. The understanding of relevance changes over time, changes as particular documents are examined. The control set fails and creates false results because “the human-selected ground truth set and used as a benchmark for further statistical measurements” is never correct, especially at the beginning of a large review project. Only at the end of a project are we in a position to determine  a “ground truth” and “benchmark” for statistical measurements.

This problem was recognized by another information retrieval expert, William Webber, PhD. William does have experience with legal search and has been kind enough to help me through technical issues involving sampling many times on this blog. Here is how Dr. Webber puts it in his blog Confidence intervals on recall and eRecall:

Using the control set for the final estimate is also open to the objection that the control set coding decisions, having been made before the subject-matter expert (SME) was familiar with the collection and the case, may be unreliable.

Having done many reviews myself, and served as the SME on most of them, I am much more emphatic than William and do not couch my opinion with “may be unreliable.” To me there is no question that at least some of the SME control set decisions at the start of a review are almost certainly unreliable.

KEYS_cone.filter-copyAnother reason control sets fail in legal search is the very low prevalence typical of the ESI collections searched. We only see high prevalence when the document collection was keyword filtered. The original collections are always low, usually less that 5%, and often less than 1%. About the highest prevalence collection I have ever searched was the Oracle collection in the EDI search contest, and it had obviously been heavily filtered by a variety of methods. That is not a best practice because the filtering often removes the relevant documents from the collection, making it impossible for predictive coding to ever find them. See eg, William Webber’s analysis of the Biomet case where this kind of keyword filtering was used before predictive coding began. What is the maximum recall in re Biomet?Evaluating e-Discovery (4/24/13). Webber shows that in Biomet this method first filtered out over 40% of the relevant documents. This doomed the second filter predictive coding review to a maximum possible recall of 60%, even if it was perfect, meaning it would otherwise have attained 100% recall, which (almost) never happens. The Biomet case thus very clearly shows the dangers of over-reliance on keyword filtering.

The control set approach cannot work in legal search because the size of the random sample, much less the portion of the sample allocated to the control set, is never even close to large enough to include a representative document from each type of relevant documents in the corpus, much less the outliers. So even if the benchmark were not on such shifting grounds, it would still fail because it is incomplete. The result is likely to be overtraining of the document types to those that happened to hit in the control set, which is exactly what the control set is supposed to prevent. This kind of overfitting can and does happen even without exact knowledge of the documents in the control set. That is an additional problem separate and apart from relevance shift. It is a problem solved by the multimodal search aspects of predictive coding 3.0.

William_webberAgain William Webber has addressed this issue in his typical understated manner. He points out in Why training and review (partly) break control sets the futility of  using of control sets to measure effectiveness because the sets are incomplete:

Direct measures of process effectiveness on the control set will fail to take account of the relevant and irrelevant documents already found through human assessment.

A naïve solution to this problem to exclude the already-reviewed documents from the collection; to use the control set to estimate effectiveness only on the remaining documents (the remnant); and then to combine estimated remnant effectiveness with what has been found by manual means. This approach, however, is incorrect: as documents are non-randomly removed from the collection, the control set ceases to be randomly representative of the remnant. In particular, if training (through active learning) or review is prioritized towards easily-found relevant documents, then easily-found relevant documents will become rare in the remnant; the control set will overstate effectiveness on the remnant, and hence will overstate the recall of the TAR process overall. …

In particular, practitioners should be wary about the use of control sets to certify the completeness of a production—besides the sequential testing bias inherent in repeated testing against the one control set, and the fact that control set relevance judgments are made in the relative ignorance of the beginning of the TAR process. A separate certification sample should be preferred for making final assessments of production completeness.

Control sets are a good idea in general, and the basis of most scientific research, but it simply does not work in legal search. It was built into the version 1.0 software by engineers and scientists who had little understanding of legal search. They apparently had, and some still have, no real grasp at all as to how relevance is refined and evolves during the course of any large document review, nor of the typical low prevalence of relevance. The normal distribution in probability statistics is just never found in legal search. The whole theory behind the secret control set myth in legal search is that the initial relevance coding of these documents was correct, immutable and complete; that it should be used to objectively judge the rest of the coding in the project. That is not true. In point of fact, many documents determined to be relevant or irrelevant at the beginning of a project may be considered the reverse by the end. Many more types of relevant documents are never even included in the control set. That is not because of a bad luck or a weak SME, but because of the natural progression of the understanding of the probative value of various types of documents over the course of a review. It is also because of the natural rarity of relevant evidence in unfiltered document collections.

All experienced lawyers know how relevance shifts during a case. But the scientists and engineers who designed the first generation software did not know this, and anyway, it contravened their dogma of the necessity of control sets. They could not bend their minds to the reality of indeterminate, rare legal relevance. In legal search the target is always moving and always small. Also, the data itself can often change as new documents are added to the collection. In other areas of information retrieval, the target is solid granite, simple Newtonian, and big, or at least bigger than just a few percent. Outside of legal search it may make sense to talk of an immutable ground truth. In legal search the ground truth is discovered. It emerges as part of the process, often including surprise court rulings and amended causes of action. It is in flux. The truth is rare. The truth is relative.

schrodinger_quantum_uncertainityThe parallels of legal search with quantum mechanics are obvious. The documents have to be observed before they will manifest certainly as either relevant or irrelevant. Uncertainty is inherent to information retrieval in legal search. Get used to it. That is reality on many levels, including the law.

The control set based procedures were not only over-complicated, they were inherently defective. They were based on an illusion of certainty, an illusion of a ground truth benchmark magically found at the beginning of a project before document review even began. There were supposedly SME wizards capable of such prodigious feats. I have been an SME in many, many topics of legal relevance in my over 38 plus years of legal practice. SMEs are human, all too human. There is no magic wizard behind the curtain. Moreover, the understanding of any good SME naturally evolves over time as previously unknown, unseen documents are unearthed and analyzed. Legal understanding is not static. The theory of a case is not static. All experienced trial lawyers know this. The case you start out with is never the one you end up with. You never really know if Schrodinger’s cat is alive or dead. You get used to that after a while. Certainty comes from the final rulings of the last court of appeals.

The use of magical control sets doomed many a predictive coding project to failure. Project team leaders thought they had high recall, because the secret control set said they did, yet they still missed key documents. They still had poor recall and poor precision, or at least far less than their control set analysis led them to believe. See: Webber, The bias of sequential testing in predictive coding, June 25, 2013, (“a control sample used to guide the producing party’s process cannot also be used to provide a statistically valid estimate of that process’s result.”) I still hear stores from reviewers where they find precision of less than 50% using Predictive Coding 1.o methods, sometimes far less. That always seems shocking to me, unbelievable, as I have never had a predictive coding project (where I have, of course, always used these 3.0 methods) with less than 80% precision, and many times reviewers find 95% plus precision.

Many attorneys who worked with predictive coding software version 1.0, where they did not see their projects overtly crash and burn, as when missed smoking gun documents later turn up, or where reviewers see embarrassingly low precision, were nonetheless suspicious of the results. Even if not suspicious, they were discouraged by the complexity and arcane control set process from every trying predictive coding again. As attorney and search expert J. William (Bill) Speros likes to say, they could smell the junk science in the air. They were right. I do not blame them for rejecting predictive coding 1.0. I did. But unlike many, I created by own method, here called version 3.0.

At first iI could not understand why so many of my search expert friends did not enjoy the same level of success that I did, or Maura Grossman did, or a few others like us in the industry. In fact, I heard more complaints about predictive coding than praise. I have finally understood (yes, I admit to being fairly slow on this realization) that they were following the version 1.0 predictive coding methods of the vendors they used. That explained their failures, their frustrations. I never did followed the 1.0 procedures. Maura Grossman never even used any of the vendor software. The many frustrated with predictive coding 1.0 were also told by some vendors to leave behind their other search skills and tools, and just use predictive coding type searches. I also have always rejected this too, and instead used a multimodal approach.

funny_wizardThe control set fiction also put an unnecessarily heavy burden upon SMEs. They were supposed to review thousands of random documents at the beginning of a project, sometimes tens of thousands, and successfully classify them, not only for relevance, but sometimes also for a host of sub-issues. Some gamely tried, and went along with the pretense of omnipotence. After all, the documents in the control set were kept secret so no one would ever know if any particular document they coded was correct of not. But most SMEs simply refused to spend days and days coding random documents. They refused to pay the pretend wizard game. They correctly intuited that they had better things to do with their time, plus many clients did not want to spend over $500 per hour to have their senior trial lawyers reading random emails, most of which would be irrelevant. So the SMES would delegate this tedious task to other, less experienced attorneys, ones who were even less qualified to play God.

I have heard many complaints from lawyers that predictive coding is too complicated and did not work for them. These complaints were justified. The control set and two-step review process were the culprits, not the active machine learning process. The control set has done great harm to the legal profession. As one of the few writers in e-discovery free from vendor influence, much less control (you will never see any ads here), I am here to blow the whistle, to put an end to the vendor hype. No more secret control sets. Let us simplify and get real. Lawyers who have tried predictive coding before and given up, come back and try Predictive Coding 3.0. It has been purged of vendor hype and bad science and proven effective many times.

Blue_Lexie_robot_blackVendors – Do want to increase your business and predictive coding users? Then make sure your software will work with Predictive Coding 3.0 and make sure your experts understand 3.0 methods. Mr. EDR already allows for use of version 3.0, and the Kroll Ontrack experts now know how to use these methods with him. But even Mr. EDR, my current favorite software, needs to be improved and purged of his needless control set complexities. Predictive Coding 3.0 is much simpler, and more accurate, than any prior method.

Users – If your vendor is version 3.0 compliant, then come back and give predictive coding another try. I am sure you will be pleasantly surprised this time.

Version 3.0 is CAL Based and Control Free 

Version 1.0 type software, which is still being manufactured by many vendors today, has a built-in two-step process as mentioned earlier. It requires you to train documents, and then after training, review a certain total of ranked documents, as guided by your control set recall calculations. Version 2.0 of Predictive Coding eliminated the two-step process, and made the training continuous. For that reason version 2.0 is also called continuous active learning or CAL. It did not, however, explicitly reject the random sample step and its control set nonsense.

Predictive Coding 3.0 builds on the CAL improvements in 2.0, but also eliminates the secret control set and mandatory initial review of a random sample for this set. This and other process improvements in Predictive Coding 3.0 significantly reduce the burden on busy SMEs, and significantly improves the recall estimates, and thus improves the quality of the reviews.

In Predictive Coding 3.0 the secret control set basis of recall calculation are replaced with a prevalence based random sample guide, and elusion based quality control samples. These can be done with contract lawyers and only minimal involvement by SME. See Zero Error Numerics. The final elusion type recall calculation is done at the end of the project, when final relevance has been determined. See: EI-Recall. Moreover, in the 3.0 process the sample documents are not secret. They are known and adjusted as the definitions of relevance change over time to better control your recall range estimates. That is a major improvement.

The secret control set never worked, and it is high time it be expressly abandoned, because: (1) relevance is never static, it changes over the course of the review; (2) the random selection size was typically too small for statistically meaningful calculations; (3) the random selection was typically too small in low prevalence collections (the last majority in legal search) for complete training selections; and (4) it supposedly required a senior SMEs personal attention for days of document review work, a mission impossible for most e-discovery teams.

Predictive Coding 1.0 and the First Patents

USPTOWhen predictive coding first entered the legal marketplace in 2009 the legal methodology used by lawyers for predictive coding was dictated by the software manufacturers, mainly the engineers who designed the software. See egLeading End-to-End eDiscovery Platform Combines Unique Predictive Coding Technology with Random Sampling to Revolutionize Document Review (2009 Press Release). Recommind was an early leader, which is one reason I selected them for the Da Silva Moore v. Publicis Groupe case back in 2011. On April 26, 2011, Recommind was granted a patent for predictive coding: Patent No. 7,933,859, entitled Full-Text Systems and methods for predictive coding. The search algorithms in the patent used Probabilistic Latent Semantic Analysis, an already well-established statistical analysis technique for data analysis. (Recommind obtained two more patents with the same name in 2013: Patent No. 8,489,538 on July 16, 2013; and Patent No. 8,554,716 on October 8, 2013.)

As the title of all of these patents indicate, the methods of use of the text analytics technology in the software were key to the patent claims. As is typical for patents, many different method variables were described to try to obtain as wide a protection as possible. The core method was shown in Figure Four of the 2011 patent.


This essentially describes the method that I now refer to as Predictive Coding Version 1.0. It is the work flow I had in mind when I first designed procedures for the Da Silva Moore case. In spite of the Recommind patent, this basic method was followed by all vendors who added predictive coding features to their software in 2011, 2012 and thereafter. It is still going on today. Many of the other vendors also received patents for their predictive coding technology and methods, or applications are pending. See eg. Equivio, patent applied for on June 15, 2011 and granted on September 10, 2013, patent number  8,533,194; Kroll Ontrack, application  20120278266, April 28, 2011.

To my knowledge there has been no litigation between vendors. My guess is they all fear invalidation on the basis of lack of innovation and prior art.

The engineers, statisticians and scientists who designed the first predictive coding software are the people  who dictated to lawyers how the software should be used in document review. None of the vendors seemed to have consulted practicing lawyers in creating these version 1.0 methods. I know I was not involved.

Ralph Losey

Losey in 2011 when first arguing against the methods of version 1.0

I also remember getting into many arguments with these technical experts from several companies back in 2011. That was when the predictive coding 1.0 methods hardwired into their software were first explained to me. I objected right away to the secret control set. I wanted total control of my search and review projects. I resented the secrecy aspects. There were enough black boxes in the new technology already. I was also very dubious of the statistical projections. In my arguments with them, sometimes heated, I found that they had little real grasp of how legal search was actually conducted or the practice of law. My arguments were of no avail. And to be honest, I had a lot to learn. I was not confident of my positions, nor knowledgable enough of statistics. All I knew for sure is that I resented their trying to control my well-established, pre-predictive coding search methods. Who were they to dictate how I should practice law, what procedures I should follow? These scientists did not understand legal relevance, nor how it changes over time during the course of any large-scale review. They did not understand the whole notion of the probative value of evidence and the function of e-discovery as trial preparation. They did not understand weighted relevance, and the 7+/2 rule of judge and jury persuasion. I gave up trying, and just had the software modified to suit my needs. They would at least agree to do that to placate me.

Part of the reason I gave up trying back in 2011 is that I ran into a familiar prejudice from this expert group. It was a prejudice against lawyers common to most academics and engineers. As a high-tech lawyer since 1980 I have faced this prejudice from non-lawyer techies my whole career. They assume we were all just a bunch of weasels, not to be trusted, and with little or no knowledge of technology and search. They have no idea at all about legal ethics or professionalism, nor of our experience with the search for evidence. They fail to understand the central role of lawyers in e-discovery, and how our whole legal system, not just discovery, is based on the honesty and integrity of lawyers. We need good software from them, not methods to use the software, but they knew better. It was frustrating, believe me. So I gave up on the control set arguments and moved on. Until today.

In the arrogance of the first designers of predictive coding, an arrogance born of advanced degrees in entirely different fields, these information scientists and engineers presumed they knew enough to tell all lawyers how to use predictive coding software. They were blind to their own ignorance. The serious flaws inherent in Predictive Coding Version 1.0 are the result.

Predictive Coding Version 2.0 Adopts CAL

The first major advance in predictive coding methodology was to eliminate the dual task phases present in Predictive Coding 1.0. The first phase of the two-fold version 1.0 procedure was to use active learning to train the classifier. This would take several rounds of training and eventually the software would seem to understand what you were looking for. Your concept of relevance would be learned by the machine. Then the second phase would begin. In phase two you actually reviewed the documents that met the ranking criteria. In other words, you would use predictive coding in phase one to cull out the probable irrelevant documents, and then you would be done with predictive coding. (In some applications you might continue to use predictive coding for reviewer batch assignment purposes only, but not for training.) The next phase two was all about review to confirm the prediction of classification, usually relevance. In phase two you would just review, and not also train.

In my two ENRON experiments in 2012 I did not follow this two-step procedure. I just kept on training until I could not find any more relevant documents. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One); Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents(Part Two); Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (in PDF form and the blog introducing this 82-page narrative, with second blog regarding an update); Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods).

I did not think much about it at the time, but by continuing to train I used a, to me, perfectly reasonable departure from the version 1.0 method. I was using what is now promoted as the new and improved Predictive Coding 2.0. In this 2.0 version you combine training and review. The training is continuous. The first round of document training might be called the seed set, if you wish, but it is nothing particularly special. All rounds of training are important and the training should continue as the review proceeds, unless there are some logistical reasons not to. After all, training and review are both part of the same review software, or should be. It just makes good common sense to do that, if your software allows you to. If you review a document, then you might as well at least have the option to include it in the training. There is no logical reason for a cut-off point in the review process where training stops. I really just came up with that notion in Da Silva for simplicity sake.

In predictive coding 2.0 you do Continuous Active Learning, or CAL for short, a term which was, I think, first coined by Gordon Cormack and Maura Grossman. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014. It just makes much more sense to keep training as long as you can, if your software allows you to do that.

There are now several vendors that promote the capacity of continuous training and have it built into their review software, including Kroll Ontrack. The vendor most vocal about it, however, and the one who promotes the term Predictive Coding 2.0, is Catalyst. Apparently many vendors still use the old dual task, stop training approach of version 1.0. And, most vendors still use, or at least give lip service to, the previously sacrosanct random secret control set features of version 1.0.

John Tredennick

John Tredennick

The well-known Denver law technology sage, John Tredennick, CEO of Catalyst, writes about 2.0 continuously. Here is just one of many good explanations John has made about CAL, this one from his article with the catchy name “A TAR is Born: Continuous Active Learning Brings Increased Savings While Solving Real-World Review Problems” (note these diagrams are his, not mine, and he here calls predictive coding TAR):

How Does CAL Work?

CAL turns out to be much easier to understand and implement than the more complicated protocols associated with traditional TAR reviews.


A TAR 1.0 review is typically built around the following steps:

1. A subject matter expert (SME), often a senior lawyer, reviews and tags a sample of randomly selected documents to use as a “control set” for training.
2. The SME then begins a training process using Simple Passive Learning or Simple Active Learning. In either case, the SME reviews documents and tags them relevant or non-relevant.
3. The TAR engine uses these judgments to build a classification/ranking algorithm that will find other relevant documents. It tests the algorithm against the control set to gauge its accuracy.
4. Depending on the testing results, the SME may be asked to do more training to help improve the classification/ranking algorithm.
5. This training and testing process continues until the classifier is “stable.” That means its search algorithm is no longer getting better at identifying relevant documents in the control set.

Even though training is iterative, the process is finite. Once the TAR engine has learned what it can about the control set, that’s it. You turn it loose to rank the larger document population (which can take hours to complete) and then divide the documents into categories to review or not. There is no opportunity to feed reviewer judgments back to the TAR engine to make it smarter.

TAR 2.0: Continuous Active Learning

In contrast, the CAL protocol merges training with review in a continuous process. Start by finding as many good documents as you can through keyword search, interviews, or any other means at your disposal. Then let your TAR 2.0 engine rank the documents and get the review team going.


As the review progresses, judgments from the review team are submitted back to the TAR 2.0 engine as seeds for further training. Each time reviewers ask for a new batch of documents, they are presented based on the latest ranking. To the extent the ranking has improved through the additional review judgments, reviewers receive better documents than they otherwise would have.

After this blog first published John contact me and said his software never had a control set, and so in this sense his Catalyst software is already fully Predictive Coding 3.0 compliant. Even if your software has control set features, you can probably still disable them. That is what I do with the Kroll Ontrack software that I typically use (see eg I am talking about a method of use here, not a specific algorithm, nor patentable invention. So unless the software you uses forces you do a two-step process, or makes you use a control set, you can use these version 3.0 methods with it. Still, some modifications of the software would be advantageous to streamline and simplify the whole process that is inherent in Predictive Coding 3.0. For this reason I call on all software vendors to eliminate the secret control set now and the dual step process.

Version 3.0 Rejects the Use of Control and Seed Sets

Recommind_Patent_control_setThe main problem for me with the 1.0 work-flow methodology for Predictive Coding was not the two-fold nature of train then review, which is what 2.0 addressed, but its dependence on creation of a secret control set and seed set at the beginning of a project. That is the box labeled 430 in Figure Four to the Recommind patent. It is shown in Tredennick’s Version 1.0 diagram on the left as control set and seed set. The need for a random secret control set and seed set became an article of faith based on black letter statistics rules. Lawyers just accepted it without question as part of version 1.0 predictive coding. It is also one reason that the two-fold method of train then review, instead of CAL 2.0, is taking so long for some vendors to abandon.

Based on my experience and experiments with predictive coding methods since 2011, the random control set and seed set are both unnecessary. The secret control set is especially suspect. It does not work in real-world legal review projects, or worse, provides statistical mis-information as to recall. As mentioned, that is primarily because in the real world of legal practice relevance is a continually evolving concept. It is never the same at the beginning of a project, when the control set is created, as at the end. The engineers who designed version 1.0 simply did not understand that. They were not lawyers and did not appreciate the flexibility of the relevance. They did not know about concept drift. They did not understand the inherent vagaries and changing nature of the search target in a large document review project. They also did not understand how human SMEs were, how they often disagree with themselves on the classification of the same document even without concept drift. As mentioned, they were also blinded by their own arrogance, tinged with antipathy against lawyers.

They did understand statistics. I am not saying their math was wrong. But they did not understand evidence, did not understand relevance, did not understand relevance drift (or, as I prefer to call it, relevance evolution), and did not understand efficient legal practice. Many I have talked to did not have any real understanding of how lawyers worked at all, much less document review. Most were just scientists or statisticians. They meant well, but they did harm nonetheless. These scientists did not have any legal training. If they were any lawyers on the version 1.0 software development team, they were not heard, or had never really practiced law. (As a customer, I know I was brushed off.) Things have gotten much better in this regard since 2008 and 2009, but still, many vendors have not gotten the message. They still manufacture version 1.0 type predictive coding software.

Jeremy Pickens, Ph.D., Catalyst’s in-house information scientist, seems to agree with my assessment of control sets. See Pickens, An Exploratory Analysis of Control Sets for Measuring E-Discovery ProgressDESI VI 2015, where he reports on an his investigation of the effectiveness of control sets to measure recall and precision. Jeremy used the Grossman and Cormack TAR Evaluation Toolkit for his data and gold standards. Here is his conclusion:

A popular approach in measuring e-discovery progress involves the creation of a control set, holding out randomly selected documents from training and using the quality of the classification on that set as an indication of progress on or quality of the whole. In this paper we do an exploratory data analysis of this approach and visually examine the strength of this correlation. We found that the maximum-F1 control set approach does not necessarily always correlate well with overall task progress, calling into question the use of such approaches. Larger control sets performed better, but the human judgment effort to create these sets have a significant impact on the total cost of the process as a whole.

predictive_coding_3.0A secret control set is not a part of the Predictive Coding 3.0 method. As will be explained, I still have random selection reviews for prevalence and quality control purposes – Steps Three and Seven – but the documents are not secret and they are typically used for training (although they do not have to be). Moreover, version 3.0 eliminates any kind of special first round of training seed set, random based or otherwise. The first time the machine training begins is simply the first round. Sometimes it is big, sometimes it is not. It all depends on my technical and legal analysis of the data presented or circumstances of the project. It also all depends on my legal analysis and the disputed issues of fact in the law suit or other legal investigation. That is the kind of thing that lawyers do everyday. No magic required, not even high intelligence; only background and experience as a practicing lawyer are required.

The seed set is dead. So too is the control set. Other statistical methods must be used to calculate recall ranges and other numeric parameters beyond the ineffective control set method. Other methods beyond just statistics must be used to evaluate the quality and success of a review project. See eg. EI-Recall and Zero Error Numerics that includes statistics, but is not limited to it).

A full description of the eight-step model used to describe Predictive Coding 3.0 will follow, step by step, in part two of this article.

Grossman and Cormack Patents

I do not claim any patents or other intellectual property rights to Predictive Coding 3.0, aside from copyrights to my writings, and certain trade secrets that I use, but have not published or disclosed outside of my circle of trust. Hopefully my 3.0 method does not infringe any existing patent claims. In the course of writing this article I happened to notice, for the first time, that my 3.0 method appears to have several features in common with some of the descriptions of predictive coding work flow in the predictive coding patents of Gordon Cormack and Maura Grossman. Their patents are all entitled Full-Text Systems and methods for classifying electronic information using advanced active learning technique: December 31, 2013, 8,620,842, Cormack; April 29, 2014, 8,713,023, Grossman and Cormack; and,  September 16, 2014, 8,838,606, Grossman and Cormack.

The slight similarities are not too surprising. My development of the Predictive Coding 3.0 method was based in part on their research and publications. It was also based on my studies of the publications of others, the prior art, as well as my own research and experiments with a variety of predictive coding experiments. Finally, like Maura Grossman, the 3.0 methods developed out of my experience with real-world legal predictive coding projects since 2011. All seem like obvious methods to me.

The Grossman and Cormack patents and patent applications are interesting for a number of reasons. I suggest you read them. For instance, they all contain the following paragraph in the Background section explaining why their invention is needed. As you can see it criticizes all of the existing version 1.0 software on the market at the time of their applications (2013) (emphasis added):

Generally, these e-discovery tools require significant setup and maintenance by their respective vendors, as well as large infrastructure and interconnection across many different computer systems in different locations. Additionally, they have a relatively high learning curve with complex interfaces, and rely on multi-phased approaches to active learning. The operational complexity of these tools inhibits their acceptance in legal matters, as it is difficult to demonstrate that they have been applied correctly, and that the decisions of how to create the seed set and when to halt training have been appropriate. These issues have prompted adversaries and courts to demand onerous levels of validation, including the disclosure of otherwise non-relevant seed documents and the manual review of large control sets and post-hoc document samples. Moreover, despite their complexity, many such tools either fail to achieve acceptable levels of performance (i.e., with respect to precision and recall) or fail to deliver the performance levels that their vendors claim to achieve, particularly when the set of potentially relevant documents to be found constitutes a small fraction of a large collection.

They then indicate that their invention overcomes these problems and is thus a significant improvement over prior art. In Figure Eleven of their patent (shown below) they describe one such improvement, “an exemplary method 1100 for eliminating the use of seed sets in an active learning system in accordance with certain embodiments.”


These are basically the same kind of complaints that I have made here against Predictive Coding 1.0 and 2.0. I understand the criticisms regarding complex interfaces, that rely on multi-phased approaches to active learning. If the software forces use of control set and seed set nonsense, then it is an overly complex interface. (It is not overly complex if it allows other types of search, such as keyword, similarity or concept, for this degree of complexity is necessary for a multimodal approach.) I also understand their criticism of the multi-phased approaches to active learning, which was fixed in 2.0 and CAL.

The Grossman & Cormack criticism about low prevalence document collections, which is the rule, not the exception in legal search, is also correct. It is another reason the control set approach cannot work in legal search. The number of relevant documents to be found constitutes a small fraction of a large collection and so the control set random sample is very unlikely to be representative, much less complete. That is an additional problem separate and apart from relevance shift.

About the only complaint the Grossman & Cormack patent makes that I do not understand is the gripe about large infrastructure and interconnection across many different computer systems in different locations. For Kroll Ontrack software at least, that is the vendor’s problem, not the attorneys. All the user does is sign on to a secure cloud server.

Notice that there is no seed set or control set in the Grossman & Cormack patent diagram as you see in the old Recommind patent. Much of the rest of the patent, in so far as I am able to understand the arcane patent language used, consists of applications of CAL techniques that have been tested and explained in their writings, including many additional variables and techniques not mentioned in their articles. See egEvaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014. Their patent includes CAL methods, of course, but also eliminates the use of seed sets. I presume this means they also eliminate control sets, at least in some of their methods. If true, then in that sense their patents are like my own 3.0 innovation.

To be continued and concluded with a lengthy description of the Predictive Coding version 3.o eight-step method.

%d bloggers like this: