New Video to Introduce the TAR Course

April 23, 2017

We are continuing to upgrade the e-Discovery Team’s free TAR Course. The latest improvements include the addition of “homework assignments” to the first ten classes. These are challenging and add to the depth of the instruction. The homework includes both supplemental reading suggestions and exercises. We will add homework assignments to the last six classes soon. We also made a few minor revisions and additions to the written materials, but nothing substantial. Periodically we will add some more video content to the TAR Course. We started this weekend by adding a video to the first class:

Here is the list of all sixteen classes in the TAR Course.

  1. First Class: Introduction 
  2. Second Class: TREC Total Recall Track
  3. Third Class: Introduction to the Nine Insights Concerning the Use of Predictive Coding in Legal Document Review
  4. Fourth Class: 1st of the Nine Insights – Active Machine Learning
  5. Fifth Class: Balanced Hybrid and Intelligently Spaced Training
  6. Sixth Class: Concept and Similarity Searches
  7. Seventh Class: Keyword and Linear Review
  8. Eighth Class: GIGO, QC, SME, Method, Software
  9. Ninth Class: Introduction to the Eight-Step Work Flow
  10. Tenth Class: Step One – ESI Communications
  11. Eleventh Class: Step Two – Multimodal ECA
  12. Twelfth Class: Step Three – Random Prevalence
  13. Thirteenth Class: Steps Four, Five and Six – Iterate
  14. Fourteenth Class: Step Seven – ZEN Quality Assurance Tests
  15. Fifteenth Class: Step Eight – Phased Production
  16. Sixteenth Class: Conclusion

Certification is not offered. Maybe someday. We created a test based on our TREC experiments that we may someday roll out.


Five Reasons You Should Read the ‘Practical Law’ Article by Maura Grossman and Gordon Cormack called “Continuous Active Learning for TAR”

April 11, 2016

Maura-and-Gordon_Aug2014There is a new article by Gordon Cormack and Maura Grossman that stands out as one of their best and most accessible. It is called Continuous Active Learning for TAR (Practical Law, April/May 2016). The purpose of this blog is to get you to read the full article by enticing you with some of the information and knowledge it contains. But before we go into the five reasons, we will examine the purpose of the article, which aligns with our own, and touch on the differences between their trademarked TAR CAL method and our CAR Hybrid Multimodal method. Both of our methods use continuous, active learning, the acronym for which, CAL, they now claim as a Trademark. Since they clearly did invent the acronym, CAL, we for one will stop using it – CAL – as a generic term.

The Legal Profession’s Remarkable Slow Adoption of Predictive Coding

The article begins with the undeniable point of the remarkably slow adoption of TAR by the legal profession, in their words:

Adoption of TAR has been remarkably slow, considering the amount of attention these offerings have received since the publication of the first federal opinion approving TAR use (see Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182 (S.D.N.Y. 2012)).

Winners in Federal CourtI remember getting that landmark ruling in our Da Silva Moore case, a ruling that pissed off plaintiffs’ counsel, because, despite what you may have heard to the contrary, they were strenuously opposed to predictive coding. Like most other lawyers at the time who were advocating for advanced legal search technologies, I thought Da Silva would open the flood gates, that it would encourage attorneys to begin using the then new technology in droves. In fact, all it did was encourage the Bench, but not the Bar. Judge Peck’s more recent ruling on the topic contains a good summary of the law. Rio Tinto PLC v. Vale S.A., 306 F.R.D. 125 (S.D.N.Y. 2015). There were a flood  of judicial rulings approving predictive coding all around the country, and lately, around the world. See Eg. Pyrrho Investments v MWB PropertyEWHC 256 (Ch) (2/26/16).

The rulings were followed in private arbitration too. For instance, I used the Da Silva More ruling a few weeks after it was published to obtain what was apparently the first ruling by an arbitrator in AAA approving use of predictive coding. The opposition to our use of cost-saving technology in that arbitration case was again fierce, and again included personal attacks, but the arguments for use in arbitration are very compelling. Discovery in arbitration is, after all, supposed to be constrained and expedited.

IT_GovernanceAfter the Da Silva Moore opinion, Maura Grossman and I upped our speaking schedule (she far more than me), and so did several tech-minded judges, including Judge Peck (although never at the same events as me, until the cloud of false allegations created by a bitter plaintiff’s counsel in Da Silva Moore could be dispelled). At Legal Tech for the next few years Predictive Coding is all anybody wanted to talk about. Then IG, Information Governance, took over as the popular tech-child of the day. In 2015 we had only a few predictive coding panels at Legal Tech, but they were well attended.

The Grossman Cormack speculates that the cause of the remarkably slow adoption is:

The complex vocabulary and rituals that have come to be associated with TAR, including statistical control sets, stabilization, F1 measure, overturns, and elusion, have dissuaded many practitioners from embracing TAR. However, none of these terms, or the processes with which they are associated, are essential to TAR.

Control-SetsWe agree. The vendors killed what could have been their golden goose with all this control set nonsense and their engineers love of complexity and misunderstanding of legal search. I have ranted about this before. See Predictive Coding 3.0. I will not go into that again here, except to say the statistical control set nonsense that had large sampling requirements was particularly toxic. It was not only hard and expensive to do, it led to mistaken evaluations of the success or failure of projects because it ignored the reality of the evolving understand of relevance, so called concept drift. Another wrong turn involved the nonsense of using only random selection to find training documents, a practice that Grossman and I opposed vigorously. See Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One,  Part Two,  Part Three, and Part Four. Grossman and Cormack correctly criticize these old vendor driven approaches in Continuous Active Learning for TAR. They call them SAL and SPL protocols (a couple of acronyms that no one wants to trademark!).

Bottom line, the tide is changing. Over the last several years the few private attorneys who specialize in legal search, but are not employed by a vendor, have developed simpler methods. Maura and I are just the main ones writing and speaking about it, but there are many others who agree. Many have found that it is counter-productive to use control sets, random input, non-continuous training with its illogical focus on the seed set, and misleading recall point projections.

grossman_cormack_filteredWe do so in defiance of the vendor establishment and other self-proclaimed pundits in this area who benefitted by such over-complexity. Maura and Gordon, of course, have their own software (Gordon’s creation), and so never needed any vendors to begin with. Not having a world renowned information scientist like Professor Cormack as my life partner, I had no choice but to rely on vendors for their software. (Not that I complaining, mind you. I’m married to a mental health counselor, and it does not get any better than that!)

MrEdr_CapedAfter a few years I ultimately settled on one vendor, Kroll Ontrack, but I continue to try hard to influence all vendors. It is a slow process. Even Kroll Ontrack’s software, which I call Mr. EDR, still has control set functions built in. Thanks to my persistence, it is easy to turn off these settings and do things my way, with no secret control sets and false recall calculations. Hopefully soon that will be the default setting. Their eyes have been opened. Hopefully all of the other major vendors will soon follow suit.

All of the Kroll Ontrack experts in predictive coding are now, literally, a part of my Team. They are now fully trained and believers in the simplified methods, methods very similar to those of Grossman and Cormack, albeit, as I will next explain, slightly more complicated. We proved how well these methods worked at TREC 2015 when the Kroll Ontrack experts and I did 30 review projects together in 45 days. See e-Discovery Team at TREC 2015 Total Recall Track, Final Report (116 pg. PDF), and  (web page with short summary). Also see – Mr. EDR with background information on the Team’s participation in the TREC 2015 Total Recall Track.

We Agree to Disagree with Grossman and Cormack on One Issue, Yet We Still Like Their Article

Team_TRECWe are fans of Maura Grossman and Gordon Cormack’s work, but not sycophants. We are close, but not the same; colleagues, but not followers. For those reasons we think our recommendation for you to read this article means more than a typical endorsement. We can be critical of their writings, but, truth is, we liked their new article, although we continue to dislike the name TAR (not important, but we prefer CAR). Also, and this is of some importance, my whole team continues to disagree with what we consider the somewhat over-simplified approach they take to finding training documents, namely reliance on the highest ranking documents alone.

LogisticRegressionWindowLogisticFitChart6Despite what some may think, the high-ranking approach does eventually find a full diversity of relevant documents. All good predictive coding software today pretty much uses some type of logistic regression based algorithms that are capable of building out probable relevance in that way. That is one of the things we learned by rubbing shoulders with text retrieval scientists from around the world at TREC when participating in the 2015 Total Recall Track that Grossman and Cormack helped administer. This regression type of classification system works well to avoid the danger of over-training on a particular relevancy type. Grossman and Cormack have proven that before to our satisfaction (so have our own experiments), and they again make a convincing case for this approach in this article.

4_Cylinder_engineStill, we disagree with their approach of only using high-ranking documents for training, but we do so on the grounds of efficiency and speed, not effectiveness. The e-Discovery Team continues to advocate a Hybrid Multimodal approach to active machine learning. We use what I like to call a four-cylinder type of CAR search engine, instead of one-cylinder, like they do.

  1. High-ranking documents;
  2. Mid-level, uncertain documents;
  3. A touch, a small touch, of random documents; and,
  4. Human ingenuity found documents, using all type of search techniques (multimodal) that seem appropriate to the search expert in charge, including keyword, linear, similarity (including chains and families), concept (including passive machine learning, clustering type search).

Predictive Coding 3.0 – The method is here described as an eight-part work flow (Step 6 – Hybrid Active Training).

The latest Grossman and Cormack’s versions of CAL (their trademark) only uses the highest-ranking documents for active training. Still, in spite of this difference, we liked their article and recommend you read it.

The truth is, we also emphasize the high-probable relevant documents for training. The difference between us is that we use the three other methods as well. On that point we agree to disagree. To be clear, we are not talking about continuous training or not, we agree on that. We are not talking about active training, or not (passive), we agree on that. We are not talking about using what they call using SAL or SPL protocols (read their article for details), we agree with them that these protocols are ineffective relics invented by misguided vendors. We are only talking about a difference in methods to find documents to use to train the classifier. Even that is not a major disagreement, as we agree with Grossman and Cormack that high-ranking documents usually make the best trainers, just not in the first seed set. There are also points in a search, depending on the project, where the other methods can help you get to the relevant documents in a fast, efficient manner. The primary difference between us is that we do not limit ourselves to that one retrieval method like Grossman and Cormack do in their trademarked CAL methodology.

Cormack and Grossman emphasize simplicity, ease of use, and reliance on the software algorithms as another way to try to overcome the Bar’s continued resistance to TAR. The e-Discovery Team has the same goal, but we do not think it is necessary to go quite that far for simplicity sake. The other methods we use, the other three cylinders, are not that difficult and have many advantages. e-Discovery Team at TREC 2015 Total Recall Track, Final Report (116 pg. PDF and web page with short  summary). Put another way, we like the ability of fully automatic driving from time to time, but we want to keep an attorney’s learned hand at or near the wheel at all times. See Why the ‘Google Car’ Has No Place in Legal Search.

Accessibility with Integrity: The First Reason We Recommend the Article

Professor Gordon Cormack

Here’s the first reason we like Grossman & Cormack’s article, Continuous Active Learning for TAR: you do not have to be one of Professor Cormac’s PhD students to understand it. Yes. It is accessible, not overly technical, and yet still has scientific integrity, still has new information, accurate information, and still has useful knowledge.

It is not easy to do both. I know because I try to make all of my technical writings that way, including the 57 articles I have written on TAR, which I prefer to call Predictive Coding, or CAR. I have not always succeeded in getting the right balance, to be sure. Some of my articles may be too technical, and perhaps some suffer from breezy information over-load and knowledge deficiency. Hopefully none are plain wrong, but my views have changed over the years. So have my methods. If you compare my latest work-flow (below) with earlier ones, you will see some of the evolution, including the new emphasis over the past few years with continuous training.


The Cormacks and I are both trying hard to get the word out to the Bar as to the benefits of using active machine learning in legal document review.  (We all agree on that term, active machine learning, and all agree that passive machine learning is not an acceptable substitute.) It is not easy to write on this subject in an accurate, yet still accessible and interesting manner. There is a constant danger that making a subject more accessible and simple will lead to inaccuracies and misunderstandings. Maura and Gordon’s latest article meets this challenge.

Search ImageTake for example the first description in the article of their continuous active training search method using highest ranking documents:

At the outset, CAL resembles a web search engine, presenting first the documents that are most likely to be of interest, followed by those that are somewhat less likely to be of interest. Unlike a typical search engine, however, CAL repeatedly refines its understanding about which of the remaining documents are most likely to be of interest, based on the user’s feedback regarding the documents already presented. CAL continues to present documents, learning from user feedback, until none of the documents presented are of interest.

That is a good way to start an article. The comparison with a Google search having continued refinement based on user feedback is well thought out; simple, yet accurate. It represents a description honed by literally hundreds of presentations on the topic my Maura Grossman. No one has talked more on this topic than her, and I for one intend to start using this analogy.

Rare Description of Algorithm Types – Our Second Reason to Recommend the Article

Another reason our Team liked Continuous Active Learning for TAR is the rare description of search algorithm types that it includes. Here we see the masterful touch of one of the world’s leading academics on text retrieval, Gordon Cormack. First, the article makes clear the distinction between effective analytic algorithms that truly rank documents using active machine learning, and a few other popular programs now out there that use passive learning techniques and call it advanced analytics.

The supervised machine-learning algorithms used for TAR should not be confused with unsupervised machine-learning algorithms used for clustering, near-duplicate detection, and latent semantic indexing, which receive no input from the user and do not rank or classify documents.

Old_CAR_stuck_mudThese other older, unsupervised search methods are what I call concept search. It is not predictive coding. It is not advanced analytics, no matter what some vendors may tell you. It is yesterday’s technology – helpful, but far from state-of-the-art. We still use concept search as part of multimodal, just like any other search tool, but our primary reliance to properly rank documents is placed on active machine learning.

hyperplanes3d_2The Cormack-Grossman article goes farther than pointing out this important distinction, it also explains the various types of bona fide active machine learning algorithms. Again, some are better than others. First Professor Cormack explains the types that have been found to be effective by extensive research over the past ten years or so.

Supervised machine-learning algorithms that have been shown to be effective for TAR include:

–  Support vector machines. This algorithm uses geometry to represent each document as a point in space, and deduces a boundary that best separates relevant from not relevant documents.

– Logistic regression. This algorithm estimates the probability of a document’s relevance based on the content and other attributes of the document.

Conversely Cormack explains:

Popular, but generally less effective, supervised machine-learning algorithms include:

– Nearest neighbor. This algorithm classifies a new document by finding the most similar training document and assuming that the correct coding for the new document is the same as its nearest neighbor.

– Naïve Bayes (Bayesian classifier). This algorithm estimates the probability of a document’s relevance based on the relative frequency of the words or other features it contains.

Ask your vendor which algorithms its software includes. Prepare yourself for double-talk.


If you try out your vendors software and the Grossman-Cormack CAL method does not work for you, and even the e-Discovery Team’s slightly more diverse Hybrid Multimodal method does not work, then your software may be to blame. As Grossman-Cormack put it, where the phrase “TAR tool” means software:

[I]t will yield the best possible results only if the TAR tool incorporates a state-of-the-art learning algorithm.

That means software that uses a type of support vector machine and/or logistic regression.

Teaching by Example – Our Third Reason to Recommend the Article

The article uses a long example involving search of Jeb Bust email to show you how their CAL method works. This is an effective way to teach. We think they did a good job with this. Rather than spoil the read with quotes and further explanation, we urge you to check out the article to see for yourself. Yes, it is an oversimplification, after all this is a short article, but it is a good one, and is still accurate.

 Quality Control Suggestions – Our Fourth Reason to Recommend the Article

quality_diceAnother reason we like the article are the quality control suggestions it includes. They essentially speak of using other search methods, which is exactly what we do in Hybrid Multimodal. Here are their words:

To increase counsel’s confidence in the quality of the review, they might:

Review an additional 100, 1,000, or even more documents.

Experiment with additional search terms, such as “Steve Jobs,” “iBook,” or “Mac,” and examine the most-likely relevant documents containing those terms.

Invite the requesting party to suggest other keywords for counsel to apply.

Review a sample of randomly selected documents to see if any other documents of interest are identified.

We like this because it shows that the differences are small between the e-Discovery Team’s Hybrid Multimodal method (hey, maybe I should claim Trademark rights to Hybrid Multimodal, but then again, no vendors are using my phrase to sell their products) using continuous active training, and the Grossman-Cormack trademarked CAL method. We also note that their section on Measures of Success essentially mirrors our own thoughts on metric analysis and ei-Recall. Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal SearchPart One, Part Two and Part Three.

Article Comes With an Online “Do it Yourself” CAL Trial Kit – Our Fifth Reason to Recommend the Article

We are big believers in learning by doing. That is especially true in legal tasks that seem complicated in the abstract. I can write articles and give presentations that provide explanations of AI-Enhanced Review. You may get an intellectual understanding of predictive coding from these, but you still will not know how to do it. On the other hand, if we have a chance to show someone an entire project, have them shadow us, then they will really learn how it is done. It is like teaching a young lawyer how to try a case. For a price, we will be happy to do so (assuming conflicts clear).

Jeb_BushMaura and Gordon seem to agree with us on that learn by doing point and have created an online tool that anyone can use to try out their method. In allows for a search of the Jeb Bush email, the same set of 290,099 emails that we used in ten of the thirty topics in 2015 TREC. In their words:

There is no better way to learn CAL than to use it. Counsel may use the online model CAL system to see how quickly and easily CAL can learn what is of interest to them in the Jeb Bush email dataset. As an alternative to throwing up their hands over seed sets, control sets, F1 measures, stabilization, and overturns, counsel should consider using their preferred TAR tool in CAL mode on their next matter.

You can try out their method with their online tool, or in a real project using your vendor’s tool. By the way, we did that as part of our TREC 2015 experiments, and the Kroll Ontrack software worked about the same as theirs, even when we used their one-cylinder, high ranking only, CAL (their trademark) method.

Here is where you can find their CAL testing tool: Those of you who are still skeptical can see for yourself how it works. You can follow the example given in the article about searching for documents relevant to Apple products, to verify their description of how that works. For even more fun, you can dream up your own searches.

030114-O-0000D-001 President George W. Bush. Photo by Eric Draper, White House.

Perhaps, if you try hard enough, you can find some example searches where their high-end only method, which is built into the test software, does not work well. For example, try finding all emails that pertain to, or in any way mention, the then President, George Bush. Try entering George Bush in the demo test and see for yourself what happens.

It becomes a search for George + Bush in the same document, and then goes from there based on your coding the highest ranked documents presented as either relevant or non-relevant. You will see that you quickly end up in a TAR pit. The word Bush is in every email (I think), so you are served up with every email where George is mentioned, and believe me, there are many Georges, even if there is only one President George Bush. Here is the screen shot of the first document presented after entering George Bush. I called it relevant.

Screen Shot 2016-04-10 at 4.13.24 PM

These kind of problem searches do not discredit TAR, or even the Grossman Cormack one-cylinder search method. If this happened to you in a real search project, you could always use our Hybrid Multimodal™ method for the seed set (1st training), or start over with a different keyword or keywords to start the process. You could, for instance, search for President Bush, or President within five of George, or “George Bush.” There are many ways, some faster and more effective than others.

Even using the single method approach, if you decided to use the keywords “President + Bush”, then the search will go quicker than “George + Bush.” Even just using the term “President” works better than George + Bush, but still seems like a TAR pit, and not a speeding CAR. It will probably get you to the same destination, high recall, but the journey is slightly longer and, at first, more tedious. This high recall result was verified in TREC 2015 by our Team, and by a number of Universities who participated in the fully automatic half of the Total Recall Track, including Gordon’s own team. This was all done without any manual review by the fully automatic participants because there was instant feedback of relevant or irrelevant based on a prejudged gold standard. See e-Discovery Team at TREC 2015 Total Recall Track, Final Report (116 pg. PDF), and (web page with short  summary). With this instant feedback protocol, all of the teams attained high recall and good precision. Amazing but true.

You can criticized this TREC experiment protocol, which we did in our report, as unrealistic to legal practice because:

(1) there is no SME who works like that (and there never will not be, until legal knowledge itself is learned by an AI); and,

(2) the searches presented as tasks were unrealistically over-simplistic. Id.

But you cannot fairly say that CAL (their trademark) does not work. The glass is most certainly not half empty. Moreover, the elixir in this glass is delicious and fun, especially when you use our Hybrid Multimodal™ method. See Why I Love Predictive Coding: Making document review fun with Mr. EDR and Predictive Coding 3.0.


Ralph_head_2016Active machine learning (predictive coding) using support vector or logistic regression algorithms, and a method that employs continuous active training, using either one cylinder (their CAL), or four (our Hybrid Multimodal), really works, and is not that hard to use. Try it out and see for yourself. Also, read the Grossman Cormack article, it only takes about 30 minutes. Continuous Active Learning for TAR (Practical Law, April/May 2016). Feel free to leave any comments below. I dare say you can even ask questions of Grossman or Cormack here. They are avid readers and will likely respond quickly.

Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three

December 8, 2013

Ralph_InconsistenciesThis is part-three of a three-part blog, so please read Part One and Part Two first.

The Losey Study on Inconsistencies Suggests a Promising Future for Active Machine Learning

The data from my Enron review experiment shows that relatively high consistent relevance determinations are possible. The comparatively high overlap results achieved in this study suggest that the problem of inconsistent human relevance determinations can be overcome. All it takes is hybrid multimodal search methods, good software with features that facilitate consistent coding, good SME(s), and systematic quality control efforts, including compliance with the less is more rule.

I am not saying good results cannot be achieved with multiple reviewers too. I am just saying it is more difficult that way. It is hard to be of one mind on something as tricky as some document relevance decisions with just one reviewer. It is even more challenging to attain that level of attunement with many reviewers.

The results of my study are especially promising for reviews using active machine learning processes. Consistency of coding training documents is very important to avoid GIGO errors. That is because of the cascading effects of sensitivity to initial conditions that are inherent in machine learning. As mentioned, good software can smooth out inconsistency errors somewhat, but if the Jaccard index is too low, the artificial intelligence will be impacted, perhaps severely so. You will not find the right documents, not because there is anything wrong with the software, or anything wrong with your conception of relevance, but because you did not provide coherent instructions. You instead sent mixed messages that did not track your right conceptions. (But see the research reports of John Tredennick, CEO of Catalyst, whose chief scientist Jeremy Pickens, is investigating the ability of their software to attain good rankings in spite of inconsistent machine training.)

The same thing can happen, of course, if your conceptions of relevance are wrong to begin with. If you fail to use bona fide, objective SMEs to do the training. Even if their message is consistent, it may be the consistently wrong message. The trainers do not understand what the real target is, do not know what it looks like, so of course they cannot find it.

The inexperienced reviewers lack the broad knowledge of the subject matter and the evidence required to prove the case, and they lack the necessary deep understanding to have a correct conception of relevance. In situations like that, despite all of the quality control efforts for consistency, you will still be consistently wrong in your training. (Again, but see the research of Catalyst, where what they admit are very preliminary test results seem to suggest that their software can fulfill the alchemists dream, of turning lead into gold, of taking intentionally wrong input for training and still getting better results than manual review, and even some predictive coding. Tredennick, J., Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?  (November 17, 2013). I will continue to monitor their research with interest, as data must trump theories, but for now remain skeptical. I am at a loss to understand how the fundamental principle of GIGO could be overcome. Does anyone else who has read the Catalyst reports have any insights or comments on their analysis?)

One information scientist I spoke with on the principle of GIGO and machine training, William Webber, explained that it might not matter too much if your trainer makes some mistakes, or even quite a few mistakes, if the documents they mistakenly mark as relevant nevertheless happen to contain similar vocabulary as the relevant documents. In that case the errors might not hurt the model of “a relevant vocabulary” too much. The errors will dilute the relevance model somewhat, but there may still be sufficient weight on the “relevant terms” for the overall ranking to work.

William further explained that the training errors would seriously hurt the classification system in three situations (which he admits are a bit speculative). First, errors would be fatal in situations where there is a specialized vocabulary that identifies relevant documents, and the trainer is not aware of this language. In that case key language would never make it into the relevance model. The software classification system could not predict that these documents were relevant. Second, if the trainers have a systematically wrong idea of relevance (rather than just being inattentive or misreading borderline cases). In that case the model will be systematically biased (but this is presumably the easiest case to QC, assuming you have an SME available to do so). Third, if the trainers flip too many relevant documents into the irrelevant class, and so the software classifier thinks that the “relevant vocabulary” is not really that strong an indicator of relevance after all. That is a situation where there is too much wrong information, where the training is too diluted by errors to work.

Consistency Between Reviews Even Without Horizontal Quality Control Efforts

Horizontal_QCIn my Enron experiment with two separate reviews I intentionally used only internal, or vertical, quality control procedures. That is one reason that the comparatively low 27% relevance inconsistency rate is so encouraging. There may have been some inconsistencies in coding in the same project, but not of the same document. That is because the methods and software I used (Kroll Ontrack’s Inview) made such errors easy to detect and correct. I made efforts to make my document coding consistent within the confines of both projects. But no efforts were made to try to make the coding consistent between the two review projects. In other words, I made no attempt in the second review to compare the decisions made in the first review nine-months earlier. In fact, just the opposite was true. I avoided horizontal quality control procedures on purpose in the second project to protect the integrity of my experiment to compare the two types of search methods used. That was, after all, the purpose of my experiment, not reviewer consistency.

I tried to eliminate carryover of any kind from one project to the next, even simple carryover like consulting notes or re-reading my first review report. I am confident that if I had employed quality controls between projects the Jaccard index would have been even higher, that I would have reduced the single reviewer error rate.

Ralph_Borg_stationAnother artificial reason the error rates between the two reviews might have been so high was the fact that I used a different, inferior methodology in the second review. Again, that was inherent in the experiment to compare methods. But the second method, a monomodal review method that I called a modified Borg approach, was a foreign method to me, and one that I found quite boring. Further, the Borg method was not conducive to consistent document reviews because it involved skimming a high number of irrelevant documents. I read 12,000 Enron documents in the Borg review and only 2,500 in the first, multimodal review. When using my normal methods in the first review I found 597 relevant documents in the 2,500 documents read. That is a prevalence rate of 24%. In the Borg review I found 376 relevant documents in the 12,000 documents read. That is a prevalence of only 03.1%. That kind of low prevalence review is, I suspect, more likely to lead to careless errors.

I am confident that if I had employed my same preferred hybrid multimodal methods in both reviews, that the consistency rate would have been even higher, even without additional quality control efforts. If I had done both, consistent methods and horizontal quality controls, the best results would have been attained.

In addition to improving consistency rates for a single reviewer, quality controls should also be able to improve consistency rates between multiple reviewer inconsistencies, at least in so far as the SME expertise can be transmitted between multiple reviewers. That in turn depends in no small part on whether the Grossman Cormack theory of review error causation is true, that inconsistencies are due to mere human error, carelessness and the like, as opposed to prior theories that relevance is always inherently subjective. If the subjective relevance theories are true, then everyone will have no choice but to just use one SME, who had better be well tuned to the judge. But, as mentioned, I do not believe in the theory that relevance is inherently subjective, so I do think multiple reviewers can be used, so long as there are multiple safeguards and quality controls in place. It will just be more difficult that way, and probably take longer.

How much more difficult, and how much longer, depends in part on the degree of subjectivity involved in the particular search project. I do not see the choice of competing theories as being all or nothing. Grossman and Cormack in their study concluded that only five percent of the relevance calls they made were subjective. It may well be higher than that on average, but, there is no way it is all subjective. I think it varies according to the case and the issues. The more subjectivity involved in a project, the more that strong, consistent, SME input is needed for machine training to work successfully.

Crowd Sourcing Does Not Apply to Most Predictive Coding Work

crowdSome think that most relevance determinations are just subjective, so SMEs are not really needed. They think that contract review lawyers will work just as well. After all, they are usually intelligent generalists. They think that more is better, and do not like the results of the studies I have discussed in this article, especially my own success as a Less is More Army of One type predictive coder. They hang their theories on crowd sourcing, and the wisdom of the crowd.

Crowd sourcing does work with some things, but not document review, and certainly not predictive coding. We are not looking for lost dogs here, where crowd sourcing does work. We are looking for evidence in what are often very complex questions. These questions, especially in large cases where predictive coding is common, are usually subject to many arcane rules and principles of which the crowd has no knowledge, or worse, has wrong knowledge. Multiple wrongs do not make a right.

Here is a key point to remember on the crowd sourcing issue: the judge makes the final decisions on relevance, not the jury. Crowd sourcing might help you to predict the final outcome of a jury trial, juries are, after all, like small crowds with no particular expertise, just instructions from the judge. Crowd sourcing will not, however, help you to predict how a judge will rule on legal issues. Study of the judge’s prior rulings are a much better guide (perhaps along with, as some contend, what the judge had for breakfast). The non-skilled reviewers, the crowd, have little or nothing to offer in predicting an expert ruling. To put this mathematically, no matter how many zeros you add together, the total sum is always still zero.

Bottom line, you cannot crowd-source highly specialized skills.When it comes to specialized knowledge, the many are not always smarter than the few.

crowd_surgeryWe all know this on a common sense level. Think about it. Would you want a crowd of nurses to perform surgery on you? Or would you insist on one skilled doctor? Of course you would want to have an SME surgeon operate on you, not a crowd. You would want a doctor who specializes in the kind of surgery you needed. One who had done it many times before. You cannot crowd source specialized skills.

The current facile fascination with crowd sourcing is trendy to be sure, but misplaced when it comes to most of the predictive coding work I see. Some documents, often critical ones, are too tricky, too subtle, for all but an experienced expert to recognize their probative value. Even documents that are potentially critical to the outcome of a case can be missed by non-experts. Most researchers critiquing the SME theory of predictive coding do not seem to understand this. I think that is because most are not legal experts, not experienced trial attorneys. They fail to appreciate the complexity and subtle nuances of the law in general, and evidence in particular.

They also fail to apprehend the enormous differences in skill levels and knowledge between attorneys. The law, like society, is so complex now that lawyers are becoming almost as specialized as doctors. We can only know a few fields of law. Thus, for example, just as you would not want a podiatrist to perform surgery on your eye, you would not want a criminal lawyer to handle your breach of contract suit.

To provide another example, if it were an area of law in which I have no knowledge, such as immigration law, I could read a hot document and not even know it. I might even think it was irrelevant. I would lack the knowledge and frame of reference to grasp its significance. The kind of quick training that passes muster in most contract lawyer reviews would not make much of a difference. That is because of complexity, and because the best documents are often the unexpected ones, the ones that only an expert would realize are important when they see one.

Penrose_triangle_ExpertiseIn the course of my 35 years of document review I have seen many inexperienced lawyers not recognize or misunderstand key documents on numerous occasions, including myself in the early days, and, to be honest, sometimes even now (especially when I am not the first-level SME, but just a surrogate). That is why partners supervise and train young lawyers, day in and day out for years. Although contract review lawyers may well have the search skills, and be power-users with great software skills, and otherwise be very smart and competent people, they lack the all important specialized subject matter expertise. As mentioned before, other experiments have shown that subject matter expertise is the most important of the three skill-sets needed for a good legal searcher. That is why you should not use contract lawyers to do machine training, at least in most projects. You should use SMEs. At the very least you should use an SME for quality control.


I will, however, concede that there may be some review projects where an SME is not needed at all, where multiple reviewers would work just fine. A divorce case for instance, where all of the reviewers might have an equally keen insight into sexy emails, or sexting, and no SMEs are needed. Alas, I never see cases like that, but I concede they are possible. It could also work in simplistic topics and non-real-world hypotheticals. That may explain some of the seemingly contra research results from Catalyst that rely on TREC data, not real world, complex, litigation data.

 Conclusions Regarding Inconsistent Reviews

The data from the experiments on inconsistent reviews suggest that when only one human reviewer is involved, a reviewer who is also an experienced SME, that the overall consistency rates in review are much higher than when multiple non-SME reviewers are involved (contract reviewers in the Roitblat, Kershaw and Oot study) (77% v 16%), or even when multiple SMEs are involved (retired intelligence officers in Voorhees study) (77% v 45% with two SMEs and 30% with three SMEs). These comparisons are shown visually in this graph.


These results also suggest that with one SME reviewer the classification of irrelevant documents is nearly uniform (99%), and that the inconsistencies primarily lie in relevant categorizations (77% Jaccard) of borderline relevant documents. (A caveat should be made that this observation is based on unfiltered data, and not a keyword collection or data otherwise distorted with artificially high prevalence rates.)

The overall Agreement rate of 98%+ of all relevancy determinations, including irrelevant classifications where almost all classifications are easy and obvious, suggests that the very low Jaccard index rates measured in previous studies of 16% to 45% were more likely caused by human error, not document relevance ambiguity or genuine disagreement on the scope of relevance. A secondary explanation for the low scores is lack of significant subject matter expertise, such that the reviewers were not capable of recognizing a clearly relevant document when they saw one. Half of the TREC reviews were done by volunteer law students where such mistakes could easily happen. As I understand the analysis of Grossman and Cormack, they would consider this to be mere error, as opposed to a difference of opinion.

Even if you only consider the determinations of relevancy, and exclude determinations of irrelevancy, the 77% Jaccard index for one reviewer is still significantly greater than the prior 16% to 45% consistency rates. The data on inconsistencies from my experiment thus generally support the conclusions of Cormack and Grossman that most inconsistencies in document classifications are due to human error, not the presence of borderline documents or the inherent ambiguity of all relevancy determinations. Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012). Of the 3,274 different documents that I read in both projects during my experiment, only 63 were seen to be borderline, grey area types, which is less than 2%. The rest, 3,211 documents, were consistently coded. This is shown in the graph below.


There were almost certainly more grey area relevant documents than 63 in the 3,274 documents reviewed. But they did not come to my attention in the post hoc analysis because my determinations in both projects were consistent in review of the other borderline documents. Still, the findings support the conclusions of Grossman and Cormack that less than 5% of documents in a typical unfiltered predictive coding review project are of a borderline grey area type. In fact, the data from my study supports the conclusion that only 2% of the total documents subject to relevance were grey area types, that 98% of the judgment calls were not subjective. I think this is a fair assessment for the unfiltered Enron data that I was studying, and the relatively simple relevance issue (involuntary employment termination) involved.

The percentage of grey area documents where the relevance determinations are subjective and arguable may well be higher than 5%. More experiments are needed and nothing is proven by only a few tests. Still, my estimate, based on general experience and the Enron tests, is that when you are only considering relevant documents, it could be a high, on average, of as much as 20% subjective calls. (When considering all judgments, relevant and irrelevant, it is under 5% subjective.) Certainly subjectivity is a minority cause of inconsistent relevance determinations.

The data does not support the conclusion that relevance adjudications are inherently subjective, or mere idiosyncratic decisions. I am therefore confident that our legal traditions rest on solid relevance ground, not quicksand.

But I also understand that this solid ground in turn depends on competence, legal expertise, and a clear objective understanding of the rules of law and equity, not to mention the rules of reason and common sense. That is what legal training is all about. It always seems to come back to that, does it not?

Disclosure of Irrelevant Training Documents

These observations, especially the high consistency of review of irrelevance classifications (99%), support the strict limitation of disclosure of irrelevant documents as part of a cooperative litigation discovery process. Instead, only documents that a reviewer knows are of a grey area type or likely to be subject to debate should be disclosed. Even then the disclosure need not include the actual documents, but rather a summary and dialogue on the issues raised.


During my experimental review projects of the Enron documents, much like my reviews in real-world legal practice that I cannot speak of, I was personally aware of the ambiguous type grey area documents when originally classifying these documents. They were obvious because it was difficult to decide if they were within the border of relevance, or not. I was not sure how a judge would rule on the issue. The ambiguity would trigger an internal debate where a close question decision would ultimately be made. It could also trigger quality control efforts, such as consultations with other SMEs about those documents, although that did not happen in my Enron review experiment. In practice it does happen.

Even when limiting disclosure of irrelevant documents to those that are known to be borderline, disclosure of the actual documents themselves may often be unnecessary. Instead, a summary of the documents with explanation of the rationale as to the ultimate determination of irrelevance may suffice. The disclosure of a description of the borderline documents will at least begin a relevancy dialogue with the requesting party. Only if the abstract debate fails to reach agreement should disclosure of the actual documents be required. Even then it could be done in camera to a neutral third-party, such as a judge or special master. Alternatively, disclosure could be made with additional confidentiality restrictions, such as redactions, pending a ruling by the court.


Ralph_review_12-13Some relevance determinations certainly do include an element of subjectivity, of flexibility, and the law is used to that. But not all. Only a small minority. Some relevance determinations are more opinion than fact. But not all. Only a small minority. Some relevance determinations are more art than science. But not all. Only a small minority. Therefore, consistent and reliable relevance determinations by trained legal experts is possible, especially when good hybrid multimodal methods are used, along with good quality controls. (Good software is also important, and, as I have said many times before, some software on the market today is far better than others.)

The fact that it is possible to attain consistent coding is good news for legal search in general and especially good news for predictive coding, with its inherent sensitivity to initial conditions and cascading effects. It means that it is possible to attain the kind of consistent training needed for active machine learning to work accurately and efficiently, even in complex real-world litigation.

The findings of the studies reviewed in this article also support the use of SMEs with in-depth knowledge of the legal subject, and the use of as few SMEs to do the review as possible – Less Is More. These studies also strongly support that the greatest consistency in document review arises from the use of one SME only. By the way, despite the byline in Monica Bay’s article, EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013), that “Phase I of the study shows that older lawyers still have e-discovery chops and you don’t want to turn EDD over to robots,” the age of the lawyers is irrelevant. The best predictive coding trainers do not have to be old, they just have to be SMEs and have good search skills. In fact, not all SMEs are old, although many may be. It is the expertise and skills that matter, not age per se.

The findings and conclusions of the studies reviewed in this article also reinforce the need for strong quality control measures in large reviews where multiple reviewers must be used, such as second-pass reviews, or reviews led by traditionalists. This is especially true when the reviewers are relatively low-paid, non-SMEs. Quality controls detecting inconsistencies in coding and other possible human errors should be a part of all state-of-the-art software, and all legal search and review methodologies.

Army of One: Multimodal Single-SME Approach To Machine Learning Finally, it is important to remember that good project management skills are important to the success of any project, including legal search. That is true even if you are talking about an Army of One, which is my thing. Skilled project management is even more important when hundreds of reviewers are involved. The effectiveness of any large-scale document review, including its quality controls, always depends on the project management.

%d bloggers like this: