Please read Part One of this article before reading this second segment.
Contingency Table Background
A review some of the basic concepts and terminology used in this article may be helpful before going further. It is also important to remember that ei-Recall is a method for measuring recall, not attaining recall. There is a fundamental difference. Many of my other articles have discussed search and review methods to achieve recall, but this one does not. See eg.
- Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One, Part Two, Part Three, and Part Four.
- Predictive Coding and the Proportionality Doctrine: a Marriage Made in Big Data, 26 Regent U. Law Review 1 (2013-2014).
- Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three.
- Three-Cylinder Multimodal Approach To Predictive Coding.
This article is focused on the very different topic of measuring recall as one method among many to assure quality in large-scale document reviews.
Everyone should know that in legal search analysis False Negatives are documents that were falsely predicted to be irrelevant, that are in fact relevant. They are mistakes. Conversely, documents predicted irrelevant, that are in fact irrelevant, are called True Negatives. Documents predicted relevant that are in fact relevant are called True Positives. Documents predicted relevant that are in fact irrelevant are called False Positives.
These terms and formulas derived therefrom are set forth in the Contingency Table, a/k/a Confusion Matrix, a tool widely used in information science. Recall using these terms is the total number of relevant documents found, the True Positives (TP), divided by that same number, plus the total number of relevant documents not found, the False Negatives (FN). Recall is the percentage of total target documents found in any search.
|Truly Non-Relevant||Truly Relevant|
|Coded Non-Relevant||True Negatives (“TN”)||False Negatives (“FN”)|
|Coded Relevant||False Positives (“FP”)||True Positives (“TP”)|
- The standard formula for Recall using contingency table values is: R = TP / (TP+FN).
- The standard formula for Prevalence is: P = (TP + FN) / (TP + TN + FP + FN)
General Background on Recall Formulas
Before I get into the examples and math for ei-Recall, I want to provide more general background. In addition, I suggest that you re-read my short description of an elusion test at the end of Part Three of Visualizing Data in a Predictive Coding Project. It provides a brief description of the other quality control applications of the elusion test for False Negatives. If you have not already done so, you should also read my entire article, In Legal Search Exact Recall Can Never Be Known.
I also suggest that you read John Tredennick’s excellent article: Measuring Recall in E-Discovery Review: A Tougher Problem Than You Might Realize, especially Part Two of that article. I give a big Amen to John’s tough problem insights.
For the more technical and mathematically minded, I suggest you read the works of William Webber, including his key paper on this topic, Approximate Recall Confidence Intervals (January 2013, Volume 31, Issue 1, pages 2:1–33) (free version in arXiv), and his many less formal and easier to understand blogs on the topic: Why confidence intervals in e-discovery validation? (12/9/12); Why training and review (partly) break control sets, (10/20/14); Why 95% +/- 2% makes little sense for e-discovery certification, (5/25/13); Stratified sampling in e-discovery evaluation, (4/18/13); What is the maximum recall in re Biomet?, (4/24/13). Special attention should be given to Webber’s recent article on Roitblat’s eRecall, Confidence intervals on recall and eRecall (1/4/15), where it is tested and found deficient on several grounds,
My idea for a recall calculation that includes binomial confidence intervals, like most ideas, is not truly original. It is, as our friend Voltaire puts it, a judicious imitation. For instance, I am told that my proposal to use comparative binomial calculations to determine approximate confidence interval ranges follows somewhat the work of an obscure Dutch medical statistician, P. A. R. Koopman, in the 1980s. See: Koopman, Confidence intervals for the ratio of two binomial proportions, Biometrics 40: 513–517 (1984). Also see: Webber, William, Approximate Recall Confidence Intervals, ACM Transactions on Information Systems, Vol. V, No. N, Article A (October 2012); Duolao Wang, Confidence intervals for the ratio of two binomial proportions by Koopman’s method, Stata Technical Bulletin, 10-58, 2001.
As mentioned, the recall method I propose here is also similar to that promoted by Herb Roitbalt – eRecall – except that avoids its fundamental defect. I include binomial intervals in the calculations to provide an elusion recall range, and his method does not. Measurement in eDiscovery (2013). Herb’s method relies solely on point projections and disregards the ranges of both the Prevalence and False Negative projections. That is why no statistician will accept Rotibalt’s eRecall, whereas ei-Recall has already been reviewed without objection by two of the leading authorities in the field, William Webber and Gordon Cormack.
ei-Recall is also a superior method because it is based on a specific number of relevant documents found at the end of the project, the True Positives (TP). That is not an estimated number. It is not a projection based on sampling where a confidence interval range and more uncertainty are necessarily created. True Positives in ei-Recall is the number of relevant documents in a legal document production (or privilege log). It is an exact number verified by multiple reviews and other quality control efforts set forth in steps six, seven and eight in Electronic Discovery Best Practices (EDBP), and then produced in step nine (or logged).
In a predictive coding review the True Positives as defined by ei-Recall are the documents predicted relevant, and then confirmed to be relevant in second pass reviews, etc., and produced and logged. (Again see: Step 8 of the EDBP, which I call Protections.) The production is presumed to be a 100% precise production, or at least as close as is humanly possible, and contain no False Positives. For that reason ei-Recall may not be appropriate in all projects. Still, it could also work, if need be, by estimating the True Positives. The fact that ei-Recall includes interval ranges in and of itself make it superior and more accurate that any other ratio method.
In the usual application of ei-Recall, only the number of relevant documents missed, the False Negatives, is estimated. The actual number of relevant documents found (TP) is divided by the sum of the projected range of False Negatives from the samples of the null set of each strata, both high (FNh) and low (FNl), and the number of relevant documents found (TP). This method is summarized by the following formulas:
Formula for the lowest end of the recall range from the null set sample: Rl = TP / (TP+FNh).
Formula for the highest end of the recall range from the null set sample: Rh = TP / (TP+FNl).
This is a very different from the approach used by Herb Roitblat for eRecall. Herb’s approach is to sample the entire collection to calculate a point projection of the probable total number of relevant documents in the collection, which I will here call P. He then takes a second random sample of the null set to calculate the point projection of the probable total False Negatives contained in the null set (FN). Roitblat’s approach only uses point projections and ignores the interval ranges inherent in each sample. My approach uses one sample and includes its confidence interval range. Also, as mentioned, my approach uses a validated number of True Positives found at the end of a review project, and not a projection of the probable total number of relevant documents found (P). Although Herb never uses a formula per se in his paper, Measurement in eDiscovery, to describe his approach, if we use the above described definitions the formula for eRecall would seem to be: eR = P / (P + FN). (Note there are other speculations as to what Roitblat’s really intends here, as discussed in the comments to Webber’s blog on eRecall. One thing we know for sure, is that although he may change the details to his approach, it never includes a recall range, just a spot projection.)
My approach of making two recall calculations, one for the low end, and another for the high end, is well worth the slight additional time to create a range. Overall the effort and cost of ei-Recall is significantly less than eRecall because only one sample is used in my method, not two. My method significantly improves the reliability of recall estimates and overcomes the defects inherent in ignoring confidence intervals found in eRecall and other methods such as the Basic Ratio Method and Global Method. See Eg: Grossman & Cormack, Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ Federal Courts Law Review, Vol. 7, Issue 1 (2014) at 306-310.
The use of range values avoids the trap of using a point projection that may be very inaccurate. The point projections of eRecall may be way off from the true value, as was explained in detail by In Legal Search Exact Recall Can Never Be Known. Moreover, ei-Recall fits in well with the overall work flow of my current two-pass, CAL-based (continuous active learning), hybrid, multimodal search and review method.
Recall Calculation Methods Must Include Range
A fuller explanation of Herb Rotiblat’s eRecall proposal, and other similar point projection based proposals, should help clarify the larger policy issues at play in the proposed alternative ei-Recall approach.
Again, I cannot accept Herb Roitblat’s approach to using an Elusion sample to calculate recall because he uses the point projection of prevalence and elusion only, and does not factor in the recall interval ranges. My reason for opposing this simplification was set out in detail In Legal Search Exact Recall Can Never Be Known. It is scientifically and mathematically wrong to use point projections and not include ranges.
I note that industry leader John Tredennick also disagrees with Herb’s approach. See his recent article: Measuring Recall in E-Discovery Review: A Tougher Problem Than You Might Realize, Part Two. After explaining Herb’s eRecall John says this:
Does this work? Not so far as I can see. The formula relies on the initial point estimate for richness and then a point estimate for elusion.
I agree with John Tredennick in this criticism of Herb’s method. So too does Bill Dimm, who has a PhD in Physics and is the founder and CEO of Hot Neuron. Bill summarizes Herb’s eRecall method in his article, eRecall: No Free Lunch. He uses an example to show that eRecall does not work at all in low prevalence situations. Of course, all sampling is challenged by extremely low prevalence, even ei-Recall, but at least my interval approach does not hide the limitations of such recall estimates. There is no free lunch. Recall estimates are just one quality control effort among many.
Maura Grossman and Gordon Cormack also challenge the validity of Herb’s method. They refer to Roitblat’s eRecall as a specious argument. Grossman and Cormack make the same judgment about several other approaches that compare the ratios of point projections and show how they all suffer from a basic mathematical statistical error, which they call the Ratio Method Fallacy. Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ supra at 308-309.
In Grossman & Cormack’s, Guest Blog: Talking Turkey (e-Discovery Team, 2014) they explained an experiment that they did and reported on in the Comments article where they repeatedly used Roitblat’s eRecall, the direct method, and other methods to estimate recall. They used a review known to have achieved 75% recall and 83% precision, from a collection with 1% prevalence. They results showed that in this review “eRecall provides an estimate that is no better than chance.” That means eRecall was a complete failure as a quality assurance measure.
Although my proposed range method is a comparative Ratio Method, it avoids the fallacy of other methods criticized by Grossman and Cormack. It does so because it includes binomial probability ranges in the recall calculations and eschews the errors of point projection reliance. It is true that the range of recall estimates using ei-Recall may be still uncomfortably large in some low yield projects, but at least it will be real and honest, and, unlike eRecall, it is better than nothing.
No Legal Economic Arguments Justify the Errors of Simplified Point Projections
The oversimplified point projection ratio approach can lead to a false belief of certainty for those who do not understand probability ranges inherent in random samples. We presume that Herb Roitblat understands the probability range issues, but he chooses to simplify anyway on the basis of what appears to me to be essentially legal-economic arguments, namely proportionality cost-savings, and the inherent vagaries of legal relevance. Roitblat, The Pendulum Swings: Practical Measurement in eDiscovery.
I disagree strongly with Roitblat’s logic. As one scholar in private correspondence pointed out, Herb appears to fall victim to the classic fallacy of the converse. Herb asserts that “if the point estimate is X, there is a 50% probability that the true value is greater than X.” What *is* true (for an unbiased estimate) is that “if the true value is X, there is a 50% probability that the estimate is greater than X.” Assuming the latter implies the former is classic fallacy of the converse. Think about it. It is a very good point. For a more obvious example of the fallacy of the converse consider this: “Most accidents occur within 25 miles from home; therefore, you are safest when you are far from home.”
Although I disagree with Herb Roitblat’s logic, I do basically agree with many of his non-statistical arguments and observations on document review, including, for instance, the following:
Depending on the prevalence of responsive documents and the desired margin-of-error, the effort needed to measure the accuracy of predictive coding can be more than the effort needed to conduct predictive coding.
Until a few years ago, there was basically no effort expended to measure the efficacy of eDiscovery. As computer-assisted review and other technologies became more widespread, an interest in measurement grew, in large part to convince a skeptical audience that these technologies actually worked. Now, I fear, the pendulum has swung too far in the other direction and it seems that measurement has taken over the agenda.
There is sometimes a feeling that our measurement should be as precise as possible. But when the measure is more precise than the underlying thing we are measuring, that precision gives a false sense of security. Sure, I can measure the length of a road using a yardstick and I can report that length to within a fraction of an inch, but it is dubious whether the measured distance is accurate to within even a half of a yard.
Although I agree with many of the points of Herb’s legal economic analysis in his article, The Pendulum Swings: Practical Measurement in eDiscovery, I disagree with the conclusion. The quality of the search software, and legal search skills of attorney-users of this software, have both improved significantly in the past few years. It is now possible for relatively high recall levels to be attained, even including ranges, and even without incurring extraordinary efforts and costs as Herb and others suggest. (As a side note, please notice that I am not opining on a specific minimum recall number. That is not helpful because it depends on too many variable factors unique to particular search projects. However, I would point out that in the TREC Legal Track studies in 2008 and 2009 the participants, expert searchers all, attained verified recall levels of only 20% to 70%. See The Legal Implications of What Science Says About Recall. All I am saying is that in my experience our recall efforts have improved and are continually improving as our software and skills improve.)
Further, although relevance and responsiveness can sometimes be vague and elusive as Roitblat points out, and human judgments can be wrong and inconsistent, there are quality control process steps that can be taken to significantly mitigate these problems, including the often overlooked better dialogues with the requesting party. Legal search is not an arbitrary exercise such that it is a complete waste of time to try to accurately measure recall.
I disagree with Herb’s suggestion to the contrary based on his evaluation of legal relevance judgments. He reaches this conclusion based on the very interesting study he did with Anne Kershaw and Patrick Oot on a large-scale document review that Verizon did nearly a decade ago. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review. In that review Verizon employed 225 contract reviewers and a Twentieth Century linear review method wherein low paid contract lawyers sat in isolated cubicles and read one document after another. The study showed, as Herb summarizes it, that the reviewers agree with one another on relevance calls only about 50% of the time.” Measurement in eDiscovery at pg. 6. He takes that finding as support for his contention that consistent legal review is impossible and so there is no need to bother with finer points of recall intervals.
I disagree. My experience as an attorney making judgments on the relevancy of documents since 1980 tells me otherwise. It is absurd, even insulting, to call legal judgment a mere matter of coin flipping. Yes, there are well-known issues with consistency in legal review judgments in large-scale reviews, but this just makes the process more challenging, more difficult, not impossible.
Although consistent review may be impossible if large teams of contract lawyers do linear review in isolation using yesterday’s technology, that does not mean consistent legal judgments are impossible. It just means the large team linear review process is deeply flawed. That is why the industry has moved away from the approaches used by the Verizon team review nearly ten years ago. We are now using predictive coding, small teams of SMEs and contract lawyers, and many new innovative quality control procedures, including soon, I hope, ei-Recall. The large team linear review approach of a decade ago, and other quality factors, were the primary causes of the inconsistencies seen in the Verizon approach, not the inherent impossibility of determining legal relevance.
Good Recall Results Are Possible Without Heroic Efforts
But You Do Need Good Software and Good Methods
Even with the consistency and human error challenges inherent in all legal review, and even with the ranges of error inherent in any valid recall calculation, it is, I insist, still possible to attain relatively high recall ranges in most projects. (Again, note that I will not commit to a specific general minimum range.) I am seeing better recall ranges attained in more and more of my projects and I am certainly not a mythical TAR-whisperer, as Grossman and Cormack somewhat tongue in cheek described lawyers who may have extraordinary predictive coding search skills. Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ at pg. 298. Any experienced lawyer with technology aptitude can attain impressive results in large-scale document reviews. They just need to use hybrid, multimodal, CAL-type, quality controlled, search and review methods. They also need to use proven, high quality, bona fide predictive coding software. I am able to teach this in practice with bright, motivated, hard-working, technology savvy lawyers.
Legal search is a new legal skill to be sure, just like countless others in e-discovery and other legal fields. I happen to find the search and review challenges more interesting than the large enterprise preservation problems, but they are both equally difficult and complex. TAR-whispering is probably an easier skill to learn than many others required today in the law. (It is certainly easier than becoming a dog whisperer like Cesar Millan. I know. I’ve tried and failed many times.)
Think of the many arcane choice of law issues U.S. lawyers have faced for over a century in our 50-state, plus federal law system. Those intellectual problems are more difficult than predictive coding. Think of the tax code, securities, M&A, government regulations, class actions. It is all hard. All difficult. But it can all be learned. Like everything else in the law, large-scale document review just requires a little aptitude, hard work and lots of legal practice. It is no different from any other challenge lawyers face. It just happens to require more software skills, sampling, basic math, and AI intuition than any other legal field.
On the other point of bona fide predictive coding software, while I will not name names, as far as I am concerned the only bona fide software on the market today uses active machine learning algorithms. It does not depend instead on some kind of passive learning process (although they too can be quite effective, they are not predictive coding algorithms, and, in my experience, do not provide as powerful a search tool). I am sorry to say that some legal review software on the market today falsely claims to have predictive coding features, when, in fact, it does not. It is only passive learning, more like concept search, than AI-enhanced search. With software like that, or even with good software where the lawyers use poor search and review methods, or do not really know what they are searching for (poor relevance scope), then the efforts required to attain high recall ranges may indeed be very extensive and thus cost prohibitive as Herb Roitblat argues. If your tools and or methods are poor, it takes much longer to reach your goals.
One final point regarding Herb’s argument, I do not think sampling really needs to be as cost prohibitive as he and others suggest. As noted before in In Legal Search Exact Recall Can Never Be Known, one good SME and skilled contract review attorney can carefully review a sample of 1,534 documents for between $1,000 and $2,000. In large review projects that is hardly a cost prohibitive barrier. There is no need to be thinking in terms of small 385 document sample sizes, which create a huge margin of error of 5%. This is what Herb Rotiblat and others do when suggesting that all sampling is anyway ineffective, so just ignore intervals and ranges. Any large project can afford a full sample of 1,534 documents to cut the interval in half to a 2.5% margin of error. Many can afford much larger samples to narrow the interval range even further, especially if the tools and methods used allow them to attain their recall range goals in a fast and effective manner.
John Tredennick, who, like me, is an attorney, also disagrees with Herb’s legal-economic analysis in favor of eRecall, but John proposes a solution involving larger sample sizes, wherein the increased cost burden would be shifted onto the requesting party. Recall in E-Discovery Review: A Tougher Problem Than You Might Realize, Part Two. I do not disagree with John’s assertions in his article, and cost shifting may be appropriate in some cases. It is not, however, my intention to address the cost-shifting arguments here, or the other good points made in John’s article. Instead, my focus in the remaining Part Three of this blog series will be to provide a series of examples of ei-Recall in action. For me, and I suspect for many of you, seeing a method in action is the best way to understand it.
Summary of the Five Reasons ei-Recall is the new Gold Standard
Before moving onto the samples, I wanted to summarize what we have covered so far and go over the five main reasons ei-Recall is superior to all other recall methods. First, and most important, is the fact ei-Recall calculates a recall range, and not just one number. As shown by In Legal Search Exact Recall Can Never Be Known, recall statements must include confidence interval range values to be meaningful. Recall should not be based on point projections alone. Therefore any recall calculation method must calculate both a high and low value. The ei-Recall method I offer here is designed for the correct high low interval range calculations. That, in itself, makes it a significant improvement over all point projection recall methods.
The second advantage of ei-Recall is that is only uses one random sample, not two, or more. This avoids the compounding of variables, uncertainties, and outlier events inherent in any system that uses multiple chance events, multiple random samples. The costs are also controlled better in a one sample method like this, especially since the one sample is of reasonable size. This contrasts with the Direct Method, which also uses one sample, but the sample has to be insanely large. That is not only very costly, but also introduces a probability of more human error in inconsistent relevancy adjudications.
The timing of the one sample in ei-Recall is another of its advantages. It is taken at the end of the project when the relevance scope has been fully articulated.
Another key advantage of ei-Recall is that the True Positives used for the calculation are not estimated, and are not projected by random samples. They are documents confirmed to be relevant by multiple quality control measures, including multiple reviews of these documents by humans, or computer, and often both.
Finally, ei-Recall has the advantage of simplicity, and ease of use. It can be carried out by any attorney who knows fractions. The only higher math required, the calculation of binomial confidence intervals, can be done by easily available online calculators. You do not need to hire a statistician to make the recall range calculations using ei-Recall.
To be continued.