EDITORS NOTE: This is a guest blog by Gordon V. Cormack, Professor, University of Waterloo, and Maura R. Grossman, Of Counsel, Wachtell, Lipton, Rosen & Katz. The views expressed herein are solely those of the authors and should not be attributed to Maura Grossman’s law firm or its clients.
This guest blog constitutes the first public response by Professor Cormack and Maura Grossman, J.D., Ph.D., to articles published by one vendor, and others, that criticize their work. In the Editor’s opinion the criticisms are replete with misinformation and thus unfair. For background on the Cormack Grossman study in question, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, and the Editor’s views on this important research see: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One and Part Two and Part Three. After remaining silent for some time in the face of constant vendor potshots, Professor Cormack and Dr. Grossman feel that a response is now necessary. They choose to speak at this time in this blog because, in their words:
We would have preferred to address criticism of our work in scientifically recognized venues, such as academic conferences and peer-reviewed journals. Others, however, have chosen to spread disinformation and to engage in disparagement through social media, direct mailings, and professional meetings. We have been asked by a number of people for comment and felt it necessary to respond in this medium.
Guest Blog: TALKING TURKEY
OrcaTec, the eDiscovery software company started by Herbert L. Roitblat, attributes to us the following words at the top of its home page: “Not surprisingly, costs of predictive coding, even with the use of relatively experienced counsel for machine-learning tasks, are likely to be substantially lower than the costs of human review.” These words are not ours. We neither wrote nor spoke them, although OrcaTec attributes them to our 2011 article in the Richmond Journal of Law and Technology (“JOLT article”).
[Ed. Note: The words were removed shortly after blog was published.]
A series of five OrcaTec blog posts (1, 2, 3, 4, 5) impugning our 2014 articles in SIGIR and Federal Courts Law Review (“2014 FCLR article”) likewise misstates our words, our methods, our motives, and our conclusions. At the same time, the blog posts offer Roitblat’s testimonials—but no scientific evidence—regarding the superiority of his, and OrcaTec’s, approach.
As noted in Wikipedia, “a straw man is a common type of argument and is an informal fallacy based on the misrepresentation of an opponent’s argument. To be successful, a straw man argument requires that the audience be ignorant or uninformed of the original argument.” First and foremost, we urge readers to avoid falling prey to Roitblat’s straw man by familiarizing themselves with our articles and what they actually say, rather than relying on his representations as to what they say. We stand by what we have written.
Second, we see no reason why readers should accept Roitblat’s untested assertions, absent validation through the scientific method and peer review. For example, Roitblat claims, without providing any scientific support, that:
- “Although in their investigation, they do find that their implementation of random sampling yields notably poorer Recall than the other methods they employ, that difference does not extend to other implementations. In particular, it does not extend to OrcaPredict”;
- “The random sampling training regimen used by OrcaTec, for example, achieves higher levels of Recall with less training than Cormack and Grossman’s best learning algorithm”;
- “Good estimates of Recall can be obtained by evaluating a few hundred documents rather than the many thousands that could be needed for traditional measures of Recall.”
These claims are testable hypotheses, the formulation of which is the first step in distinguishing science from pseudo-science; but Roitblat declines to take the essential step of putting his hypotheses to the test in controlled studies.
Overall, Roitblat’s OrcaTec blog posts represent a classic example of truthiness. In the following paragraphs, we outline some of the misstatements and fallacious arguments that might leave the reader with the mistaken impression that Roitblat’s conclusions have merit.
With Us or Against Us?
Overall, the myth that exhaustive manual review is the most effective—and therefore, the most defensible—approach to document review is strongly refuted. Technology-assisted review can (and does) yield more accurate results than exhaustive manual review, with much lower effort. Of course, not all technology-assisted reviews (and not all manual reviews) are created equal. The particular processes found to be superior in this study are both interactive, employing a combination of computer and human input. While these processes require the review of orders of magnitude fewer documents than exhaustive manual review, neither entails the naïve application of technology absent human judgment. Future work may address which technology-assisted review process(es) will improve most on manual review, not whether technology-assisted review can improve on manual review (emphasis added; original emphasis in bold).
The particular processes shown to be superior, based on analysis of the results of the Interactive Task of the TREC 2009 Legal Track, were an active learning method employed by the University of Waterloo, and a rule-based method employed by H5. Despite the fact that OrcaTec chose not to participate in TREC, and their method—which employs neither active learning nor a rule base—is not one of those shown by our study to be superior, OrcaTec was quick to promote TREC and our JOLT article as scientific evidence for the effectiveness of their method.
In his OrcaTec blog posts following the publication of our SIGIR and 2014 FCLR articles, however, Roitblat espouses a different view. In Daubert, Rule 26(g) and the eDiscovery Turkey, he states that the TREC 2009 data used in the JOLT and SIGIR studies “cannot be seen as independent in any sense, in that the TREC legal track was overseen by Grossman and Cormack.” Notwithstanding his argumentum ad hominem, the coordinators of the TREC 2009 Legal Track included neither of us. Cormack was a TREC 2009 participant, who directed the Waterloo effort, while Grossman was a “Topic Authority,” who neither knew Cormack at the time, nor had any role in assessing the Waterloo effort. It was not until 2010, that Cormack and Grossman became Legal Track coordinators.
Roitblat’s change of perspective perhaps owes to the fact that our SIGIR article is critical of random training for technology-assisted review (“TAR”), and our 2014 FCLR article is critical of “eRecall,” both methods advanced by Roitblat and employed by OrcaTec. But nothing about TREC 2009 or our JOLT study has changed in the intervening years, and the OrcaTec site continues—even at the time of this writing—to (mis)quote our work as evidence of OrcaTec’s effectiveness, despite Roitblat’s insistence that OrcaTec bears no resemblance to anything we have tested or found to be effective. The continuous active learning (“CAL”) system we tested in our SIGIR study, however, does resemble the Waterloo system shown to be more effective than manual review in our JOLT study. If OrcaTec bears no resemblance to the CAL system—or indeed, to any of the others we have tested—on what basis has OrcaTec cited TREC 2009 and our JOLT study in support of the proposition that their TAR tool works?
Apples v. Oranges
Contrary to the aphorism, “you can’t compare apples to oranges,” you certainly can, provided that you use a common measure like weight in pounds, price in dollars per pound, or food energy in Calories. Roitblat, in comparing his unpublished results to our peer-reviewed results, compares the shininess of an apple in gloss units with the sweetness of an orange in percent sucrose equivalent. The graph above, reproduced from the first of the five Roitblat blogs, shows three dots placed by Roitblat over four “gain curves” from our SIGIR article. Roitblat states (emphasis added):
The x-axis shows the number of training documents that were reviewed. The y-axis shows the level of Recall obtained.
This may be true for Roitblat’s dots, but for our gain curves, on which his dots are superimposed, the x-axis shows the total number of documents reviewed, including both the training and review efforts combined. Dots on a graph reflecting one measure, placed on top of curves reflecting a different measure, convey no more information than paintball splats.
For OrcaTec’s method, the number of training documents is tiny compared to the number of documents identified for subsequent review. Small wonder the dots are so far to the left. For a valid comparison, Roitblat would have to move his dots way to the right to account for the documents subject to subsequent review, which he has disregarded. Roitblat does not disclose the number of documents identified for review in the matters reflected by his three dots. We do know, however, that in the Global Aerospace case, OrcaTec was reported to achieve 81% recall with 5,000 training documents, consistent with the placement of Roitblat’s green dot. We also know that roughly 173,000 documents were identified for second-pass review. Therefore, in an apples-to-apples comparison with CAL, a dot properly representing Global Aerospace would be at the same height as the green dot, but 173,000 places farther to the right—far beyond the right edge of Roitblat’s graph.
Of course, even if one were to compare using a common measure, there would be little point, due to the number of uncontrolled differences between the situations from which the dots and gain curves were derived. Only a valid, controlled comparison can convey any information about the relative effectiveness of the two approaches.
In The Science of Comparing Learning Protocols—Blog Post II on the Cormack & Grossman Article, Roitblat seeks to discredit our SIGIR study so as to exempt OrcaTec from its findings. He misrepresents the context of our words in the highlighted quote below, claiming that they pertain to the “gold standard” we used for evaluation:
Here I want to focus on how the true set, the so-called “gold standard” was derived for [four of the eight] matters [Cormack and Grossman] present. They say that for the “true” responsiveness values “for the legal-matter-derived tasks, we used the coding rendered by the first-pass reviewer in the course of the review. Documents that were never seen by the first-pass reviewer (because they were never identified as potentially responsive) were deemed to be coded as non-responsive” (emphasis added).
As may be seen from our SIGIR article at page 155, the words quoted above do not refer to the gold standard at all, but to a deliberately imperfect “training standard” used to simulate human review. Our gold standard used a statistical sampling technique for the entire collection known as the Horvitz-Thompson estimator; a technique that has gained widespread acceptance in the scientific community since its publication, in 1952, in the Journal of the American Statistical Association.
Apparently, to bolster his claims, Roitblat also provides a column of numbers titled “Precision,” on the right side of the table reproduced below.
We have no idea where these numbers came from—since we did not report precision in our SIGIR article—but if these numbers are intended to reflect the precision achieved by the CAL process at 90% recall, they are simply wrong. The correct numbers may be derived from the information provided in Table 1 (at page 155) and Figure 1 (at page 157) of our SIGIR article.
While we make no claim that our study is without limitations (see Section 7.5 at page 161 of our SIGIR article), Roitblat’s special pleading regarding the real or imagined limitations of our study provides no support for his claim that random training (using the OrcaTec tool in particular) achieves superior results to active learning. If Roitblat believes that a different study would show a contrary result to ours, he should conduct such a study, and submit the results for peer review.
Although we have been described by Roitblat as “CAR vendors” with a “vested interest in making their algorithm appear better than others,” we have made freely available our TAR Evaluation Toolkit, which contains the apparatus we used to conduct our SIGIR study, including the support vector machine (“SVM”) learning algorithm, the simulation tools, and four of the eight datasets. Researchers are invited to reproduce our results—indeed, we hope, to improve on them—by exploring other learning algorithms, protocols, datasets, and review tasks. In fact, in our SIGIR article at page 161, we wrote:
There is no reason to presume that the CAL results described here represent the best that can be achieved. Any number of feature engineering methods, learning algorithms, training protocols, and search strategies might yield substantive improvements in the future.
Roitblat could easily use our toolkit to test his claims, but he has declined to do so, and has declined to make the OrcaTec tool available for this purpose. We encourage other service providers to use the toolkit to evaluate their TAR tools, and we encourage their clients to insist that they do, or to conduct or commission their own tests. The question of whether Vendor X’s tool outperforms the free software we have made available is a hypothesis that may be tested, not only for OrcaTec, but for every vendor.
Since SIGIR, we have expanded our study to include the 103 topics of the RCV1-v2 dataset, with prevalences ranging from 0.0006% (5 relevant documents in 804,414) to 47.4% (381,000 relevant documents in 804,414). We used the SVMlight tool and word-based tf-idf tokenization strategy that the RCV1-v2 authors found to be most effective. We used the topic descriptions, provided with the dataset, as keyword “seed queries.” We used the independent relevance assessments, also provided with the dataset, as both the training and gold standards. The results—on 103 topics—tell the same story as our SIGIR paper, and will appear—once peer reviewed—in a forthcoming publication.
We were dumbfounded by Roitblat’s characterization of our 2014 FCLR article:
Schieneman and Gricks argue that one should measure the outcome of eDiscovery efforts to assess their reasonableness, and Grossman and Cormack argue that such measurement is unnecessary under certain conditions.
[Schieneman and Gricks’] exclusive focus on a particular statistical test, applied to a single phase of a review effort, does not provide adequate assurance of a reasonable production, and may be unduly burdensome. Validation should consider all available evidence concerning the effectiveness of the end-to-end review process, including prior scientific evaluation of the TAR method, its proper application by qualified individuals, and proportionate post hoc sampling for confirmation purposes (emphasis added).
Roitblat doubles down on his strawman, asserting that we eschew all measurement, insisting that our metaphor of cooking a turkey is inconsistent with his false characterization of our position. We have never said—nor do we believe—that measurement is unnecessary for TAR. In addition to pointing out the necessity of ensuring that the method is sound and is properly applied by qualified individuals, we state (at page 312 of our 2014 FCLR article) that it is necessary to ensure “that readily observable evidence—both statistical and non-statistical—is consistent with the proper functioning of the method.”
The turkey-cooking metaphor appears at pages 301-302 of our 2014 FCLR article:
When cooking a turkey, one can be reasonably certain that it is done, and hence free from salmonella, when it reaches a temperature of at least 165 degrees throughout. One can be reasonably sure it has reached a temperature of at least 165 degrees throughout by cooking it for a specific amount of time, depending on the oven temperature, the weight of the turkey, and whether the turkey is initially frozen, refrigerated, or at room temperature. Alternatively, when one believes that the turkey is ready for consumption, one may probe the turkey with a thermometer at various places. Both of these approaches have been validated by biological, medical, and epidemiological evidence. Cooking a turkey requires adherence, by a competent cook, to a recipe that is known to work, while observing that tools like the oven, timer, and thermometer appear to behave properly, and that the appearance, aroma, and texture of the turkey turn out as expected. The totality of the evidence—vetting the method in advance, competently and diligently applying the method, and monitoring observable phenomena following the application of the method—supports the reasonable conclusion that dinner is ready.
Roitblat reproduces our story, and then argues that it is inconsistent with his mischaracterization of our position:
They argue that we do not need to measure the temperature of the turkey in order to cook it properly, that we can be reasonably sure if we roast a turkey of a specific weight and starting temperature for a specific time at a specific oven temperature. This example is actually contrary to their position. Instead of one measure, using a meat thermometer to assess directly the final temperature of the meat, their example calls on four measures: roasting time, oven temperature, turkey weight, and the bird’s starting temperature to guess at how it will turn out. . . . To be consistent with their argument, they would have to claim that we would not have to measure anything, provided that we had a scientific study of our oven and a qualified chef to oversee the cooking process.
In our story, the turkey chef would need to ensure—through measurement and other observations—that the turkey was properly cooked, in order to avoid the risk of food poisoning. The weight of most turkeys sold in the U.S. is readily observable on the FDA label because it has been measured by the packer, and it is reasonable to trust that information. At the same time, a competent chef could reasonably be expected to notice if the label information were preposterous; for example, six pounds for a full-sized turkey. If the label were missing, nothing we have ever said would even remotely suggest that the chef should refrain from weighing the turkey with a kitchen scale—assuming one were available—or even a bathroom scale, if the alternative was for everyone to go hungry. Similarly, if the turkey were taken from a functioning refrigerator, and were free of ice, a competent chef would know the starting temperature with a margin of error that is inconsequential to the cooking time. Any functioning oven has a thermostat that measures and regulates its temperature. It is hard to imagine our chef having no ready access to some sort of timepiece with which to measure cooking time. Moreover, many birds come with a built-in gizmo that measures the turkey’s temperature and pops up when the temperature is somewhat more than 165 degrees. It does not display the temperature at all, let alone with a margin of error and confidence level, but it can still provide reassurance that the turkey is done. We have never suggested that the chef should refrain from using the gizmo, but if it pops up after one hour, or the turkey has been cooking for seven hours and it still has not popped up, they should not ignore the other evidence. And, if the gizmo is missing when the turkey is unwrapped, our chef can still cook dinner without running out to buy a laboratory thermometer. The bottom line is that there are many sources of evidence—statistical and otherwise—that can tell us whether a TAR process has been reasonable.
Your Mileage May Vary
Roitblat would have us believe that science has no role to play in determining which TAR methods work, and which do not. In his fourth blog post, Daubert, Rule 26(g) and the eDiscovery Turkey, he argues that there are too many “[s]ources of variability in the eDiscovery process”; that every matter and every collection is different, and that “[t]he system’s performance in a ‘scientific study’ provides no information about any of these sources of variability. . . .” The same argument could be made about crash testing or EPA fuel economy ratings, since every accident, every car, every road, and every driver is also different.
The EPA’s infamous disclaimer, “your mileage may vary,” captures the fact that it is impossible to predict with certainty the fuel consumption of a given trip. But it would be very difficult indeed to find a trip for which a Toyota Prius consumed more fuel than a Hummer H1. And it would be a very good bet that, for your next trip, you would need less gas if you chose the Prius.
Manufacturers generally do not like controlled comparisons, because there are so few winners and so many also-rans. So it is with automobiles, and so it is with eDiscovery software. On the other hand, controlled comparisons help consumers and the courts to determine which TAR tools are reliable.
We have identified more than 100 instances—using different data collections with different prevalences, different learning algorithms, and different feature engineering methods—in which controlled comparison demonstrates that continuous active learning outperforms simple passive learning, and none in which simple passive learning prevails. Neither Roitblat, nor anyone else that we are aware of, has yet identified an instance in which OrcaTec prevails, in a controlled comparison, over the CAL implementation in our toolkit.
In his fifth blog post, Daubert, Rule 26(g) and the eDiscovery Turkey: Tasting the eDiscovery Turkey, Part 2, Roitblat first claims that “[g]ood estimates of Recall can be obtained by evaluating a few hundred documents rather than the many thousands that could be needed for traditional measures of Recall,” but later admits that eRecall is a biased estimate of recall, “like a clock that runs a little fast or slow.” Roitblat further admits, “eRecall has a larger confidence interval than directly measured Recall because it involves the ratio of two random samples.” Roitblat then wonders “why [we] think that it is necessary to assume that the two measures [eRecall and the “direct method” of estimating recall] have the same confidence interval [(i.e., margin of error)].”
Our assumption came from representations made by Roitblat in Measurement in eDiscovery—A Technical White Paper:
Rather than exhaustively assessing a large random sample of thousands of documents [as required by the direct method], with the attendant variability of using multiple reviewers, we can obtain similar results by taking advantage of the fact that we have identified putatively responsive and putatively non-responsive documents. We use that information and the constraints inherent in the contingency table to evaluate the effectiveness of our process. Estimating Recall from Elusion can be called eRecall (emphasis added).
Our “mistake” was in taking Roitblat’s use of “similar results” to imply that an estimate of recall using eRecall would have a similar accuracy, margin of error, and confidence level to one obtained by the direct method; that is, unbiased, with a margin of error of ±5%, and a confidence level of 95%.
eRecall misses this mark by a long shot. If you set the confidence level to 95%, the margin of error achieved by eRecall is vastly larger than ±5%. Alternatively, if you set the margin of error to ±5%, the confidence level is vastly inferior to 95%, as illustrated below.
Table 2 at page 309 of our 2014 FCLR article (reproduced below) shows the result of repeatedly using eRecall, the direct method, and other methods to estimate recall for a review known to have achieved 75% recall and 83% precision, from a collection with 1% prevalence.
To achieve a margin of error of ±5%, at the 95% confidence level, the estimate must fall between 70% and 80% (±5% of the true value) at least 95% of the time. From the fourth column of the table one can see that the direct method falls within this range 97.5% of the time, exceeding the standard for 95% confidence. eRecall, on the other hand, falls within this range a mere 8.9% of the time. If the recall estimate had been drawn at random from a hat containing all estimates from 0% to 100%, the result would have fallen within the required range 10% of the time—more often than eRecall. Therefore, for this review, eRecall provides an estimate that is no better than chance.
How large does the margin of error need to be for eRecall to achieve a 95% confidence level? The fifth and sixth columns of the table show that one would need to enlarge the target range to include all values between 0% and 100%, for eRecall to be able to hit the target 95% of the time. In other words, eRecall provides no information whatsoever about the true recall of this review, at the 95% confidence level. On the other hand, one could narrow the target range to include only the values between 70.6% and 79.2%, and the direct method would still hit it 95% of the time, consistent with a margin of error slightly better than ±5%, at the 95% confidence level.
In short, the direct method provides a valid—albeit burdensome—estimate of recall, and eRecall does not.
Roitblat repeatedly puts words in our mouths to attack positions we do not hold in order to advance his position that one should employ OrcaTec’s software and accept—without any scientific evidence—an unsound estimate of its effectiveness. Ironically, one of the positions that Roitblat falsely attributes to us is that one should not measure anything. Yet, we have spent the better part of the last five years doing quantitative research—measuring—TAR methods.
We are convinced that sound quantitative evaluation is essential to inform the choice of tools and methods for TAR, to inform the determination of what is reasonable and proportionate, and to drive improvements in the state of the art. We hope that our studies so far—and our approach, as embodied in our TAR Evaluation Toolkit—will inspire others, as we have been inspired, to seek even more effective and more efficient approaches to TAR, and better methods to validate those approaches through scientific inquiry.
Our next steps will be to expand the range of datasets, learning algorithms, and protocols we investigate, as well as to investigate the impact of human factors, stopping criteria, and measures of success. We hope that information retrieval researchers, service providers, and consumers will join us in our quest, by using our toolkit, by allowing us to evaluate their efforts using our toolkit, or by conducting scientific studies of their own.