I hear a lot about how different software will find all relevant documents. That would be 100% recall. I also hear demands from requesting parties to find and produce all relevant documents. In the context of large disorganized banks of electronic data, such as email collections, these claims and demands are not only contra to the rules of law, embedded as they are in reasonability, but they are also unrealistic and contra to the latest scientific research. In my Bottom Line Driven Proportional Review article I showed how this kind of demand for all relevant ESI is not permitted under the rules and Doctrine of Proportionality in big data cases (and most cases these days are big data cases). I explained, as many have done before me, that the rules do not require production of all relevant documents, if the burden to do so is disproportional. I also shared my method for keeping the costs for review proportional to the value and importance of the case and the production request. But aside from the cost issue, how practical is it expect to find all relevant ESI? I examined this question at length in my Secrets of Search series, volumes one, two and three. Still, people find it hard to accept, especially in view of the unregulated clamor of the marketplace.
So as my last gift to readers before Legal Tech 2012 starts tomorrow, the ultimate event of marketplace claims and competing exaggerations, I present you with a hard dose of reality, I present you with more findings on legal search from the world of science. This time I direct you to an important article, Evaluation of Information Retrieval for E-Discovery, Artificial Intelligence and Law, 18(4)347-386 (2011). It was written by leaders of TREC Legal Track and established giants in the field of legal search: Douglas W. Oard, Jason R. Baron, Bruce Hedin, David D. Lewis, and Stephen Tomlinson. They analyzed the now fully published test results of the experiments in 2008, and carefully examined the interactive task, topic 301, as the best test of competing legal search technologies. This task made use of a subject matter expert and an appeals process for quality control on relevance determinations. Four teams of experts participated in the test, two academic and two commercial. A well known e-discovery vendor won the test (scientists hate it when I put it that way). They won because they attained better precision and recall scores than the three other participants.
Now we come to the punch line, the winning vendor attained a recall rate of only 62%. That’s right, they missed 38% of the relevant documents. And they were the winner. Think about it. The other three participants in the scientific experiment attained recall rates of less than 20%! That’s right, they missed over 80% of the relevant documents. Now what do you think about a requesting party who demands that you produce all of the relevant email?
Find my summary of the experiments hard to believe, then read the report for yourself. Here is the excerpt on which I rely at page 24 of Evaluation of Information Retrieval for E-Discovery:
On the basis of the adjudicated sample assessments, we estimated that there are 786,862 documents (11.4% of the collection) relevant to Topic 103 in the test collection (as the topic was defined by the TA). All four teams attained quite high precision; point estimates ranged from 0.71 to 0.81. One team (notably the one that made the most use of TA time) attained relatively high recall (0.62), while the other three (all making significantly less use of TA time) obtained recall values below 0.20.
The team of information scientists, and their lawyer guide, Jason R. Baron, next report on the 2009 TREC experiments, specifically the one they found most representative, the interactive tasks, again with subject matter consultations and appeals. This time there were eleven teams participating in the experiment, three academic and eight commercial. That’s right, eight e-discovery vendors were in the game this time. How did they do? They did a little better, but not much. Five of the teams, and just five only, got a little over 70% recall.
The post-adjudication results for the 2009 topics showed some encouraging signs. Of the 24 submitted runs (aggregating across all seven topics), 6 (distributed across 5 topics) attained an F1 score (point estimate) of 0.7 or greater. In terms of recall, of the 24 submitted runs, 5 (distributed across 4 topics) attained a recall score of 0.7 or greater; of these 5 runs, 4 (distributed across 3 topics) simultaneously attained a precision score of 0.7 or greater.
Id. at pgs. 24-25. If you follow the article’s direction and see the Overview of the TREC 2009 Legal Track, by B. Hedin, S. Tomlinson, J. Baron, and D. Oard, you can find more details of the 2009 test results. After you wade through the wonderfully dense language that information scientists love to use to convey information, you find section 2.3.5 Final Results. There you are pointed to a table of numbers: Table 6: Post-adjudication estimates of recall, precision, and F1.
What does this chart tell us? The best anyone did was an 86.5% recall on one of the seven tasks. Look at the third column from the left for the recall rates attained. The lowest was 9%. Digging deeper the analysts found that the teams with the highest scores appealed the most, and those with the lowest scores, not at all. Consultation with the topic authority also helped improve scores. But the bottom line for purposes of my point today, is that the average recall rate was only 41% (993/24), and even the best attained on one search, by one team of experts, was only 86%. Demands for recall in the 80s for every project are thus unrealistic.
The scientific research proves, once again, that it is unreasonable to ask for any better recall than 70%, in fact, it could be substantially less. Law demands reasonable efforts, not perfection. The best recall results attainable in scientific experiments, with the best software and top experts at the helm, is way too high a standard for reasonable efforts. Reasonability should be more like average results attained by trained lawyers making good faith efforts, not results attained by information scientists and specialists using the best software money can buy. So, unless the software improves, the 41% average of experts in 2009 might be sufficient. Even standards like that should be used with caution and the efforts meter should always be tempered by costs. Proportionality of efforts should, if they are in good faith and reasonable, always trump any quality control efforts. See Bottom Line Driven Proportional Review.
In fairness to my vendor friends, the latest reports from TREC are dated. That was then, 2008 and 2009, this is now, 2012. The test scores showed substantial progress from 2008 to 2009. In my experience, the predictive coding type search software has significantly improved in the last year or so. I have also heard unsubstantiated reports of much higher recall rates attained in the 2011 TREC Legal Track tests, but I take all of these claims with a big grain of salt. Until Dr. Oard and his information scientist crew (that, by the way, includes two lawyers, Jason Baron and Maura Grossman) publish results, obtuse as their publications are, I will remain skeptical. Right now the latest published scientific data (2009) shows that if you can find an estimated 41% of the relevant documents in a large collection of ESI, then you are doing just as good as the experts. That should be good enough to meet the reasonable efforts required under the law.
Be skeptical of any claims or demands for better results than that, at least for now. You should stop chasing, or being chased, by unreasonable demands for high recall rates. The only way to attain 70% or higher rates today is by document dumps, where precision plummets as you produce irrelevant documents, or, perhaps by budget busting, near-endless iterations of search and seed-set training.
Even then, your expensive pursuit is quixotic from the point of view of science, where the fuzziness measurement issue remains unresolved. Furthermore, and most importantly, in today’s world of big data, where everyone has 100,000 emails, it is wasteful in the extreme to try to find all relevant documents. If your are still trying to find them all, and not just the few super-relevant smoking guns, you have not understood that in today’s age, relevant is irrelevant, nor that the ultimate goal of discovery is to prepare for trial, where the 7±2 rule of persuasion reigns supreme.