This is a continuation of last week’s blog, An Elusive Dialogue on Legal Search: Part One where the Search Quadrant is Explained. The quadrant and random sampling are not as elusive as Peeta Mellark in The Hunger Games shown right, but almost. Indeed, as most of us lawyers did not major in math or information science, these new techniques can be hard to grasp. Still, to survive in the vicious games often played these days in litigation, we need to find a way. If we do, we can not only survive, we can win, even if we are from District 12 and the whole world is watching our every motion.
The emphasis in the second part of this essay is on quality controls and how such efforts, like search itself, must be multimodal and hybrid. We must use a variety of quality assurance methods – we must be multimodal. To use the Hunger Games analogy, we must use both bow and rope, and camouflage too. And we must employ both our skilled human legal intelligence and our computer intelligence – we must be hybrid; Man and machine, working together in perfect harmony, but with Man in charge. That is the only way to survive the Hunger Games of litigation in the 21st Century. The only way the odds will be ever in your favor.
Recall and Elusion
But enough fun with Hunger Games, Search Quadrant terminology, nothingness, and math, and back to Herb Rotiblat’s long comment on my earlier blog, Day Nine of a Predictive Coding Narrative.
Recall and Precision are the two most commonly used measures, but they are not the only ones. The right measure to use is determined by the question that you are trying to answer and by the ease of asking that question.
Recall and Elusion are both designed to answer the question of how complete we were at retrieving all of the responsive documents. Recall explicitly asks “of all of the responsive documents in the collection, what proportion (percentage) did we retrieve?” Elusion explicitly asks “What proportion (percentage) of the rejected documents were truly responsive?” As recall goes up, we find more of the responsive documents, elusion, then, necessarily goes down; there are fewer responsive documents to find in the reject pile. For a given prevalence or richness as the YY count goes up (raising Recall), the YN count has to go down (lowering Elusion). As the conversation around Ralph’s report of his efforts shows, it is often a challenge to measure recall.
This last comment was referring to prior comments made in my same Day Nine Narrative blog by two other information scientists William Webber and Gordon Cormack. I am flattered that they all seem to read my blog, and make so many comments, although I suspect they may be master game-makers of sorts like we saw in Hunger Games.
The earlier comments of Webber and Cormack pertained to point projection of yield and the lower and upper intervals derived from random samples. All things I was discussing in Day Nine. Gordon’s comments focused on the high-end of possible interval error and said you cannot know anything for sure about recall unless you assume the worst case scenario high-end of the confidence interval. This is true mathematically and scientifically, I suppose (to be honest, I do not really know if it is true or not, but I learned long ago not to argue science with a scientist, and they do not seem to be quibbling amongst themselves, yet.) But it certainly is not true legally, where reasonability and acceptable doubt (a kind of level of confidence), such as a preponderance of the evidence, are always the standard, not perfection and certainty. It is not true in manufacturing quality controls either.
But back to Herb’s comment, where he picks up on their math points and elaborates concerning the Elusion test that I used for quality control.
Measuring recall requires you to know or estimate the total number of responsive documents. In the situation that Ralph describes, responsive documents were quite rare, estimated at around 0.13% prevalence. One method that Ralph used was to relate the number of documents his process retrieved with his estimated prevalence. He would take as his estimate of Recall, the proportion of the estimated number of responsive documents in the collection as determined by an initial random sample.
Unfortunately, there is considerable variability around that prevalence estimate. I’ll return to that in a minute. He also used Elusion when he examined the frequency of responsive documents among those rejected by his process. As I argued above, Elusion and Recall are closely related, so knowing one tells us a lot about the other.
One way to use Elusion is as an accept-on-zero quality assurance test. You specify the maximum acceptable level of Elusion, as perhaps some reasonable proportion of prevalence. Then you feed that value into a simple formula to calculate the sample size you need (published in my article the Sedona Conference Journal, 2007). If none of the documents in that sample comes up responsive, then you can say with a specified level of confidence that responsive documents did not occur in the reject set at a higher rate than was specified. As Gordon noted, the absence of a responsive document does not prove the absence of responsive documents in the collection.
The Sedona Conference Journal article Herb referenced here is called Search & Information Retrieval Science. Also, please recall that my narrative states, without using the exact same language, that my accept-on-zero quality assurance test pertained to Highly Relevant documents, not relevant documents. I decided in advance that if my random sample of excluded documents included any that were Highly Relevant documents, then I would consider the test a failure and initiate another round of predictive coding. My standard for merely relevant documents was secondary and more malleable, depending on the probative value and uniqueness of any such false negatives. False negatives are what Herb calls YN, and we also now know is called D in the Search Quadrant with totals shown again below.
Back to Herb’s comment, who, by the way looks a bit like President Snow, don’t you think? Herb is now going to start talking about Recall, which as we now know is A/G, and is a measure of accuracy that I did not directly make or claim.
If you want to directly calculate the recall rate after your process, then you need to draw a large enough random sample of documents to get a statistically useful sample of responsive documents. Recall is the proportion of responsive documents that have been identified by the process. The 95% confidence range around an estimate is determined by the size of the sample set. For example, you need about 400 responsive documents to know that you have measured recall with a 95% confidence level and a 5% confidence interval. If only 1% of the documents are responsive, then you need to work pretty hard to find the required number of responsive documents. The difficulty of doing consistent review only adds to the problem. You can avoid that problem by using Elusion to indirectly estimate Recall.
The Fuzzy Lens Problem Again
The reference to the difficulty of doing consistent review refers to the well documented inconsistency of classification among human reviewers. That is what I called in Secrets of Search, Part One, as the fuzzy lens problem that makes recall such an ambiguous measure in legal search. It is ambiguous because when large data sets are involved the value for G (total relevant) is dependent upon human reviewers. The inconsistency studies show that the gold standard of measurement by human review is actually just dull lead.
Let me explain again in shorthand, and please fell free to refer to the Secrets of Search trilogy and original studies for the full story. Roitblot’s own well-known study of a large-scale document review showed that human reviewers only agreed with each other on average of 28% of the time. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010. An earlier study by one of the leading information scientists in the world, Ellen M. Voorhees, found a 40% agreement rate between human reviewers. Variations in relevance judgments and the measurement of retrieval effectiveness, 36:5 Information Processing & Management 697, 701 (2000). Voorhees concluded that with 40% agreement rates it was not possible to measure recall any higher than 65%. Information scientist William Webber calculated that with a 28% agreement rate a recall rate cannot be reliably measured above 44%. Herb Rotiblat and I dialogued about this issue before the last time in Reply to an Information Scientist’s Critique of My “Secrets of Search” Article.
I prepared the graphics below to illustrate this problem of measurement and the futility of recall calculations when the measurements are made by inconsistent reviewers.
Until we can crack the inconsistent reviewer problem, we can only measure recall vaguely, as we see on the left, or at best the center, and can only make educated guesses as to the reality on the right. The existence of the error has been proven, but as Maura Grossman and Gordon Cormack point out, there is a dispute as to the cause of the error. In one analysis that they did of TREC results they concluded that the inconsistencies were caused by human error, not a difference of opinion on what was relevant or not. Inconsistent Responsiveness Determination in Document Review: Difference of Opinion or Human Error? But, regardless of the cause, the error remains.
Back to Herb’s Comment.
One way to assess what Ralph did is to compare the prevalence of responsive documents in the set before doing predictive coding with their prevalence after using predictive coding to remove as many of the responsive documents as possible. Is there a difference? An ideal process will have removed all of the responsive documents, so there will be none left to find in the reject pile.
That question of whether there is a difference leads me to my second point. When we use a sample to estimate a value, the size of the sample dictates the size of the confidence interval. We can say with 95% confidence that the true score lies within the range specified by the confidence interval, but not all values are equally likely. A casual reader might be led to believe that there is complete uncertainty about scores within the range, but values very near to the observed score are much more likely that values near the end of the confidence interval. The most likely value, in fact, is the center of that range, the value we estimated in the first place. The likelihood of scores within the confidence interval corresponds to a bell shaped curve.
This is a critical point. It means that the point projections, a/k/a, the spot projections, can be reliably used. It means that even though you must always qualify any findings that are based upon random sampling by stating the applicable confidence interval, the possible range of error, you may still reliably use the observed score of the sample in most data sets, if a large enough sample size is used to create low confidence interval ranges. Back to Herb’s Comment.
Moreover, we have two proportions to compare, which affects how we use the confidence interval. We have the proportion of responsive documents before doing predictive coding. The confidence interval around that score depends on the sample size (1507) from which it was estimated. We have the proportion of responsive documents after predictive coding. The confidence interval around that score depends on its sample size (1065). Assuming that these are independent random samples, we can combine the confidence intervals (consult a basic statistics book for a two sample z or t test or http://facstaff.unca.edu/dohse/Online/Stat185e/Unit3/St3_7_TestTwoP_L.htm), and determine whether these two proportions are different from one another (0.133% vs. 0.095%). When we do this test, even with the improved confidence interval, we find that the two scores are not significantly different at the 95% confidence level. (try it for yourself here: http://www.mccallum-layton.co.uk/stats/ZTestTwoTailSampleValues.aspx.). In other words, the predictive coding done here did not significantly reduce the number of responsive documents remaining in the collection. The initial proportion 2/1507 was not significantly higher than 1/1065. The number of responsive documents we are dealing with in our estimates is so small, however, that a failure to find a significant difference is hardly surprising.
This paragraph appears to me to have assumed that my final quality control test was a test for Recall and uses the upper limit, the worst case scenario, as the defining measurement. Again, as I said in the narrative and replies to other comments, I was testing for Elusion, not Recall. Further, the Elusion test (D/F) here was for Highly Relevant documents, not relevant, and none were found, 0%. None were found in the first random sample at the beginning of the project, and none were found in the second random sample at the end. The yields referred to by Herb are for relevant documents, not Highly Relevant. The value of D, False Negatives, in the elusion test was thus zero. As we have discussed, when that happens, where the numerator in a fraction is zero, the result of the division is also always zero, which, in an Elusion test, is exactly what you are looking for. You are looking for nothing and happy to find it.
The final sentence in Herb’s last paragraph is key to understanding his comment: The number of responsive documents we are dealing with in our estimates is so small, however, that a failure to find a significant difference is hardly surprising. It points to the inherent difficulty of using random sampling measurements of recall in low yield document sets where the prevalence is low. But there is still some usefulness for random sampling in these situations as the conclusion of his Comment shows.
Still, there is other information that we can glean from this result. The difference in the two proportions is approximately 28%. Predictive coding reduced by 28% the number of responsive documents unidentified in the collection. Recall, therefore, is also estimated to be 28%. Further, we can use the information we have to compute the precision of this process as approximately 22%. We can use the total number of documents in the collection, prevalence estimates, and elusion to estimate the entire 2 x 2 decision matrix.
For eDiscovery to be considered successful we do not have to guarantee that there are no unidentified responsive documents, only that we have done a reasonable job searching for them. The observed proportions do have some confidence interval around them, but they remain as our best estimate of the true percentage of responsive documents both before predictive coding and after. We can use this information and a little basic algebra to estimate Precision and Recall without the huge burden of measuring Recall directly.
These are great points made by Herb Rotiblat in the last paragraph regarding reasonability. It shows how lawyer-like he has become after working with our kind for so many years, rather than professor types like my brother in the first half of his career. Herb now well understands the difference between law and science and what this means to legal search.
Law is not a Science, and Neither Is Legal Search
To understand the numbers and need for reasonable efforts that accepts high margins of error, we must understand the futility of increasing sample sizes to try to cure the upper limit of confidence. William Webber in his Comment of August 6, 2012 at 10:28 pm said that “it is, unfortunately, very difficult to place a reassuring upper bound on a very rare event using random sampling.” (emphasis added) Dr. Webber goes on to explain that to attain even a 50% confidence interval would require a final quality control sample of 100,000 documents. Remember, there were only 699,082 documents to begin with, so that is obviously no solution at all. It is about as reassuring as the Hunger Games slogan, may the odds be ever in your favor, when we all know that all but 1 of the 24 gamers must die.
Aside from the practical cost and time issues, the fuzzy lens problem of poor human judgments also makes the quest for reassuring bounds of error a fool’s errand. The perfection is illusory. It cannot be attained, or more correctly put, if you do attain high recall in a large data set, you will never be able to prove it. Do not be fooled by the slogans and the flashy, facile analysis.
Fortunately, the law has long recognized the frailty of all human endeavors. The law necessarily has different standards for acceptable error and risks than does math and science. The less-than-divine standards also apply to manufacturing quality control where small sample sizes have long been employed for acceptable risks. There too, like in a legal search for relevance, the prevalence of defective items sampled for is typically very low.
Math and science demand perfection. But the law does not. We demand reasonability and good faith, not perfection. Some scientists may think that we are settling, but it is more like practical realism, and, is certainly far better than unreasonable and bad faith. Unlike science and math, the law is used to uncertainties. Lawyers and judges are comfortable with that. For example, we are reassured enough to allow civil convictions when a judge or jury decides that it is more likely than not that the defendant is at fault, a 51% standard of doubt. Law and justice demand reasonable efforts, not perfection.
I know Herb Rotiblat agrees with me because this is the fundamental thesis of the fine paper he wrote with two lawyers, Patrick Oot and Anne Kershaw, entitled: Mandating Reasonableness in a Reasonable Inquiry. At pages 557-558 they sum up saying (footnote omitted):
We do not suggest limiting the court system’s ability to discover truth. We simply anticipate that judges will deploy more reasonable and efficient standards to determine whether a litigant met his Rule 26(g) reasonable inquiry obligations. Indeed, both the Victor Stanley and William A. Gross Construction decisions provide a primer for the multi-factor analysis that litigants should invoke to determine the reasonableness of a selected search and review process to meet the reasonable inquiry standard of Rule 26(f): 1. Explain how what was done was sufficient; 2. Show that it was reasonable and why; 3. Set forth the qualifications of the persons selected to design the search; 4. Carefully craft the appropriate keywords with input from the ESI’s custodians as to the words and abbreviations they use; and 5. Use quality control tests on the methodology to assure accuracy in retrieval and the elimination of false positives.
As to the fifth criteria, which we are discussing here, of quality control tests, Rotiblat, Oot and Kershaw assert in their article at page 551 that : “A litigant should sample at least 400 results of both responsive and non-responsive data.” This is the approximate sample size when using 95% confidence level and a 5% confidence interval. (Note in my sampling I used less than a 3% confidence interval with a much larger sample size of 1,065 documents.) To support this assertion that a sample size of 400 documents is reasonable, the authors in footnote 77 refer to an email they have on file from Maura Grossman regarding legal search of data sets in excess of 100,000 documents, which concluded with the statement:
Therefore, it seemed to me that, for the average matter with a large amount of ESI, and one which did not warrant hiring a statistician for a more careful analysis, a sample size of 400 to 600 documents should give you a reasonable view into your data collection, assuming the sample is truly randomly drawn.
Personally, I think a larger sample size than 400-600 documents is needed for quality control tests in large cases. The efficacy of this small calculated sample size using a 5% confidence interval assumes a prevalence of 50%, in other words, that half of the documents sampled are relevant. This is an obvious fiction in all legal search, just as it is in all sampling for defective manufacturing goods. That is why I sampled 1,065 documents using 3%. Still, in smaller cases, it may be very appropriate to just sample 400-600 documents using a 5% interval. It all depends, as I will elaborate further in the conclusion.
But regardless, all of these scholars of legal search make the valid point that only reasonable efforts are required in quality control sampling, not perfection. We have to accept the limited usefulness of random sampling alone as a quality assurance tool because of the margins of error inherent in sampling of the low prevalence data sets common in legal search. Fortunately, random sampling is not our only quality assurance tool. We have many other methods to show reasonable search efforts.
Going Beyond Reliance on Random Sampling Alone to a Multimodal Approach
Random sampling is not a magic cure-all that guaranties quality, or definitively establishes the reasonability of a search, but it helps. In low yield datasets, where there is a low percentage of relevant documents in the total collection, the value of random sampling for Recall is especially suspect. The comments of our scientist friends have shown that. There are inherent limitations to random sampling.
Ever increasing sample sizes are not the solution, even if that was affordable and proportionate. Confidence intervals in sampling of less than two or three percent are generally a waste of time and money. (Remember the sampling statistics rule of thumb of 2=4 that I have explained before wherein a halving of confidence interval error rate, say from 3% to 1.5%, requires a quadrupling of sample size.) Three or four percent confidence interval levels are more appropriate in most legal search projects, perhaps even the 5% interval used in the Mandating Reasonableness article by Roitblat, Oot and Kershaw. Depending on the data set itself, prevalence, other quality control measures, complexity of the case, and the amount at issue, say less than $1,000,000, the five percent based small sample size of approximately 400 documents could well be adequate and reasonable. As usual in the law, it all depends on many circumstances and variables.
The issue of inconsistent reviews between reviewers, the fuzzy lens problem, necessarily limits the effectiveness of all large-scale human reviews. The sample sizes required to make a difference are extremely large. No such reviews can be practically done without multiple reviewers and thus low agreement rates. The gold standard for review of large samples like this is made of lead, not gold. Therefore, even if cost was not a factor, large sample sizes would still be a waste of time.
Moreover, in the real word of legal review projects, there is always a strong component of vagary in relevance. Maybe that was not true in the 2009 TREC experiment as Grossman and Cormack’s study suggests, but it has been true in the thousands of messy real-world lawsuits that I have handled in the past 32 years. All trial lawyers I have spoken with on the subject agree.
Relevance can be, and usually is, a fluid and variable target depending on a host of factors, including changing legal theories, changing strategies, changing demands, new data, and court rulings. The only real gold standard in law is a judge ruling on specific documents. Even then, they can change their mind, or make mistakes. A single person, even a judge, can be inconsistent from one document to another. See Grossman & Cormack, Inconsistent Responsiveness Determination at pgs. 17-18 where a 2009 TREC Topic Authority contradicted herself 50% of the time when re-examining the same ten documents.
We must realize that random sampling is just one tool among many. We must also realize that even when random sampling is used, Recall is just one measure of accuracy among many. We must utilize the entire 2 x 2 decision matrix.
We must consider the possible applicability of all of the measurements that the search quadrant makes possible, not just recall.
- Recall = A/G
- Precision = A/C
- Elusion = D/F
- Fallout = B/H
- Agreement = (A+E)/(D+B)
- Prevalence = G/I
- Miss Rate = D/G
- False Alarm Rate = B/C
No doubt we will develop other quality control tests, for instance using Prevalence as a guide or target for relevant search as I described in my seven part Search Narrative. Just as we must use multimodal search efforts for effective search of large-scale data sets, so too must we use multiple quality control methods when evaluating the reasonability of search efforts. Random sampling is just one tool among many, and, based on the math, maybe not the best method at that, regardless of whether it is for recall, or elusion, or any other binary search quadrant measure.
Just as keyword search must be supplemented by the computer intelligence of predictive coding, so too must random based quality analysis be supplemented by skilled legal intelligence. That is what I call a Hybrid approach. The best measure of quality is to be found in the process itself, coupled with the people and software involved. A judge called upon to review reasonability of search should look at a variety of factors, such as:
- What was done and by whom?
- What were their qualifications?
- What rules and disciplined procedures were followed?
- What measures were taken to avoid inconsistent calls?
- What training was involved?
- What happened during the review?
- Which search methods were used?
- Was it multimodal?
- Was it hybrid, using both human and artificial intelligence?
- How long did it take?
- What did it cost?
- What software was used?
- Who developed the software?
- How long has the software been used?
These are just a few questions that occur to me off the top of my head. There are surely more. Last year in Part Two of Secrets of Search I suggested nine characteristics of what I hope would become an accepted best practice for legal review. I invited peer review and comments on what I may have left out, or any challenges to what I put in, but so far this list of nine remains unchallenged. We need to build on this to create standards so that quality control is not subject to so many uncertainties.
Jason R. Baron, William Webber, myself, and others keep saying this over and over, and yet the Hunger Games of standardless discovery goes on. Without these standards we may all fall prey at any time to a vicious sneak attack by another contestant in the litigation games. A contest that all too often feels like a fight to the death, rather than a cooperative pursuit of truth and justice. It has become so bad now that many lawyers snicker just to read such a phrase.
The point here is, you have to look at the entire process, and not just focus on taking random samples, especially ones that claims to measure recall in low yield collections. By the way I submit that almost all legal search is of low yield collections, not just employment law related as some have suggested. Those who think the contrary have too broad a concept of relevance, and little or no understanding of actual trials, cumulative evidence, and the modern data koan of big data “relevant is irrelevant.” Even though random sampling is not The Answer we once thought, it should be part of the process. For instance, a random sample elusion test that finds no Highly Relevant documents should remain an important component of that process.
The no-holds-barred Hunger Games approach to litigation must end now. If we all join together, this will end in victory, not defeat. It will end with alliances and standards. Whatever district you hail from, join us in this nobel quest. Turn away from the commercial greed of winning-at-all-costs. Keep your integrity. Keep the faith. Renounce the vicious games; both hide-the-ball and extortion. The world is watching. But we are up for it. We are prepared. We are trained. The odds are ever in our favor. Salute all your colleagues who turn from the games and the leadership of greed and oppression. Salute all who join with us in the rebellion for truth of justice.