Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part Three

January 18, 2015

ei-recallPlease read Part One and Part Two of this article before reading this third and final segment.

First Example of How to Calculate Recall Using the ei-Recall Method

Let us begin with the same simple hypothetical used in In Legal Search Exact Recall Can Never Be Known. Here we assume a review project of 100,000 documents. By the end of the search and review, when we could no longer find any more relevant documents, we decided to stop and run our ei-Recall quality assurance test. We had by then found and verified 8,000 relevant documents, the True Positives. That left 92,000 documents presumed irrelevant that would not be produced, the Negatives.

As a side note, the decision to stop may be somewhat informed by running estimates of possible recall range attained based on early prevalence assumptions from a sample of all documents at or near the beginning of the project. The prevalence based recall range estimate would not, however, be the sole driver of the decision to stop and test. The prevalence based recall estimates alone can be very unreliable as shown In Legal Search Exact Recall Can Never Be Known. That is one of the main reasons for developing the ei-Recall alternative. I explained the thinking behind the decision to stop in Visualizing Data in a Predictive Coding Project – Part Three.

I will not have stopped the review in most projects (proportionality constraints aside), unless I was confident that I had already found all of those (highly relevant) types of documents; already found all types of strong relevant documents, and already found all highly relevant document, even if they are cumulative. I want to find each and every instance of all hot (highly relevant) documents that exists in the entire collection. I will only stop (proportionality constraints aside) when I think the only relevant documents I have not recalled are of an unimportant, cumulative type; the merely relevant. The truth is, most documents found in e-discovery are of this type; they are merely relevant, and of little to no use to anybody except to find the strong relevant, new types of relevant evidence, or highly relevant evidence.

Back to our hypothetical. We take a sample of 1,534 (95%+/-2.5%) documents, creating a 95% confidence level and 2.5% confidence interval, from the 92,000 Negatives. This allows us to estimate how many relevant documents had been missed, the False Negatives.

Assume we found only 5 False Negatives. Conversely, we found that 1,529 of the documents picked at random from the Negatives were in fact irrelevant as expected. They were True Negatives.

The percentage of False Negatives in this sample was thus a low 0.33% (5/1534). Using the Normal, but wrong, Gaussian confidence interval the projected total number of False Negatives in the entire 92,000 Negatives would thus be between 5 and 2,604 documents (0.33%+2.5%= 2.83% * 92,000). Using the binomial interval calculation the range would be from 0.11% to 0.76%. The more accurate binomial calculation eliminates the absurd result of a negative interval on the low recall range (.33% -2.5%= -2.17). The fact that negative recall arises from using the Gaussian normal distribution demonstrates why the binomial interval calculation should always be used, not Gaussian, especially in low prevalence. From this point forward, in accordance with the ei-Recall method, we will only use the more accurate Binomial range calculations. Here the correct range generated by the binomial interval is from between 101 (92,000 * 0.11%) and 699 (92,000 * 0.76%) False Negatives. Thus the FNh value is 699, and FNl is 101.

ei-recall_exampleThe calculation of the lowest end of the recall range is based on the high end of the False Negatives projection: Rl = TP / (TP+FNh) = 8,000 / (8,000 + 699) = 91.96% 

The calculation of the highest end of the recall range is based on the low end of the False Negatives projection: Rh = TP / (TP+FNl) = 8,000 / (8,000 + 101) = 98.75%.

Our final recall range values for this first hypothetical is thus from 92%- 99% recall. It was an unusually good result.

Recall_Range_1

Ex. 1 – 92% – 99%

It is important to note that we could have still failed this quality assurance test, in spite of the high recall range shown, if any of the five False Negatives found was a highly relevant, or unique-strong relevant document. That is in accord with the accept on zero error standard that I always apply to the final elusion sample, a standard having nothing directly to do with ei-Recall. Still, I recommend that the e-discovery community also accept this as a corollary to implement ei-Recall. I have previously explained this zero error quality assurance protocol on this blog several times, most recently in Visualizing Data in a Predictive Coding Project – Part Three where I explained:

I always use what is called an accept on zero error protocol for the elusion test when it comes to highly relevant documents. If any are highly relevant, then the quality assurance test automatically fails. In that case you must go back and search for more documents like the one that eluded you and must train the system some more. I have only had that happen once, and it was easy to see from the document found why it happened. It was a black swan type document. It used odd language. It qualified as a highly relevant under the rules we had developed, but just barely, and it was cumulative. Still, we tried to find more like it and ran another round of training. No more were found, but still we did a third sample of the null set just to be sure. The second time it passed.

Variations of First Example with Higher False Negatives Ranges

I want to provide two variations of this hypothetical where the sample of the null set, Negatives, finds more mistakes, more False Negatives. Variations like this will provide a better idea of the impact of the False Negatives range on the recall calculations. Further, the first example wherein I assumed that only five mistakes were found in a sample of 1,534 is somewhat unusual. A point projection ratio of 0.33% for elusion is on the low side for a typical legal search project. In my experience in most projects a higher rate of False Negatives will be found, say in the 0.5% to 2% range.

Let us assume for the first variation that instead of finding 5 False Negatives, we find 20. That is a quadrupling of the False Negatives. It means that we found 1,514 True Negatives and 20 False Negatives in the sample of 1,534 documents from the 92,000 document discard pile. This creates a point projection of 1.30% (20 / 1534), and a binomial range of 0.8% to 2.01%. This generates a projected range of total False Negatives of from 736 (92,000 * .8%) to 1,849 (92,000 * 2.01%).

Now let’s see how this quadrupling of errors found in the sample impacts the recall range calculation.

ei-recall_example2The calculation of the low end of the recall range is based on the high end of the False Negatives projection: Rl = TP / (TP+FNh) = 8,000 / (8,000 + 1,849) = 81.23% 

The calculation of the high end of the recall range is based on the low end of the False Negatives projection: Rh = TP / (TP+FNl) = 8,000 / (8,000 + 736) = 91.58%.

Our final recall range values for this variation of the first hypothetical is thus 81% – 92%.

In this first variation the quadrupling of the number of False Negatives found at the end of the project, from 5 to 20, caused an approximate 10% decrease in recall values from the first hypothetical where we attained a recall range of 92% to 99%.

Ex. 2

Ex. 2 – 81% – 87%

Let us assume a second variation that instead of finding 5 False Negatives, finds 40. That is eight times the number of False Negatives found in the first hypothetical. It means that we found 1,494 True Negatives and 40 False Negatives in the sample of 1,534 documents from the 92,000 document discard pile. This creates a point projection of 2.61% (40/1534), and a binomial range of 1.87% to 3.53%. This generates a projected range of total False Negatives of from 1,720 (92,000*1.87%) to 3,248 (92,000*3.53%).

ei-recall_example3The calculation of the low end of the recall range is based on the high end of the False Negatives projection: Rl2 = TP / TP+FNh = 8,000 / (8,000 + 3,248) = 71.12% 

The calculation of the high end of the recall range is based on the low end of the False Negatives projection: Rh2 = TP / TP+FNl = 8,000 / (8,000 + 1,720) = 82.30%.

Our recall range values for this variation of the first hypothetical is thus 71% – 82%.

In this second variation the eightfold increase of the number of False Negatives found at the end of the project, from 5 to 20, caused an approximate 20% decrease in recall values from the first hypothetical where we attained a recall range of 92% to 99%.

Ex. 3

Ex. 3 – 71% – 82%

Second Example of How to Calculate Recall Using the ei-Recall Method

We will again go back to the second example used in In Legal Search Exact Recall Can Never Be KnownThe second hypothetical assumes a total collection of 1,000,000 documents and that 210,000 relevant documents were found and verified.

In the random sample of 1,534 documents (95%+/-2.5%) from the 790,000 documents withheld as irrelevant (1,000,000 – 210,000) we assume that only ten mistakes were uncovered, in other words, 10 False Negatives. Conversely, we found that 1,524 of the documents picked at random from the discard pile (another name for the Negatives) were in fact irrelevant as expected; they were True Negatives.

The percentage of False Negatives in this sample was thus 0.65% (10/1534). Using the binomial interval calculation the range would be from 0.31% to 1.2%. The range generated by the binomial interval is from  2,449 (790,000*0.31%) to 9,480 (790,000*1.2%) False Negatives.

ei-recall_example4The calculation of the lowest end of the recall range is based on the high end of the False Negatives projection: Rl2 = TP / TP+FNh = 210,000 / (210,000 + 9,480) = 95.68% 

The calculation of the highest end of the recall range is based on the low end of the False Negatives projection: Rh2 = TP / TP+FNl = 210,000 / (210,000 + 2,449) = 98.85%.

Our recall range for this second hypothetical is thus 96% – 99% recall. This is a highly unusual, truly outstanding result. It is, of course, still subject to the outlier result uncertainty inherent in the confidence level. In that sense my labels on the diagram below of “worst” or “best” case scenario are not correct. It could be better or worse in five times out of one hundred times the sample is drawn in accord with the 95% confidence level. See the discussion near the end of my article In Legal Search Exact Recall Can Never Be Known, regarding the role that luck necessarily plays in any random sample. This could have been a lucky draw, but nevertheless, it is just one quality assurance factor among many, and is still an extremely good recall range achievement.

Ex.4 -

Ex.4 – 96% – 99%

Variations of Second Example with Higher False Negatives Ranges

I now offer three variations of the second hypothetical where each has a higher False Negative rate. These examples should better illustrate the impact of the elusion sample on the overall recall calculation.

Let us first assume that instead of finding 10 False Negatives, we find 20, a doubling of the rate. This means that we found 1,514 True Negatives and 20 False Negatives in the sample of 1,534 documents in the 790,000 document discard pile. This creates a point projection of 1.30% (20/1534), and a binomial range of 0.8% to 2.01%. This generates a projected range of total False Negatives of from 6,320 (790,000*.8%) to 15,879 (790,000*2.01%).

ei-recall_example5

Now let us see how this doubling of errors in the second sample impacts the recall range calculation.

The calculation of the low end of the recall range is: Rl = TP / (TP+FNh) = 210,000 / (210,000 + 15,879) = 92.97% 

The calculation of the high end of the recall range is: Rh = TP / (TP+FNl) = 210,000 / (210,000 + 6,320) = 97.08%.

Our recall range for this first variation of the second hypothetical is thus 93% – 97%

The doubling of the number of False Negatives from 10 to 20, caused an approximate 2.5% decrease in recall values from the second hypothetical where we attained a recall range of 96% to 99%.

Ex. 5 -

Ex. 5 – 93% – 97%

Let us assume a second variation where instead of finding 10 False Negatives at the end of the project, we find 40. That is a quadrupling of the number of False Negatives found in the first hypothetical. It means that we found 1,494 True Negatives and 40 False Negatives in the sample of 1,534 documents from the 790,000 document discard pile. This creates a point projection of 2.61% (40/1534), and a binomial range of 1.87% to 3.53%. This generates a projected range of total False Negatives of from 14,773 (790,000*1.87%) to 27,887 (790,000*3.53%).

ei-recall_example6The calculation of the low end of the recall range is now: Rl = TP / (TP+FNh) = 210,000 / (210,000 + 27,887) = 88.28% 

The calculation of the high end of the recall range is now: Rh = TP / (TP+FNl) = 210,000 / (210,000 + 14,773) = 93.43%.

Our recall range for this second variation of second hypothetical is thus 88% – 93%.

The quadrupling of the number of False Negatives from 10 to 40, caused an approximate 7% decrease in recall values from the original where we attained a recall range of 96% to 99%.

Ex. 6 – 88% – 93%

If we do a third variation and increase the number of False Positives found by eight-times, from 10 to 80, this changes the point projection to 5.22% (80/1534), with a binomial range of 4.16% to 6.45%.  This generates a projected range of total False Negatives of from 32,864 (790,000*4.16%) to 50,955 (790,000*6.45%).

ei-recall_example7The calculation of the low end of the recall range is: Rl = TP / (TP+FNh) = 210,000 / (210,000 + 50,955) = 80.47%. 

The calculation of the high end of the recall range is: Rh = TP / (TP+FNl) = 210,000 / (210,000 + 32,864) = 86.47%.

Our recall range for this third variation of the second hypothetical is thus 80% – 86%.

The eightfold increase of the number of False Negatives, from 10 to 80, caused an approximate 15% decrease in recall values from the second hypothetical where we attained a recall range of 96% to 99%.

Ex. 7 - 80% - 86%

Ex. 7 – 80% – 86%

By now you should have a pretty good idea of how the ei-Recall calculation works, and a feel for how the number of False Negatives found impacts the overall recall range.

Third Example of How to Calculate Recall Using the ei-Recall Method where there is Very Low Prevalence

A criticism of many recall calculation methods is that they fail and become completely useless in very low prevalence situations, say 1%, or sometimes even less. Such low prevalence is considered by many to be common in legal search projects.

upside_down_plane_stampObviously it is much harder to find things that are very rare, such as the famous, and very valuable, Inverted Jenny postage stamp with the upside down plane. These stamps exist, but not many. Still, it is at least possible to find them (or buy them), as opposed to a search for a Unicorn or other complete fiction. (Please, Unicorn lovers, no hate mail!) These creatures cannot be found no matter how many searches and samples you take because they do not exist. There is absolute zero prevalence.

unicornThis circumstance sometimes happens in legal search, where one side claims that mythical documents must exist because they want them to. They have a strong suspicion of their existence, but no proof. More like hope, or wishful thinking. No matter how hard you look for such smoking guns, you cannot find them. You cannot find something that does not exist. All you can do is show that you made reasonable, good faith efforts to find the Unicorn documents, and they did not appear. Recall calculations make no sense in crazy situations like that because there is nothing to recall. Fortunately that does not happen too often, but it does happen, especially in the wonderful world of employment litigation.

We are not going to talk further about a search for something that does not exist, like a Unicorn, the zero prevalence. We will not even talk about the extremely, extremely rare, like the Inverted Jenny. Instead we are going to talk about prevalence of about 1%, which is still very low.

In many cases, but not all, very low prevalence like 1%, or less, can be avoided, or at least mitigated, by intelligent culling. This certainly does not mean filtering out all documents that do not have certain keywords. There are other, more reliable methods than simple keywords to eliminate superfluous irrelevant documents, including elimination by file type, date ranges, custodians, and email domains, among other things.

When there is a very low prevalence of relevant documents, this necessarily means that there will be a very large Negatives pool, thus diluting the sampling. There are ways to address the large Negatives sample pool, as I discussed in Part One. The most promising method is to cull out the low end of the probability rankings where relevant documents should anyway be non-existent.

Even with the smartest culling possible, low prevalence is often still a problem in legal search. For that reason, and because it is the hardest test for any recall calculation method, I will end this series of examples with a completely new hypothetical that considers a very low prevalence situation of only 1%. This means that there will be a large size Negatives pool: 99% of the total collection.

We will again assume a 1,000,000 document collection, and again assume sample sizes using 95% +/-2.5% confidence level and interval parameters. An initial sample of all documents taken at the beginning of the project to give us a rough sense of prevalence for search guidance purposes (not recall calculations), projected a range of relevant documents of from 5,500 to 16,100.

The lawyers in this hypothetical legal search project plodded away for a couple of weeks and found and confirmed 9,000 relevant documents, True Positives all. At this point they are finding it very difficult and time consuming to find more relevant documents. What they do find is just more of the same. They are sophisticated lawyers who read my blog and have a good grasp of the nuances of sampling. So they know better than to simply rely on a point projection of prevalence to calculate recall, especially one based on a relatively small sample of a million documents taken at the beginning of the project. See In Legal Search Exact Recall Can Never Be KnownThey know that their recall level could be only a 56% recall 9,000/16,100 (or perhaps far less, in the event the one sample they took was a confidence level outlier event, or there was more concept drift than they thought). It could also be near perfect, 100% recall, when they consider the binomial interval range going the other way. The 9,000 documents they had found was way more than the low range of 5,500. But they did not really consider that too likely.

They decide to stop the search and take a second 1,534 document sample, but this time of the 991,000 null set (1,000,000 – 9,000). They want to follow the ei-Recall method, and they also want to test for any highly relevant or unique strong relevant documents by following the accept on zero error quality assurance test. They find -1- relevant document in that sample. It is just a more of the same type merely relevant document. They had seen many like it before. Finding a document like that meant that they passed the quality assurance test they had set up for themselves. It also meant that using the binomial intervals for 1/1534, which is from 0.00% and 0.36%, there is a projected range of False Negatives of from between -0- and 3,568 documents (991,000*0.36%). (Actually, a binomial calculator that shows more decimal places than any I have found on the web (hopefully we can fix that soon) will not show zero percent, but some very small percentage less than one hundredth of a percent, and thus some documents, not -0- documents, and thus something slightly less than 100% recall.)

ei-recall_example8They then took out the ei-Recall formula and plugged in the values to see what recall range they ended up with. They were hoping it was tighter, and more reliable, than the 56% to 100% recall level they calculated from the first sample alone based on prevalence.

Calculation for the low end of the recall range: Rl = TP / (TP+FNh) = 9,000 / (9,000 + 3,568) = 71.61%.  

Calculation for the high end of the recall range: Rh = TP / (TP+FNl) = 9,000 / (9,000 + 0) = 100%.

The recall range using ei-Recall was 72% – 100%.

Ex. 8 - 72% - 100%

Ex. 8 – 72% – 100%

The attorneys’ hopes in this extremely low prevalence hypothetical were met. The 72%-100% estimated recall range was much tighter than the original 56%-100%. It was also more reliable because it was based on a sample taken at the end of the project when relevance was well defined. Although this sample did not, of and by itself, prove that a reasonable legal effort had been made, it did strongly support that position. When considering all of the many other quality control efforts they could report, if challenged, they were comfortable with the results. Assuming that they did not miss a highly relevant document that later turns up in discovery, it is very unlikely they will ever have to redo, or even continue, this particular legal search and review project.

Would the result have been much different if they had doubled the sample size, and thus doubled the cost of this quality control effort? Let us do the math and find out, assuming that everything else was the same.

ei-recall_example9This time the sample is 3,068 documents from the 991,000 null set. They find two relevant documents, False Negatives, of a kind they had seen many times before. This created a binomial range of 0.01%  to 0.24%, projecting a range of False Negatives from 99 to 2,378 (991,000 * 0.01% — 991,000 * 0.24%). That creates a recall range of 79% – 99%.

Rl = TP / (TP+FNh) = 9,000 / (9,000 + 2,378) = 79.1%.  

Rh = TP / (TP+FNl) = 9,000 / (9,000 + 99) = 98.91%.

Ex. 9 - 79% - 99%

Ex. 9 – 79% – 99%

In this situation by doubling the sample size the attorneys were able to narrow the recall range from 72% – 100% to 79% – 99%. But was it worth the effort and doubling of  cost? I do not think so, at least not in most cases. But perhaps in larger cases, it would be worth the expense to tighten the range somewhat and so increase somewhat the defensibility of your efforts. After all, we are assuming in this hypothetical that the same proportional results would turn up in a sample size double that of the original. The results could have been much worse, or much better. Either way, your results would be more reliable than an estimate based on a sample half that size, and would have produced a tighter range. Also, you may sometimes want to take a second sample of the same size, if you suspect the first was an outlier.

Let is consider one more example, this time of an even smaller prevalence and larger document collection. This is the hardest challenge of all, a near Inverted Jenny puzzler. Assume a document collection of 2,000,000 and a prevalence based on a first random sample for search-help purposes, where again only one relevant was found in the sample of 1,534 sample. This suggested there could be as many as 7,200 relevant documents (0.36% * 2,000,000). So in this second hypothetical we are talking about a dataset where the prevalence may be far less than one percent.

ei-recall_extreme_low_prevalenceAssume next that only 5,000 relevant documents were found, True Positives. A sample 1,534 of the remaining 1,995,000 documents found -3- relevant, False Negatives. The binomial intervals for 3/1534, is from 0.04% to 0.57%, producing  a projected range of False Negatives of from between 798 and 11,372 documents (1,995,000 * .04% — 1,995,000 * 0.57%). Under ei-Recall the recall range measured is 31% – 86%.

Rl = TP / (TP+FNh) = 5,000 / (5,000 + 11,372) = 30.54%.  

Rh = TP / (TP+FNl) = 5,000 / (5,000 + 798) = 86.24%.

31% – 86% is a big range. Most would think too big, but remember, it is just one quality assurance indicator among many.

Ex. 10 - 31% - 86%

Ex. 10 – 31% – 86%

The size of the range could be narrowed by a larger sample. (It is also possible to take two samples, and, with some adjustment, add them together as one sample. This is not mathematically perfect, but fairly close, if you adjust for any overlaps, which anyway would be unlikely.) Assume the same proportions where we sample 3,068 documents from 1,995,000 Negatives, and find -6- relevant, False Negatives. The binomial range is 0.07% – 0.43%. The projected number of False Negatives is 1,397 – 8,579 (1,995,000*.07% – 1,995,000*.43%). Under ei-Recall the range is 37% – 78%.

Rl = TP / (TP+FNh) = 5,000 / (5,000 + 8,579) = 36.82%.  

Rh = TP / (TP+FNl) = 5,000 / (5,000 + 1,397) = 78.16%.

Ex. 11 - 37% - 78%

Ex. 11 – 37% – 78%

The range has been narrowed, but is still very large. In situations like this, where there is a very large Negatives set, I would suggest taking a different approach. As discussed in Part One, you may want to consider a rational culling down of the Negatives. The idea is similar to that behind stratified sampling. You create a subset or strata of the entire collection of Negatives that has a higher, hopefully much higher prevalence of False Negatives than the entire set. See eg. William Webber, Control samples in e-discovery (2013) at pg. 3

CULLING.filters_MULTIMODALAlthough Webber’s paper only uses keywords as an example of an easy way to create a strata, in reality in modern legal search today there are a number of methods that could be used to create the stratas, only one of which is keywords. I use a combination of many methods that varies in accordance with the data set and other factors. I call that a multimodal method. In most cases (but not all), this is not too hard to do, even if you are doing the stratification before active machine learning begins. The non-AI based culling methods that I use, typically before active machine learning begins, include parametric Boolean keywords, concept, key player, key time, similarity, file type, file size, domains, etc.

After the predictive coding begins and ranking matures, you can also use probable relevance ranking as a method of dividing documents into strata. It is actually the most powerful of the culling methods, especially when it comes to predicting irrelevant documents. The second filter level is performed at or near the end of a search and review project. (This is all shown in the two-filter diagram above, which I may explain in greater detail in a future blog.) The second AI based filter can be especially effective in limiting the Negatives size for the ei-Recall quality assurance test. The last example will show how this works in practice.

Low_prevalence_exampleWe will begin this example as before, assuming again 2,000,000 documents where the search finds only 5,000. But this time before we take a sample of the Negatives we divide them into two strata. Assume, as we did in the example we considered in Part One, that the predictive coding resulted in a well defined distribution of ranked documents. Assume that all 5,000 documents found were in the 50%, or higher, probable relevance ranking (shown in red in the diagram). Assume that all of the 1,995,000 presumed irrelevant documents are ranked 49.9%, or less, probable relevant (shown in blue in the diagram). Finally assume that 1,900,000 of these documents are ranked 10% or less probable relevant. Thus leaving 95,000 documents ranked between 10.1% and 49.9%.

Assume also that we have good reason to believe based on our experience with the software tool used, and the document collection itself, that all, or almost all, False Negatives are contained in the 95,000 group. We therefore limit our random sample of 1,534 documents to the 95,000 lower midsection of the Negatives. Finally, assume we now find -30- relevant, False Negatives, none of them important.

ei-recall_ex_StrataThe binomial range is 0.80% – 2.01%, but this time the projected number of False Negatives is 1,254 – 2,641 (95,000*1.32%  — 95,000*2.78%). Under ei-Recall the range is 72.37% – 80.06%.

Rl = TP / (TP+FNh) = 5,000 / (5,000 + 2,641) = 72.37%.  

Rh = TP / (TP+FNl) = 5,000 / (5,000 + 1,245) = 80.06%.

We see that culling down the Negative set of documents in a defensible manner can lead to a much tighter recall range. Assuming we did the culling correctly, the resulting recall range would also be more accurate. On the other hand, if the culling was wrong, based on incorrect presumptions, then the resulting recall range would be less accurate.

Ex. 12 - 72% - 80%

Ex. 12 – 72% – 80%

The fact is, no random sampling techniques can provide completely reliable results in very low prevalence data sets. There is no free lunch, but, at least with ei-Recall the bill for your lunch is honest because it includes ranges. Moreover, with intelligent culling to increase the probable prevalence of False Negatives, you are more likely to get a good meal.

Conclusion

ei-Recall_pentagramThere are five basic advantages of ei-Recall over other recall calculation techniques:

  1. Interval Range values are calculated, not just a deceptive point value. As shown by In Legal Search Exact Recall Can Never Be Known, recall statements must include confidence interval range values to be meaningful.
  2. One Sample only is used, not two, or more. This limits the uncertainties inherent in multiple random samples.
  3. End of Project is when the sample of the Negatives is taken for the calculation. At that time the relevance scope has been fully developed.
  4. Confirmed Relevant documents that have been verified as relevant by iterative reviews, machine and human, are used for the True Positives. This eliminates another variable in the calculation.
  5. Simplicity is maintained in the formula by reliance on basic fractions and common binomial confidence interval calculators. You do not need an expert to use it.

I suggest you try ei-Recall. It has been checked out by multiple information scientists and will no doubt be subject to more peer review here and elsewhere. Be cautious in evaluating any criticisms you may read of ei-Recall from persons with a vested monetary interest in the defense of a competitive formula, especially vendors, or experts hired by vendors. Their views may be colored by their monetary interests. I have no skin in the game. I offer no products that include this method. My only goal is to provide a better method to validate large legal search projects, and so, in some small way, to improve the quality of our system of justice. The law has given me much over the years. This method, and my other writings, are my personal payback.

I offer ei-Recall to anyone and everyone, no strings attached, no payments required. Vendors, you are encouraged to include it in your future product offerings. I do not want royalties, nor even insist on credit (although you can do so if you wish, assuming you do not make it seem like I endorse your product). ei-Recall is all part of the public domain now. I have no product to sell here, nor do I want one. Although I do hope to create an online calculator soon for ei-Recall. When I do, that too will be a give away.

eLeetMy time and services as a lawyer to implement ei-Recall are not required. Simplicity is one of its strengths, although it helps if you are part of the eLeet. I think I have fully explained how it works in this lengthy article. Still, if you have any non-legal technical questions about its application, send me an email, and I will try to help you out. Gratis of course. Just realize that I cannot by law provide you with any legal advice. All articles in my blog, including this one, are purely for educational services, and are not legal advice, nor in any way a solicitation for legal services. Show this article to your own lawyer or e-discovery vendor. You do not have to be 1337 to figure it out (although it helps).


Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part Two

January 11, 2015

ei-recall_spherePlease read Part One of this article before reading this second segment.

Contingency Table Background

A review some of the basic concepts and terminology used in this article may be helpful before going further. It is also important to remember that ei-Recall is a method for measuring recall, not attaining recall. There is a fundamental difference. Many of my other articles have discussed search and review methods to achieve recall, but this one does not. See eg.

  1. Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One,  Part Two,  Part Three, and Part Four.
  2. Predictive Coding and the Proportionality Doctrine: a Marriage Made in Big Data, 26 Regent U. Law Review 1 (2013-2014).
  3. Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three.
  4. Three-Cylinder Multimodal Approach To Predictive Coding.

This article is focused on the very different topic of measuring recall as one method among many to assure quality in large-scale document reviews.

Everyone should know that in legal search analysis False Negatives are documents that were falsely predicted to be irrelevant, that are in fact relevant. They are mistakes. Conversely, documents predicted irrelevant, that are in fact irrelevant, are called True Negatives. Documents predicted relevant that are in fact relevant are called True Positives. Documents predicted relevant that are in fact irrelevant are called False Positives.

These terms and formulas derived therefrom are set forth in the Contingency Table, a/k/a Confusion Matrix, a tool widely used in information science. Recall using these terms is the total number of relevant documents found, the True Positives (TP), divided by that same number, plus the total number of relevant documents not found, the False Negatives (FN). Recall is the percentage of total target documents found in any search.

CONTINGENCY TABLE

Truly Non-Relevant Truly Relevant
Coded Non-Relevant True Negatives (“TN”) False Negatives (“FN”)
Coded Relevant False Positives (“FP”) True Positives (“TP”)

 

 

 

  • The standard formula for Recall using contingency table values is: R = TP / (TP+FN).
  • The standard formula for Prevalence is: P = (TP + FN) / (TP + TN + FP + FN)

The Grossman-Cormack Glossary of Technology Assisted Review. Also see: LingPipe Toolkit class on PrecisionRecallEvaluation.

 General Background on Recall Formulas

Before I get into the examples and math for ei-Recall, I want to provide more general background. In addition, I suggest that you re-read my short description of an elusion test at the end of Part Three of Visualizing Data in a Predictive Coding Project. It provides a brief description of the other quality control applications of the elusion test for False Negatives. If you have not already done so, you should also read my entire article, In Legal Search Exact Recall Can Never Be Known

I also suggest that you read John Tredennick’s excellent article: Measuring Recall in E-Discovery Review: A Tougher Problem Than You Might Realize, especially Part Two of that article. I give a big Amen to John’s tough problem insights.

For the more technical and mathematically minded, I suggest you read the works of William Webber, including his key paper on this topic, Approximate Recall Confidence Intervals (January 2013, Volume 31, Issue 1, pages 2:1–33) (free version in arXiv), and his many less formal and easier to understand blogs on the topic: Why confidence intervals in e-discovery validation? (12/9/12); Why training and review (partly) break control sets, (10/20/14);  Why 95% +/- 2% makes little sense for e-discovery certification, (5/25/13); Stratified sampling in e-discovery evaluation, (4/18/13); What is the maximum recall in re Biomet?, (4/24/13). Special attention should be given to Webber’s recent article on Roitblat’s eRecallConfidence intervals on recall and eRecall (1/4/15), where it is tested and found deficient on several grounds,

voltaire_sketchMy idea for a recall calculation that includes binomial confidence intervals, like most ideas, is not truly original. It is, as our friend Voltaire puts it, a judicious imitation. For instanceI am told that my proposal to use comparative binomial calculations to determine approximate confidence interval ranges follows somewhat the work of an obscure Dutch medical statistician, P. A. R. Koopman, in the 1980s. See: Koopman, Confidence intervals for the ratio of two binomial proportions, Biometrics 40: 513–517 (1984).  Also see: Webber, William, Approximate Recall Confidence IntervalsACM Transactions on Information Systems, Vol. V, No. N, Article A (October 2012); Duolao Wang, Confidence intervals for the ratio of two binomial proportions by Koopman’s methodStata Technical Bulletin, 10-58, 2001.

As mentioned, the recall method I propose here is also similar to that promoted by Herb Roitbalt – eRecall - except that avoids its fundamental defect. I include binomial intervals in the calculations to provide an elusion recall range, and his method does not. Measurement in eDiscovery (2013). Herb’s method relies solely on point projections and disregards the ranges of both the Prevalence and False Negative projections. That is why no statistician will accept Rotibalt’s eRecall, whereas ei-Recall has already been reviewed without objection by two of the leading authorities in the field, William Webber and Gordon Cormack.

EDBP_5-9

ei-Recall is also a superior method because it is based on a specific number of relevant documents found at the end of the project, the True Positives (TP). That is not an estimated number. It is not a projection based on sampling where a confidence interval range and more uncertainty are necessarily created. True Positives in ei-Recall is the number of relevant documents in a legal document production (or privilege log). It is an exact number verified by multiple reviews and other quality control efforts set forth in steps six, seven and eight in Electronic Discovery Best Practices (EDBP), and then produced in step nine (or logged).

In a predictive coding review the True Positives as defined by ei-Recall are the documents predicted relevant, and then confirmed to be relevant in second pass reviews, etc., and produced and logged. (Again see: Step 8 of the EDBP, which I call Protections.) The production is presumed to be a 100% precise production, or at least as close as is humanly possible, and contain no False Positives. For that reason ei-Recall may not be appropriate in all projects. Still, it could also work, if need be, by estimating the True Positives. The fact that ei-Recall includes interval ranges in and of itself make it superior and more accurate that any other ratio method.

ei-Recall_smallIn the usual application of ei-Recall, only the number of relevant documents missed, the False Negatives, is estimated. The actual number of relevant documents found (TP) is divided by the sum of the projected range of False Negatives from the samples of the null set of each strata, both high (FNh) and low (FNl), and the number of relevant documents found (TP). This method is summarized by the following formulas:

Formula for the lowest end of the recall range from the null set sample: Rl = TP / (TP+FNh).

Formula for the highest end of the recall range from the null set sample: Rh = TP / (TP+FNl).

This is a very different from the approach used by Herb Roitblat for eRecall. Herb’s approach is to sample the entire collection to calculate a point projection of the probable total number of relevant documents in the collection, which I will here call P. He then takes a second random sample of the null set to calculate the point projection of the probable total False Negatives contained in the null set (FN). Roitblat’s approach only uses point projections and ignores the interval ranges inherent in each sample. My approach uses one sample and includes its confidence interval range. Also, as mentioned, my approach uses a validated number of True Positives found at the end of a review project, and not a projection of the probable total number of relevant documents found (P). Although Herb never uses a formula per se in his paper, Measurement in eDiscovery, to describe his approach, if we use the above described definitions the formula for eRecall would seem to be: eR = P / (P + FN). (Note there are other speculations as to what Roitblat’s really intends here, as discussed in the comments to Webber’s blog on eRecall. One thing we know for sure, is that although he may change the details to his approach, it never includes a recall range, just a spot projection.)

My approach of making two recall calculations, one for the low end, and another for the high end, is well worth the slight additional time to create a range. Overall the effort and cost of ei-Recall is significantly less than eRecall because only one sample is used in my method, not two. My method significantly improves the reliability of recall estimates and overcomes the defects inherent in ignoring confidence intervals found in eRecall and other methods such as the Basic Ratio Method and Global Method. See Eg: Grossman & Cormack, Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ Federal Courts Law Review, Vol. 7, Issue 1 (2014) at 306-310.

The use of range values avoids the trap of using a point projection that may be very inaccurate. The point projections of eRecall may be way off from the true value, as was explained in detail by In Legal Search Exact Recall Can Never Be KnownMoreover, ei-Recall fits in well with the overall work flow of my current two-pass, CAL-based (continuous active learning), hybrid, multimodal search and review method.

Recall Calculation Methods Must Include Range

A fuller explanation of Herb Rotiblat’s eRecall proposal, and other similar point projection based proposals, should help clarify the larger policy issues at play in the proposed alternative ei-Recall approach.

Again, I cannot accept Herb Roitblat’s approach to using an Elusion sample to calculate recall because he uses the point projection of prevalence and elusion only, and does not factor in the recall interval ranges. My reason for opposing this simplification was set out in detail In Legal Search Exact Recall Can Never Be Known. It is scientifically and mathematically wrong to use point projections and not include ranges.

TredennickI note that industry leader John Tredennick also disagrees with Herb’s approach. See his recent article: Measuring Recall in E-Discovery Review: A Tougher Problem Than You Might RealizePart Two. After explaining Herb’s eRecall John says this:

Does this work? Not so far as I can see. The formula relies on the initial point estimate for richness and then a point estimate for elusion.

I agree with John Tredennick in this criticism of Herb’s method. So too does Bill Dimm, who has a PhD in Physics and is the founder and CEO of Hot Neuron. Bill summarizes Herb’s eRecall method in his article, eRecall: No Free LunchHe uses an example to show that eRecall does not work at all in low prevalence situations. Of course, all sampling is challenged by extremely low prevalence, even ei-Recall, but at least my interval approach does not hide the limitations of such recall estimates. There is no free lunch. Recall estimates are just one quality control effort among many.

Maura Grossman and Gordon Cormack also challenge the validity of Herb’s method. They refer to Roitblat’s eRecall as a specious argument. Grossman and Cormack make the same judgment about several other approaches that compare the ratios of point projections and show how they all suffer from a basic mathematical statistical error, which they call the Ratio Method Fallacy. Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ supra at 308-309.

Missed_targetIn Grossman & Cormack’s, Guest Blog: Talking Turkey (e-Discovery Team, 2014) they explained an experiment that they did and reported on in the Comments article where they repeatedly used Roitblat’s eRecall, the direct method, and other methods to estimate recall. They used a review known to have achieved 75% recall and 83% precision, from a collection with 1% prevalence. They results showed that in this review “eRecall provides an estimate that is no better than chance.” That means eRecall was a complete failure as a quality assurance measure.

Although my proposed range method is a comparative Ratio Method, it avoids the fallacy of other methods criticized by Grossman and Cormack. It does so because it includes binomial probability ranges in the recall calculations and eschews the errors of point projection reliance. It is true that the range of recall estimates using ei-Recall may be still uncomfortably large in some low yield projects, but at least it will be real and honest, and, unlike eRecall, it is better than nothing.

No Legal Economic Arguments Justify the Errors of Simplified Point Projections 

arrows missing targetThe oversimplified point projection ratio approach can lead to a false belief of certainty for those who do not understand probability ranges inherent in random samples. We presume that Herb Roitblat understands the probability range issues, but he chooses to simplify anyway on the basis of what appears to me to be essentially legal-economic arguments, namely proportionality cost-savings, and the inherent vagaries of legal relevance. Roitblat, The Pendulum Swings: Practical Measurement in eDiscovery.

I disagree strongly with Roitblat’s logic. As one scholar in private correspondence pointed out, Herb appears to fall victim to the classic fallacy of the converse. Herb asserts that “if the point estimate is X, there is a 50% probability that the true value is greater than X.” What *is* true (for an unbiased estimate) is that “if the true value is X, there is a 50% probability that the estimate is greater than X.” Assuming the latter implies the former is classic fallacy of the converse. Think about it. It is a very good point. For a more obvious example of the fallacy of the converse consider this: “Most accidents occur within 25 miles from home; therefore, you are safest when you are far from home.”

Although I disagree with Herb Roitblat’s logic, I do basically agree with many of his non-statistical arguments and observations on document review, including, for instance, the following:

Depending on the prevalence of responsive documents and the desired margin-of-error, the effort needed to measure the accuracy of predictive coding can be more than the effort needed to conduct predictive coding.

Until a few years ago, there was basically no effort expended to measure the efficacy of eDiscovery. As computer-assisted review and other technologies became more widespread, an interest in measurement grew, in large part to convince a skeptical audience that these technologies actually worked. Now, I fear, the pendulum has swung too far in the other direction and it seems that measurement has taken over the agenda.

There is sometimes a feeling that our measurement should be as precise as possible. But when the measure is more precise than the underlying thing we are measuring, that precision gives a false sense of security. Sure, I can measure the length of a road using a yardstick and I can report that length to within a fraction of an inch, but it is dubious whether the measured distance is accurate to within even a half of a yard.

bullseye_arrow_hitAlthough I agree with many of the points of Herb’s legal economic analysis in his article, The Pendulum Swings: Practical Measurement in eDiscoveryI disagree with the conclusion. The quality of the search software, and legal search skills of attorney-users of this software, have both improved significantly in the past few years. It is now possible for relatively high recall levels to be attained, even including ranges, and even without incurring extraordinary efforts and costs as Herb and others suggest. (As a side note, please notice that I am not opining on a specific minimum recall number. That is not helpful because it depends on too many variable factors unique to particular search projects. However, I would point out that in the TREC Legal Track studies in 2008 and 2009 the participants, expert searchers all, attained verified recall levels of only 20% to 70%. See The Legal Implications of What Science Says About Recall. All I am saying is that in my experience our recall efforts have improved and are continually improving as our software and skills improve.)

Further, although relevance and responsiveness can sometimes be vague and elusive as Roitblat points out, and human judgments can be wrong and inconsistent, there are quality control process steps that can be taken to significantly mitigate these problems, including the often overlooked better dialogues with the requesting party. Legal search is not an arbitrary exercise such that it is a complete waste of time to try to accurately measure recall.

I disagree with Herb’s suggestion to the contrary based on his evaluation of legal relevance judgments. He reaches this conclusion based on the very interesting study he did with Anne Kershaw and Patrick Oot on a large-scale document review that Verizon did nearly a decade ago. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual ReviewIn that review Verizon employed 225 contract reviewers and a Twentieth Century linear review method wherein low paid contract lawyers sat in isolated cubicles and read one document after another. The study showed, as Herb summarizes it, that the reviewers agree with one another on relevance calls only about 50% of the time.” Measurement in eDiscovery at pg. 6. He takes that finding as support for his contention that consistent legal review is impossible and so there is no need to bother with finer points of recall intervals.

coin_flipI disagree. My experience as an attorney making judgments on the relevancy of documents since 1980 tells me otherwise. It is absurd, even insulting, to call legal judgment a mere matter of coin flipping. Yes, there are well-known issues with consistency in legal review judgments in large-scale reviews, but this just makes the process more challenging, more difficult, not impossible.

Although consistent review may be impossible if large teams of contract lawyers do linear review in isolation using yesterday’s technology, that does not mean consistent legal judgments are impossible. It just means the large team linear review process is deeply flawed. That is why the industry has moved away from the approaches used by the Verizon team review nearly ten years ago. We are now using predictive coding, small teams of SMEs and contract lawyers, and many new innovative quality control procedures, including soon, I hope, ei-Recall. The large team linear review approach of a decade ago, and other quality factors, were the primary causes of the inconsistencies seen in the Verizon approach, not the inherent impossibility of determining legal relevance.

Good Recall Results Are Possible Without Heroic Efforts
But You Do Need Good Software and Good Methods

robot_whispererEven with the consistency and human error challenges inherent in all legal review, and even with the ranges of error inherent in any valid recall calculation, it is, I insist, still possible to attain relatively high recall ranges in most projects. (Again, note that I will not commit to a specific general minimum range.) I am seeing better recall ranges attained in more and more of my projects and I am certainly not a mythical TAR-whisperer, as Grossman and Cormack somewhat tongue in cheek described lawyers who may have extraordinary predictive coding search skills. Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ at pg. 298. Any experienced lawyer with technology aptitude can attain impressive results in large-scale document reviews. They just need to use hybrid, multimodal, CAL-type, quality controlled, search and review methods. They also need to use proven, high quality, bona fide predictive coding software. I am able to teach this in practice with bright, motivated, hard-working, technology savvy lawyers.

cesar_millan-Dog_WhispererLegal search is a new legal skill to be sure, just like countless others in e-discovery and other legal fields. I happen to find the search and review challenges more interesting than the large enterprise preservation problems, but they are both equally difficult and complex. TAR-whispering is probably an easier skill to learn than many others required today in the law.  (It is certainly easier than becoming a dog whisperer like Cesar Millan. I know. I’ve tried and failed many times.)

Think of the many arcane choice of law issues U.S. lawyers have faced for over a century in our 50-state, plus federal law system. Those intellectual problems are more difficult than predictive coding. Think of the tax code, securities, M&A, government regulations, class actions. It is all hard. All difficult. But it can all be learned. Like everything else in the law, large-scale document review just requires a little aptitude, hard work and lots of legal practice. It is no different from any other challenge lawyers face. It just happens to require more software skills, sampling, basic math, and AI intuition than any other legal field.

On the other point of bona fide predictive coding software, while I will not name names, as far as I am concerned the only bona fide software on the market today uses active machine learning algorithms. It does not depend instead on some kind of passive learning process (although they too can be quite effective, they are not predictive coding algorithms, and, in my experience, do not provide as powerful a search tool). I am sorry to say that some legal review software on the market today falsely claims to have predictive coding features, when, in fact, it does not. It is only passive learning, more like concept search, than AI-enhanced search. With software like that, or even with good software where the lawyers use poor search and review methods, or do not really know what they are searching for (poor relevance scope), then the efforts required to attain high recall ranges may indeed be very extensive and thus cost prohibitive as Herb Roitblat argues. If your tools and or methods are poor, it takes much longer to reach your goals.

rcl-head_3d_SMALLOne final point regarding Herb’s argument, I do not think sampling really needs to be as cost prohibitive as he and others suggest. As noted before in In Legal Search Exact Recall Can Never Be Known, one good SME and skilled contract review attorney can carefully review a sample of 1,534 documents for between $1,000 and $2,000. In large review projects that is hardly a cost prohibitive barrier. There is no need to be thinking in terms of small 385 document sample sizes, which create a huge margin of error of 5%. This is what Herb Rotiblat and others do when suggesting that all sampling is anyway ineffective, so just ignore intervals and ranges. Any large project can afford a full sample of 1,534 documents to cut the interval in half to a 2.5% margin of error. Many can afford much larger samples to narrow the interval range even further, especially if the tools and methods used allow them to attain their recall range goals in a fast and effective manner.

John Tredennick, who, like me, is an attorney, also disagrees with Herb’s legal-economic analysis in favor of eRecall, but John proposes a solution involving larger sample sizes, wherein the increased cost burden would be shifted onto the requesting party. Recall in E-Discovery Review: A Tougher Problem Than You Might RealizePart Two. I do not disagree with John’s assertions in his article, and cost shifting may be appropriate in some cases. It is not, however, my intention to address the cost-shifting arguments here, or the other good points made in John’s article. Instead, my focus in the remaining Part Three of this blog series will be to provide a series of examples of ei-Recall in action. For me, and I suspect for many of you, seeing a method in action is the best way to understand it.

Summary of the Five Reasons ei-Recall is the new Gold Standard

ei-Recall_pentagramBefore moving onto the samples, I wanted to summarize what we have covered so far and go over the five main reasons ei-Recall is superior to all other recall methods. First, and most important, is the fact ei-Recall calculates a recall range, and not just one number. As shown by In Legal Search Exact Recall Can Never Be Known, recall statements must include confidence interval range values to be meaningful. Recall should not be based on point projections alone. Therefore any recall calculation method must calculate both a high and low value. The ei-Recall method I offer here is designed for the correct high low interval range calculations. That, in itself, makes it a significant improvement over all point projection recall methods.

The second advantage of ei-Recall is that is only uses one random sample, not two, or more. This avoids the compounding of variables, uncertainties, and outlier events inherent in any system that uses multiple chance events, multiple random samples. The costs are also controlled better in a one sample method like this, especially since the one sample is of reasonable size. This contrasts with the Direct Method, which also uses one sample, but the sample has to be insanely large. That is not only very costly, but also introduces a probability of more human error in inconsistent relevancy adjudications.

The timing of the one sample in ei-Recall is another of its advantages. It is taken at the end of the project when the relevance scope has been fully articulated.

Another key advantage of ei-Recall is that the True Positives used for the calculation are not estimated, and are not projected by random samples. They are documents confirmed to be relevant by multiple quality control measures, including multiple reviews of these documents by humans, or computer, and often both.

Finally, ei-Recall has the advantage of simplicity, and ease of use. It can be carried out by any attorney who knows fractions. The only higher math required, the calculation of binomial confidence intervals, can be done by easily available online calculators. You do not need to hire a statistician to make the recall range calculations using ei-Recall.

To be continued.


Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part One

January 4, 2015

ei-recallI have uncovered a new method for calculating recall in legal search projects that I call ei-Recall, which stands for elusion interval recall. I offer this to everyone in the e-discovery community in the hope that it will replace the hodgepodge of methods currently used, most of which are statistically invalid. My goal is to standardize a new best practice for calculating recall. Towards this end I will devote the next three blogs to ei-Recall. Parts One and Two will describe the formula in detail, and explain why I think it is the new gold standard. Part Two will also provide a detailed comparison with Herb Roitblat’s eRecall. Part Three will provide a series of examples as to how ei-Recall works.

I received feedback on my ideas and experiments from the top two scientists in the world with special expertise in this area, William Webber and Gordon Cormack. I would likely have presented one of my earlier, flawed methods, but for their patient guidance. I finally settled on the ei-Recall method as the most accurate and reliable of them all. My thanks and gratitude to them both, especially to William, who must have reviewed and responded to a dozen earlier drafts of this blog. He not only corrects logics flaws, and there were many, but also typos! As usual any errors remaining are purely my own, and these are my opinions, not theirs.

ei-Recall is preferable to all other commonly used methods of recall calculation, including Roitbalt’s eRecall, for two reasons. First, ei-Recall includes interval based range values, and, unlike eRecall, and other simplistic ratio methods, is not based on point projections. Second, and this is critical, ei-Recall is only calculated at the end of a project, and depends on a known, verified count of True Positives in a production. It is thus unlike eRecall, and all other recall calculation methods that depend on an estimated value for the number of True Positives found.

Yes, this does limit the application of ei-Recall to projects in which great care is taken to bring the precision of the production to near 100%, including second reviews, and many quality control cross-checks. But this is anyway part of the workflow in many Continuous Active Learning (CAL) predictive coding projects today. At least it is in mine, where we take great pains to meet the client’s concern to maintain the confidentiality of their data. See: Step 8 of the EDBP (Electronic Discovery Best Practices), which I call Protections and is the step after first pass review by CAR (computer assisted review, multimodal predictive coding).

Advanced Summary of ei-Recall

rcl-head_3d_2I begin with a high level summary of this method for my more advanced readers. Do not be concerned if this seems fractured and obtuse at first. It will come into clear 3-D focus later as I describe the process in multiple ways and conclude in Part Three with examples.

ei-Recall calculates recall range with two fractions. The numerator of both fractions is the actual number of True Positives found in the course of the review project and verified as relevant. The denominator of both fractions is based on a random sample of the documents presumed irrelevant that will not be produced, the Negatives. The percentage of False Negatives found in the sample allows for a calculation of a binomial range of the total number of False Negatives in the Negative set. The denominator of the low end recall range fraction is the high end number of the projected range of False Negatives, plus the number of True Positives. The denominator of the high end recall range fraction is the low end number of the projected range of False Negatives, plus the number of True Positives.

Here is the full algebraic explanation of ei-Recall, starting with the definitions for the symbols in the formula.

  • Rl stands for the low end of recall range.
  • Rh stands for high end of recall range
  • TP is the verified total number of relevant documents found in the course of the review project.
  • FNl is the low end of the False Negatives projection range based on the low end of the exact binomial confidence interval.
  • FNh is the high end of the False Negatives projection range based on the high end of the exact binomial confidence interval.

Formula for the low end of the recall range:
Rl = TP / (TP+FNh).

Formula for the high end of the recall range:
Rh = TP / (TP+FNl).

This formula essentially adds the extreme probability ranges to the standard formula for recall, which is: R = TP / (TP+FN).

ei-recall_sphere

Quest for the Holy Grail of Recall Calculations

holy.grail.chaliceI have spent the last few months in intense efforts to bring this project to conclusion. I have also spent more time writing and rewriting this blog than any I have ever written in my eight plus years of blogging. I wanted to find the best possible recall calculation method for e-discovery work. I convinced myself that I needed to find a new method in order to take my work as a legal search and review lawyer to the next level. I was not satisfied with my old ways and methods of quality control of large legal search projects. I was not comfortable with my prevalence based recall calculations. I was not satisfied with anyone else’s recall methods either. I had heard the message of Gordon Cormack and Maura Grossman clearly stated right here in their guest blog of September 7, 2014: Talking Turkey. In their conclusion they stated:

We hope that our studies so far—and our approach, as embodied in our TAR Evaluation Toolkit—will inspire others, as we have been inspired, to seek even more effective and more efficient approaches to TAR, and better methods to validate those approaches through scientific inquiry.

I had already been inspired to find better methods of predictive coding, and have uncovered an efficient approach with my multimodal CAL method. But I was still not satisfied with my recall validation approach, I wanted to find a better method to scientifically validate my review work.

Like almost everyone else in legal search, including Cormack and Grossman, I had earlier rejected the so called Direct Method of recall calculation. It is unworkable and very costly, especially in low prevalence collections where it requires sample sizes in the tens of thousands of documents. See Eg. Grossman & Cormack, Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ Federal Courts Law Review, Vol. 7, Issue 1 (2014) at 306-307 (“The Direct Method is statistically sound, but is quite burdensome, especially when richness is low.”)

Like Grossman and Cormack, I did not much like any of the other sampling alternatives either. Their excellent Comments articles discusses and rejects Roitblat’s eRecall, and two other methods by Karl Schieneman and Thomas C. Gricks III, which Grossman and and Cormack call the Basic Ratio Method and Global Method. Supra at 307-308.

I was on a quest of sorts for the Holy Grail of recall calculations. I knew there had to be a better way. I wanted a method that used sampling with interval ranges as a tool to assure the quality of a legal search project. I wanted a method that created as accurate an estimate as possible. I also wanted a method that relied on simple fraction calculations and did not depend on advanced math to narrow the binomial ranges, such as William Webber’s favorite recall equation: the Beta-binomial Half formula, shown below.

Webber_beta-binomial_formula

Webber, W., Approximate Recall Confidence IntervalsACM Transactions on Information Systems, Vol. V, No. N, Article A, Equation 18, at pg. A:13 (October 2012).

Before settling on my much simpler algebraic formula I experimented with many other methods to calculate recall ranges. Most were much more complex and included two or more samples, not just one. I wanted to try to include a sample that I usually take at the beginning of a project to get a rough idea of prevalence with interval ranges. These were the examples shown by my article, In Legal Search Exact Recall Can Never Be Known, and described in the section, Calculating Recall from Prevalence. I wanted to include the first sample, and prevalence based recall calculations based on that first sample, with a second sample of excluded documents taken at the end of the project. Then I wanted to kind of average them somehow, including the confidence interval ranges. Good idea, but bad science. It does not work, statistically or mathematically, especially in low prevalence.

I found a number of other methods, which, at first, looked like the Holy Grail. But I was wrong. They were made of lead, not gold. Some of the one’s that I dreamed up were made of fools gold! A couple of the most promising methods I tried and rejected used multiple samples of various stratas. That is called stratified random sampling as compared to simple sampling.

My questionable, but inspired research method for this very time consuming development work consisted of background reading, aimless pondering, sleepless nights, intuition, trial and error (appropriate I suppose for a former trial lawyer), and many consults with the top experts in the field (another old trial lawyer trick). I ran though many other alternative formulas. I did the math in several standard review project scenarios, only to see the flaws of these other methods in certain circumstances, primarily low prevalence.

Every experiment I tried with added complexity, and added effort of multiple samples, proved to be fruitless. Indeed, most of this work was an exercise in frustration. (It turns out that noted search expert Bill Dimm is right. There is no free lunch in recall.) My experiments, and especially the expert input I received from Webber and Cormack, all showed that the extra complexities were not worth the extra effort, at least not for purposes of recall estimation. Instead, my work confirmed that the best way to channel additional efforts that might be appropriate in larger cases is simply to increase the sample size. This, and my use of confirmed True Positives, are the only sure-fire methods to improve the reliability of recall range estimates. They are the best ways to lower the size of the interval spread that all probability estimates must include.

Finding the New Gold Standard

gold_standard_ei_recallei-Recall meets all of my goals for recall calculation. It maintains mathematical and statistical integrity by including probable ranges in the estimate. The size of the range depends on the size of the sample. It is simple and easy to use, and easy to understand. It can thus be completely transparent and easy to disclose. It is also relatively inexpensive and you control the costs by controlling the sample size (although I would not recommend a sample size of less than 1,500 in any legal search project of significant size and value).

Finally, by using verified True Positives, and basing the recall range calculation on only one random sample, one of the null set, instead of two samples, the chance factor inherent to all random sampling is reduced. I described these chance factors in detail in In Legal Search Exact Recall Can Never Be Known, in the section on Outliers and Luck of Random Draws. The possibility of outlier events is still possible using ei-Recall, but is minimized by limiting the sample to the null set and only estimating a projected range of False Positives. While it is true that the prevalence based recall calculations described in In Legal Search Exact Recall Can Never Be Known, also only use one random sample, that is a sample of the entire document collection to estimate a projected range of relevant documents, True Positives. The number of relevant documents found will (or at least should be in any half-way decent search) be a far larger number than the number of False Negatives. For that reason alone the variability range (interval spread) of the straight elusion recall method should typically be smaller and more reliable.

Focus Your Sampling Efforts on Finding Errors of Omission

ei-recall_SodaThe number of documents presumed irrelevant, the Negatives, or null set, will always be smaller than the total document collection, unless of course you found no relevant documents at all! This means you will always be sampling a smaller dataset when doing an elusion sample, than when doing a prevalence sample of the entire collection. Therefore, if you are trying to find your mistakes, the False Negatives, look for them where they might lie, in the smaller Negative set, the null set. Do not look for them in the larger complete collection, which includes the documents you are going to produce, the Positive set. Your errors of omission, which is what you are trying to measure, could not possibly be there. So why include that set of documents in the random sample? That is why I reject the idea of taking a sample at the end of the entire collection, including the Positives.

The Positives, the documents to be produced, have already been verified enough under my two-pass system. They have been touched multiple times by machines and humans. It is highly unlikely there will be False Positives. Even if there are, the requesting party will not complain about that. Their concern should be on completeness, or recall, especially if any precision errors are minor.

There is no reason to include the Positives in a final recall search in any project with verified True Positives. That just unnecessarily increases the total population size and thereby increases the possibility of an inaccurate sample. Estimates made from a sample of 1,500 documents of a collection of 150,000 documents will always be more accurate, more reliable, than estimates made from a sample of 1,500 documents of a collection of 1,500,000. The only exception is when there is an even distribution of target documents making up half of the total collection – 50% prevalence.

Random_sample_group-of-peopleSample size does not scale perfectly, only roughly, and the lower the prevalence, the more inaccurate it becomes. That is why sampling is not a miracle tool in legal search, and recall measures are range estimates, not certainties. In Legal Search Exact Recall Can Never Be Known. Recall measure when done right, as it is in ei-Recall, is a powerful quality assurance tool, to be sure, but it is not the end-all of quality control measures. It should be part of a larger tool kit that includes several other quality measures and techniques. The other quality control methods should be employed throughout the review, not just at the end like ei-Recall. Maura Grossman and Gordon Cormack agree with me on this. Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ supra at 285. They recommend that validation:

consider all available evidence concerning the effectiveness of the end-to-end review process, including prior scientific evaluation of the TAR method, its proper application by qualified individuals, and proportionate post hoc sampling for confirmation purposes.

Ambiguity in the Scope of the Null Set

ei-recall_N_AmbigiousThere is an open-question in my proposal as to exactly how you define the Negatives, the presumed irrelevant documents that you sample. This may be varied somewhat depending on the circumstances of the review project. In my definition above I said the Negatives were the documents presumed to be irrelevant that will not be produced. That was intentionally somewhat ambiguous. I will later state with less ambiguity that Negatives are the documents not produced (or logged for privilege). Still, I think this application should be varied sometimes according to the circumstances.

In some circumstances you could improve the reliability of an elusion search by excluding from the null set all documents coded irrelevant by an attorney, either with or without actual review. The improvement would arise from shrinking the size of the number of documents to be sampled. This would allow you to focus your sample on the documents most likely to have an error.

For example, you could have 50,000 documents out of 900,000 not produced, that have actually been read or skimmed by an attorney, and coded irrelevant. You could have yet another 150,000 that have not been actually been read or skimmed by an attorney, but have been bulked coded irrelevant by an attorney. This would not be uncommon in some projects. So even though you are not producing 900,000 documents, you may have manually coded 200,000 of those, and only 700,000 have been presumed irrelevant on the basis of computer search. Typically in predictive coding driven search that would be because their ranking at the end of the CAL review was too low to warrant further consideration. In a simplistic keyword search they would be documents omitted from attorney review because they did not contain a keyword.

In other circumstances you might want to include the documents attorneys reviewed and coded as irrelevant, for instance, where you were not sure of the accuracy of their coding for one reason or another. Even then you might want to exclude other sets of documents for other grounds. For instance, in predictive coding projects you may want to exclude some bottom strata of the rankings of probable relevance. For example, you could exclude the bottom 25%, or maybe the bottom 10%, or bottom 2%, where it is highly unlikely that any error has been made in predicting irrelevance of those documents.

data-visual_Round_5In the data visualization diagram I explained in Visualizing Data in a Predictive Coding Project – Part Two (shown right) you could exclude some bottom portion of the ranked documents shown in blue. You could, for instance, limit the Negatives searched to those few documents in the 25% to 50% probable relevance range. Of course, whenever you limit the null set, you have to be careful to adjust the projections accordingly. Thus, if you find 1% False Negatives in a sample of a presumably enriched sub-collection of 10,000 out of 100,000 total Negatives, you cannot just project 1% of 100,000 and assume there are a total of 1,000 False Negatives (plus or minus of course). You have to project the 1% from the sample of the size of the sub-collection sampled, and so it would be 1% of 10,000, or 100 False Negatives, not 1,000, again subject to the confidence interval range, a range that varies according to your sample size.

Remember, the idea is to focus your random search to find mistakes on the group of documents that are most likely to have mistakes. There are many possibilities.

In still other scenarios you might want to enlarge the Negatives to include documents that were never included in the review project at all. For instance, if you collected emails from ten custodians, but eliminated three as unlikely to have relevant information as per Step 6 of the EDBP (culling), and only reviewed the email of seven custodians, then you might want to include select documents from the three excluded custodians in the final elusion test.

There are many other variations and issues pertaining to the scope of the Negatives set searched in ei-Recall. There are too many to discuss in this already long article. I just want to point out in this introduction that the makeup and content of the Negatives sampled at the end of the project is not necessarily cut and dry.

Advantage of End Project Sample Reviews

ei-recall_wineBasing recall calculations on a sample made at the end of a review project is always better than relying on a sample made at the beginning. This is because final relevance standards will have been determined and fully articulated by the end of a project. Whereas at the beginning of any review project, the initial relevance standards will be tentative. They will typically change in the course of the review. This is known as relevance shift, where the understanding of relevance changes and matures during the course of the project.

This variance of adjudication between samples can be corrected during and at the end of the project by careful re-review and correction of initial sample relevance adjudications. This also requires correction of changes of all codings made during the review in the same way, not just inconsistencies in sample codings.

The time and effort spent to reconcile the adjudications might be better spent on a larger sample size of the final elusion sample. Except for major changes in relevance, where you would anyway have to go back and make corrections as part of quality control, it may not be worth the effort to remediate the first sample, just so you can still use it again at the end of the project with an elusion sample. That is because of the unfortunate statistical fact of life, that the two recall methods cannot be added to one another to create a third, more reliable number. I know. I tried. The two recall calculations are apples and oranges. Although a comparison between the two range values is interesting, they cannot somehow be stacked together to improve the reliability of either or both of them.

Prevalence Samples May Still Help Guide Search, Even Though They Cannot Be Reliably Used to Calculate Recall

sampleI like to make a prevalence sample at the beginning of a project to get a general idea of the number of relevant documents there might be, and I emphasize general and might, in order to help with my search. I used to make recall calculation from that initial sample too, but no longer (except in small cases under the theory it is better than nothing), because it is simply too unreliable. The prevalence samples can help with search, but not with recall calculations used to test the quality of the search results. For quality testing it is better to sample the null set and calculate recall using the ei-Recall method.

Still, if you are like me, and like to take a sample at the start of a project for search guidance purposes, then you might as well do the math at the end of the project to see what the recall range estimate is using the prevalence method described in In Legal Search Exact Recall Can Never Be Known. It is interesting to compare the two recall ranges, especially if you take the time and trouble to go back and correct the first prevalence sample adjudications to match those of calls made in your second null set sample (that can eliminate the problem of concept drift and reviewer inconsistencies). Still, go with the recall range values of the ei-Recall, not prevalence. It is more reliable. Moreover, do not waste your time, as I did for weeks, trying to somehow average out the results. I traveled down that road and it is a dead-end.

Claim for ei-Recall

Claim_ChartMy claim is that ei-Recall is the most accurate recall range estimate method possible that uses only algebraic math within everyone’s grasp. (This statement is not exactly true because binomial confidence interval calculations are not simple algebra, but we avoid these calculations by use of an online calculator. Many are available.) I also claim that ei-Recall is more reliable, and less prone to error in more situations, than a standard prevalence based recall calculation, even if the prevalence recall includes ranges as I did in In Legal Search Exact Recall Can Never Be Known.

I also claim that my range based method of recall calculation is far more accurate and reliable than any simple point based recall calculations that ignore or hide interval ranges, including the popular eRecall. This later claim is based on what I proved in In Legal Search Exact Recall Can Never Be Knownand is not novel. It has long been known and accepted by all experts in random sampling, that recall projections that do not include high-low ranges are inexact and often worthless and misleading. And yet attorneys and judges are still relying on point projections of recall to certify the reasonableness of search efforts. The legal profession and our courts need to stop relying on such bogus science and turn instead to ei-Recall.

I am happy to concede that scientists who specialize in this area of knowledge like Dr. Webber and Professor Cormack can make slightly more accurate and robust calculations of binomial recall range estimates by using extremely complex calculations such as Webber’s Beta-binomial formula.

Webber_beta-binomial_formula_hyperSuch alternative black box type approaches are, however, disadvantaged by the additional expense from expert consultations and testimony to implement and explain. (Besides, at the present time, neither Webber nor Cormack are available for such consultations.) My approach is based on multiplication and division, and simple logic. It is well within the grasp of any attorney or judge (or anyone else) who takes the time to study it. My relatively simple system thus has the advantage of ease of use, ease of understanding, and transparency. These factors are very important in legal search.

ei-Recall_formula

Although the ei-Recall formula may seem complex at first glance, it is really just ratios and proportions. I reject the argument some make that calculations like this are too complex for the average lawyer. Ratios and proportions are part of the Grade 6 Common Core Curriculum. Reducing word problems to ratios and proportions is part of the Grade 7 Common Core, so too is basic statistics and probability.

Overview of How ei-Recall Works

ei-recallei-Recall is designed for use at the end of a search project as a final quality assurance test. A single random sample is taken of the documents that are not marked relevant and so will not be produced or privileged-logged – the Negatives. (As mentioned, definition and scope of the Negatives can be varied depending on project circumstances.) The sample is taken to estimate the total number of False Negatives, documents falsely presumed irrelevant that are in fact relevant. The estimate projects a range of the probable total number of False Negatives using a binomial interval range in accordance with the sample size. A simplistic and illusory point value projection is not used. The high end of the range of probable False Negatives is shown in the formula and graphic as FNh. The low end of the projected range of False Negatives is FNl.

This type of search is generally called an elusion based recall search. As will be discussed here in some detail, well-known software expert and entrepreneur, Herb Rotiblat, who has a PhD in psychology, advocates for the use of a similar elusion based recall calculation that uses only the point projection of the total False Negatives. He has popularized a name for this method: eRecall, and uses it with his company’s software.

I here offer a more accurate alternative that avoids the statistical fallacies of point projections. Rotiblat’s eRecall, and other ratio calculations like it, ignore the interval high and low range range inherent in all sampling. My version includes interval  range, and for this reason an “i” is added to the name: ei-Recall.

ei-Recall is more accurate than eRecall, especially when working with low prevalence datasets, and, unlike eRecall, is not misleading because it shows the total range of recall. It is also more accurate because it uses the exact count of the documents verified as relevant at the end of the project, and does not estimate the True Positives value. I offer ei-Recall to the e-discovery community as a statistically valid alternative, and urge its speedy adoption.

To be continued ….


In Legal Search Exact Recall Can Never Be Known

December 18, 2014
Voltaire

“Uncertainty is an uncomfortable position. But certainty is an absurd one.” VOLTAIRE

In legal search you can never know exactly what recall level you have attained. You can only know a probable range of recall. For instance, you can never know that you have attained 80% recall, but you can know that you have attained between 70% and 90% recall. Even the range is a probable range, not certain. Exact knowledge of recall is impossible because there are too many documents in legal search to ever know for certain how many of them are relevant, and how many are irrelevant.

Difficulty of Recall in Legal Search 

60% Red Fish Recall

60% Red Fish Recall

In legal search recall is the percentage of target documents found, typically relevant documents. Thus, for instance, if you know that there are 100 relevant documents in a collection of 1,000, and you find 80 of them, then you know that you have attained 80% recall.

Exact recall calculations are possible in small volumes of documents like that because it is possible to know how many relevant documents there are. But legal search today does not involve small collections of documents. Legal search involves tens of thousands of documents, tens of millions of documents. When you get into large collections of documents like that it is impossible to know how many of the documents in the collection are relevant to any particular legal issue. That has to do with several things: human fallibility, the vagaries of legal relevance, and, to some extent, cost limitations. (Although even with unlimited funds you could never know for sure that you had found all relevant documents in a large collection of documents.)

Sampling Allows for Calculations of Probable Ranges Only

diceSince you cannot know exactly how many relevant documents there are in a large population of documents, all you can do is sample and estimate recall. When you start sampling you can never know exact values. You can only know probable ranges according to statistics. That, in a nutshell, is why it is impossible to ever know exactly what recall you have attained in a legal search project.

Even though you can never know an exact recall value, it is still worth trying to calculate recall because you can know the probable range of recall that you have attained at the end of a project.

How Probable Range Calculations Are Helpful

This qualified knowledge of recall range provides evidence, albeit limited, that your efforts to respond to a request for production of documents have been proportional and reasonable. The law requires this. Unreasonably weak or negligent search is not permitted under the rules of discovery. Failure to comply with these rules can result in sanctions, or at least costly court ordered production supplements.

Recall range calculations are helpful in that they provide some proof of the success of your search efforts. They also provide some evidence of your quality control efforts. That is the main purpose of recall calculations in e-discovery, to assist in quality control and quality assurance. Either way, probable recall range calculations can significantly buttress the defensibility of your legal search efforts.

20% Red Fish Recall

20% Red Fish Recall

In some projects the recall range may seem low. Fortunately, there are many other ways to prove reasonable search efforts beyond offering recall measurements. Furthermore, the law generally assumes reasonable efforts have been made until evidence to the contrary has been provided. For that reason evidence of reasonable, proportionate efforts may never be required.

Still, in any significant legal review project I try to make recall calculations for quality control purposes. Now that my understanding of math, sampling, and statistics have matured, when I calculate recall these days I calculate it as a probable range, not a single value. The indisputable mathematical truth is that there is no certainty in recall calculations in e-discovery. Any claims to the contrary are false.

General Example of Recall Range 

Here is a general example of what I mean by recall range, the first of several. You cannot know that you have attained 80% recall. But you can know with some probable certainty, say with the usual 95% confidence level, that you have attained between 70% and 90% recall.

You can also know that the most likely value within the range is 80% recall, but you can never for sure. You can only know the range of values, which, in turn is a function of the confidence interval used in the sampling. The confidence intervals, also known as margin of error, are in turn a function of the sample size, and, to some extent, also the size of the general collection sampled.

Sample_size

Confidence Levels

Even your knowledge of the recall range created by confidence intervals is subject to a confidence level caveat, typically 95%. That is what I mean by probable range. A confidence level of 95% simply means that if you were to take 100 different samples of the same document collection, that ninety five times out of hundred the true recall value would fall inside the confidence interval calculated from each sample. Conversely, five times out of one hundred the true recall value would fall outside the confidence interval. This may sound very complicated, and it can be very hard to understand, but the math component is all just fractions and well within any lawyer’s abilities.

William_webberA few more detailed examples should clarify, examples that I have been fortunate enough to have double checked by one of the world’s leading experts on statistical analysis like this, William Webber, who has a PhD in Information Science. He is my go to science consultant. William, like Gordon Cormack, and others, has patiently worked with me over the years to understand this kind of statistical analysis. William graciously reviewed an advance copy of this blog (actually several) and double checked and often corrected these examples. Any mistakes still remaining are purely my own.

For an example, I go back to the hypothetical search project I described in Part Three of Visualizing Data in a Predictive Coding Project. This was a search of 1,000,000 documents where I took a random sample of 1,534 documents. A sample size of 1,534 created a confidence interval of 2.5% and confidence level of 95%. This means your sample value is subject to a 2.5% error rate in both directions, high and low, for a total error range of 5%. This is a 5% error of the total One Million document population (50,000 documents), not just 5% of the 1,534 sample (77 documents).

data-visual_RANDOM_ONEIn my sample of 1,534 documents 384 were determined to be relevant and 1,150 irrelevant. This is a ratio of 25% (384/1534). This does not mean that you can then multiply 25% times the total population and know that you have exactly 250,000 relevant documents. That is where whole idea of range of probable knowledge comes in. All you can ever know is that there is between 22.5% and 27.5%, which is 25% plus or minus 2.5%, the nominal confidence interval. Thus all we can ever know from that one sample is that there are between 225,000 and 275,000 relevant documents. (This simple spread of 2.5% both ways as the interval is called a Gaussian estimation. Dr. Webber points out that this 2.5% range should be called a nominal intervalIt is only exact if there happens to be a 50% prevalence of the target in the total population, a so-called normal distribution. Exact interval values can only be attained by use of binomial interval calculations (here 22.88% – 27.28%) that takes actual prevalence into consideration. I am going to ignore the binomial adjustment in this blog to try to keep these first examples easier to follow, but, in statistics the binomial distribution is the preferred calculation for intervals on proportions, not the Gaussian distribution, aka the Normal distribution.)

Corpus_data_recall_25

Even this knowledge of range is subject to the confidence level limitation. In our example the 95% confidence level means that if you were to take a random sample of 1,534 documents one hundred times, that in ninety five times out of that one hundred you would have an interval range that contains the true value. The true value in legal search is a kind of fictitious number representing the actual number of relevant documents in the collection. I say fictitious because,  as stated before, in legal search the target we are searching for – relevant documents – is somewhat nebulous, vague and elusive. Certainty is never possible in legal search, just probabilities.

Still, this legal truth problem aside, we assume in statistical sampling that the mid-ratio, here 25%, is the center of the true value, with a range of 2.5% both ways. In our hypothetical the so-called true value is from 225,000 to 275,000 relevant documents. If you repeat the sample of 1,534 documents one hundred times, you will get a variety of different intervals over the number of relevant documents in the collection. In 95% of the cases, the interval will contain the true number of relevant documents.  In 5% of the cases, the true value will fall outside the interval.

25_bell-curve-Standard_deviation_diagram

In 95% of the samples the different intervals created will include the so-called “true value” of 25%, 250,000 documents

Confidence Level Examples

In several of the one hundred samples you will probably see the exact same or nearly the same numbers. You will again find 384 of the 1,534 sample to be relevant and 1,150 irrelevant. On other samples you may have one or two more or less relevant, still creating a 25% ratio (rounding off the tenths of a percent). On another random draw of 1,534 documents you might find 370 documents are relevant and 1,164 are irrelevant. That is a difference of fourteen documents, and brings the ratio down to 24%. Still, the plus or minus 2.5% range of the 24% value is from 21.5% to 26.5%. The so-called true value of 25% is thus still well inside the range of that sample.

Only when you find 345 or fewer relevant documents, instead of 384 relevant, or when you find 422 or more relevant documents, instead of 384 relevant, will you create the five in one hundred (5%) outlier event inherent in the 95% confidence level. Do the math with me here. It is simple proportions.

If you find 345 relevant documents in your sample of 1,534, which I call the low lucky side of the confidence level, then this creates a ratio of 22.49% (345/1534=0.2249), plus or minus 2.5%. This means a range of from between 19.99% and 24.99%. This projects a range of 199,900 to 249,900 relevant documents in the entire collection. The 24.99% value is just under the interval range of the so-called true value of 25% and 250,000 relevant documents.

At the other extreme, which I call the unlucky side, as I will explain later, if you find 422 relevant documents in your sample of 1,534, then this creates a ratio of 27.51% (422/1534=0.2751), plus or minus 2.5%. This means a range of 25.01% to 30.01%. This projects a range of 250,100 to 300,100 relevant documents in the entire collection.

unlucky_bell_curve_example

The 25.01% value at the low end of the 27.51% range of plus or minus 2.5% is just over the so-called true value of 25% and 250,000 relevant documents.

unlucky_bell_curve_ex_COMBINED

In the above combined charts the true value bell curve is shown on the left. The unlucky high value bell curve is shown on the right. The low-end of the high value curve range is 25.01% (shown by the red line). This is just to the right of the 25% center point of the true value curve.

The analysis shows that in this example a variance of only 39 or 38 relevant documents is enough to create the five times out of one hundred sampling event. This means that ninety five times out of one hundred the number of relevant documents found will be from between 346 and 421. Most of the time the number of documents found will be closer to the 384. That is what confidence level means. There are important recall calculation implications to this random sample variation that I will spell out shortly, especially where only one random sample is taken.

To summarize, in this hypothetical sample of 1,534 documents, the 95% confidence level means that the outlier result where an attorney determines that less than 346 documents are relevant, or more than 421 documents are relevant, is likely to happen five times out of one hundred. This 75 document variance (421-346=75) is likely to happen because the documents chosen at random will be different. It is inherent to the process of random sampling. The variance happens even if the attorney has been perfectly consistent and correct in his or her judgments of relevance.

Inherent Vagaries of Relevance Judgments and Human Consistency Errors Create Quality Control Challenges

dice_manyThis assumption of human perfection in relevance judgment is, of course, false for most legal review projects. I call this the fuzzy lens problem of legal search. See Top Ten e-Discovery Predictions for 2014 (prediction number five). Consistency, even in reviews of small samples of 1,534 documents, only arises when special care and procedures are in place for attorney review, including multiple reviews of all grey area documents and other error detection procedures. This is because of the vagaries of relevance and inconsistencies in human judgments problem mentioned earlier. These errors in human legal judgment can be mitigated and constrained, but never eliminated entirely, especially when you are talking about large numbers of samples.

This error component in legal judgments is necessarily a part of all legal search. It adds even more uncertainties to the uncertainties already inherent in all random sampling, expressed as confidence levels and confidence intervals. As Maura Grossman and Gordon Cormack put it recently: “The bottom line is that inconsistencies in responsiveness determinations limit the ability to estimate recall.” Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’ Federal Courts Law Review, Vol. 7, Issue 1 (2014) at 304. The legal judgment component to legal search is another reason to be cautious in relying on recall calculations alone to verify the quality of our work.

Calculating Recall from Prevalence

You can calculate recall, the percent of the total relevant documents found, based upon your sample calculation of prevalence and the final number of relevant documents identified. Again, prevalence means the percentage of relevant documents in the collection. The final number of relevant documents identified is the total number of relevant documents found by the end of a legal search project. These are the total number of documents either produced or logged.

With these two numbers you can calculate recall. You do so by dividing the final number of relevant documents identified by the projected total number of relevant documents based on the prevalence range of the sample. It is really easier than it sounds as a couple of examples will show.

Examples of Calculating Recall from Prevalence

To start off very simple, assume that our prevalence projection was from between 10,000 to 15,000 relevant documents in the entire collection. The spot or point projection was 12,500, plus or minus 2,500 documents. (Again, I am still excluding the binomial interval calculation for simplicity of illustration purposes, but would not advise this omission for recall calculations using prevalence.)

Next assume that by the end of the project we had found 8,000 relevant documents. Our recall would be calculated as a range. The high end of the recall range would be created by dividing 8,000, the number of relevant documents found, by the low end of the total number of relevant documents projected for the whole collection, here 10,000. That gives us a high of 80% recall (8,000/10,000). The low end of the recall range is calculated by dividing 8,000 by the high end of the total number of relevant documents projected for the whole collection, here 15,000. That gives us a low of 53% recall (8,000/15,000).

recall_example1

Thus our recall rate for this project is between 53% to 80%, subject again, of course, to the 95% confidence level uncertainty. It would not be correct to simply use the spot projection of prevalence, here 12,500 documents, and say that we had attained a recall of 64% (8,000/12,500). We can only say that we have a 95% probability confidence level that we attained between 53% to 80% recall.

Ralph LoseyYes. I know what you are thinking. You have heard every vendor in the business, and most every attorney who speaks on this topic, myself included, proclaim at one time or another that an exact recall level has been attained in a review project. But these proclamations are wrong. You can only know recall range, not a single value, and even your knowledge of range must have a confidence level caveat. This article is intended to stop that imprecise usage of language. The law demands truth from attorneys and those who would serve them. If there is any profession that understands the importance of truth and precision of language, it is the legal profession.

Let us next consider our prior example where we found 384 relevant documents in our sample of 1,534 documents from a total collection of 1,000,000. This created a prevalence of from 225,000 to 275,000 relevant documents. It had a spot or point projection of 25%, with a 2.5% interval range of from 22.5% to 27.5%. (The intervals when the binomial adjustment is used are 22.88% – 27.28%.)

If at the end of the project the producing party had found 210,000 relevant documents, this would mean they may claim a recall of from between 93.33% (210,000/225,000) and 76.36%(210,000/275,000). But even then we would have to make this recall range claim of 76.36% – 93.33% with the 95% confidence interval disclaimer.

recall_example2

Impact of 95% Confidence Level

Even if you assume perfect legal judgment and consistency, multiple random draws of the same 1,000,000 collection of documents in this example could result in a projection of less than 225,000 relevant documents, or more than 275,000 relevant documents. As seen, with the 95% confidence level this happens five times out of one hundred. That is the same as one time out of twenty, or 5%.

That is acceptable odds for almost all scientific and medical research. It is also reasonable for all legal search efforts, so long as you know that this 5% caveat applies, that in one out of twenty times your range may be so far off as to not even include the true value. And, so long as you understand the impact that a 5% chance outlier sample can have on your recall calculations.

The 5% confidence level ambiguity can have a very profound effect on recall calculations based on prevalence alone. For instance, consider what happens when you take only one random sample and it happens to be a 5% outlier sample. Assume the sample happens to have less than 346 relevant documents in it, or more than 421 relevant documents. If you forget the impact of the 95% confidence level uncertainty, you might take the confidence intervals created by these extremes as certain true values. But they are not certain, not at all. You cannot know whether the one sample you took is an outlier sample without taking more samples. By chance it could have been a sample with an unusually large, or unusually small number of relevant documents in it. You might assume that your sample created a true value, but that would only be true 95% of the time.

You should always remember when taking a random sample that the documents selected may by chance not be truly representative of the whole. They may instead fall within an outlier range. You may have pulled a 5% outlier sample. This would, for instance, be the case in our hypothetical true value of 25% if you pulled a sample that happened to have less than 346 or more than 421 relevant documents.

unlucky_prevalence_exYou might forget this fact of life of random sampling and falsely assume, for instance, that your single sample of 1,534 documents, which happened to have, let’s say, 425 relevant documents in it, was representative of all one million documents. You might assume from this one sample that the prevalence of the whole collection was 27.71% (425/1534) with a 2.5% interval of from between 25.21% to 30.21% (again ignoring for now the binomial adjustment (25.48% – 30.02%)). You might assume that 27.71 % was an absolute true value, and the projected relevance range of from 252,100 to 302,100 relevant documents was a certainty.

Only if you took a large number of additional samples would you discover that your first sample was an unlucky outlier that occurs only 2.5% of the time. (You cannot just say take 19 more samples, because each one of those samples would also have a randomness element. But if you took one hundred more samples the “true value” would almost certainly come out.) By repeating the sampling many times, you might find that the average number of relevant documents was actually 384, not the 425 that you happened to draw in the first sample. You would thus find by more sampling that the true value was actually 25%, not 27.71%, that there was probably between 225,000 and 275,000 relevant documents in the entire collection, not between 252,100 and 302,100 as you first thought.

The same thing could happen on what I call the low, lucky side. You could draw a sample with, let’s say, only 342 relevant documents in it the first time out. This would create a spot projection prevalence of 22.29% (342/1534) with a range of 19.79% – 24.79%; projecting to between 197,900 – 247,900 relevant documents. The next series of samples could have an average of 384 relevant documents, our familiar range of 225,000 to 275,000.

Outliers and Luck of Random Draws

So what does this luck of the draw in random sampling mean to recall calculations? And why do I call the low side rarity lucky, and the high side rarity unlucky? The lucky or unlucky perspective is from the perspective of the legal searcher making a production of documents. From the perspective of the requesting party the opposite attributes would apply, especially if only a single sample for recall was taken for quality control purposes.

recall_example2To go back again to our standard example where we find 384 relevant documents in our sample of 1,534 from a total collection of 1,000,000. Our prevalence projection is that there is from 225,000 to 275,000 relevant documents in the total collection. If at the end of the project the producing party has found 210,000 relevant documents, this means, as previously shown, they may claim a recall of from between 93.33% (210,000/225,000) and 76.36%(210,000/275,000). But they should do so with the 95% confidence interval disclaimer.

As discussed, the interval level disclaimer means that in one time out of twenty (5%), the true value may be based on an outlier sample. Thus, for instance, in one time out of forty (2.5% of the time) the sample may have an unluckily large number of relevant documents in it, let us assume again 425 relevant, and not 384. As shown that creates a prevalence spot projection of 27.71% with a range of from 252,100 to 302,100 documents.

Assume again that the producing party finds 210,000 relevant documents. This time they may only claim a recall of from between 83.3% (210,000/252,100) and 69.51% (210,000/302,100).

recall_example_unlucky

That is why I call that the unlucky random sample for the producing party. In 95% of the random samples they would have found 384 relevant documents. They then could have claimed a significantly higher recall range of 76.36% to 93.33%. So based on bad luck alone their recall range has dropped from 76.36% – 93.33% to 69.5% – 83.3%. That is a significant difference, especially if a party is naively putting a great deal of weight on recall value alone.

It is easy to see the flip side of this random coin. The producing party could be lucky (this would happen in 2.5% of the random draws) and by chance draw a sample with less than the lower range. Let us here assume again that the random sample had only 342 relevant documents in it, and not 384. This would create a spot projection prevalence of 22.29% (342/1534) with a range of 19.79% – 24.79%; projecting between 197,900 – 247,900 relevant documents.

Then when the producing party found 210,000 relevant documents it could claim a much higher recall range. It would be from between 84.7% recall (210,000/247,900) to 106% recall (210,000/197,900). The later, 106%, is, of course, a logical impossibility, but one that happens when calculating recall based on prevalence, especially when not using the more accurate binomial calculation. We take that to mean near 100%, or near total recall.

recall_example_lucky

Under both scenarios the number of relevant documents found was the same, 210,000, but as a result of pure chance, one review project could claim from 84.7% to 100% recall, and another only 69.5% to 83.3% recall. The difference between 84.7%-100% and 69.5% -83.3% is significant, and yet is was all based on the luck of the draw. It had nothing whatsoever to do with effort, or actual success. It was just based on chance variables inherent in sampling statistics. This shows the dangers of relying on recall based on one prevalence sample.

Conclusion

These examples show why I am skeptical of recall calculations, even a recall value that is correctly described in terms of a range, if it is only based on a prevalence sample. If the project can afford it, a better practice is to take a second sample at the end of the project and make recall calculations from the second sample. If the project cannot afford two samples, you would be better off from the point of view of recall calculations to skip the first prevalence sample all together, and just rely on a second end of project sample. Taking two samples doubles the sampling costs from around $1,500 to $3,000, assuming, as I do, that a sample of 1,534 documents can be judged, and quality controlled, for between $1,000 to $2,000. This two-sample review cost may be appropriate in many projects to help determine the success of the search efforts

When the cost of a second sample is a reasonable, proportionate expense, I suggest that the second sample not repeat the first, that it not sample again the entire collection for a comparative second calculation of prevalence. Instead, I suggest that the second sample be made for calculation of False Negatives. This means that the second sample would be limited to those documents considered to be irrelevant by the end of the project (sometimes called the discard pile or null set). More on this in a coming blog.


Follow

Get every new post delivered to your Inbox.

Join 3,724 other followers