In legal search you can never know exactly what recall level you have attained. You can only know a *probable range* of recall. For instance, you can never know that you have attained 80% recall, but you can know that you have attained between 70% and 90% recall. Even the range is a *probable* range, not certain. Exact knowledge of recall is impossible because there are too many documents in legal search to ever know for certain how many of them are relevant, and how many are irrelevant.

**Difficulty of Recall in Legal Search **

In legal search *recall* is the percentage of target documents found, typically relevant documents. Thus, for instance, if you know that there are 100 relevant documents in a collection of 1,000, and you find 80 of them, then you know that you have attained 80% recall.

Exact recall calculations are possible in small volumes of documents like that because it is possible to know how many relevant documents there are. But legal search today does not involve small collections of documents. Legal search involves tens of thousands of documents, tens of millions of documents. When you get into large collections of documents like that it is impossible to know how many of the documents in the collection are relevant to any particular legal issue. That has to do with several things: human fallibility, the vagaries of legal relevance, and, to some extent, cost limitations. (Although even with unlimited funds you could never know for sure that you had found all relevant documents in a large collection of documents.)

**Sampling Allows for Calculations of Probable Ranges Only**

Since you cannot know exactly how many relevant documents there are in a large population of documents, all you can do is sample and estimate recall. When you start sampling you can never know exact values. You can only know *probable ranges* according to statistics. That, in a nutshell, is why it is impossible to ever know exactly what recall you have attained in a legal search project.

Even though you can never know an exact recall value, it is still worth trying to calculate recall because you *can know* the *probable **range* *of recall* that you have attained at the end of a project.

**How Probable Range Calculations Are Helpful**

This qualified knowledge of *recall range* provides evidence, albeit limited, that your efforts to respond to a request for production of documents have been proportional and reasonable. The law requires this. Unreasonably weak or negligent search is not permitted under the rules of discovery. Failure to comply with these rules can result in sanctions, or at least costly court ordered production supplements.

Recall range calculations are helpful in that they provide some proof of the success of your search efforts. They also provide some evidence of your quality control efforts. That is the main purpose of recall calculations in e-discovery, to assist in quality control and quality assurance. Either way, probable recall range calculations can significantly buttress the defensibility of your legal search efforts.

In some projects the recall range may seem low. Fortunately, there are many other ways to prove reasonable search efforts beyond offering recall measurements. Furthermore, the law generally assumes reasonable efforts have been made until evidence to the contrary has been provided. For that reason evidence of reasonable, proportionate efforts may never be required.

Still, in any significant legal review project I try to make recall calculations for quality control purposes. Now that my understanding of math, sampling, and statistics have matured, when I calculate recall these days I calculate it as a probable *range, *not a single value*. *The *indisputable mathematical truth* is that there is *no certainty** *in recall calculations in e-discovery. Any claims to the contrary are false.

**General Example of Recall Range **

Here is a general example of what I mean by *recall range*, the first of several. You cannot know that you have attained 80% recall. But you can know with some probable certainty, say with the usual 95% confidence level, that you have attained between 70% and 90% recall.

You can also know that the *most likely value* within the range is 80% recall, but you can never for sure. You can only know the *range* of values, which, in turn is a function of the confidence interval used in the sampling. The confidence intervals, also known as *margin of error*, are in turn a function of the sample size, and, to some extent, also the size of the general collection sampled.

**Confidence Levels**

Even your knowledge of the recall *range* created by confidence intervals is subject to a *confidence level* caveat, typically 95%. That is what I mean by *probable* range. A confidence level of 95% simply means that if you were to take 100 different samples of the same document collection, that ninety five times out of hundred the *true recall value* would fall inside the confidence interval calculated from each sample. Conversely, five times out of one hundred the *true recall value* would fall outside the confidence interval. This may sound very complicated, and it can be very hard to understand, but the math component is all just fractions and well within any lawyer’s abilities.

A few more detailed examples should clarify, examples that I have been fortunate enough to have double checked by one of the world’s leading experts on statistical analysis like this, William Webber, who has a PhD in Information Science. He is my *go to* science consultant. William, like Gordon Cormack, and others, has patiently worked with me over the years to understand this kind of statistical analysis. William graciously reviewed an advance copy of this blog (actually several) and double checked and often corrected these examples. Any mistakes still remaining are purely my own.

For an example, I go back to the hypothetical search project I described in Part Three of *Visualizing Data in a Predictive Coding Project. *This was a search of 1,000,000 documents where I took a random sample of 1,534 documents. A sample size of 1,534 created a confidence *interval* of 2.5% and confidence *level* of 95%. This means your sample value is subject to a 2.5% error rate in both directions, high and low, for a total error range of 5%. This is a 5% error of the total One Million document population (50,000 documents), not just 5% of the 1,534 sample (77 documents).

In my sample of 1,534 documents 384 were determined to be relevant and 1,150 irrelevant. This is a ratio of 25% (384/1534). This does not mean that you can then multiply 25% times the total population and know that you have exactly 250,000 relevant documents. That is where whole idea of *range* of probable knowledge comes in. All you can ever know is that there is between 22.5% and 27.5%, which is 25% plus or minus 2.5%, the *nominal* confidence interval. Thus all we can ever know from that one sample is that there are between 225,000 and 275,000 relevant documents. (This simple spread of 2.5% both ways as the interval is called a* Gaussian *estimation*. *Dr. Webber points out that this 2.5% range should be called a* nominal *interval*. *It is only exact if there happens to be a 50% prevalence of the target in the total population, a so-called *normal* distribution. Exact interval values can only be attained by use of *binomial* interval calculations (here 22.88% – 27.28%) that takes actual prevalence into consideration. I am going to ignore the *binomial* adjustment in this blog to try to keep these first examples easier to follow, but, in statistics the binomial distribution is the preferred calculation for intervals on proportions, not the *Gaussian* distribution, aka the *Normal* distribution.)

Even this knowledge of range is subject to the confidence *level* limitation. In our example the 95% confidence level means that if you were to take a random sample of 1,534 documents one hundred times, that in ninety five times out of that one hundred you would have an interval range that *contains* the *true value*. The *true value* in legal search is a kind of fictitious number representing the actual number of relevant documents in the collection. I say *fictitious* because, as stated before, in legal search the target we are searching for – relevant documents – is somewhat nebulous, vague and elusive. Certainty is never possible in legal search, just probabilities.

Still, this *legal truth problem* aside, we assume in statistical sampling that the mid-ratio, here 25%, is the center of the *true value, *with a range of 2.5% both ways. In our hypothetical the so-called *true value* is from 225,000 to 275,000 relevant documents. If you repeat the sample of 1,534 documents one hundred times, you will get a variety of different intervals over the number of relevant documents in the collection. In 95% of the cases, the interval will contain the true number of relevant documents. In 5% of the cases, the true value will fall outside the interval.

**Confidence Level Examples**

In several of the one hundred samples you will probably see the exact same or nearly the same numbers. You will again find 384 of the 1,534 sample to be relevant and 1,150 irrelevant. On other samples you may have one or two more or less relevant, still creating a 25% ratio (rounding off the tenths of a percent). On another random draw of 1,534 documents you might find 370 documents are relevant and 1,164 are irrelevant. That is a difference of fourteen documents, and brings the ratio down to 24%. Still, the plus or minus 2.5% range of the 24% value is from 21.5% to 26.5%. The so-called *true value* of 25% is thus still well inside the range of that sample.

Only when you find 345 or fewer relevant documents, instead of 384 relevant, or when you find 422 or more relevant documents, instead of 384 relevant, will you create the five in one hundred (5%) outlier event inherent in the 95% confidence level. Do the math with me here. It is simple proportions.

If you find 345 relevant documents in your sample of 1,534, which I call the low *lucky side* of the confidence level, then this creates a ratio of 22.49% (345/1534=0.2249), plus or minus 2.5%. This means a range of from between 19.99% and 24.99%. This projects a range of 199,900 to 249,900 relevant documents in the entire collection. The 24.99% value is just *under* the interval range of the so-called *true value* of 25% and 250,000 relevant documents.

At the other extreme, which I call the *unlucky side*, as I will explain later, if you find 422 relevant documents in your sample of 1,534, then this creates a ratio of 27.51% (422/1534=0.2751), plus or minus 2.5%. This means a range of 25.01% to 30.01%. This projects a range of 250,100 to 300,100 relevant documents in the entire collection.

The 25.01% value at the low end of the 27.51% range of plus or minus 2.5% is just *over* the so-called *true value* of 25% and 250,000 relevant documents.

In the above combined charts the *true value* bell curve is shown on the left. The unlucky high value bell curve is shown on the right. The low-end of the high value curve range is 25.01% (shown by the red line). This is just to the right of the 25% center point of the *true value* curve.

The analysis shows that in this example a variance of only 39 or 38 relevant documents is enough to create the five times out of one hundred sampling event. This means that ninety five times out of one hundred the number of relevant documents found will be from between 346 and 421. Most of the time the number of documents found will be closer to the 384. That is what confidence level means. There are important recall calculation implications to this random sample variation that I will spell out shortly, especially where only one random sample is taken.

To summarize, in this hypothetical sample of 1,534 documents, the 95% confidence level means that the outlier result where an attorney determines that less than 346 documents are relevant, or more than 421 documents are relevant, is likely to happen five times out of one hundred. This 75 document variance (421-346=75) is likely to happen because the documents chosen at random will be different. It is inherent to the process of random sampling. The variance happens even if the attorney has been perfectly consistent and correct in his or her judgments of relevance.

**Inherent Vagaries of Relevance Judgments and Human Consistency Errors Create Quality Control Challenges**

This assumption of human perfection in relevance judgment is, of course, false for most legal review projects. I call this the *fuzzy lens* problem of legal search. *See Top Ten e-Discovery Predictions for 2014* (prediction number five). Consistency, even in reviews of small samples of 1,534 documents, only arises when special care and procedures are in place for attorney review, including multiple reviews of all grey area documents and other error detection procedures. This is because of the vagaries of relevance and inconsistencies in human judgments problem mentioned earlier. These errors in human legal judgment can be mitigated and constrained, but never eliminated entirely, especially when you are talking about large numbers of samples.

This error component in legal judgments is necessarily a part of all legal search. It adds even more uncertainties to the uncertainties already inherent in all random sampling, expressed as confidence levels and confidence intervals. As Maura Grossman and Gordon Cormack put it recently: “*The bottom line is that inconsistencies in responsiveness determinations limit the ability to estimate recall.*” *Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review*,’ Federal Courts Law Review, Vol. 7, Issue 1 (2014) at 304. The legal judgment component to legal search is another reason to be cautious in relying on recall calculations alone to verify the quality of our work.

**Calculating Recall from Prevalence**

You can calculate recall, the percent of the total relevant documents found, based upon your sample calculation of *prevalence* and the *final number of relevant documents identified.* Again, *prevalence* means the percentage of relevant documents in the collection. The *final number of relevant documents identified* is the total number of relevant documents found by the end of a legal search project. These are the total number of documents either produced or logged.

With these two numbers you can calculate *recall. *You do so by dividing the *final number of relevant documents identified *by the projected total number of relevant documents based on the prevalence range of the sample. It is really easier than it sounds as a couple of examples will show.

**Examples of Calculating Recall from Prevalence**

To start off very simple, assume that our prevalence projection was from between 10,000 to 15,000 relevant documents in the entire collection. The spot or point projection was 12,500, plus or minus 2,500 documents. (Again, I am still excluding the *binomial* interval calculation for simplicity of illustration purposes, but would not advise this omission for recall calculations using prevalence.)

Next assume that by the end of the project we had found 8,000 relevant documents. Our recall would be calculated as a range. The high end of the recall range would be created by dividing 8,000, the number of relevant documents found, by the low end of the total number of relevant documents projected for the whole collection, here 10,000. That gives us a high of 80% recall (8,000/10,000). The low end of the recall range is calculated by dividing 8,000 by the high end of the total number of relevant documents projected for the whole collection, here 15,000. That gives us a low of 53% recall (8,000/15,000).

Thus our recall rate for this project is between 53% to 80%, subject again, of course, to the 95% confidence level uncertainty. It would **not be correct** to simply use the spot projection of prevalence, here 12,500 documents, and say that we had attained a recall of 64% (8,000/12,500). We can only say that we have a 95% probability confidence level that we attained between 53% to 80% recall.

Yes. I know what you are thinking. You have heard every vendor in the business, and most every attorney who speaks on this topic, myself included, proclaim at one time or another that an exact recall level has been attained in a review project. *But these proclamations are wrong. *You can only know recall *range*, not a single value, and even your knowledge of range must have a confidence level caveat. This article is intended to stop that *imprecise* usage of language. The law demands *truth* from attorneys and those who would serve them. If there is any profession that understands the importance of truth and precision of language, it is the legal profession.

Let us next consider our prior example where we found 384 relevant documents in our sample of 1,534 documents from a total collection of 1,000,000. This created a prevalence of from 225,000 to 275,000 relevant documents. It had a spot or point projection of 25%, with a 2.5% interval range of from 22.5% to 27.5%. (The intervals when the *binomial* adjustment is used are 22.88% – 27.28%*.*)

If at the end of the project the producing party had found 210,000 relevant documents, this would mean they may claim a recall of from between 93.33% (210,000/225,000) and 76.36%(210,000/275,000). But even then we would have to make this recall range claim of 76.36% – 93.33% with the 95% confidence interval disclaimer.

**Impact of 95% Confidence Level**

Even if you assume perfect legal judgment and consistency, multiple random draws of the same 1,000,000 collection of documents in this example could result in a projection of less than 225,000 relevant documents, or more than 275,000 relevant documents. As seen, with the 95% confidence level this happens five times out of one hundred. That is the same as one time out of twenty, or 5%.

That is acceptable odds for almost all scientific and medical research. It is also reasonable for all legal search efforts, so long as you know that this 5% *caveat* applies, that in one out of twenty times your range may be so far off as to not even include the *true value*. And, so long as you understand the impact that a 5% chance outlier sample can have on your recall calculations.

The 5% confidence level ambiguity can have a very profound effect on recall calculations based on prevalence alone. For instance, consider what happens when you take only one random sample and it happens to be a 5% outlier sample. Assume the sample happens to have less than 346 relevant documents in it, or more than 421 relevant documents. If you forget the impact of the 95% confidence level uncertainty, you might take the confidence intervals created by these extremes as certain *true values*. But they are not certain, not at all. You cannot know whether the one sample you took is an outlier sample without taking more samples. By chance it could have been a sample with an unusually large, or unusually small number of relevant documents in it. You might assume that your sample created a *true value, *but that would only be true 95% of the time.

You should always remember when taking a random sample that the documents selected may by chance not be truly representative of the whole. They may instead fall within an outlier range. You may have pulled a 5% outlier sample. This would, for instance, be the case in our hypothetical *true value* of 25% if you pulled a sample that happened to have less than 346 or more than 421 relevant documents.

You might forget this fact of life of random sampling and falsely assume, for instance, that your single sample of 1,534 documents, which happened to have, let’s say, 425 relevant documents in it, was representative of all one million documents. You might assume from this one sample that the prevalence of the whole collection was 27.71% (425/1534) with a 2.5% interval of from between 25.21% to 30.21% (again ignoring for now the *binomial* adjustment (25.48% – 30.02%)). You might assume that 27.71 % was an absolute *true value*, and the projected relevance range of from 252,100 to 302,100 relevant documents was a certainty.

Only if you took a large number of additional samples would you discover that your first sample was an *unlucky* outlier that occurs only 2.5% of the time. (You cannot just say take 19 more samples, because each one of those samples would also have a randomness element. But if you took one hundred more samples the “true value” would almost certainly come out.) By repeating the sampling many times, you might find that the average number of relevant documents was actually 384, not the 425 that you happened to draw in the first sample. You would thus find by more sampling that the *true value* was actually 25%, not 27.71%, that there was probably between 225,000 and 275,000 relevant documents in the entire collection, not between 252,100 and 302,100 as you first thought.

The same thing could happen on what I call the low, *lucky* side. You could draw a sample with, let’s say, only 342 relevant documents in it the first time out. This would create a spot projection prevalence of 22.29% (342/1534) with a range of 19.79% – 24.79%; projecting to between 197,900 – 247,900 relevant documents. The next series of samples could have an average of 384 relevant documents, our familiar range of 225,000 to 275,000.

**Outliers and Luck of Random Draws**

So what does this luck of the draw in random sampling mean to recall calculations? And why do I call the low side rarity *lucky*, and the high side rarity *unlucky*? The *lucky* or *unlucky *perspective is from the perspective of the legal searcher making a production of documents. From the perspective of the requesting party the opposite attributes would apply, especially if only a single sample for recall was taken for quality control purposes.

To go back again to our standard example where we find 384 relevant documents in our sample of 1,534 from a total collection of 1,000,000. Our prevalence projection is that there is from 225,000 to 275,000 relevant documents in the total collection. If at the end of the project the producing party has found 210,000 relevant documents, this means, as previously shown, they may claim a recall of from between 93.33% (210,000/225,000) and 76.36%(210,000/275,000). But they should do so with the 95% confidence interval disclaimer.

As discussed, the interval level disclaimer means that in one time out of twenty (5%), the *true value* may be based on an outlier sample. Thus, for instance, in one time out of forty (2.5% of the time) the sample may have an unluckily large number of relevant documents in it, let us assume again 425 relevant, and not 384. As shown that creates a prevalence spot projection of 27.71% with a range of from 252,100 to 302,100 documents.

Assume again that the producing party finds 210,000 relevant documents. This time they may only claim a recall of from between 83.3% (210,000/252,100) and 69.51% (210,000/302,100).

That is why I call that the *unlucky* random sample for the producing party. In 95% of the random samples they would have found 384 relevant documents. They then could have claimed a significantly higher recall range of 76.36% to 93.33%. So based on bad luck alone their recall range has dropped from 76.36% – 93.33% to 69.5% – 83.3%. That is a significant difference, especially if a party is naively putting a great deal of weight on recall value alone.

It is easy to see the flip side of this random coin. The producing party could be lucky (this would happen in 2.5% of the random draws) and by chance draw a sample with less than the lower range. Let us here assume again that the random sample had only 342 relevant documents in it, and not 384. This would create a spot projection prevalence of 22.29% (342/1534) with a range of 19.79% – 24.79%; projecting between 197,900 – 247,900 relevant documents.

Then when the producing party found 210,000 relevant documents it could claim a much higher recall range. It would be from between 84.7% recall (210,000/247,900) to 106% recall (210,000/197,900). The later, 106%, is, of course, a logical impossibility, but one that happens when calculating recall based on prevalence, especially when not using the more accurate *binomial* calculation. We take that to mean near 100%, or near total recall.

Under both scenarios the number of relevant documents found was the same, 210,000, but as a result of pure chance, one review project could claim from 84.7% to 100% recall, and another only 69.5% to 83.3% recall. The difference between 84.7%-100% and 69.5% -83.3% is significant, and yet is was all based on the luck of the draw. It had nothing whatsoever to do with effort, or actual success. It was just based on chance variables inherent in sampling statistics. This shows the dangers of relying on recall based on one prevalence sample.

**Conclusion**

These examples show why I am skeptical of recall calculations, even a recall value that is correctly described in terms of a range, if it is only based on a prevalence sample. If the project can afford it, a better practice is to take a second sample at the end of the project and make recall calculations from the second sample. If the project cannot afford two samples, you would be better off from the point of view of recall calculations to skip the first prevalence sample all together, and just rely on a second end of project sample. Taking two samples doubles the sampling costs from around $1,500 to $3,000, assuming, as I do, that a sample of 1,534 documents can be judged, and quality controlled, for between $1,000 to $2,000. This two-sample review cost may be appropriate in many projects to help determine the success of the search efforts

When the cost of a second sample is a reasonable, proportionate expense, I suggest that the second sample not repeat the first, that it not sample again the entire collection for a comparative second calculation of prevalence. Instead, I suggest that the second sample be made for calculation of *False Negatives. *This means that the second sample would be limited to those documents considered to be irrelevant by the end of the project (sometimes called the *discard pile *or *null set*). More on this in a coming blog.

Reblogged this on GMO MARKETING.

Ralph,

It’s interesting to see you go full circle on this and recognize the problems with recall in document review. I think the shortcut to all your math and your discussion is just the old axiom that you can never have greater precision than your least precise measure. The least precise measure in document review is the measure of relevance itself, since it is so highly subjective.

Things like predictive coding are great weapons when one’s interest in documents is limited to one’s own measure of relevance. To then project that measure onto other parties is uncertain. There have even been studies on this. I believe it was the Grossman Carmack study on inconsistent assessment of responsiveness in e-discovery, that examined this concept and measured the overlap between different reviewers and found that overlap was only between 15 and 49 percent. A score that low means that there are a lot of differences or stated differently, imprecision.

So, do all the calculations and statistics you think help to support your point but in the end none of them mean anything. Not normal distributions or standard deviations or the concept of sampling risk. The real problem with measuring recall is that one can never have more precision than the least precise measure, which will be the measure of relevance itself. Furthermore, there is no way to measure that measure with even repeatable and consistent values between reviewers, particularly when they are opposing sides of an issue.

That said, I look forward to reading your part 2 in this series.

I too am looking forward to the next installments. As you know, I did some thinking about this subject and didn’t find an easy answer to providing recall because of the impact of large numbers in the discard pile. Will stand by to see how you overcome the problem.

Thanks, Ralph. It will be interesting to learn whether your model is sensitive to the materially different qualitative value of more-of-the-same, what you call “redundantly-relevant,” vs the only-ones-that-really-matter, “hot” documents.

[…] is an ongoing discussion about methods of estimating the recall of a production, as well as estimating a […]

[…] get a rough idea of prevalence with interval ranges. These were the examples shown by my article, In Legal Search Exact Recall Can Never Be Known, and described in the section, Calculating Recall from Prevalence. I wanted to include the first […]

[…] Before I get into the examples and math for ei-Recall, I want to provide more general background. In addition, I suggest that you re-read my short description of an elusion test at the end of Part Three of Visualizing Data in a Predictive Coding Project. It provides a brief description of the other quality control applications of the elusion test for False Negatives. If you have not already done so, you should also read my entire article, In Legal Search Exact Recall Can Never Be Known. […]

[…] us begin with the same simple hypothetical used in In Legal Search Exact Recall Can Never Be Known. Here we assume a review project of 100,000 documents. By the end of the search and review, when […]

I am glad that the fish are a useful illustration. Please attribute their source.

http://www.lucidatainc.com/2012/10/recall-and-precision-understanding-relevancy-in-ediscovery/

Thanks. I was not aware of the source. I changed them to link to your article.