This is going to be a hyper-technical blog for all those professionals in e-discovery who are struggling, like I am, to fully understand the math governing random sampling, particularly as it is applied to our field of legal search. I can say with a high degree of confidence that most of us who specialize in e-discovery employ random sampling in some form or another as part of our quality control efforts. We typically use random sampling in large-scale review projects. But do we really understand all of the intricacies? Probably not.

*Bubble People* and the Future Here Now

I would estimate that 80% of the elite few who attend Sedona, as mentioned in my last blog, use random sampling as part of their e-discovery work. But this is a small group of dedicated specialists, probably only a few hundred strong. They are in what Paul D. Weiner likes to call *the Sedona Bubble*. I have about only a 90% confidence level of that number, however, as I have not done a valid poll yet of the *Sedonites* (not the best word perhaps for Sedona members, but better than *bubble-people*). Moreover, I suspect that my margin of error, aka *confidence interval*, is a high one of 10%. That means that as few as 70% of the Sedonites in fact use sampling, or as many as 90%. *See eg.* “Sampling 101 for the e-Discovery Lawyer,” an appendix to * The Sedona Conference Commentary on Achieving Quality in the E-Discovery Process *(2009) at pgs. 35-39.

This kind of probabilistic thinking is all part of the future practice of law, coming your way soon. How soon? I’ll tell you in a minute. As William Gibson said: *The future is already here — it’s just not very evenly distributed.* Many of my readers may already be there, Sedonites or not, and may already use random sampling and statistics as part of their legal practice. But I am pretty sure, and here I’d go as far as say I have a 99.9% confidence level, that *most* lawyers in the world do not.

My guess is based on my travels and teachings to many lawyer groups around the U.S., not to mention my interaction with many of those delightful lawyers in towns large and small who go by the label of *opposing counsel*. In other words, these statements and predictions are based on what I have seen, not from a validly random sample of American lawyers. (Hint to the Rand Corporation: here is a good research project for you.) Still, my *wetware* (gooey brain based) estimates, with a 95% confidence level, that less than 2% of all lawyers now use random sampling in any way. Random sampling is still a rare exception in U.S. legal culture. And therein lies the problem, at least in so far as e-discovery quality control is concerned. Sampling now has a very low prevalence rate.

But those of us in the world of e-discovery are used to that. There are still very few full-time specialists in e-discovery. This is changing fast. It has to in order for the profession to cope with the exploding volume and complexity of written evidence, meaning of course, evidence stored electronically. We e-discovery professionals are also used to the scarcity of valuable evidence in any large e-discovery search. Relevant evidence, especially evidence that is actually used at trial, is a very small percentage of the total data stored electronically. *DCG Sys., Inc. v. Checkpoint Techs, LLC*, 2011 WL 5244356 at *1 (N.D. Cal. Nov. 2, 2011) (quoting Chief Judge Rader: only .0074% of e-docs discovered ever make it onto a trial exhibit list). Again, this is a question of low prevalence. So yes, we are used to that.* See* Good, Better, Best: a Tale of Three Proportionality Cases – Part Two; and, *Secrets of Search* article, * Part Three* (

*Relevant Is Irrelevant*).

**A Losey Prediction**

I predict that the rate of prevalence of use of sampling and probabilistic thinking by lawyers will increase rapidly over the next ten years. It must. Random sampling is too powerful a tool for the profession to ignore. It has been well proven as an indispensable tool of science and industry. It is probably time for law to also embrace this tool.

But I will do more than make such vague general assertions. I will now get very specific and put hard metrics on my predictions, metrics with which future lawyers can hold me accountable. (I’m not really too worried as I’ll have Adam to defend me, and he’ll probably come up with some good excuses in the 5% unlikely event I’m wrong.)

I hereby predict that … (trumpets sound) … in the year 2022 a random sample polling of American lawyers will show that 20% of the lawyers in fact use random sampling in their legal practice. I make this prediction with an 95% confidence interval and an error rate of only 2%. I even predict how the growth will develop in a year by year basis, although my confidence in this detail is lower.

But I will go still further out on the limb, and make my prediction even more specific. Assuming that by the year 2022 there are 1.5 Million lawyers (the ABA estimated there were 1,128,729 resident, active lawyers in 2006), I predict that 300,000 lawyers in the U.S. will be using random sampling by 2022. The confidence interval of 2% by which I qualified my prediction means that the range will be between 18% and 22%, which means between 270,000 lawyers and 330,000 lawyers. I have a 95% level of confidence in my prediction, which means there is a 5% chance I could be way wrong, that there could be fewer than 270,000 using random sampling, or more than 330,000. This is all shown by the familiar bell curve first shown above and below. (Hint – Adam, here’s the *out* to defend my predictions (in the unlikely event you’ll have to.))

I do all of this prognostication somewhat tongue-in-cheek, but with the ulterior motive to provide an example of what I mean by probabilistic thinking. Forget about absolute certainty of knowledge about anything. Forget about perfection. Think reasonability of efforts. Think preponderance of evidence. Think probability. Think in terms of degrees of confidence. For example, I am highly confident that most of you probably get 90% of my humor, give or take 2% of my jokes.

But enough with the pleasantries. I promised a hard-nosed technical math blog for all you super-nerds out there, and now you’re going to get it! (Here is where I predict 50% of my readers will stop reading!)

**The Value and Limitations of Random Sampling**

When you review a random sample of data (“corpus”), and categorize the sample data in some way, for instance by identifying all documents in the sample as either relevant or irrelevant, and you then project the percentage found in the sample onto the entire corpus, you **can not know for certain** that your percentage is the correct answer (i.e. – only 10% of the total corpus is relevant because only 10% of the sample is relevant). But, if the sample size is large enough, and the selection of the sample is truly random, you **can know** that there is a certain chance, i.e. 95% chance, or “confidence level,” that you are within a certain margin of error (“confidence interval”) of the correct answer. Put another way, there is a 95% chance that you are correct, at least within a defined plus or minus range.

For my purposes as an e-discovery lawyer concerned with quality control of document reviews, this explanation of *near certainty* is the essence of random probability theory. This kind of probabilistic knowledge, and use of random samples to gain an accurate picture of a larger group, has been used successfully for decades by science, technology, and manufacturing. It is key to both quality control and understanding large sets of data. The legal profession must now also adopt random sampling techniques to accomplish the same goals in large-scale document reviews.

You can use any standard random sample calculator to determine the appropriate size of a random sample, using either a 95% or 99% confidence level, and the confidence interval of your choice. I suggest you use the calculator shown at the top of random sample page in my FloridaLawFirm.com website. The confidence interval you plug into the calculator represents the margin of error you find acceptable. Less documents are required for a valid random sample size as the confidence interval increases, or confidence level decreases.

In the example above where 10% of the sample was relevant, if a confidence interval of 4 is used, that means that the 10% projected level may be as high as 14% or as low as 6%. This means that with a corpus of 1,000,000 documents, and a review of a random sample of 600 documents, which is the sample size required for a 95% confidence level and +/- 4% confidence interval, wherein you find that 60 of the documents are relevant, and 540 are irrelevant, that *you can know that there is a 95% chance that the range of relevant documents in the entire corpus is from between 140,000 to 60,000 documents.* If a confidence interval of 2% is used, and the corresponding number of randomly selected documents is reviewed (2,395), and again 10% were found to be relevant (240), then the range of relevant documents in the corpus is from between 120,000 to 80,000. That is how random probability works in a binary classification system. Here is the standard bell curve graphic illustrating a 95% confidence level:

The variation in sample size required for various confidence levels and intervals is shown in the graph below. It illustrates the sample sizes needed for 90%, 95%, and 99% confidence levels with confidence intervals of 10%, 5% and 2%.

The math at work for calculating sample sizes and confidence intervals involves square root calculations, as will be shown in the *fun math* part below. This essentially requires about a **quadrupling** of sample size in order to achieve a **doubling** of accuracy. Put another way, if you want to cut your error margin in half, you will have to quadruple your sample size. For instance, assuming a population size of 1,000,000, and a 95% confidence level, the sample size required for a 10% confidence interval is 96. The sample size required for a 5% confidence interval is 384. The sample size required for a 2.5% confidence interval is 1534. The sample size required for a 1.25% confidence interval is 6,109.

**This is a good rule of thumb to remember.** If you want to reduce your error rate in half, your confidence interval, and thus double your accuracy, your cost to do so will quadruple. It will quadruple, at least approximately, because you will have four times as many documents in the sample to review. Twice the quality at four times the cost. Thus 2=4 in the world of quick calculations for random sampling. Hopefully the picture of my old unmanicured thumb will help you to remember this.

**The Impact of Prevalence on Random Sampling Calculations**

The second calculator shown on my linked page allows you to add another dimension, another criterion, to your probability analysis, namely “prevalence.” This is especially important to understand in the field of legal search where low prevalence rates are common. In the binary example of relevance, the prevalence of the corpus is the percentage of relevant documents. The prevalence percentage has a direct numerical impact on the margin of error (“confidence interval”) applicable to the sample projections. Prevalence is also known as “richness,” as in *target-richness*, or “response distribution.” *See eg. *another sample size calculator by RAOsoft.com that includes these criteria and an explanation.

The first calculator shown on my website assumes what some call the “worst case scenario” for sample prediction where the prevalence is 50%. This a perfectly even distribution, which requires the largest sample size to attain a desired confidence level. The top calculator conservatively assumes that half of the corpus will be in the target group, i.e. – not relevant. When the target rate or prevalence is 50/50, that requires the highest number of documents to be sampled for statistical validity, which is why it is called the “worst case scenario.” When the prevalence rate is higher or lower 50%, the number of documents that must be sampled **decreases**.

Thus, if the prevalence rate is 95%, meaning in our example, 95% of the documents are relevant, or, conversely, if the richness is very low, and the prevalence rate is only 5%, again a smaller sample is required to attain the same confidence interval. Put another way, review of the same sample size creates a much lower confidence interval, and thus a much lower margin of error. This is very important to understanding the binary classifications of a large corpus of data where only a small amount of the data is responsive, i.e., are relevant. (Another example of a binary classification could be privileged or not.)

Try out the second standard random sample calculator shown on my website to see this for yourself. In the first example shown, assuming a corpus of one million documents, with a confidence interval of 4, you see that a sample size of 600 documents is required. This is the largest possible sample size required for the 95% +/- 4. It assumes the worst case scenario of 50% prevalence (i.e. – half of the documents are relevant). Now change the prevalence percentage to 95% in the second calculator, using a sample size of 600, and a corpus of 1,000,000. The confidence interval is now 1.74%. You get the same result when you assume a prevalence rate of only 5%.

Again, see the Sample Size Calculator at RaoSoft.com for a calculator that allows you to plug-in different prevalence rates (called “response distribution” in that calculator) to determine sample sizes for certain intervals based on prevalence. Bottom line, when you have a corpus with a high or low prevalence, one that is either target rich, or target poor, a smaller sample size is required to attain an acceptable confidence interval. (Note, there are some exceptions where, for instance, there are extreme values (“outliers”) or where there are small corpus sizes.)

A good way to understand prevalence is by example. Start by assuming a 1,000,000 document corpus, which has a prevalence rate of 5% (one where 5% or less of the documents are relevant), you need only review **456** documents to know with 95% certainty, and an error rate of only 2%, the total number of relevant documents. Remember, if you had assumed that half of the documents were relevant, then you would have had to review **2,395** documents to attain the same confidence level and interval. See for yourself by trying this out in the standard calculators on my page and on RaoSoft’s.

This characteristic of random sampling must be understood for cost-effective quality control in a corpus with low prevalence. This is important because low prevalence is the norm in legal search, and not the so-called standard normal distribution used in other fields, where you assume the hard-search of separating out half of a 50/50 split.

**Mathematical Formula for Random Sample Size Calculations**

Here is one way of expressing the basic formula behind most standard random sample size calculators:

*n = Z² x p(1-p) ÷ I²*

*n = Z² x p(1-p) ÷ I²*

Description of the symbols in the formula:

**n** = required sample size

**Z** = confidence level (The value of Z is statistics is called the “Standard Score,” wherein a 90% confidence level=1.645, 95%=1.96, and 99%=2.577)

**p** = estimated prevalence of target data (richness)

**I** = confidence interval or margin of error

Putting the formula into words – the required sample size is equal to the confidence level squared, times (the estimated prevalence times one minus the estimated prevalence), then divided by the square of the confidence interval.

Here is an example of the formula in action where we assume a 95% confidence level and confidence interval of 2%, and a prevalence of 4%:

**n = Z² x p(1-p) ÷ I²**

** n= 1.96² x .04(1-.04) ÷ .o2²**

** n = 3.8416 x .04(.96) ÷ .0004**

** n = 3.8416 x .0384 ÷ .0004**

** n = .14751744 ÷ .0004**

** n = 368.7936**

The formula shows that with an estimated prevalence of 4% we need a sample size of 369 documents to attain a 95% confidence level with a margin of error of 2%.

It is important to understand that this sample size formula is derived from the **formula for calculating confidence intervals (I). **

If you take the “n” value as unknown (the number to be sampled for a specified confidence interval), and assign a value to the confidence level of say, 95%, wherein the value for “Z” is thus 1.96, and you move the “n” to the left side of the equation, the formula now looks like this:

Mathematically this is the same thing as our original formula:

**n = Z² x p(1-p) ÷ I²**

We can easily prove the formulas are identical by example where we again assume a 95% +/- 2%, and a prevalence of 4%:

**I = Z√p(1-p)/n
.02 = 1.96 √.04(1-.04)/n
n = (1.96/.02)² x .04(.96)
n = (98)² x .0384
n = 9604 x .0384
n = 368.7936**

Here is another example using the formula I prefer, and following our first assumptions where the estimated prevalence rate is 5% relevant documents, and a 95% confidence level is desired with a confidence interval of 2%. The following relatively simple mathematical calculation provides the required sample size:

**n = 1.96² x .05(1-.05) ÷ .02²
n = 3.8416 x .05(.95) ÷ .0004
n = 3.8416 x .0475 ÷ .0004
n = .182476 ÷ .0004
n = 456.19
**

Now if you change the prevalence rate from 5% to 50%, the formula increases the required sample size for a 95% confidence with plus or minus 2% as follows:

**n = 1.96² x .5(1-.5) ÷ .02²
n = 3.8416 x .5(.5) ÷ .0004
n = 3.8416 x .25 ÷ .0004
n = .9604 ÷ .0004
n = 2401**

Do the math above. Really, it is not that hard. It is all just multiplication and division. It shows that with the lower prevalence rates commonly found in legal search you can make accurate predictions using lower sample sizes. Further, if you do determine sample size based on an assumed 50% prevalence rate, whereas in fact you have a much lower rate, you are actually lowering your confidence interval, your margin of error.

Thus, if you use a standard calculator that by default has a worst-case 50% distribution or prevalence rate built-in, and review 2,401 documents, which you thought was the sample size necessary to attain a confidence interval of 2%, and you in fact were dealing with a document corpus that only had a 5% prevalence rate, having 95% irrelevant documents, then in fact your calculations will have a confidence interval (error rate) of only .87%, and not the 2% interval you thought. That is a good thing.

Again, don’t believe me. Do the math. Use the Interval formula that the sample size formula is based upon. (You may also need a calculator that does square root.)

** I = Z√p(1-p)/n**

**I = 1.96√.05(1-.05)/2401**

**I = 1.96√.05(.95)/2401**

**I = 1.96√.0475/2401**

**I = 1.96√.00001978342357**

**I = 1.96 x .004447856064443**

**I = .00871779788631**

You can also use the second standard calculator on my page, Just plug-in 95% confidence level, a sample size of 2401, a population of 1,000,000, and a prevalence percentage of 5. It should calculate a confidence interval of 0.87. You can also double-check by using the RAOsoft calculator.

**Additional Math Disclaimer**

I have a disclaimer on all of my blog postings. See the top title and the first link on the right hand column: DISCLAIMER. On this particular post I thought it would be a good idea to add yet another level of disclaimer. Although math is math, and these are well accepted formulas and principles, these are still just my personal applications and synthesis of information and rules applicable in the field of statistics and legal search. I reserve the right to go back and make revisions to this post as my understanding deepens and improves. I am an attorney, not an information scientist or statistician. These views should not be relied upon, nor accepted as anyone’s opinion other than my own. You should, of course, always do your own due diligence, study and analysis. Like I said, *do the math. *

As always, if you disagree with the analysis here, or detect any math errors, please let me know. I welcome a free exchange of ideas and information. You can either email me privately, or write a public comment. That is how my blog works. I put my ideas out there for peer-review, and I make corrections as I go along, and before the blogs are ultimately transformed into a book. I appreciate all of the help my learned readers have provided to me over the years since I first began this open writing experiment in 2006. The odds are, your comments will help make my next book even better.

**Conclusion**

This blog has discussed thirteen different scenarios showing probabilistic analysis:

- I began with analysis of
*e-discovery expert bubble people*wherein I estimate, based on anecdotal evidence, that 80% already use random sampling in some manner. I have only a 90% confidence level in that, with a confidence interval of 10%, so it could actually range from 70% to 90%, and maybe a lot more or less. - The I moved on to analysis of all lawyers in the world. I estimated that a majority (51% or more) do not use random sampling at all. I put a 99.9% confidence level on that opinion and invited the Rand Corporation to try to prove me wrong.
- Then I turned my half-witty attention to all lawyers in the U.S. and opined that less than 2% use random sampling. I put a 95% confidence level on that one.
- Then I made my prediction that in ten years the number of lawyers in the U.S. using random sampling will increase tenfold from 2% to 20%. I am 95% confident on that projection, but I put a margin of error on it of plus or minus 2%. Based on the ABA’s estimate of the number of lawyers in America, I projected that from between 270,000 to 330,000 lawyers will be using random sampling by 2022. Rand Corp., make a note and do a follow-up survey in 2022, would you please?
- I next estimated that my blog readers get 90% of the humor in this blog (or better said,
*attempts*at same), with a confidence interval of 2%, meaning between 88% and 92%. - Serious sampling examples then began where I assumed a 95% confidence level, and 4% confidence interval. A review of a sample of 600 documents found that 60 were relevant (10%). Based on the sample we can project that 100,000 of the documents in the million document corpus would be relevant, with a range of between 6% and 14%, which means between 60,000 and 140,000 documents.
- Another variation of the last example was then considered where a confidence interval of 2% was used, instead of 4%. This required a sample size of 2,395 documents, where 10% were again found to be relevant (240). Since a 2% interval was used, the range of relevant documents projected was narrower, from between 80,000 and 120,000.
- Next, I added consideration of prevalence into the sample size formulas and started with an example of a 95% confidence level, and either 5% or 95% prevalence ratio (same either way). With a review of a random sample of 600 documents, and either a 5% or 95% prevalence, I showed that the confidence interval improved from 4% to 1.74%. This is an important point.
- Then I considered a 5% prevalence, where I showed that a sample of only 456 documents provides a 95% certainty and an error rate of 2%. This compared to the need to sample 2,395 documents for a 2% confidence interval if you assume 50% prevalence. Another important point.
- Then I showed the actual mathematical calculations explaining the formulas and used an example of a 95% confidence level, a 2% confidence interval, and a prevalence of 4%. You remember, it went like this and showed you only had to sample 369 documents:

**n = Z² x p(1-p) ÷ I²**

**n= 1.96² x .04(1-.04) ÷ .o2²**

**n = 3.8416 x .04(.96) ÷ .0004**

**n = 3.8416 x .0384 ÷ .0004**

**n = .14751744 ÷ .0004**

**n = 368.7936** - The next formula I ran again assumed a 95% confidence level and 2% interval, but this time changed the prevalence to 5%. The formula showed a required sample size of 456 documents.
- Then I ran the math on 95% +/- 2, but this time assuming a 50% prevalence. The formula showed a required sample size of 2,401 documents.
- Then I ended with another twist where the sample size of 2,401 documents is used, but this time a 5% prevalence is assumed. The interval calculation formula showed that a .87 confidence interval results. That was shown in only formula where you had to do a square root calculation:

**I = Z√p(1-p)/n**

**I = 1.96√.05(1-.05)/2401**

**I = 1.96√.05(.95)/2401**

**I = 1.96√.0475/2401**

**I = 1.96√.00001978342357**

**I = 1.96 x .004447856064443**

**I = .00871779788631**

I pointed out that you could skip the math entirely if you wanted, and attain the same results by using the random sample size calculators on my page, or on the RAOsoft calculator, or any other of a number of calculators freely available on the web. Depending on what software you are using for review, you might also have this ability built-in. You can also skip formulas and calculators all together and rely upon charts that list common values. These charts typically assume a prevalence of 50%. *See eg* Sample Size Table from Research Advisors. It can anyway be helpful to look at these charts to get a feel for how the numbers relate. For instance, look at these tables from the University of Florida, Professor Glenn D. Israel:

Table 1. Sample size for ±3%, ±5%, ±7% and ±10% Precision Levels Where Confidence Level is 95% and P=.5. |
||||

Size of | Sample Size (n) for Precision (e) of: | |||

Population | ±3% | ±5% | ±7% | ±10% |

500 | a | 222 | 145 | 83 |

600 | a | 240 | 152 | 86 |

700 | a | 255 | 158 | 88 |

800 | a | 267 | 163 | 89 |

900 | a | 277 | 166 | 90 |

1,000 | a | 286 | 169 | 91 |

2,000 | 714 | 333 | 185 | 95 |

3,000 | 811 | 353 | 191 | 97 |

4,000 | 870 | 364 | 194 | 98 |

5,000 | 909 | 370 | 196 | 98 |

6,000 | 938 | 375 | 197 | 98 |

7,000 | 959 | 378 | 198 | 99 |

8,000 | 976 | 381 | 199 | 99 |

9,000 | 989 | 383 | 200 | 99 |

10,000 | 1,000 | 385 | 200 | 99 |

15,000 | 1,034 | 390 | 201 | 99 |

20,000 | 1,053 | 392 | 204 | 100 |

25,000 | 1,064 | 394 | 204 | 100 |

50,000 | 1,087 | 397 | 204 | 100 |

100,000 | 1,099 | 398 | 204 | 100 |

>100,000 | 1,111 | 400 | 204 | 100 |

a = Assumption of normal population is poor (Yamane, 1967). The entire population should be sampled. |

Even though calculators and charts make sample size determination easy, it is good to know how to do the math yourself. That provides a solid understanding of what the calculators and charts are doing and why. *Also see* the work of the EDRM on the subject: *Statistical Sampling Applied to Electronic Discovery*; and, *Appendix 2: Application of Sampling to E-Discovery Search Result Evaluation*.

The math we examined shows the importance of *prevalence* to random sample size calculations and confidence interval calculations. This has been overlooked, or at least underestimated, by many in the field of e-discovery. This error often leads to over-sampling and review of more documents than required to obtain reasonable confidence levels and intervals. The routine assumption of a worst-case-scenario of 50% prevalence leads to overkill and unnecessarily large samples for many (but not all) uses of random sampling, including many quality control calculations. We need to start adding prevalence into our equations, and start being more efficient in our quality control metrics.

I look forward to your public and private comments. Hopefully I have caught all of the minor number and math mistakes (I have already spotted and corrected quite a few), but it is late, and I may well have missed some. Please let me know if you see any more errors.

Fascinating post, Ralph–and undoubtedly the toughest to plow through of any you’ve penned. You’re safe in defending your projections on many grounds, not least of which is the variability with which you can define the circa-2022 lawyer population under scrutiny. I wouldn’t expect much change in the utilization level of sampling within the segments of the bar blissfully unconcerned with validating production methodologies in e-discovery.

Still, I will bet you much more than a steak dinner that, despite your confident projection, 20% of U.S. lawyers WILL NOT know the formula for calculating confidence intervals by 2022. Conversely, I can confidently assure you that the population of U.S. lawyers will drop to 20% of its current level if it is made known now that the lawyers of 2022 *must* understand the formula for calculating confidence intervals and work with same on a daily basis. I think we might even lose a few of the Sedona bubble boys and girls.

I just deleted that “know the formula” sentence. There I did get carried away. You can use random sampling without knowing formulas, however, and I stick by by main premise of 20% utilization by 2022. May be too conservative of an estimate if you succeed in your technology education.

Great post Ralph. Brings back not-so-fond memories of quantitive data analysis classes in college. A real mean, median, deviation hoot! I agree with Craig that (discovery) attorneys will not know the formula for calculating confidence intervals. But that doesn’t mean they won’t be using this kind of technology to quickly cull data. Which I think will happen a lot sooner than many realize especially considering data doubles every 18 months…and Skynet is inching closer to reality.

But much in the same way you don’t need to know how Google’s page rank algorithm works you won’t need to know how the statistical sampling magic happens either. It will just work and technology will make it easy to use. Dare I say easy enough for attorneys to use without scaring them away from Sedona or the law in general? =)

Ralph,

So far I have just used random sampling to verify adherence to coding instructions, (and take corrective action where necessary). I always ran into the issue of not know how large a sample I would need. Your article has convinced me that additional quantitative coursework, (perhaps even an MSc in Applied Statistics/Mathematics) would be a worthwhile investment.

I

Jeffrey

Hi Ralph,

Venkat Rangan (CIO of Clearwell) wrote about sampling a while back and noted my article about the topic, which appeared in LTN (12/06/2010). Here’s the link to Venkat’s article and his references to the Sedona materials: http://www.clearwellsystems.com/e-discovery-blog/2011/02/09/how-do-you-sample-electronically-stored-information-esi-in-e-discovery/.

I think the best way for attorneys to “get” the importance of this topic is by analogy: think political polling. With a sample size of about 1,000 people chosen at random, the polling firms (quoted by all the major news media) estimate with about a 95% confidence level what an entire country of about 300 million people is thinking at any given time.

For examples, go to http://www.realclearpolitics.com. RCP takes a handful of poll results and averages them. Attorneys might pull sample sizes of randomly selected documents and conduct the same exercise.

A shorthand formula for choosing the sample size is to know that a reasonable estimate of the error factor is 1 divided by the square root of n. Suppose n =900; then the square root of n is 30. And 1 divided by 30 is 3.3%. So, if you’re happy with that error factor, n = 900.

Or, without the math and straight from Venkat’s article:

“Perhaps the real need is for the requesting party to specify in their Rule 26 (b) meet and confer, that the production be certified for completeness by also including a statement on sampling and its results. A simple request such as, “Sample the data for 98% confidence level and 2% error rate, and report the number of responsive documents” could be sufficient. The producing side can perform random sampling, per the sampling goals for the above request, selecting 13526 documents (based on the sampling table of EDRM Search Guide). This allows the attorneys representing the producing party to certify and sign off on an agreed-upon target.”

Thanks Nick. Good comments. Although I think Venkat’s numbers are unrealistic: “98% confidence level and 2% error rate.” This is way over-kill IMO, especially considering the vagaries of human coding, and the great expense of such over-review. Reasonable, proportional efforts must be our pole-star, and I submit that 98% +/- 2 is unreasonable and disproportionate for almost any size project. Most of science and industry, for instance, uses 95% (except perhaps research on the quality of bullet proof vests!).

Ralph, I think we need a consensus on this; a standard. If the standard was as tight as 98% + or – 2%, you can see that one would need to pull a random sample of 13,526 documents from the set. In science, I don’t think anyone would accept less than 95% as statistically signficant, but that is certainly a reasonable figure for litigation. And, for a 95% confidence level and an appropriate error rate, the number of documents would go way down, as would the costs. I hope your post sets the wheels in motion.

Are the prevalence rates (of responsive documents, not just “hits”) in e-discovery really this large as a rule? A 5% prevalence rate in a million document corpus is 50,000 documents. And how do you know the prevalence rate before you’ve done any searching?

I build search test collections at NIST as the head of the TREC project. The prevalence rate of relevant documents we use in the general (not e-discovery) collections we have built is much less than 1%—more like 0.02% (150/750,000=.0002). In this case, uniform sampling is not helpful. The formula will suggest very few documents need to be looked at because an estimate of 0 relevant documents will be well within the error rates with very high confidence.

Ellen Voorhees

Thank you Ellen for contributing. I remember doing a panel event with you and Jason at GeorgeTown on search a few years ago and how clear and concise you were in explaining some of the technicalities of search to lawyers. You are certainly one of the best information scientists who is interested in the filed of legal search and we all appreciate that.

You asked: “how do you know the prevalence rate before you’ve done any searching?”

Good question and I have two answers:

1. In many cases you don’t, but a quick look around, a couple of hours, will give you a pretty good idea, especially in the kind of databases we are typically dealing with, large unfiltered email collections. The variety among people’s email is not really all that great. After you have seen a few hundred collections of corporate email, you have a pretty good idea of what to expect. Moreover, unlike academic tests of ENRON type databases, in litigation you often have access to the custodians and can ask them questions.

2. You have looked at the same of very similar collection of data before, i.e. – email from the same company. You have a pretty good idea of what’s what, and you have a pretty good feel of relevancy in the type of case, often because you have had hundreds of cases just like in over the past few years. Although there are slight difference in every case and every email collection, there are also great similarities.

What suggestions do you have for the extremely low prevalence scenario you encounter in the Enron database search test as TREC? Also, what do you suggest when you are asked to search for a Unicorn? That is, all searches turn up nothing? No smoking gun email exists that the requesting party suspects or hopes may exist. This proof of a negative is a real problem in the law as you probably already know.

Thanks again for contributing. Your comments are welcome anytime.

By the way, did I get the math part right?

Ralph

Moore’s Law as applied to the “Losey Rule”

Of course that assumes that 292 Lawyers are currently using TAR.

Year # of lawyers using TAR

2012 293

2013 586

2014 1,172

2015 2,344

2016 4,688

2017 9,375

2018 18,750

2019 37,500

2020 75,000

2021 150,000

2022 300,000

Losey’s Law ….. good idea.

Excellent job, Ralph. I think that you pretty much nailed it. I always (or at least to a reasonable approximation of always) enjoy statistical humor. Since you already used my best material, I’ll refrain from any further attempts here. I recommend the book and movie Money Ball for a look at how (the right kind of) statistical thinking transformed another industry.

About your predictions regarding the likelihood of lawyers using sampling. Adam may have to defend you (he knows how to find good statisticians to support his efforts). I would say that a higher percentage of litigators will use sampling, but a lower percentage of lawyers will be using it. I think that most lawyers don’t do discovery-intensive work (am I wrong?). So, we need to limit your prediction to lawyers who do litigation or who do discovery-intensive litigation.

About Ellen’s comments. I think that it is usually just safer to base your sampling on the worst case (complete uncertainty). Does it really make much difference to review 400 vs. 600 documents out of a million?

Great article. It got me thinking about a few considerations while utilizing sampling in the eDiscovery context. Your point about prevalence elucidates how important it is to take the time to learn a little bit about the ESI you’re working with. Of course it’s always a good idea to err on the side of caution (i.e. assume the “worst case scenario”), but as your post suggests there are times where introducing assumptions about prevalence into the equation might be exceedingly helpful (especially when trying to save costs). However, I can imagine a wide array of situations where culled or produced data might exhibit atypical or unexpected distributions of prevalence (at least in this context). Therefore, it’s imperative that any attorney/eDiscovery specialist be proactive about understanding the nature of any corpus of data before making any assumptions about prevalence.

Here are some questions that immediately come to mind. Is the corpus of data relatively untouched raw (unrefined) data or has it already been culled? In the latter instance, it would be worthwhile to get a sense about how the data was culled. Does your data come from a select few high-priority custodians that are likely to contain a large amount of relevant data? Has the data been deNISTed or date-range culled? Was the data provided in complete families? Has the data been search-term culled or culled pursuant to a more sophisticated process such as predictive/meaning-based coding? If so, what do you know about the particulars of the process this data was subjected to? Are you dealing with produced data? These may be easy questions if you or a friendly vendor did the culling, but what if the culling was performed by an adversary to the litigation? Will this information be accessible should you need it (i.e. what if they claim attorney-client privilege or work product)? These are some questions that might serve as a starting point.

If you are ever called to explain why you chose a particular sampling strategy, being able to show that you made a deliberate and informed decision (and hopefully documented) one way or the other might prove to be invaluable should you need to justify it.

Again, thanks for this article.

[...] Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2… [...]

[...] Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2… [...]

This a great reference post on random sampling. I’ve been giving presentations on practical application of statistical sampling in eDiscovery for a few years now. We gave a presentation today in the eDiscovery track at CEIC 2012 in Vegas. Unfortunately we were up against Craig Ball at the same time slot, but we still managed to attract 70 like-minded people (our free beer giveaway probably helped) interested in putting sampling to work in their organizations. Of note, only 3 were attorneys (per my very scientific hand raising poll). It seems that tech guys like me are often more interested in statistics than lawyers (go figure). If you have time to provide feedback, we’d love to hear it. The presentation with notes can be downloaded from our blog post on the presentation here: http://www.lightboxtechnologies.com/2012/05/22/ceic-2012-statistical-analysis-and-data-sampling-presentation/ .

On to Ellen’s question about low prevalence… you’re right about knowing some things about the data without having to search it, at least when you’re not forced to work in a vacuum. The problem I’ve seen all too often in the past is that many attorneys partially or completely isolate themselves from the actual search process and then try to divine search criteria. Sampling won’t help those people if they don’t really care about the results. One option is stratified sampling – exactly because you *do* know something about the document population before searching, through interviews, etc., you can break the data into sections that make sense for the specific case, and sample within the sections in an attempt to avoid skew. You have to be careful doing this, that you don’t introduce your own bias to the sampling process, though.

That’s not the real question, though. What do you do when an “estimate of 0 relevant documents [is] well within the error rates?” Well, that’s kind of the point of using statistically valid sampling, isn’t it? You document your procedures, you clear those procedures with the court, and you move on. Sampling isn’t meant to provide 100% accuracy all the time; its purpose is to make sure you’re testing a representative sample at a specific confidence level and interval. If your procedure is well documented and supported by the math, then hopefully the court will support it as well. Sometimes reviewing 100% of the documents is the best choice, but as we know, those pesky humans have their own issues with being consistent and accurate all the time, too.

[...] including from lawyers themselves. Ralph Losey, for interest, has devoted a post in his blog to the topic of sampling, and his recent blog posts narrating an example predictive coding exercise have contained much [...]

[...] The accuracy measure Prevalence (a/k/a Richness or Yield) is also a term you have seen in this blog many times and is starting to come into general usage among legal search experts. It means the percent of relevant documents (the True Positives and False negatives) to the total corpus. Using the formula above, this is G/I, the Total Relevant documents divided by the Total documents. This important measure was referred throughout my Search Narrative, and before that in my blog Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By …. [...]

[...] are generally a waste of time and money. (Remember the sampling statistics rule of thumb of 2=4 that I have explained before wherein a halving of confidence interval error rate, say from 3% to 1.5%, requires a quadrupling of [...]

Two nits, much delayed but important, I think.

First, your calculations of confidence interval break down at low prevalence rates. It makes no sense, for example to say that your answer is 3% plus or minus 5 because you can not (in this context, at least) have a negative probability. For values near the extremes of prevalence, you are better served with calculations based on the Poisson distribution.

Second and more important, all of the discussion above depends on sample of documents selected being truly random. That is a requirement very rarely met, in my experience. Humans are fundamentally unable to select randomly and even our machine-assisted “random” selections are often subject to selection bias. If you choose your documents from the front of the set, from the back, from a particular custodian or any other criteria, no matter how helpful you think you’re being, you’ve spoiled the randomness of the sample.

I pretty much agree with both of your comments. I now use the Binomial calculator for more accurate results.

[...] Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By …. [...]

[…] [24] A random sample is step-three in my recommended eight-step methodology for AI enhanced review, aka predictive coding. See eg my blog at http://e-discoveryteam.com/car/ and the EDBP at http://www.edbp.com/search-review/predictive-coding/. For more on random sampling see Losey, R., Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents (Part Two) found at http://e-discoveryteam.com/2013/06/17/comparative-efficacy-of-two-predictive-coding-reviews-of-699082-enron-documents/; and Losey, R. Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022 found at http://e-discoveryteam.com/2012/05/06/random-sample-calculations-and-my-prediction-that-300000-lawye…. […]

[…] [24] A random sample is step-three in my recommended eight-step methodology for AI enhanced review, aka predictive coding. See eg my blog at http://e-discoveryteam.com/car/ and the EDBP at http://www.edbp.com/search-review/predictive-coding/. For more on random sampling see Losey, R., Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents (Part Two) found at http://e-discoveryteam.com/2013/06/17/comparative-efficacy-of-two-predictive-coding-reviews-of-699082-enron-documents/; and Losey, R. Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022 found at http://e-discoveryteam.com/2012/05/06/random-sample-calculations-and-my-prediction-that-300000-lawye…. […]

[…] Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By …. […]

[…] (photo cred) […]

[…] Also consider the many possible errors of random sampling, and the over-reliance on inconsistent humans, SMEs and contract reviewers alike. I have written on these TAR pits before too. Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three; and, Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By …; […]