This is going to be a hyper-technical blog for all those professionals in e-discovery who are struggling, like I am, to fully understand the math governing random sampling, particularly as it is applied to our field of legal search. I can say with a high degree of confidence that most of us who specialize in e-discovery employ random sampling in some form or another as part of our quality control efforts. We typically use random sampling in large-scale review projects. But do we really understand all of the intricacies? Probably not.
Bubble People and the Future Here Now
I would estimate that 80% of the elite few who attend Sedona, as mentioned in my last blog, use random sampling as part of their e-discovery work. But this is a small group of dedicated specialists, probably only a few hundred strong. They are in what Paul D. Weiner likes to call the Sedona Bubble. I have about only a 90% confidence level of that number, however, as I have not done a valid poll yet of the Sedonites (not the best word perhaps for Sedona members, but better than bubble-people). Moreover, I suspect that my margin of error, aka confidence interval, is a high one of 10%. That means that as few as 70% of the Sedonites in fact use sampling, or as many as 90%. See eg. “Sampling 101 for the e-Discovery Lawyer,” an appendix to The Sedona Conference Commentary on Achieving Quality in the E-Discovery Process (2009) at pgs. 35-39.
This kind of probabilistic thinking is all part of the future practice of law, coming your way soon. How soon? I’ll tell you in a minute. As William Gibson said: The future is already here — it’s just not very evenly distributed. Many of my readers may already be there, Sedonites or not, and may already use random sampling and statistics as part of their legal practice. But I am pretty sure, and here I’d go as far as say I have a 99.9% confidence level, that most lawyers in the world do not.
My guess is based on my travels and teachings to many lawyer groups around the U.S., not to mention my interaction with many of those delightful lawyers in towns large and small who go by the label of opposing counsel. In other words, these statements and predictions are based on what I have seen, not from a validly random sample of American lawyers. (Hint to the Rand Corporation: here is a good research project for you.) Still, my wetware (gooey brain based) estimates, with a 95% confidence level, that less than 2% of all lawyers now use random sampling in any way. Random sampling is still a rare exception in U.S. legal culture. And therein lies the problem, at least in so far as e-discovery quality control is concerned. Sampling now has a very low prevalence rate.
But those of us in the world of e-discovery are used to that. There are still very few full-time specialists in e-discovery. This is changing fast. It has to in order for the profession to cope with the exploding volume and complexity of written evidence, meaning of course, evidence stored electronically. We e-discovery professionals are also used to the scarcity of valuable evidence in any large e-discovery search. Relevant evidence, especially evidence that is actually used at trial, is a very small percentage of the total data stored electronically. DCG Sys., Inc. v. Checkpoint Techs, LLC, 2011 WL 5244356 at *1 (N.D. Cal. Nov. 2, 2011) (quoting Chief Judge Rader: only .0074% of e-docs discovered ever make it onto a trial exhibit list). Again, this is a question of low prevalence. So yes, we are used to that. See Good, Better, Best: a Tale of Three Proportionality Cases – Part Two; and, Secrets of Search article, Part Three (Relevant Is Irrelevant).
A Losey Prediction
I predict that the rate of prevalence of use of sampling and probabilistic thinking by lawyers will increase rapidly over the next ten years. It must. Random sampling is too powerful a tool for the profession to ignore. It has been well proven as an indispensable tool of science and industry. It is probably time for law to also embrace this tool.
But I will do more than make such vague general assertions. I will now get very specific and put hard metrics on my predictions, metrics with which future lawyers can hold me accountable. (I’m not really too worried as I’ll have Adam to defend me, and he’ll probably come up with some good excuses in the 5% unlikely event I’m wrong.)
I hereby predict that … (trumpets sound) … in the year 2022 a random sample polling of American lawyers will show that 20% of the lawyers in fact use random sampling in their legal practice. I make this prediction with an 95% confidence interval and an error rate of only 2%. I even predict how the growth will develop in a year by year basis, although my confidence in this detail is lower.
But I will go still further out on the limb, and make my prediction even more specific. Assuming that by the year 2022 there are 1.5 Million lawyers (the ABA estimated there were 1,128,729 resident, active lawyers in 2006), I predict that 300,000 lawyers in the U.S. will be using random sampling by 2022. The confidence interval of 2% by which I qualified my prediction means that the range will be between 18% and 22%, which means between 270,000 lawyers and 330,000 lawyers. I have a 95% level of confidence in my prediction, which means there is a 5% chance I could be way wrong, that there could be fewer than 270,000 using random sampling, or more than 330,000. This is all shown by the familiar bell curve first shown above and below. (Hint – Adam, here’s the out to defend my predictions (in the unlikely event you’ll have to.))
I do all of this prognostication somewhat tongue-in-cheek, but with the ulterior motive to provide an example of what I mean by probabilistic thinking. Forget about absolute certainty of knowledge about anything. Forget about perfection. Think reasonability of efforts. Think preponderance of evidence. Think probability. Think in terms of degrees of confidence. For example, I am highly confident that most of you probably get 90% of my humor, give or take 2% of my jokes.
But enough with the pleasantries. I promised a hard-nosed technical math blog for all you super-nerds out there, and now you’re going to get it! (Here is where I predict 50% of my readers will stop reading!)
The Value and Limitations of Random Sampling
When you review a random sample of data (“corpus”), and categorize the sample data in some way, for instance by identifying all documents in the sample as either relevant or irrelevant, and you then project the percentage found in the sample onto the entire corpus, you can not know for certain that your percentage is the correct answer (i.e. – only 10% of the total corpus is relevant because only 10% of the sample is relevant). But, if the sample size is large enough, and the selection of the sample is truly random, you can know that there is a certain chance, i.e. 95% chance, or “confidence level,” that you are within a certain margin of error (“confidence interval”) of the correct answer. Put another way, there is a 95% chance that you are correct, at least within a defined plus or minus range.
For my purposes as an e-discovery lawyer concerned with quality control of document reviews, this explanation of near certainty is the essence of random probability theory. This kind of probabilistic knowledge, and use of random samples to gain an accurate picture of a larger group, has been used successfully for decades by science, technology, and manufacturing. It is key to both quality control and understanding large sets of data. The legal profession must now also adopt random sampling techniques to accomplish the same goals in large-scale document reviews.
You can use any standard random sample calculator to determine the appropriate size of a random sample, using either a 95% or 99% confidence level, and the confidence interval of your choice. I suggest you use the calculator shown at the top of random sample page in my FloridaLawFirm.com website. The confidence interval you plug into the calculator represents the margin of error you find acceptable. Less documents are required for a valid random sample size as the confidence interval increases, or confidence level decreases.
In the example above where 10% of the sample was relevant, if a confidence interval of 4 is used, that means that the 10% projected level may be as high as 14% or as low as 6%. This means that with a corpus of 1,000,000 documents, and a review of a random sample of 600 documents, which is the sample size required for a 95% confidence level and +/- 4% confidence interval, wherein you find that 60 of the documents are relevant, and 540 are irrelevant, that you can know that there is a 95% chance that the range of relevant documents in the entire corpus is from between 140,000 to 60,000 documents. If a confidence interval of 2% is used, and the corresponding number of randomly selected documents is reviewed (2,395), and again 10% were found to be relevant (240), then the range of relevant documents in the corpus is from between 120,000 to 80,000. That is how random probability works in a binary classification system. Here is the standard bell curve graphic illustrating a 95% confidence level:

The variation in sample size required for various confidence levels and intervals is shown in the graph below. It illustrates the sample sizes needed for 90%, 95%, and 99% confidence levels with confidence intervals of 10%, 5% and 2%.

The math at work for calculating sample sizes and confidence intervals involves square root calculations, as will be shown in the fun math part below. This essentially requires about a quadrupling of sample size in order to achieve a doubling of accuracy. Put another way, if you want to cut your error margin in half, you will have to quadruple your sample size. For instance, assuming a population size of 1,000,000, and a 95% confidence level, the sample size required for a 10% confidence interval is 96. The sample size required for a 5% confidence interval is 384. The sample size required for a 2.5% confidence interval is 1534. The sample size required for a 1.25% confidence interval is 6,109.
This is a good rule of thumb to remember. If you want to reduce your error rate in half, your confidence interval, and thus double your accuracy, your cost to do so will quadruple. It will quadruple, at least approximately, because you will have four times as many documents in the sample to review. Twice the quality at four times the cost. Thus 2=4 in the world of quick calculations for random sampling. Hopefully the picture of my old unmanicured thumb will help you to remember this.
The Impact of Prevalence on Random Sampling Calculations
The second calculator shown on my linked page allows you to add another dimension, another criterion, to your probability analysis, namely “prevalence.” This is especially important to understand in the field of legal search where low prevalence rates are common. In the binary example of relevance, the prevalence of the corpus is the percentage of relevant documents. The prevalence percentage has a direct numerical impact on the margin of error (“confidence interval”) applicable to the sample projections. Prevalence is also known as “richness,” as in target-richness, or “response distribution.” See eg. another sample size calculator by RAOsoft.com that includes these criteria and an explanation.
The first calculator shown on my website assumes what some call the “worst case scenario” for sample prediction where the prevalence is 50%. This a perfectly even distribution, which requires the largest sample size to attain a desired confidence level. The top calculator conservatively assumes that half of the corpus will be in the target group, i.e. – not relevant. When the target rate or prevalence is 50/50, that requires the highest number of documents to be sampled for statistical validity, which is why it is called the “worst case scenario.” When the prevalence rate is higher or lower 50%, the number of documents that must be sampled decreases.
Thus, if the prevalence rate is 95%, meaning in our example, 95% of the documents are relevant, or, conversely, if the richness is very low, and the prevalence rate is only 5%, again a smaller sample is required to attain the same confidence interval. Put another way, review of the same sample size creates a much lower confidence interval, and thus a much lower margin of error. This is very important to understanding the binary classifications of a large corpus of data where only a small amount of the data is responsive, i.e., are relevant. (Another example of a binary classification could be privileged or not.)
Try out the second standard random sample calculator shown on my website to see this for yourself. In the first example shown, assuming a corpus of one million documents, with a confidence interval of 4, you see that a sample size of 600 documents is required. This is the largest possible sample size required for the 95% +/- 4. It assumes the worst case scenario of 50% prevalence (i.e. – half of the documents are relevant). Now change the prevalence percentage to 95% in the second calculator, using a sample size of 600, and a corpus of 1,000,000. The confidence interval is now 1.74%. You get the same result when you assume a prevalence rate of only 5%.
Again, see the Sample Size Calculator at RaoSoft.com for a calculator that allows you to plug-in different prevalence rates (called “response distribution” in that calculator) to determine sample sizes for certain intervals based on prevalence. Bottom line, when you have a corpus with a high or low prevalence, one that is either target rich, or target poor, a smaller sample size is required to attain an acceptable confidence interval. (Note, there are some exceptions where, for instance, there are extreme values (“outliers”) or where there are small corpus sizes.)
A good way to understand prevalence is by example. Start by assuming a 1,000,000 document corpus, which has a prevalence rate of 5% (one where 5% or less of the documents are relevant), you need only review 456 documents to know with 95% certainty, and an error rate of only 2%, the total number of relevant documents. Remember, if you had assumed that half of the documents were relevant, then you would have had to review 2,395 documents to attain the same confidence level and interval. See for yourself by trying this out in the standard calculators on my page and on RaoSoft’s.
This characteristic of random sampling must be understood for cost-effective quality control in a corpus with low prevalence. This is important because low prevalence is the norm in legal search, and not the so-called standard normal distribution used in other fields, where you assume the hard-search of separating out half of a 50/50 split.
Mathematical Formula for Random Sample Size Calculations
Here is one way of expressing the basic formula behind most standard random sample size calculators:
n = Z² x p(1-p) ÷ I²
Description of the symbols in the formula:
n = required sample size
Z = confidence level (The value of Z is statistics is called the “Standard Score,” wherein a 90% confidence level=1.645, 95%=1.96, and 99%=2.577)
p = estimated prevalence of target data (richness)
I = confidence interval or margin of error
Putting the formula into words – the required sample size is equal to the confidence level squared, times (the estimated prevalence times one minus the estimated prevalence), then divided by the square of the confidence interval.
Here is an example of the formula in action where we assume a 95% confidence level and confidence interval of 2%, and a prevalence of 4%:
n = Z² x p(1-p) ÷ I²
n= 1.96² x .04(1-.04) ÷ .o2²
n = 3.8416 x .04(.96) ÷ .0004
n = 3.8416 x .0384 ÷ .0004
n = .14751744 ÷ .0004
n = 368.7936
The formula shows that with an estimated prevalence of 4% we need a sample size of 369 documents to attain a 95% confidence level with a margin of error of 2%.
It is important to understand that this sample size formula is derived from the formula for calculating confidence intervals (I).
If you take the “n” value as unknown (the number to be sampled for a specified confidence interval), and assign a value to the confidence level of say, 95%, wherein the value for “Z” is thus 1.96, and you move the “n” to the left side of the equation, the formula now looks like this:
Mathematically this is the same thing as our original formula:
n = Z² x p(1-p) ÷ I²
We can easily prove the formulas are identical by example where we again assume a 95% +/- 2%, and a prevalence of 4%:
I = Z√p(1-p)/n
.02 = 1.96 √.04(1-.04)/n
n = (1.96/.02)² x .04(.96)
n = (98)² x .0384
n = 9604 x .0384
n = 368.7936
Here is another example using the formula I prefer, and following our first assumptions where the estimated prevalence rate is 5% relevant documents, and a 95% confidence level is desired with a confidence interval of 2%. The following relatively simple mathematical calculation provides the required sample size:
n = 1.96² x .05(1-.05) ÷ .02²
n = 3.8416 x .05(.95) ÷ .0004
n = 3.8416 x .0475 ÷ .0004
n = .182476 ÷ .0004
n = 456.19
Now if you change the prevalence rate from 5% to 50%, the formula increases the required sample size for a 95% confidence with plus or minus 2% as follows:
n = 1.96² x .5(1-.5) ÷ .02²
n = 3.8416 x .5(.5) ÷ .0004
n = 3.8416 x .25 ÷ .0004
n = .9604 ÷ .0004
n = 2401
Do the math above. Really, it is not that hard. It is all just multiplication and division. It shows that with the lower prevalence rates commonly found in legal search you can make accurate predictions using lower sample sizes. Further, if you do determine sample size based on an assumed 50% prevalence rate, whereas in fact you have a much lower rate, you are actually lowering your confidence interval, your margin of error.
Thus, if you use a standard calculator that by default has a worst-case 50% distribution or prevalence rate built-in, and review 2,401 documents, which you thought was the sample size necessary to attain a confidence interval of 2%, and you in fact were dealing with a document corpus that only had a 5% prevalence rate, having 95% irrelevant documents, then in fact your calculations will have a confidence interval (error rate) of only .87%, and not the 2% interval you thought. That is a good thing.
Again, don’t believe me. Do the math. Use the Interval formula that the sample size formula is based upon. (You may also need a calculator that does square root.)
I = Z√p(1-p)/n
I = 1.96√.05(1-.05)/2401
I = 1.96√.05(.95)/2401
I = 1.96√.0475/2401
I = 1.96√.00001978342357
I = 1.96 x .004447856064443
I = .00871779788631
You can also use the second standard calculator on my page, Just plug-in 95% confidence level, a sample size of 2401, a population of 1,000,000, and a prevalence percentage of 5. It should calculate a confidence interval of 0.87. You can also double-check by using the RAOsoft calculator.
Additional Math Disclaimer
I have a disclaimer on all of my blog postings. See the top title and the first link on the right hand column: DISCLAIMER. On this particular post I thought it would be a good idea to add yet another level of disclaimer. Although math is math, and these are well accepted formulas and principles, these are still just my personal applications and synthesis of information and rules applicable in the field of statistics and legal search. I reserve the right to go back and make revisions to this post as my understanding deepens and improves. I am an attorney, not an information scientist or statistician. These views should not be relied upon, nor accepted as anyone’s opinion other than my own. You should, of course, always do your own due diligence, study and analysis. Like I said, do the math.
As always, if you disagree with the analysis here, or detect any math errors, please let me know. I welcome a free exchange of ideas and information. You can either email me privately, or write a public comment. That is how my blog works. I put my ideas out there for peer-review, and I make corrections as I go along, and before the blogs are ultimately transformed into a book. I appreciate all of the help my learned readers have provided to me over the years since I first began this open writing experiment in 2006. The odds are, your comments will help make my next book even better.
Conclusion
This blog has discussed thirteen different scenarios showing probabilistic analysis:
- I began with analysis of e-discovery expert bubble people wherein I estimate, based on anecdotal evidence, that 80% already use random sampling in some manner. I have only a 90% confidence level in that, with a confidence interval of 10%, so it could actually range from 70% to 90%, and maybe a lot more or less.
- The I moved on to analysis of all lawyers in the world. I estimated that a majority (51% or more) do not use random sampling at all. I put a 99.9% confidence level on that opinion and invited the Rand Corporation to try to prove me wrong.
- Then I turned my half-witty attention to all lawyers in the U.S. and opined that less than 2% use random sampling. I put a 95% confidence level on that one.
- Then I made my prediction that in ten years the number of lawyers in the U.S. using random sampling will increase tenfold from 2% to 20%. I am 95% confident on that projection, but I put a margin of error on it of plus or minus 2%. Based on the ABA’s estimate of the number of lawyers in America, I projected that from between 270,000 to 330,000 lawyers will be using random sampling by 2022. Rand Corp., make a note and do a follow-up survey in 2022, would you please?
- I next estimated that my blog readers get 90% of the humor in this blog (or better said, attempts at same), with a confidence interval of 2%, meaning between 88% and 92%.
- Serious sampling examples then began where I assumed a 95% confidence level, and 4% confidence interval. A review of a sample of 600 documents found that 60 were relevant (10%). Based on the sample we can project that 100,000 of the documents in the million document corpus would be relevant, with a range of between 6% and 14%, which means between 60,000 and 140,000 documents.
- Another variation of the last example was then considered where a confidence interval of 2% was used, instead of 4%. This required a sample size of 2,395 documents, where 10% were again found to be relevant (240). Since a 2% interval was used, the range of relevant documents projected was narrower, from between 80,000 and 120,000.
- Next, I added consideration of prevalence into the sample size formulas and started with an example of a 95% confidence level, and either 5% or 95% prevalence ratio (same either way). With a review of a random sample of 600 documents, and either a 5% or 95% prevalence, I showed that the confidence interval improved from 4% to 1.74%. This is an important point.
- Then I considered a 5% prevalence, where I showed that a sample of only 456 documents provides a 95% certainty and an error rate of 2%. This compared to the need to sample 2,395 documents for a 2% confidence interval if you assume 50% prevalence. Another important point.
- Then I showed the actual mathematical calculations explaining the formulas and used an example of a 95% confidence level, a 2% confidence interval, and a prevalence of 4%. You remember, it went like this and showed you only had to sample 369 documents:
n = Z² x p(1-p) ÷ I²
n= 1.96² x .04(1-.04) ÷ .o2²
n = 3.8416 x .04(.96) ÷ .0004
n = 3.8416 x .0384 ÷ .0004
n = .14751744 ÷ .0004
n = 368.7936
- The next formula I ran again assumed a 95% confidence level and 2% interval, but this time changed the prevalence to 5%. The formula showed a required sample size of 456 documents.
- Then I ran the math on 95% +/- 2, but this time assuming a 50% prevalence. The formula showed a required sample size of 2,401 documents.
- Then I ended with another twist where the sample size of 2,401 documents is used, but this time a 5% prevalence is assumed. The interval calculation formula showed that a .87 confidence interval results. That was shown in only formula where you had to do a square root calculation:
I = Z√p(1-p)/n
I = 1.96√.05(1-.05)/2401
I = 1.96√.05(.95)/2401
I = 1.96√.0475/2401
I = 1.96√.00001978342357
I = 1.96 x .004447856064443
I = .00871779788631
I pointed out that you could skip the math entirely if you wanted, and attain the same results by using the random sample size calculators on my page, or on the RAOsoft calculator, or any other of a number of calculators freely available on the web. Depending on what software you are using for review, you might also have this ability built-in. You can also skip formulas and calculators all together and rely upon charts that list common values. These charts typically assume a prevalence of 50%. See eg Sample Size Table from Research Advisors. It can anyway be helpful to look at these charts to get a feel for how the numbers relate. For instance, look at these tables from the University of Florida, Professor Glenn D. Israel:
| Table 1. Sample size for ±3%, ±5%, ±7% and ±10% Precision Levels Where Confidence Level is 95% and P=.5. |
| Size of |
Sample Size (n) for Precision (e) of: |
| Population |
±3% |
±5% |
±7% |
±10% |
| 500 |
a |
222 |
145 |
83 |
| 600 |
a |
240 |
152 |
86 |
| 700 |
a |
255 |
158 |
88 |
| 800 |
a |
267 |
163 |
89 |
| 900 |
a |
277 |
166 |
90 |
| 1,000 |
a |
286 |
169 |
91 |
| 2,000 |
714 |
333 |
185 |
95 |
| 3,000 |
811 |
353 |
191 |
97 |
| 4,000 |
870 |
364 |
194 |
98 |
| 5,000 |
909 |
370 |
196 |
98 |
| 6,000 |
938 |
375 |
197 |
98 |
| 7,000 |
959 |
378 |
198 |
99 |
| 8,000 |
976 |
381 |
199 |
99 |
| 9,000 |
989 |
383 |
200 |
99 |
| 10,000 |
1,000 |
385 |
200 |
99 |
| 15,000 |
1,034 |
390 |
201 |
99 |
| 20,000 |
1,053 |
392 |
204 |
100 |
| 25,000 |
1,064 |
394 |
204 |
100 |
| 50,000 |
1,087 |
397 |
204 |
100 |
| 100,000 |
1,099 |
398 |
204 |
100 |
| >100,000 |
1,111 |
400 |
204 |
100 |
| a = Assumption of normal population is poor (Yamane, 1967). The entire population should be sampled. |
Even though calculators and charts make sample size determination easy, it is good to know how to do the math yourself. That provides a solid understanding of what the calculators and charts are doing and why. Also see the work of the EDRM on the subject: Statistical Sampling Applied to Electronic Discovery; and, Appendix 2: Application of Sampling to E-Discovery Search Result Evaluation.
The math we examined shows the importance of prevalence to random sample size calculations and confidence interval calculations. This has been overlooked, or at least underestimated, by many in the field of e-discovery. This error often leads to over-sampling and review of more documents than required to obtain reasonable confidence levels and intervals. The routine assumption of a worst-case-scenario of 50% prevalence leads to overkill and unnecessarily large samples for many (but not all) uses of random sampling, including many quality control calculations. We need to start adding prevalence into our equations, and start being more efficient in our quality control metrics.
I look forward to your public and private comments. Hopefully I have caught all of the minor number and math mistakes (I have already spotted and corrected quite a few), but it is late, and I may well have missed some. Please let me know if you see any more errors.