Blogging, Proportional Review and Predictive Coding

May 13, 2012

I did an interview recently with Andrew Bartholomew of e-Discovery Beat. I told him he could ask me anything, except for cases involving my law firm. Andrew put the audio of the entire interview online, and added an edited transcript of selections in two segments: part one and part two. Here are a couple of questions that you might find of interest, especially the first one about blogging, which I have been  asked about a lot lately.

After last week’s difficult blog on random sampling, this one is an easy-breather. But, don’t worry, I try not to bore. The interview includes a zinger against all abusers of e-discovery. You know who I mean. All those caveman lawyers out there who abuse e-discovery as a blunt tool for extortion. They only use e-discovery to try to drive up the other side’s costs at every turn. They are not really looking for the truth. They will do or say anything to win a case, to make money for themselves. See: Judge David Waxse on Cooperation and Lawyers Who Act Like Spoiled Children.

E-discovery is a powerful tool for truth, a tool for justice. It can be dangerous in the wrong hands. We must all stand up, and stand together, to protect e-discovery from abusive bullies. That includes exercise of your First Amendment rights to free speech and free association. That is what our country is all about.

Blogging

Bartholomew: How did you come to be such a prolific blogger? Where most blogs just skim the surface, your E-Discovery Team blog really dives deep into the issues.

Losey: When I first started doing this in 2006, the blog posts were shorter and I didn’t provide a whole lot of analysis. I was mainly talking about new cases. But after doing this every week for five-and-a-half years now, it has become second nature. I find that my writing evolves as my own understanding evolves.

I’m pretty opinionated at this point because I’ve been doing it so long. I have become the analysis and opinion guy in e-discovery. I don’t try to report on each new case that comes out. Occasionally, I’ll have someone send me an opinion say, for instance, by Judge Scheindlin right off the presses, and I like to rush out there and write something that’s kind of news oriented. But generally speaking I am more of an analysis and commentary kind of guy to help people think it through.

I try to help the profession by sharing the experience of my being a lawyer for 32 years and being an avid technology person my whole life. Being there in the field as a practicing attorney, I see what’s going on. I know what the fights are in the courtrooms. Based on that, I have a lot of source material and information that comes my way. I’m doing analysis anyway as part of my job, so it’s not that hard to share it and write it up on my blog.

Bottom-Line-Driven Proportional Review

Bartholomew: You mentioned e-discovery case-law. Are there any important case-law trends that you’re following at the moment?

Losey: I came out a few months ago on my blog and went public with something I’ve been doing internally at my law firms and that’s bottom-line-driven proportional review. This is something we try to do every chance we get in every case to make sure that our production responses are proportional to the value of the case. It is my way of trying to control what I think is the primary problem in e-discovery today, and that is runaway costs.

It involves estimation and budgeting and figuring out what a project is going to cost before you actually begin. It seems like basic common sense, but you’d be amazed. That is not the way things have been done in the past. There are still plenty of law firms around the country, if not the world, that begin production responses without a set budget or without having a clue what it’s going to cost them. We see examples of this in the case-law almost every day.

I’m now trying to promote this; just get the idea out there. Use your knowledge and experience about what things cost to make an estimate at the very beginning of a case as to what is proportionate to spend on e-discovery. I call it bottom-line-driven proportional review. I want everybody to be making this argument.

Who wouldn’t be in favor of proportionate expenses? Who wouldn’t be in favor of curtailing out-of-control e-discovery costs? Who wouldn’t be in favor of reasonability when it comes to e-discovery?

There are some people that wouldn’t be in favor of it. These are the people I want to stop, the people that use e-discovery as a weapon, not as a valid tool to obtain the truth in order to decide cases.

Predictive Coding and Human Review

Bartholomew: How does the advent of computer-assisted review, or predictive coding, stand to impact the role of human review in e-discovery going forward?

Losey: The need for human input is never going to go away. Predictive coding does not replace human reviewers. Having said that, it may reduce the number of human reviewers, but so will proportional discovery.

If you use predictive coding as a tool, but you don’t use it with a legal method, it’s worthless. A hammer doesn’t build a house. It takes a carpenter to use the hammer to build the house. Predictive coding is just the latest, coolest tool, but it doesn’t replace the carpenter.

It doesn’t replace all the other tools either. I’m the one that said keyword search is very limited, but the truth is, you still need keyword search. It’s still a very valuable and important tool. It’s just not the best tool. But it still needs to be used, and so do the human reviewers.

The other slogan I’m talking about right now is called hybrid (computer and human), multi-modal (many methods) computer-assisted review. This is what it’s all about. It’s having computers help us to do a faster, better, higher quality, yet less expensive, review. Basically it allows us to get more bang for our buck.

If you’re on a budget, you better be delivering the relevant documents within that budget. The best way to do that is with the latest tools; predictive coding is the latest tool. But it’s just a drop down menu on any good software review tool, along with concept review, the similarity feature where you’re grouping words using near de-duplication, as well as keyword search.

The foundation to all of these techniques is expert human review. The human input has become even more important with predictive coding because now you need to bring in experts at the beginning. You need to bring in the people who really know what’s relevant and what isn’t in order to train the computer and generate the seed-set. If anything, the latest predictive coding technologies have elevated the importance of the expert lawyer.

Bartholomew: Are there other issues or trends that we might be hearing about from you on your blogs or future presentations?

Losey: I’m going to continue to talk a lot about predictive coding and using technology because I really believe that the only way to get out of the mess we’re in of having too much information – a problem created by technology – is to use more technology. We have to fight fire with fire. I’m going to keep encouraging the law to use technology and the knowledge and intelligence we have in computers in order to do e-discovery – not only in an inexpensive way, but also in a quality way where you get the information you need.

The new trend I’ve been talking about is the growing importance of information science on the law. It’s one thing to have technology impact the law, but you must balance out the technology with the deep knowledge and real understanding that you can really only get from science. That’s the only way law is going to be able to use technology in an appropriate manner.


Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022

May 6, 2012

This is going to be a hyper-technical blog for all those professionals in e-discovery who are struggling, like I am, to fully understand the math governing random sampling, particularly as it is applied to our field of legal search. I can say with a high degree of confidence that most of us who specialize in e-discovery employ random sampling in some form or another as part of our quality control efforts. We typically use random sampling in large-scale review projects. But do we really understand all of the intricacies? Probably not.

Bubble People and the Future Here Now

I would estimate that 80% of the elite few who attend Sedona, as  mentioned in my last blog, use random sampling as part of their e-discovery work. But this is a small group of dedicated specialists, probably only a few hundred strong. They are in what Paul D. Weiner likes to call the Sedona Bubble. I have about only a 90% confidence level of that number, however, as I have not done a valid poll yet of the Sedonites (not the best word perhaps for Sedona members, but better than bubble-people). Moreover, I suspect that my margin of error, aka confidence interval, is a high one of 10%. That means that as few as 70% of the Sedonites in fact use sampling, or as many as 90%. See eg. “Sampling 101 for the e-Discovery Lawyer,” an appendix to The Sedona Conference Commentary on Achieving Quality in the E-Discovery Process (2009) at pgs. 35-39.

This kind of probabilistic thinking is all part of the future practice of law, coming your way soon. How soon? I’ll tell you in a minute. As William Gibson said: The future is already here — it’s just not very evenly distributed. Many of my readers may already be there, Sedonites or not, and may already use random sampling and statistics as part of their legal practice. But I am pretty sure, and here I’d go as far as say I have a 99.9% confidence level, that most lawyers in the world do not.

My guess is based on my travels and teachings to many lawyer groups around the U.S., not to mention my interaction with many of those delightful lawyers in towns large and small who go by the label of opposing counsel. In other words, these statements and predictions are based on what I have seen, not from a validly random sample of American lawyers. (Hint to the Rand Corporation: here is a good research project for you.) Still, my wetware (gooey brain based) estimates, with a 95% confidence level, that less than 2% of all lawyers now use random sampling in any way. Random sampling is still a rare exception in U.S. legal culture. And therein lies the problem, at least in so far as e-discovery quality control is concerned. Sampling now has a very low prevalence rate.

But those of us in the world of e-discovery are used to that. There are still very few full-time specialists in e-discovery. This is changing fast. It has to in order for the profession to cope with the exploding volume and complexity of written evidence, meaning of course, evidence stored electronically. We e-discovery professionals are also used to the scarcity of valuable evidence in any large e-discovery search. Relevant evidence, especially evidence that is actually used at trial, is a very small percentage of the total data stored electronically. DCG Sys., Inc. v. Checkpoint Techs, LLC, 2011 WL 5244356 at *1 (N.D. Cal. Nov. 2, 2011) (quoting Chief Judge Rader: only .0074% of e-docs discovered ever make it onto a trial exhibit list). Again, this is a question of low prevalence. So yes, we are used to that. See Good, Better, Best: a Tale of Three Proportionality Cases – Part Two; and, Secrets of Search article, Part Three (Relevant Is Irrelevant).

A Losey Prediction

I predict that the rate of prevalence of use of sampling and probabilistic thinking by lawyers will increase rapidly over the next ten years. It must. Random sampling is too powerful a tool for the profession to ignore. It has been well proven as an indispensable tool of science and industry. It is probably time for law to also embrace this tool.

But I will do more than make such vague general assertions. I will now get very specific and put hard metrics on my predictions, metrics with which future lawyers can hold me accountable. (I’m not really too worried as I’ll have Adam to defend me, and he’ll probably come up with some good excuses in the 5% unlikely event I’m wrong.)

I hereby predict that … (trumpets sound) … in the year 2022 a random sample polling of American lawyers will show that 20% of the lawyers in fact use random sampling in their legal practice. I make this prediction with an 95% confidence interval and an error rate of only 2%. I even predict how the growth will develop in a year by year basis, although my confidence in this detail is lower.

But I will go still further out on the limb, and make my prediction even more specific. Assuming that by the year 2022 there are 1.5 Million lawyers (the ABA estimated there were 1,128,729 resident, active lawyers in 2006), I predict that 300,000 lawyers in the U.S. will be using random sampling by 2022. The confidence interval of 2% by which I qualified my prediction means that the range will be between 18% and 22%, which means between 270,000 lawyers and 330,000 lawyers. I have a 95% level of confidence in my prediction, which means there is a 5% chance I could be way wrong, that there could be fewer than 270,000 using random sampling, or more than 330,000. This is all shown by the familiar bell curve first shown above and below.  (Hint – Adam, here’s the out to defend my predictions (in the unlikely event you’ll have to.))

I do all of this prognostication somewhat tongue-in-cheek, but with the ulterior motive to provide an example of what I mean by probabilistic thinking. Forget about absolute certainty of knowledge about anything. Forget about perfection. Think reasonability of efforts. Think preponderance of evidence. Think probability. Think in terms of degrees of confidence. For example, I am highly confident that most of you probably get 90% of my humor, give or take 2% of my jokes.

But enough with the pleasantries. I promised a hard-nosed technical math blog for all you super-nerds out there, and now you’re going to get it! (Here is where I predict 50% of my readers will stop reading!)

The Value and Limitations of Random Sampling

When you review a random sample of data (“corpus”), and categorize the sample data in some way, for instance by identifying all documents in the sample as either relevant or irrelevant, and you then project the percentage found in the sample onto the entire corpus, you can not know for certain that your percentage is the correct answer (i.e. – only 10% of the total corpus is relevant because only 10% of the sample is relevant). But, if the sample size is large enough, and the selection of the sample is truly random, you can know that there is a certain chance, i.e. 95% chance, or “confidence level,” that you are within a certain margin of error (“confidence interval”) of the correct answer. Put another way, there is a 95% chance that you are correct, at least within a defined plus or minus range.

For my purposes as an e-discovery lawyer concerned with quality control of document reviews, this explanation of near certainty is the essence of random probability theory. This kind of probabilistic knowledge, and use of random samples to gain an accurate picture of a larger group, has been used successfully for decades by science, technology, and manufacturing. It is key to both quality control and understanding large sets of data. The legal profession must now also adopt random sampling techniques to accomplish the same goals in large-scale document reviews.

You can use any standard random sample calculator to determine the appropriate size of a random sample, using either a 95% or 99% confidence level, and the confidence interval of your choice. I suggest you use the calculator shown at the top of random sample page in my FloridaLawFirm.com website.  The confidence interval you plug into the calculator represents the margin of error you find acceptable. Less documents are required for a valid random sample size as the confidence interval increases, or confidence level decreases.

In the example above where 10% of the sample was relevant, if a confidence interval of 4 is used, that means that the 10% projected level may be as high as 14% or as low as 6%. This means that with a corpus of 1,000,000 documents, and a review of a random sample of 600 documents, which is the sample size required for a 95% confidence level and +/- 4% confidence interval, wherein you find that 60 of the documents are relevant, and 540 are irrelevant, that you can know that there is a 95% chance that the range of relevant documents in the entire corpus is from between 140,000 to 60,000 documents. If a confidence interval of 2% is used, and the corresponding number of randomly selected documents is reviewed (2,395), and again 10% were found to be relevant (240), then the range of relevant documents in the corpus is from between 120,000 to 80,000. That is how random probability works in a binary classification system. Here is the standard bell curve graphic illustrating a 95% confidence level:

The Impact of Prevalence on Random Sampling Calculations

The second calculator shown on my linked page allows you to add another dimension, another criterion, to your probability analysis, namely “prevalence.” This is especially important to understand in the field of legal search where low prevalence rates are common. In the binary example of relevance, the prevalence of the corpus is the percentage of relevant documents. The prevalence percentage has a direct numerical impact on the margin of error (“confidence interval”) applicable to the sample projections. Prevalence is also known as “richness,” as in target-richness, or “response distribution.” See eg. another sample size calculator by RAOsoft.com that includes these criteria and an explanation.

The first calculator shown on my website assumes what some call the “worst case scenario” for sample prediction where the prevalence is 50%. This a perfectly even distribution, which requires the largest sample size to attain a desired confidence level. The top calculator conservatively assumes that half of the corpus will be in the target group, i.e. – not relevant. When the target rate or prevalence is 50/50, that requires the highest number of documents to be sampled for statistical validity, which is why it is called the “worst case scenario.” When the prevalence rate is higher or lower 50%, the number of documents that must be sampled decreases.

Thus, if the prevalence rate is 95%, meaning in our example, 95% of the documents are relevant, or, conversely, if the richness is very low, and the prevalence rate is only 5%, again a smaller sample is required to attain the same confidence interval. Put another way, review of the same sample size creates a much lower confidence interval, and thus a much lower margin of error. This is very important to understanding the binary classifications of a large corpus of data where only a small amount of the data is responsive, i.e., are relevant. (Another example of a binary classification could be privileged or not.)

Try out the second standard random sample calculator shown on my website to see this for yourself. In the first example shown, assuming a corpus of one million documents, with a confidence interval of 4, you see that a sample size of 600 documents is required. This is the largest possible sample size required for the 95% +/- 4. It assumes the worst case scenario of 50% prevalence (i.e. – half of the documents are relevant). Now change the prevalence percentage to 95% in the second calculator, using a sample size of 600, and a corpus of 1,000,000. The confidence interval is now 1.74%. You get the same result when you assume a prevalence rate of only 5%.

Again, see the Sample Size Calculator at RaoSoft.com for a calculator that allows you to plug-in different prevalence rates (called “response distribution” in that calculator) to determine sample sizes for certain intervals based on prevalence. Bottom line, when you have a corpus with a high or low prevalence, one that is either target rich, or target poor, a smaller sample size is required to attain an acceptable confidence interval. (Note, there are some exceptions where, for instance, there are extreme values (“outliers”) or where there are small corpus sizes.)

A good way to understand prevalence is by example. Start by assuming a 1,000,000 document corpus, which has a prevalence rate of 5% (one where 5% or less of the documents are relevant), you need only review 456 documents to know with 95% certainty, and an error rate of only 2%, the total number of relevant documents. Remember, if you had assumed that half of the documents were relevant, then you would have had to review 2,395 documents to attain the same confidence level and interval. See for yourself by trying this out in the standard calculators on my page and on RaoSoft’s.

This characteristic of random sampling must be understood for cost-effective quality control in a corpus with low prevalence. This is important because low prevalence is the norm in legal search, and not the so-called standard normal distribution used in other fields, where you assume the hard-search of separating out half of a 50/50 split.

Mathematical Formula for Random Sample Size Calculations

Here is one way of expressing the basic formula behind most standard random sample size calculators:

n = Z² x p(1-p) ÷ I²

Description of the symbols in the formula:

n = required sample size

Z = confidence level (The value of Z is statistics is called the “Standard Score,” wherein a 90% confidence level=1.645, 95%=1.96, and 99%=2.577)

p = estimated prevalence of target data (richness)

I = confidence interval or margin of error

Putting the formula into words – the required sample size is equal to the confidence level squared, times (the estimated prevalence times one minus the estimated prevalence), then divided by the square of the confidence interval.

Here is an example of the formula in action where we assume a 95% confidence level and confidence interval of 2%, and a prevalence of 4%:

n = Z² x p(1-p) ÷ I²
n= 1.96² x .04(1-.04) ÷ .o2²
n = 3.8416 x .04(.96) ÷ .0004
n = 3.8416 x .0384 ÷ .0004
n = .14751744 ÷ .0004
n = 368.7936

The formula shows that with an estimated prevalence of 4% we need a sample size of 369 documents to attain a 95% confidence level with a margin of error of 2%.

It is important to understand that this sample size formula is derived from the formula for calculating confidence intervals (I).

If you take the “n” value as unknown (the number to be sampled for a specified confidence interval), and assign a value to the confidence level of say, 95%, wherein the value for “Z” is thus 1.96, and you move the “n” to the left side of the equation, the formula now looks like this:

Mathematically this is the same thing as our original formula:

n = Z² x p(1-p) ÷ I²

We can easily prove the formulas are identical by example where we again assume a 95% +/- 2%, and a prevalence of 4%:

I = Z√p(1-p)/n
.02 = 1.96 √.04(1-.04)/n
n = (1.96/.02)² x .04(.96)
n = (98)² x .0384
n = 9604 x .0384
n = 368.7936

Here is another example using the formula I prefer, and following our first assumptions where the estimated prevalence rate is 5% relevant documents, and a 95% confidence level is desired with a confidence interval of 2%. The following relatively simple mathematical calculation provides the required sample size:

n = 1.96² x .05(1-.05) ÷ .02²
n = 3.8416 x .05(.95) ÷ .0004
n = 3.8416 x .0475 ÷ .0004
n = .182476 ÷ .0004
n = 456.19

Now if you change the prevalence rate from 5% to 50%, the formula increases the required sample size for a 95% confidence with plus or minus 2% as follows:

n = 1.96² x .5(1-.5) ÷ .02²
n = 3.8416 x .5(.5) ÷ .0004
n = 3.8416 x .25 ÷ .0004
n = .9604 ÷ .0004
n = 2401

Do the math above. Really, it is not that hard. It is all just multiplication and division. It shows that with the lower prevalence rates commonly found in legal search you can make accurate predictions using lower sample sizes. Further, if you do determine sample size based on an assumed 50% prevalence rate, whereas in fact you have a much lower rate, you are actually lowering your confidence interval, your margin of error.

Thus, if you use a standard calculator that by default has a worst-case 50% distribution or prevalence rate built-in, and review 2,401 documents, which you thought was the sample size necessary to attain a confidence interval of 2%, and you in fact were dealing with a document corpus that only had a 5% prevalence rate, having 95% irrelevant documents, then in fact your calculations will have a confidence interval (error rate) of only .87%, and not the 2% interval you thought. That is a good thing.

Again, don’t believe me. Do the math. Use the Interval formula that the sample size formula is based upon. (You may also need a calculator that does square root.)

I = Z√p(1-p)/n
I = 1.96√.05(1-.05)/2401
I = 1.96√.05(.95)/2401
I = 1.96√.0475/2401
I = 1.96√.00001978342357
I = 1.96 x .004447856064443
I = .00871779788631

You can also use the second standard calculator on my page,  Just plug-in 95% confidence level, a sample size of 2401, a population of 1,000,000, and a prevalence percentage of 5. It should calculate a confidence interval of 0.87. You can also double-check by using the RAOsoft calculator.

Additional Math Disclaimer

I have a disclaimer on all of my blog postings. See the top title and the first link on the right hand column: DISCLAIMER. On this particular post I thought it would be a good idea to add yet another level of disclaimer. Although math is math, and these are well accepted formulas and principles, these are still just my personal applications and synthesis of information and rules applicable in the field of statistics and legal search. I reserve the right to go back and make revisions to this post as my understanding deepens and improves. I am an attorney, not an information scientist or statistician. These views should not be relied upon, nor accepted as anyone’s opinion other than my own. You should, of course, always do your own due diligence, study and analysis. Like I said, do the math.

As always, if you disagree with the analysis here, or detect any math errors, please let me know. I welcome a free exchange of ideas and information. You can either email me privately, or write a public comment. That is how my blog works. I put my ideas out there for peer-review, and I make corrections as I go along, and before the blogs are ultimately transformed into a book. I appreciate all of the help my learned readers have provided to me over the years since I first began this open writing experiment in 2006. The odds are, your comments will help make my next book even better.

Conclusion

This blog has discussed thirteen different scenarios showing probabilistic analysis:

  1. I began with analysis of e-discovery expert bubble people wherein I estimate, based on anecdotal evidence, that 80% already use random sampling in some manner. I have only a 90% confidence level in that, with a confidence interval of 10%, so it could actually range from 70% to 90%, and maybe a lot more or less.
  2. The I moved on to analysis of all lawyers in the world. I estimated that a majority (51% or more) do not use random sampling at all. I put a 99.9% confidence level on that opinion and invited the Rand Corporation to try to prove me wrong.
  3. Then I turned my half-witty attention to all lawyers in the U.S. and opined that less than 2% use random sampling. I put a 95% confidence level on that one.
  4. Then I made my prediction that in ten years the number of lawyers in the U.S. using random sampling will increase tenfold from 2% to 20%. I am 95% confident on that projection, but I put a margin of error on it of plus or minus 2%. Based on the ABA’s estimate of the number of lawyers in America, I projected that from between 270,000 to 330,000 lawyers will be using random sampling by 2022. Rand Corp., make a note and do a follow-up survey in 2022, would you please?
  5. I next estimated that my blog readers get 90% of the humor in this blog (or better said, attempts at same), with a confidence interval of 2%, meaning between 88% and 92%.
  6. Serious sampling examples then began where I assumed a 95% confidence level, and 4% confidence interval. A review of a sample of 600 documents found that 60 were relevant (10%). Based on the sample we can project that 100,000 of the documents in the million document corpus would be relevant, with a range of between 6% and 14%, which means between 60,000 and 140,000 documents.
  7. Another variation of the last example was then considered where a confidence interval of 2% was used, instead of 4%. This required a sample size of 2,395 documents, where 10% were again found to be relevant (240). Since a 2% interval was used, the range of relevant documents projected was narrower, from between 80,000 and 120,000.
  8. Next, I added consideration of prevalence into the sample size formulas and started with an example of a 95% confidence level, and either 5% or 95% prevalence ratio (same either way). With a review of a random sample of 600 documents, and either a 5% or 95% prevalence, I showed that the confidence interval improved from 4% to 1.74%. This is an important point.
  9. Then I considered a 5% prevalence, where I showed that a sample of only 456 documents provides a 95% certainty and an error rate of 2%. This compared to the need to sample 2,395 documents for a 2% confidence interval if you assume 50% prevalence. Another important point.
  10. Then I showed the actual mathematical calculations explaining the formulas and used an example of a 95% confidence level, a 2% confidence interval, and a prevalence of 4%. You remember, it went like this and showed you only had to sample 369 documents:
    n = Z² x p(1-p) ÷ I²
    n= 1.96² x .04(1-.04) ÷ .o2²
    n = 3.8416 x .04(.96) ÷ .0004
    n = 3.8416 x .0384 ÷ .0004
    n = .14751744 ÷ .0004
    n = 368.7936
  11. The next formula I ran again assumed a 95% confidence level and 2% interval, but this time changed the prevalence to 5%. The formula showed a required sample size of 456 documents.
  12. Then I ran the math on 95% +/- 2, but this time assuming a 50% prevalence. The formula showed a required sample size of 2,401 documents.
  13. Then I ended with another twist where the sample size of 2,401 documents is used, but this time a 5% prevalence is assumed. The interval calculation formula showed that a .87 confidence interval results. That was shown in only formula where you had to do a square root calculation:
    I = Z√p(1-p)/n
    I = 1.96√.05(1-.05)/2401
    I = 1.96√.05(.95)/2401
    I = 1.96√.0475/2401
    I = 1.96√.00001978342357
    I = 1.96 x .004447856064443
    I = .00871779788631

I pointed out that you could skip the math entirely if you wanted, and attain the same results by using the random sample size calculators on my page, or on the RAOsoft calculator, or any other of a number of calculators freely available on the web. Depending on what software you are using for review, you might also have this ability built-in. You can also skip formulas and calculators all together and rely upon charts that list common values. These charts typically assume a prevalence of 50%. See eg Sample Size Table from Research Advisors. It can anyway be helpful to look at these charts to get a feel for how the numbers relate. For instance, look at these tables from the University of Florida, Professor Glenn D. Israel:

Table 1. Sample size for ±3%, ±5%, ±7% and ±10% Precision Levels Where Confidence Level is 95% and P=.5.
Size of Sample Size (n) for Precision (e) of:
Population ±3% ±5% ±7% ±10%
500 a 222 145 83
600 a 240 152 86
700 a 255 158 88
800 a 267 163 89
900 a 277 166 90
1,000 a 286 169 91
2,000 714 333 185 95
3,000 811 353 191 97
4,000 870 364 194 98
5,000 909 370 196 98
6,000 938 375 197 98
7,000 959 378 198 99
8,000 976 381 199 99
9,000 989 383 200 99
10,000 1,000 385 200 99
15,000 1,034 390 201 99
20,000 1,053 392 204 100
25,000 1,064 394 204 100
50,000 1,087 397 204 100
100,000 1,099 398 204 100
>100,000 1,111 400 204 100
a = Assumption of normal population is poor (Yamane, 1967). The entire population should be sampled.

Even though calculators and charts make sample size determination easy, it is good to know how to do the math yourself. That provides a solid understanding of what the calculators and charts are doing and why. Also see the work of the EDRM on the subject: Statistical Sampling Applied to Electronic Discovery; and, Appendix 2: Application of Sampling to E-Discovery Search Result Evaluation.

The math we examined shows the importance of prevalence to random sample size calculations and confidence interval calculations. This has been overlooked, or at least underestimated, by many in the field of e-discovery. This error often leads to over-sampling and review of more documents than required to obtain reasonable confidence levels and intervals. The routine assumption of a worst-case-scenario of 50% prevalence leads to overkill and unnecessarily large samples for many (but not all) uses of random sampling, including many quality control calculations. We need to start adding prevalence into our equations, and start being more efficient in our quality control metrics.

I look forward to your public and private comments. Hopefully I have caught all of the minor number and math mistakes (I have already spotted and corrected quite a few), but it is late, and I may well have missed some. Please let me know if you see any more errors.


Second Ever Order Entered Approving Predictive Coding

April 24, 2012

An order approving predictive coding was entered on April 23, 2012 in Global Aerospace, Inc. v. Landow Aviation, L.P., et al. This is a complex dispute in a Virginia State Court. The defendants’ motion seeking the order was granted. It is not pretty, and not detailed, but appears to be the second such order in the history of Man. I can’t discuss the first case, but I can and will keep posting the next cases as they come rolling out. I predict there will be many this year. Send them to me, anonymously if you like, as in this case, and I will post them here, in full. I understand that in this second case the vendor was OrcaTec. Here it is (by the way, a short order like this, with handwriting, etc., is not uncommon in state court).

You can also download the order in PDF form here.

A Press Release by the vendor involved, OraTec, came out the day after I first published this news. It provides further background and some interesting quotes. Here are selected excerpts:

The consolidated case stems from a collapse of a commercial structure, which damaged hundreds of millions of dollars in personal property. The defendants are represented by Schnader Harrison Segal and Lewis LLP of Pittsburgh and Baxter, Baker, Sidle, Conn & Jones, PA of Baltimore. Schnader’s e-Discovery Practice Group, led by Thomas C. Gricks III,  initially directed the collection and preservation of the ESI.  When agreement on production methodology could not be reached, Schnader filed a motion for a protective order to allow the firm to use predictive coding to cull the collection.  …

The order was issued after a hearing on the defendants’ motion on Monday. The plaintiffs had argued against predictive coding, saying that it was not as effective as human review. Gricks presented the arguments for predictive coding to the court, noting that Schnader has been successful in using predictive coding to save time and money on first-pass review, which in this case will be significant.  He was backed up by experts Karl Schieneman of Review Less, Timothy Opsitnick of JurInnov, and Dr. Herbert L. Roitblat of OrcaTec.

 “The critical point of the order is that the Court allowed a party to choose predictive coding as its preferred method of responding to a request for production of ESI.  His decision was an express recognition of the evolution of document review to deal with ever-increasing volumes of data,” said Gricks.

“We were very pleased to be able to show the scientific accuracy of predictive coding to a court in a formal hearing setting,” said Dr. Roitblat.  “Keyword searching seems to be perfectly acceptable to attorneys, even though several studies have focused on its inaccuracy. If keyword searching with 20 percent proven accuracy is okay, how can predictive coding with more than 90 percent demonstrable accuracy be unacceptable? I see this as the first step in that mental barrier coming down for lawyers.” …

While the issued order was quite short, the judge said in the hearing that a producing party gets to use whatever method it wants to use to review documents. The receiving party can then raise issues if it doesn’t get what it thinks it should have in litigation.  Opsitnick said, “The Judge analogized using predictive coding to a choice between using paralegals or senior partners or younger associates to review documents, which we think is correct. Unfortunately, none of the Court’s helpful explanation made it to the Order this time, but this is the first break in the predictive coding logjam.”

“OrcaTec has shown over and over how much time, money and effort predictive coding saves, plus how great the accuracy and transparency is in using it,” said Roitblat.  “We are very grateful that the court recognized the value of predictive coding.”

Schieneman added, ”This ruling  should give attorneys a real green light for moving ahead with this truly effective technology.”


“Where The Money Goes” – a Report by the Rand Corporation

April 22, 2012

The Rand Corporation is a well-known and prestigious non-profit institution. Its stated charitable purpose is to improve policy and decision-making through research and analysis. It has recently turned its attention to electronic discovery. Rand concluded, as have I, and many others, that the primary problem in e-discovery is the high cost of document review. They found it constitutes 73% of the total cost of e-discovery. For that reason, Rand focused its first report on electronic discovery on this topic, with side comments on the issue of preservation. The study was written by Nicholas M. Pace and Laura Zakaras and is entitled Where The Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery. It can be downloaded for free, both a summary and the full report (131 pages). A nicely bound paper version can be purchased for a modest fee of $20.

The full report is actually much better than the summary, in no small part because it shows the degree of care they used, and the honest disclaimers they make concerning the research. The disclaimers are needed because the study was only based on input from eight corporations. Still, it is a well written report with excellent analysis. I suggest you make time to read the full report.

The Rand Corporation Confirms Our Own Analysis and Makes The Same Bold Recommendations

The report not only analyzes the problem, it recommends a solution. Basically it says what I have been saying now for years, be bold, and take forward thinking action now to fight the high-cost problem head on. See Impactful, Fast, Bold, Open, Values: Guidance of the “Hacker Way.” As I have said before, the words lawyer and timid are not supposed to go together. Yet that is what we have here when it comes to the Bar’s use of advanced technologies, even when it is in the clients’ best interests. The Rand report recognizes the widespread timidness of many in the legal community, and makes the following recommendation at page 83, one that I strongly endorse:

To truly open the doors to more-efficient ways of conducting large-scale reviews in the face of ever-increasing volumes of digital information, litigants that have complained in the past about the high costs of e-discovery will have to take some very bold steps.

What action does the Rand study recommend as the core solution to the high costs of review? Again, it is the same mantra that most everyone in the field of e-discovery has been saying, fight the problems caused by technology (i.e. – too much information) by the intelligent use of even more technology. By intelligent we mean use the technology as part of a valid legal methodology, one based on the law. Do not just use technology on its own, for its own sake. The technology has to be run by lawyers, not techs. Sorry my tech friends, lawyers have to drive the CAR, computer assisted review.

The legal method I promote for CAR is called: Bottom Line Driven Proportional Review. It is based on the well established legal doctrine of proportionality. See eg.: Good, Better, Best: a Tale of Three Proportionality Cases, Part One and Part Two. Of course, my way is not the only way for the CAR highway. There are many other valid legal methods to use advanced technologies. There are many other reasonable applications in use by other respected attorneys in the field. The focus on budgeting, estimation, transparency, cooperation, and proportionality is just my particular method. One that I encourage others to follow.

The Rand Report does more than just recommend the use advanced technology, it actually endorses one particular type of technology, my friend Predictive Coding. That’s right, this prestigious, non-profit, independent group has reach the same conclusions that I have, and many, many others have (in fact, you would be hard pressed to find any bona fide expert to argue against the idea of predictive coding). It is now official. Predictive coding is the best answer we have to the problem of the high costs of e-discovery. Of course, there will be good faith debates for years to come on the best methods to use this new technology, and in what cases it is appropriate. The Rand report discusses all of these considerations.

The conclusion of the report states at pages 97-99:

The most promising alternative available today for large-scale reviews is the use of predictive coding and other computerized categorization strategies that can rank electronic documents by the likelihood that they are relevant, responsive, or privileged. Eyes-on review is still required but only for a much smaller set of documents determined to be the most-likely candidates for production. Empirical research suggests that predictive coding is at least as accurate as humans in traditional large-scale review. Moreover, there is evidence that the number of hours of attorney time that would be required in a large-scale review could be reduced by as much as three-fourths, depending on the nature of the documents and other factors, which would make predictive coding one answer to the critical need of significantly reducing review costs. …

Despite the apparent promise of predictive coding and other computerized categorization techniques, however, the legal world has been reluctant to embrace the new technology. … the key reason is the absence of widespread judicial approval of the methodology, specifically regarding any acknowledgment of the adequacy of the results in actual cases or whether the process was a reasonable way to prevent inadvertent privilege waiver. Without clear signs from the bench that the use of computer-categorized review tools should be considered in the same light as eyes-on review or keyword searching, litigants involved in large-scale reviews are unlikely to employ the technologies on a routine basis. …

The use of computerized categorization techniques, such as predictive coding, will likely become the norm for large-scale reviews in the future, given the likelihood of increasing societal acceptance of artificial intelligence technologies that might have seemed like improbable science fiction only a few decades ago. The problem is that considerable sums of money are being spent unnecessarily today while attitudes slowly change over time. New court rules might move the process forward, but the best catalyst for more-widespread use of predictive coding would be well-publicized instances of successful implementation in cases in which the process has received close judicial scrutiny. It will be up to forward-thinking litigants to make that happen.

Again, I join the call to all forward-thinking litigants to, in the words of Star Trek, boldly go where no man has gone before. See eg. Predictive Coding Based Legal Methods for Search and Review; and, New Methods for Legal Search and Review. I am reminded once again of the words of a famous Indian lawyer turned saint: Be the change that you wish to see in the world. Mahatma Gandhi.

By the way, even though this report basically affirms my own analysis and blogs, I had absolutely no involvement in the research or preparation of this report. I am not sure I have even met Nicholas Pace and Laura Zakaras. But I note that two of the top experts in our field did help out the Rand newcomers, mainly Thomas Y. Allman and Jason R. Baron. I am of course influenced by their many excellent writings, just as I will henceforth be influenced by the Rand report of Pace and Zakaras. That is how knowledge always advances in every field of law, technology, and science. As my readers well know, my opinions are an amalgamation of the thinking of all of the leaders in the field. Only a few of my thoughts are truly original. If I occasionally appear to be smart and far-seeing, it is only because I am standing on the shoulder of giants. It has always been so.

Rand Describes Predictive Coding

Pace and Zakaras not only recommend predictive coding, they venture deeply into the who, what, when, where and why of the new technology. For instance, they do a nice job of describing how predictive coding works at page 59 of the report:

Predictive coding, sometimes referred to as suggestive coding, is a process by which the computer does the heavy lifting in deciding whether documents are relevant, responsive, or privileged. This process is not to be confused with keyword-based Boolean searches or the similarity detection technologies described in Chapter Four. Near-duplication techniques, clustering, and email threading can help provide organizational structure to the corpus of documents requiring review but do not reduce the document set that has to be reviewed by attorneys for specific aspects, such as responsiveness or privilege. Predictive coding, on the other hand, takes the very substantial next step of automatically assigning a rating (or proximity score) to each document to reflect how close it is to the concepts and terms found in examples of documents attorneys have already determined to be relevant, responsive, or privileged. This assignment becomes increasingly accurate as the software continues to learn from human reviewers about what is, and what is not, of interest. This score and the self-learning function are the two key characteristics that set predictive coding apart from less robust analytical techniques.

They go on to point out at page 61 what they call an ironic feature of predictive coding, which, by the way, I now sometimes also like to call Intelligent Review or Probabilistic Review:

As should be clear from this description, predictive coding does not take humans out of the review loop. It requires intensive attorney support throughout the process in order to advance machine learning. Ironically, for a technique that could substantially reduce discovery expenses, the best results will be achieved if the attorneys most closely involved in the case select the seed documents and review sampled extracts, effectively precluding the use of lower cost contract attorneys or LPO vendors for these particular tasks. Moreover, attorney judgment continues to loom large in the process after the application has completed its work, with eyes on review required, for example, to check documents of unknown relevance and responsiveness or look for privileged communications.

Advanced technologies like predictive coding do not replace lawyers. Instead they require better educated lawyers. Still, the days of vast armies of minimum skilled contract lawyers are numbered. Fewer lawyers will be needed for intelligent review, but they will have to be better trained about the case and the technology. They will need to be SMEs – subject matter experts, and technophiles. I know that most contract lawyers will be quite happy about this change, as they have only been willing to suffer through the drudgery of never-ending email reading because of the economy. I predict that many of these lawyers will rise to the occasion and become the best SMEs of the future.

Rand Dares to Mention the Elephant in the Room

The Rand report discusses many resistance factors against the widespread adoption of predictive coding technologies. They even  touch on the one that most analysts dare not mention. They raise the issue of the vested financial interests of certain companies and law firms to continue expensive, over-review of documents. Here is how Pace and Zakaras describe it at page 76:

Resistance of External Counsel

Another barrier to the widespread use of predictive coding could well be resistance to the idea of outside counsel motivated not so much by accuracy issues as by the potential loss of a historical revenue stream. Some interviewees reported grumblings from outside counsel when their companies decided to directly handle a fraction of the overall review process or to markedly reduce what was shipped out for review through the use of additional data processing.

My applause to the Rand Corporations for this bold statement of the obvious. I hope they have been warned, as I was when I stood next to the elephant in the picture, not to touch him. If he steps on your toes, your whole foot will be crushed.

Vendor Cost

I always include in my essays on predictive coding a call for vendors to bring down the prices of these advanced software features. The high prices are a serious impediment to adoption by even brave attorneys and forward-thinking litigants. The prices of most vendors today usually restrains the use of predictive coding to big cases. The Rand report once again validates my complaints at page 98:

Moreover, computer applications for conducting review are unlikely to be economically viable options when dealing with smaller document sets, in which any savings in attorney hours might be overwhelmed by vendor costs and machine-training requirements. Existing approaches, such as deduplication, cluster analysis, and email threading, may provide a more practical answer in these situations.

By the way, predictive coding is not a replacement of all other search methods, it is a supplement. It is the current crown jewel of search, to be sure, but it is still just one of many methods. It is one tool in an arsenal or weapons. That is why I call my search method multimodal. It features predictive coding, but includes other types of review too, including keyword search and human eyes-on review. Predictive Coding Based Legal Methods for Search and Review.

As the Rand report indicates, cases with smaller documents sets are not yet economically viable for predictive coding. But, when vendors do finally heed my call and lower prices, predictive coding will be economically viable for many more cases. Then the full arsenal of truth-seeking missiles can be used in even medium-sized cases.

Preservation Woes

The Rand report also looks into corporate complaints of the high cost of preservation. This topic is something of an add-on to the primary topic of review, but it is still well worth reading. Preservation expenses are, after all, present in every case, which is not necessarily true with expensive review costs. The survey showed that preservation has become a significant financial burden for many companies, with many explanations on why, but nobody seemed to have good metrics on the burdens. Rand recommends that corporations begin to systematically track costs in this area. Uncertainty and conflicts in the law of preservation were also discussed, but no recommendations were made. For a new case finding gross negligence in preservation, but only awarding monetary sanctions, see Telecom, Inc. v. Global Crossing Bandwidth, Inc. No. 05-CV-6734T (W.D.N.Y. Mar. 22, 2012). Compare with Aviva USA Corp. v. Vazirani, No. 11-0369 (D. Ariz. Jan. 10, 2012) where monetary sanctions and an adverse inference is granted. Compare both with Spanish Peaks Lodge, LLC v. Keybank National Assoc., No. 10-453  (W.D. Penn. Mar. 15, 2012) where no sanctions were granted. Compare all of these with United Factory Furniture Corp. v. Alterwitz, No. 2:12-cv-00059-KJD-VCF, 2012 WL 1155741 (D. Nev. Apr. 6, 2012) where mirror imaging was ordered for preservation.

Conclusion

Where The Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery is a must read that is within everyone’s budget. It can be downloaded for free, both a summary and the full report (131 pages), but I recommend you read the full report. Although I disagree with a few points in the report, they are not worth examination. For the most part they got it right. It will be interesting to see what companies, if any, heed their call for forward-thinking litigants to take bold steps to use predictive coding. Regardless, kudos to the Rand Corporation, the RAND Institute for Civil Justice and the authors of the report, Nicholas M. Pace and Laura Zakaras, for a job well done.


Good, Better, Best: a Tale of Three Proportionality Cases – Part Two

April 15, 2012

Continuation of Part One of Good, Better, Best: a Tale of Three Proportionality Cases.

The Best Case: DCG Systems

Compared to I-Med Pharma and U.S. ex rel McBride, DCG Systems is the best of the lot. DCG Sys., Inc. v. Checkpoint Techs, LLC, 2011 WL 5244356 (N.D. Cal. Nov. 2, 2011) It is better than the rest because of timing. The issue of proportionality of discovery was raised in DCG Systems at the beginning of the case. It was raised at the 26(f) conference and 16(b) hearing as part of discovery plan discussions. That is what the rules intended. Proportionality protection requires prompt, diligent action.

In I-Med Pharma the party responding to discovery waited to take action until after a stipulation and order to review 64,382,929 hits covering 95 Million pages. In U.S. ex rel McBride the party responding to discovery waited until after the email of 230 custodians had been produced, and, in the words of Judge Facciola, a king’s ransom had already been paid.

The lesson is clear. Be a good little lawyer hacker. Be fast, be bold, and be open to impact discovery in a proportional way.  Impactful, Fast, Bold, Open, Values: Guidance of the “Hacker Way.”  Timing is everything, in law and in life. Are we not all trapped in an hour-glass? There is no getting out!

Timing and Rule 26(g)

A key lesson of these three cases is that timing is everything. Consider proportionality from the get go, and remember that it is not only based on the protective order rule, 26(b)(2)(C), it is based on the rule governing a requesting party’s signing a discovery request. I am talking about the Rule 11 of discovery, Rule 26(g)(1)(B)(iii):

(g) Signing Disclosures and Discovery Requests, Responses, and Objections.

(1) Signature Required; Effect of Signature. Every disclosure under Rule 26(a)(1) or (a)(3) and every discovery request, response, or objection must be signed by at least one attorney of record in the attorney’s own name—or by the party personally, if unrepresented—and must state the signer’s address, e-mail address, and telephone number. By signing, an attorney or party certifies that to the best of the person’s knowledge, information, and belief formed after a reasonable inquiry: …

(B) with respect to a discovery request, response, or objection, it is: …

(iii) neither unreasonable nor unduly burdensome or expensive, considering the needs of the case, prior discovery in the case, the amount in controversy, and the importance of the issues at stake in the action.

Judge Paul Grimm calls Rule 26(g) the most overlooked and misunderstood of all of the rules of civil procedure. That is the fault of us lawyers, and, it is also the fault of our judges. Rule 26(g) in subsection (3) requires a court, “on its own,” to sanction anyone who signs a discovery request in violation of the rule. This means that judges must impose sanctions, on their own initiative, whenever they see a disproportionate discovery request. There is no discretion given to judges about this. The rule does not say “may” impose sanctions. It says the court “on motion or on its own, must impose an appropriate sanction on the signer.” Yet, in my thirty-two years of legal practice in federal court, I have never once seen this done by a judge. Have you?

Here is the language of subsection (3) of Rule 26(g). It should be in all of your discovery briefs.

(3) Sanction for Improper Certification. If a certification violates this rule without substantial justification, the court, on motion or on its own, must impose an appropriate sanction on the signer, the party on whose behalf the signer was acting, or both. The sanction may include an order to pay the reasonable expenses, including attorney’s fees, caused by the violation.

Lawyers need to start including this rule in their initial analysis of any discovery request. If one side refuses in engage in cooperative discussions to narrow discovery requests, if, for instance, they refuse to limit discovery to the actual factual issues in the case, then Rule 26(g) must be squarely brought to the attention of the supervising judge. There is no time to wait.  We are all trapped in an hour-glass, and a billable one at that!

As Judge Waxse has pointed out, there is a clear path in the rules to deal with non-cooperators, and Rule 26(g) is one of the road signs on that path. See: Judge David Waxse on Cooperation and Lawyers Who Act Like Spoiled Children. But you have to time your motions. You have to seek protection before you pay the piper, but after you make a good faith effort to cooperate. Timing is everything.

Model Patent Order

The Patent Bar is trying an experiment to try to control run away e-discovery costs in patent litigation. They have a committee composed of a handful of patent lawyers and a few key judges who are well-known in patent law, Chief Judge James Ware (ND Cal), Judge Virginia Kendall (ND Ill), Magistrate Judge Chad Everingham (ED Tex), and Chief Judge Randall Rader (Fed. Cir.). They have come up with what they call a Model Order Limiting E-Discovery in Patent Cases. They explain that the Model Order:

… is intended to be a helpful starting point for district courts to use in requiring the responsible, targeted use of e-discovery in patent cases. The goal of this Model Order is to promote economic and judicial efficiency by streamlining ediscovery, particularly email production, and requiring litigants to focus on the proper purpose of discovery—the gathering of material information—rather than permitting unlimited fishing expeditions. It is further intended to encourage discussion and public commentary by judges, litigants, and other interested parties regarding e-discovery problems and potential solutions.

The Model Order is inspired by Rule 30 that presumptively limits cases to ten depositions and seven hours per deposition. The Committee notes that since email is the biggest time-waster in patent litigation (well, except for Qualcomm of course), and so it uses this same limiting approach to email discovery. It limits initial e-discovery to email from five custodians and five keywords per custodian. The Committee is careful to note that “the parties may jointly agree to modify these limits or request court modification for good cause.” Even if they do not agree, or there is no order permitting more email discovery, a requesting party is still entitled to more if they pay for it. This is their approach to proportionality:

This is not to say a discovering party should be precluded from obtaining more e-discovery than agreed upon by the parties or allowed by the court. Rather, the discovering party shall bear all reasonable costs of discovery that exceeds these limits. This will help ensure that discovery requests are being made with a true eye on the balance between the value of the discovery and its cost.

The Model Order also addresses concerns regarding waiver of attorney-client privilege and work product protection in order to minimize human pre-production review. It does so by including Rule 502(d) non-waiver language into the standard order. The Order itself is pretty short and simple, which is one of its virtues, so I reproduce it here in its entirety:


Plaintiff,
v.
Defendant.

[MODEL] ORDER REGARDING E-DISCOVERY IN PATENT CASES

The Court ORDERS as follows:

1. This Order supplements all other discovery rules and orders. It streamlines Electronically Stored Information (“ESI”) production to promote a “just, speedy, and inexpensive determination” of this action, as required by Federal Rule of Civil Procedure 1.

2. This Order may be modified for good cause. The parties shall jointly submit any proposed modifications within 30 days after the Federal Rule of Civil Procedure 16 conference. If the parties cannot resolve their disagreements regarding these modifications, the parties shall submit their competing proposals and a summary of their dispute.

3. Costs will be shifted for disproportionate ESI production requests pursuant to Federal Rule of Civil Procedure 26. Likewise, a party’s nonresponsive or dilatory discovery tactics will be cost-shifting considerations.

4. A party’s meaningful compliance with this Order and efforts to promote efficiency and reduce costs will be considered in cost-shifting determinations.

5. General ESI production requests under Federal Rules of Civil Procedure 34 and 45 shall not include metadata absent a showing of good cause. However, fields showing the date and time that the document was sent and received, as well as the complete distribution list, shall generally be included in the production.

6. General ESI production requests under Federal Rules of Civil Procedure 34 and 45 shall not include email or other forms of electronic correspondence (collectively “email”). To obtain email parties must propound specific email production requests.

7. Email production requests shall only be propounded for specific issues, rather than general discovery of a product or business.

8. Email production requests shall be phased to occur after the parties have exchanged initial disclosures and basic documentation about the patents, the prior art, the accused instrumentalities, and the relevant finances. While this provision does not require the production of such information, the Court encourages prompt and early production of this information to promote efficient and economical streamlining of the case.

9. Email production requests shall identify the custodian, search terms, and time frame. The parties shall cooperate to identify the proper custodians, proper search terms and proper timeframe.

10. Each requesting party shall limit its email production requests to a total of five custodians per producing party for all such requests. The parties may jointly agree to modify this limit without the Court’s leave. The Court shall consider contested requests for up to five additional custodians per producing party, upon showing a distinct need based on the size, complexity, and issues of this specific case. Should a party serve email production requests for additional custodians beyond the limits agreed to by the parties or granted by the Court pursuant to this paragraph, the requesting party shall bear all reasonable costs caused by such additional discovery.

11. Each requesting party shall limit its email production requests to a total of five search terms per custodian per party. The parties may jointly agree to modify this limit without the Court’s leave. The Court shall consider contested requests for up to five additional search terms per custodian, upon showing a distinct need based on the size, complexity, and issues of this specific case. The search terms shall be narrowly tailored to particular issues. Indiscriminate terms, such as the producing company’s name or its product name, are inappropriate unless combined with narrowing search criteria that sufficiently reduce the risk of overproduction. A conjunctive combination of multiple words or phrases (e.g., “computer” and “system”) narrows the search and shall count as a single search term. A disjunctive combination of multiple words or phrases (e.g., “computer” or “system”) broadens the search, and thus each word or phrase shall count as a separate search term unless they are variants of the same word. Use of narrowing search criteria (e.g., “and,” “but not,” “w/x”) is encouraged to limit the production and shall be considered when determining whether to shift costs for disproportionate discovery. Should a party serve email production requests with search terms beyond the limits agreed to by the parties or granted by the Court pursuant to this paragraph, the requesting party shall bear all reasonable costs caused by such additional discovery.

12. The receiving party shall not use ESI that the producing party asserts is attorney-client privileged or work product protected to challenge the privilege or protection.

13. Pursuant to Federal Rule of Evidence 502(d), the inadvertent production of a privileged or work product protected ESI is not a waiver in the pending case or in any other federal or state proceeding.

14. The mere production of ESI in a litigation as part of a mass production shall not itself constitute a waiver for any purpose.

This Model Order is a terrific first experiment to try to reign in disproportionate e-discovery expenses and stop wasting everybody’s time. Still, the plaintiff in DCG Systems did not like it and tried to avoid its application in its case. I have my own criticisms of the Model Order, including the obvious one of reliance on five blind keywords, and that puzzling para five on metadata, but I will save that for the conclusion.

DCG Sys., Inc. v. Checkpoint Techs, LLC

DCG Systems is a garden variety patent case between two companies with competing patent rights. It is not another very common type of patent case where a small patent troll with only a little ESI sues a big company with lots of ESI. They call those NPE cases. This means that in the DCG Systems case both companies could find e-discovery equally troubling. The plaintiff, DCH Systems Inc., argued that the Model Order should not be applied to their case because the Order was primarily designed for the David and Goliath, troll versus big company NPE type patent case.

United States Magistrate Judge Paul S. Grewal did not agree:

The court is not persuaded by DCG’s argument for at least two reasons. First, although the undersigned will not presume to know whether Chief Judge Rader or any of the esteemed members of the subcommittee were focused exclusively on reducing discovery costs in so-called “NPE” cases, there is nothing in the language of the Chief Judge’s speech or the text of the model order so limiting its application. Second, and more fundamentally, there is no reason to believe that competitor cases present less compelling circumstances in which to impose reasonable restrictions on the timing and scope of email discovery. To the extent DCG faces unique or particularly undue constraints as a result of the limitations, it remains free, under the Model Order, to seek relief from the court. But in general copying and the availability of an injunction are issues that are impacted by such restrictions no more than the myriad of other issues (e.g., inducement, state of the art, willfulness) that are present in just about all patent cases. And if competitor cases such as this lack the asymmetrical production burden often found in NPE cases, so that two parties might benefit from production restrictions, the Model Order would seem more appropriate, not less.

I know nothing about patent cases, but I do know e-discovery, and Judge Grewal’s argument sounds compelling. Judge Grewal ends his opinion with the following cautionary comment, words that I again completely agree with:

Perhaps the restrictions of the Model Order will prove undue. In that case, the court is more than willing to entertain a request to modify the limits. But only through experimentation of at least the modest sort urged by the Chief Judge will courts and parties come to better understand what steps might be taken to address what has to date been a largely unchecked problem.

We have to take new steps to control e-discovery costs, to make them proportionate. That is why I came up with my Bottom Line Driven Proportional Review approach. But the Patent Committee approach has the advantage of far greater simplicity. Moreover, little or no skill in e-discovery is required to implement this proportionality reform. Still, I am troubled by the reliance on Go Fish keyword search methods. See Child’s Game of “Go Fish” is a Poor Model for e-Discovery Search. The lack of precision and recall in blind keyword search makes this method both expensive and ineffective.

Methods aside, the Model Order Limiting E-Discovery in Patent Cases makes an important first step in litigation reform. The DCG Systems case shows timely application of the Model Order. The opinion also includes good language explaining the order and why courts should try using it to attain proportionality. (It’s use is at the discretion of the presiding judge.) You may want to use Judge Grewal’s language in DCG Systems in your case memos, patent or otherwise, to show the need to control e-discovery:

Critically, the email production requests must focus on particular issues for which that type of discovery is warranted. The requesting party must further limit each request to a total of five search terms and the responsive documents must come from only a defined set of five custodians. These restrictions are designed to address the imbalance of benefit and burden resulting from email production in most cases. As Chief Judge Rader noted in his recent address in Texas on the “The State of Patent Litigation” in which he unveiled the Model Order, “[g]enerally, the production burden of expansive e-requests outweighs their benefits. I saw one analysis that concluded that .0074% of the documents produced actually made their way onto the trial exhibit list-less than one document in ten thousand. And for all the thousands of appeals I’ve evaluated, email appears more rarely as relevant evidence.”

Remember that statistic and use it. Only .0074% of e-docs discovered ever make it onto a trial exhibit list, much less ever get used to make a difference in a case. That is why in my Secrets of Search article, Part Three, I say Relevant Is Irrelevant and point out the old trial psychology rule of 7±2, to argue for higher culling rates in e-discovery search.

More Authorities on Proportionality

Want to learn more about proportionality? Don’t rely on a keyword search to find the cases. As seen, they often do not even use the word proportionality. Try these additional articles, cases, and Mr. Shepherd instead. Mr. Google will help you find still more.

  • The Sedona Conference® Commentary on Proportionality in Electronic Discovery.
  • Bottom Line Driven Proportional Review.
  • Discovery As Abuse.
  • An Old Case With a New Opinion Demonstrating Perfect Proportionality.
  • Rimkus Consulting Group v. Cammarata, 688 F.Supp. 2d 598, 613 (S.D. Tx. 2010) (the Rules require that the parties engage in “reasonable efforts” and what is reasonable “depends on whether what was done – or not done – was proportional to that case…”)
  • Moody v. Turner Corp. Case No. 1:07-cv-692. (S.D. OH, 2010) (“…the mere availability of such vast amounts of electronic information can lead to a situation of the ESI-discovery-tail wagging the poor old merits-of-the-dispute dog.”)
  • Dilley v. Metro. Life Ins. Co., 256 F.R.D. 643, 644 (N.D. Cal. 2009) (“The court must limit discovery if it determines that ‘the burden or expense of the proposed discovery outweighs its likely benefit,’ considering certain factors including ‘the importance of the issues at stake in the action, and the importance of the discovery in resolving the issues.’” ) (quoting FED. R. CIV. P. 26(b)(2)(C)(iii))
  • Averett v. Honda of Am. Mfg., Inc., No. 2:07-cv-1167, 2009 WL 799638, at *2 (“the court always has a duty to limit discovery under Rule 26(b)(2)(C)(i)-(iii)”)
  • Wood v. Capital One Services, LLC, No. 5:09-CV-1445 (NPM/DEP), 2011 WL 2154279, at *1-3, *7 (N.D.N.Y, 2011) (the “rule of proportionality” dictated that the plaintiff’s motion be denied “without prejudice to his right to renew the motion to compel in the event he is willing to underwrite the expense associated with any such search.”)
  • Thermal Design, Inc. v. Guardian Building Products, Inc., No. 08-C-828 (E.D. Wis., 2011), (Judge refused to approve plaintiff’s electronic fishing expedition simply because the defendant had the financial resources to pay for the searches. Th financial resources of the defendant are not tantamount to good cause under FRCP 26(b)(2)(C))
  • General Steel Domestic Sales, LLC v. Chumley, No. 10-cv-01398 (D. Colo., 2011) (Judge rejected defendant’s request for the production of every recorded sales call on plaintiff’s database for a two-year period because it would take four years to listen to the calls to identify potentially responsive information.)
  • Daugherty v. Murphy, No. 1:06-cv-0878-SEB-DML, 2010 WL 4877720, at *5 (S.D. Ind., 2010) (The cost and burden of the additional production outweighed the benefit. The defendant’s sworn testimony on burden and cost was credible.)
  • Willnerd v. Sybase, 2010 U.S. Dist. LEXIS 121658 (SD Id., 2010)(“… a search of the employees’ e-mails would amount to the proverbial fishing expedition — an exploration of a sea of information with scarcely more than a hope that it will yield evidence to support a plausible claim of defamation. … In employing the proportionality standard of Rule 26(b)(2)(C), as suggested by Willnerd, the Court balances Willnerd’s interest in the documents requested, against the not-inconsequential burden of searching for and producing documents.”)
  • Rodriguez-Torres v. Gov. Dev. Bank of P.R., 265 F.R.D. 40 (D. P.R., 2010) (“… the Court determines that the ESI requested is not reasonably accessible because of the undue burden and cost. The Court finds that $35,000 is too high of a cost for the production of the requested ESI in this type of action. Moreover, the Court is very concerned over the increase in costs that will result from the privilege and confidentiality review that Defendant GDB will have to undertake on what could turn out to be hundreds or thousands of documents.”
  • Madere v, Compass Bank, 2011 U.S. Dist. LEXIS 124758, (WD Tx. 2011) (“As the cost to restore Compass Bank’s backup tapes “outweighs its likely benefit,” especially in light of the amount in controversy, the Court DENIES Madere’s request for production.”)
  • Convolve, Inc. v. Compaq Comp. Corp, 223 F.R.D. 162 (SDNY 2004) (The production request “would require an expenditure of time and resources far out of proportion to the marginal value of the materials to this litigation.”)
  • United Central Bank v. Kanan Fashions, Inc., 2010 U.S. Dist. LEXIS 83700 (DN Ill, 2010) (Restrictive date range required, but further protection from excessive burden denied due to failure to support the contentions of high cost to comply with specific facts.)
  • High Voltage Beverages, LLC v. Coca-Cola Co., 2009 U.S.Dist. LEXIS 88259 (WD NC, 2009) (“Under Rule 26(b)(2)(C)(i), the court finds that requiring defendant to sift sand for documents it has already produced would be unreasonably duplicative of earlier efforts and that the material contained therein is likely available from other sources, to wit, an earlier production of documents. … Under Rule 26(b)(2)(C)(iii), defendant has made an unrebutted showing that the man-hours and expense of reviewing the collection would be extraordinary, and it appears to the court that the burden or expense of the proposed discovery outweighs its likely benefit. Thus, the court find that it would be disproportional to require defendant to review such information prior to producing it to plaintiff and deny plaintiff’s request.”)
  • Bassi Bellotti S.p.A. v. Transcon. Granite, Inc., 2010 U.S. Dist. LEXIS 93055 (D. Md., 2010) (“… Federal Rules do impose an obligation upon courts to limit the frequency or extent of discovery sought in certain circumstances, such as when the discovery requested is unreasonably duplicative or cumulative, or the burden or expense of the proposed discovery outweighs the likely benefit, considering the needs of the case, the importance of the issues at stake in the action, and the importance of the discovery in resolving those issues. “)
  • Call of the Wild Movie, LLC v. Does 1-1062, No. 10-455 (BAH), — F. Supp. 2d —-, 2011 WL 996786, at *18-20 (D.D.C., 2011) (granting motion to compel because the request was narrow and the ESI requested was important, compared with an insufficient showing of undue burden.)
  • Hock Foods, Inc. v. William Blair & Co., LLC, No. 09-2588-KHV, 2011 WL 884446, at *9 (D. Kan. 2011) (Sebelius, Maj. J.) (denying in part a motion to compel in light of costs estimated between $1.2 and $3.6 million to search 12,000 gigabytes of data in order to answer an overbroad interrogatory.)
  • Diesel Mach., Inc. v. Manitowoc Crane, Inc., No. CIV 09-cv-4087-RAL, 2011 WL 677458, at *2-3 (D.S.D., 2011) (motion to compel the production of documents in native format was denied because no explanation provided on why information contained in native format was necessary to facts of case when those same documents had already been produced as PDFs).
  • Tucker v. American Intern. Group, Inc., 2012 WL 902930 (D. Conn. Mar. 15, 2012) (Plaintiff’s non-party Rule 45 subpoena to inspect hard drives asked the Court “to allow plaintiff “essentially carte blanche access to rummage through Marsh’s electronically stored information, purportedly in the hope that the needle she is looking for lurks somewhere in that haystack. … [T]he burdens of plaintiff’s proposed inspection upon Marsh outweigh the benefits plaintiff might obtain were she to obtain the emails through a Datatrack inspection. Plaintiff seeks to search, inter alia, the mirror images of eighty-three laptops — in effect, to dredge an ocean of Marsh’s electronically stored information and records in an effort to capture a few elusive, perhaps non-existent, fish. … Courts are obliged to recognize that non-parties should be protected with respect to significant expense and burden of compelled inspections under Fed. R. Civ. P. 45(c)(2)(B)(ii). … Moreover, courts have focused on the importance of the Rule 26(b)(2)(C) proportionality limit to implement fair and efficient operation of discovery. … Balancing the prospective burden to Marsh against the likely benefit to plaintiff from the proposed inspection, the Court concludes that the circumstances do not warrant compelling Marsh to endure inspection of its computer records by Datatrack.”)

Conclusion

DCG Systems, Inc. v. Checkpoint Techs, LLC is, by far, the best of the three cases, but it is still far from perfect. It embraces proportionality, and will no doubt save the parties lots of money in e-discovery, but at what cost? Litigation is about finding justice. If you lose that. You lose everything.

Rule 1 says, among other things, that litigation should be speedy and inexpensive. Limiting discovery to five keywords and five custodians will get you that. But Rule 1 also says litigation should be just. That is, after all, the whole point of litigation. In America, like most of the civilized world, we don’t just go through the motions of legal process in a fast and cursory manner. Court systems are not just an empty charade. The heart of law as we know it is due process. We decide cases on the merits, on the facts, on the evidence; not just on the whim of judges or juries. That is what justice means to us. I am concerned about arbitrary limits on e-discovery to save money, and speed things along, that do so at the price of justice.

Judge Paul S. Grewal, who decided DCG Systems, shares these concerns, I am sure. So too does the Patent Bar who adopted this Model Order, and Chief Judge Randall Rader who promotes it. They are, like all bona fide professionals in the Law, trying hard to find a proportional balance between benefit and burden, to know when enough is enough in the search for evidence. They don’t want too much, like some unscrupulous attorneys for whom e-discovery is little more than a legal tool of extortion. They don’t want too little, like some equally unscrupulous attorneys who play hide the ball. Good attorneys are like Goldilocks; they are looking for the just-right amount of e-discovery. They are looking for proportionality.

The patent judges show this concern in the pains they take to say that the five/five rule is just a starting point. They make clear that more e-discovery outside of these limits may be appropriate, that parties can always move the court for additional discovery. For instance, Judge Grewal in DCG Systems says: “Perhaps the restrictions of the Model Order will prove undue. In that case, the court is more than willing to entertain a request to modify the limits.” The Model Order shows the same concern that justice not be sacrificed at the altar of efficiency: “The Court shall consider contested requests for up to five additional custodians per producing party, upon showing a distinct need based on the size, complexity, and issues of this specific case.

My main criticism of the case and Model Order, aside again from the bizarre comment in paragraph five against metadata, pertains to the reliance on Go Fish type keyword search. It is not so much the arbitrary limit to five keywords that bothers me, much less the limit to five custodians, which I think is fine. What bothers me about the Model Order, and bothers every other expert I have talked to, is the reliance on keyword search alone, and blind-pick keyword search at that. It should bother anyone who has read the scientific studies. The Model Order is promoting the worst kind of search: the blind keyword guessing kind. That is inadvertent I’m sure. The lawyers and judges behind the model order were not aware of the limits of blind-guessing-based-keywords. When they do, I assume they will consider appropriate revisions to the Model Order.

The Model Order should be reformed to require that basic metrics be shared on proposed keywords. It should require enough disclosure so that the keyword picks are not blind. Some keyword testing should be permitted a requesting party before five terms are settled upon. The Order is a good start, but it needs tweaking so that the keyword searches can be more effective. I am sure there are many search experts who would help the Committee if asked. I hope they do, because the Patent Bar’s heart is in the right place, a proportionality place.

Now please, would someone get me out of this damn time bottle?

_________________________________________

_________________________________

________________________

________________

_______

__

Thank You!