Litigation, e-Discovery, e-Motions, and the Triune Brain

May 27, 2012

To understand e-discovery you must understand litigation. To understand litigation you must understand the emotions of the litigants. They are primarily negative (think fear and loathing, hatred and greed); after all, they are in a dispute and animosities are often high. An attorney must see these emotions and understand how they impact the conduct of the plaintiffs and defendants in a law suit. But, at the same time, an attorney must be careful not to get caught up in the emotions.

A good attorney is detached from the emotions of the parties. They serve as an objective voice of reason; an independent source of wise counsel. A good attorney understands the conflict, but is never part of the hostilities. A good attorney is above the fray. He or she is careful to never inflame the passions of their client. A good attorney is a peacemaker who resolves disputes, not encourages them. Unfortunately, not all attorneys are good attorneys. Some are mediocre. Some out are outright unconscionable.

Exploitative Lawyers Are A Very Old Problem

Despite the best efforts of the Bar to screen out applicants of poor moral character, there have always been bad attorneys in the Bar. There have always been lawyers who take advantage of the intense emotions inherent in conflict, attorneys who manipulate people to their own benefit. They do not seek to calm the passions, but rather to inflame them. They see it as an effective tool to make money, to attain power and fame.

This kind of dispute exploitation is not unique to plaintiffs counsel, and this is not an essay bashing the plaintiffs Bar. Even though defense counsel do not have the same temptation of big contingency fee awards, they too can seek to encourage the subjective passions and views of their clients so as to prolong litigation and pad their fees. I have seen this kind of unethical conduct by both sides. It most frequently raises its ugly head where big money and big fees are involved on both sides.

This kind of unethical conduct by attorneys is not at all new. In fact, it has been plaguing the legal profession for centuries. It’s extreme forms were long ago made crimes and torts, often known in the common law as champerty and barratry. Today it goes by the terms of malicious prosecution, abuse of process, or more commonly, ambulance chasing. It is still illegal and unethical for attorneys, but rarely prosecuted, or even punished.

Lawyers Must Be Above The Fray

Most litigation attorneys understand full-well that their role is to be the objective voice of reason. Of course they sympathize with their client’s plight. They hear and acknowledge their client’s emotions, but they do not encourage them. They do not add fuel to the fire of anger, hatred, fear, and greed. They look to resolve disputes, to calm the emotions. They try to turn the parties to logic and reason. That does not mean that they become Vulcan-like, empty, and cold. A good attorney will argue vigorously and with passion the legal positions of their clients, but they will do so without any personal enmity, hostility, or  animosity to the other side, much less to the opposing counsel or the judge.

The entire legal system depends upon lawyers to see, but be above, the fray of fragile emotions. It depends on attorneys to keep a clear head, to always remember that it is not their dispute, it is their client’s dispute. It depends on attorneys to reduce, not encourage, the animosities inherent in litigation.

If my thirty-two years as a litigation attorney have taught me anything, it is that you cannot be effective as a lawyer unless you maintain objectivity, unless your decisions are driven by reason, not emotion. A lawyer must be aware of the emotions of the litigants, and understand how they drive the parties’ behavior, but they must always be above them. Failing that, they fail as a lawyer.

This can sometimes be very difficult. I acknowledge that I have sometimes been caught up in the passions of my clients, of a dispute with opposing counsel, at least for a time. But then I would catch myself and settle down. Anger must always be controlled. Even excessive joy and gloating in victory is not a good thing.

I was lucky to have four mentors as a young-pup attorney in my twenties and thirties. I would always get upset when an opposing party lied under oath. I would get really pissed when an opposing counsel played a dirty trick on us. In the world of state and federal litigation of Orlando, Florida, we dealt with some really bad apples, often they were out-of-towners.

I have seen outright intentional lies and fraud by attorneys of all sorts, from forgetting to copy us with memos, intentional misstatements of law and mischaracterizations of cases, obvious hide-the-ball tactics, extortionate discovery requests made just to harass, sandbagging of all sorts, mud-slinging, even against opposing counsel, as well as outright lies and disrespect to a judge. I have even seen attempts to slip in new fake exhibits at trial. I have also seen champerty and barratry by attorneys, solicitation of suits of all kinds, some of which were quite frivolous. I have seen abuse of process and malicious prosecution. I have seen attorneys who are totally in the fray of all disputes they enter. They curse and shout at everyone in sight.

I have seen attorneys who obviously pander to their clients’ animosities, who expand and prolong litigation needlessly. Unfortunately, this is quite common among the ethically challenged members of the Bar, and is frequently rewarded financially. Ever wondered how the not-so-smart attorney has so much money?  Now you know (either that or the silver spoon syndrome). There are many slang references to this practice among lawyers, including milking a client, and keeping the fires going until the client runs dry, runs out of money, and then, and only then, talking about settlement. This applies to some defense counsel as well, as Craig Ball was quick to point out in his comment below. I am pretty sure all experienced lawyers have seen this kind of thing all too often. No wonder the law and lawyers have such a bad reputation. We are not doing a very good job of policing ourselves.

Lesson of a Young Lawyer and the Triune Brain

I used to get quite upset as a young lawyer when this kind of thing happened to me. But one of my mentors, Tom Moran, had a slogan: Don’t get mad, get even. He would say that quite often when I was an associate working his cases. It was a lesson of harnessing emotions into action, of using them for motivation, and not allowing them to use you, and push you into emotional reactions you would later regret.

After several such lessons, when I saw for myself the effectiveness of this  strategy, I eventually learned. Oh, I still get angry, even to this day, at what some lawyers do, but I never act on this anger. I channel it to work even harder. My actions as a lawyer are governed by my thinking, my intellect, not my feelings.

A little diversion into old brain studies would be helpful at this point. Some consider the human brain to be composed of three distinct structures. Neuroscientist and physician, Dr. Paul D. MacLean (1913-2007), Yale Medial School, called our brain a triune brain. It is an evolutionary view of our brain and mind functioning. Our neocortex is the grey matter on top. It is where our higher human functions originate, our reasoning and language skills. But below it are two older brains that came earlier in evolution, the mammalian brain (limbic system), and the still older reptilian brain (brain stem and cerebellum). The oldest reptilian brain governs basic instincts, your autotomic nervous system; it is where our flight or fight instincts reside. The mammalian brain is the home of feelings, of pleasure and pain. You share this brain with your dog.

A lawyer should not deny their feelings, or the emotions of the parties, or ignore them. That would be a mistake. But we should not allow them to dictate our actions. We should use all three of our brains, but make sure that the neocortex is in control. After all, who wants a dog for a lawyer? Yes, they can be cute, but their bark is ultimately unpersuasive.

The old saying drilled into me as a young associate – don’t get mad, get even – was not about getting revenge and allowing the lower brains to control. Just the contrary. Our goal was always about justice for our clients, not personal revenge. Our satisfaction came in seeing justice done. More often than not, it was, and the bad-apple-type-attorneys failed to deceive. I am happy to report that after a lifetime in the trenches of the law, I am still upbeat about our system of justice. It is imperfect, but it is the best in the world.

Conclusion

Lawyers who allow themselves to get caught up in the emotions of their client, or the emotions of the dispute itself, do a disservice to their client, and a disservice to the profession. We are peacemakers, not inciters. We are officers of the court, sworn to bring disputes to a reasoned and just resolution. As Rule One requires, we are “to secure the just, speedy, and inexpensive determination of every action and proceeding.” This can only be achieved by our intellect, our higher uniquely human brain functions, not by our reptilian or mammalian brains. We have evolved from the slime, the mud. We should stay in the upright world of moral conduct.

This dictate to rise above the bitter fray of litigation hostilities, and rely instead on reason and facts, applies to all aspects of litigation, including e-discovery. Do not ever let the opposing party’s or opposing counsel’s actions provoke you to an emotional response. Do not respond with anger to an e-discovery requesting extortionist, or a hide-the-ball responding illusionist. Do not reply in kind. As my mother would say, two wrongs don’t make a right.

This does not mean to capitulate or allow yourself to be bullied. This does not mean to just sit back and allow our legal system to be abused. It means to stay cool, to use your higher brain functions, your intellect. Use logic, language, reason, and truth to obtain justice. The pen is mightier than the sword. Given time the calm voice of truth and reason will almost always prevail. Never stoop to the level of another lawyer who has chosen the dark side, who is controlled by, rather than controls, his or her limbic system. Clear thinking and cold facts will best serve them their just deserts.


Blogging, Proportional Review and Predictive Coding

May 13, 2012

I did an interview recently with Andrew Bartholomew of e-Discovery Beat. I told him he could ask me anything, except for cases involving my law firm. Andrew put the audio of the entire interview online, and added an edited transcript of selections in two segments: part one and part two. Here are a couple of questions that you might find of interest, especially the first one about blogging, which I have been  asked about a lot lately.

After last week’s difficult blog on random sampling, this one is an easy-breather. But, don’t worry, I try not to bore. The interview includes a zinger against all abusers of e-discovery. You know who I mean. All those caveman lawyers out there who abuse e-discovery as a blunt tool for extortion. They only use e-discovery to try to drive up the other side’s costs at every turn. They are not really looking for the truth. They will do or say anything to win a case, to make money for themselves. See: Judge David Waxse on Cooperation and Lawyers Who Act Like Spoiled Children.

E-discovery is a powerful tool for truth, a tool for justice. It can be dangerous in the wrong hands. We must all stand up, and stand together, to protect e-discovery from abusive bullies. That includes exercise of your First Amendment rights to free speech and free association. That is what our country is all about.

Blogging

Bartholomew: How did you come to be such a prolific blogger? Where most blogs just skim the surface, your E-Discovery Team blog really dives deep into the issues.

Losey: When I first started doing this in 2006, the blog posts were shorter and I didn’t provide a whole lot of analysis. I was mainly talking about new cases. But after doing this every week for five-and-a-half years now, it has become second nature. I find that my writing evolves as my own understanding evolves.

I’m pretty opinionated at this point because I’ve been doing it so long. I have become the analysis and opinion guy in e-discovery. I don’t try to report on each new case that comes out. Occasionally, I’ll have someone send me an opinion say, for instance, by Judge Scheindlin right off the presses, and I like to rush out there and write something that’s kind of news oriented. But generally speaking I am more of an analysis and commentary kind of guy to help people think it through.

I try to help the profession by sharing the experience of my being a lawyer for 32 years and being an avid technology person my whole life. Being there in the field as a practicing attorney, I see what’s going on. I know what the fights are in the courtrooms. Based on that, I have a lot of source material and information that comes my way. I’m doing analysis anyway as part of my job, so it’s not that hard to share it and write it up on my blog.

Bottom-Line-Driven Proportional Review

Bartholomew: You mentioned e-discovery case-law. Are there any important case-law trends that you’re following at the moment?

Losey: I came out a few months ago on my blog and went public with something I’ve been doing internally at my law firms and that’s bottom-line-driven proportional review. This is something we try to do every chance we get in every case to make sure that our production responses are proportional to the value of the case. It is my way of trying to control what I think is the primary problem in e-discovery today, and that is runaway costs.

It involves estimation and budgeting and figuring out what a project is going to cost before you actually begin. It seems like basic common sense, but you’d be amazed. That is not the way things have been done in the past. There are still plenty of law firms around the country, if not the world, that begin production responses without a set budget or without having a clue what it’s going to cost them. We see examples of this in the case-law almost every day.

I’m now trying to promote this; just get the idea out there. Use your knowledge and experience about what things cost to make an estimate at the very beginning of a case as to what is proportionate to spend on e-discovery. I call it bottom-line-driven proportional review. I want everybody to be making this argument.

Who wouldn’t be in favor of proportionate expenses? Who wouldn’t be in favor of curtailing out-of-control e-discovery costs? Who wouldn’t be in favor of reasonability when it comes to e-discovery?

There are some people that wouldn’t be in favor of it. These are the people I want to stop, the people that use e-discovery as a weapon, not as a valid tool to obtain the truth in order to decide cases.

Predictive Coding and Human Review

Bartholomew: How does the advent of computer-assisted review, or predictive coding, stand to impact the role of human review in e-discovery going forward?

Losey: The need for human input is never going to go away. Predictive coding does not replace human reviewers. Having said that, it may reduce the number of human reviewers, but so will proportional discovery.

If you use predictive coding as a tool, but you don’t use it with a legal method, it’s worthless. A hammer doesn’t build a house. It takes a carpenter to use the hammer to build the house. Predictive coding is just the latest, coolest tool, but it doesn’t replace the carpenter.

It doesn’t replace all the other tools either. I’m the one that said keyword search is very limited, but the truth is, you still need keyword search. It’s still a very valuable and important tool. It’s just not the best tool. But it still needs to be used, and so do the human reviewers.

The other slogan I’m talking about right now is called hybrid (computer and human), multi-modal (many methods) computer-assisted review. This is what it’s all about. It’s having computers help us to do a faster, better, higher quality, yet less expensive, review. Basically it allows us to get more bang for our buck.

If you’re on a budget, you better be delivering the relevant documents within that budget. The best way to do that is with the latest tools; predictive coding is the latest tool. But it’s just a drop down menu on any good software review tool, along with concept review, the similarity feature where you’re grouping words using near de-duplication, as well as keyword search.

The foundation to all of these techniques is expert human review. The human input has become even more important with predictive coding because now you need to bring in experts at the beginning. You need to bring in the people who really know what’s relevant and what isn’t in order to train the computer and generate the seed-set. If anything, the latest predictive coding technologies have elevated the importance of the expert lawyer.

Bartholomew: Are there other issues or trends that we might be hearing about from you on your blogs or future presentations?

Losey: I’m going to continue to talk a lot about predictive coding and using technology because I really believe that the only way to get out of the mess we’re in of having too much information – a problem created by technology – is to use more technology. We have to fight fire with fire. I’m going to keep encouraging the law to use technology and the knowledge and intelligence we have in computers in order to do e-discovery – not only in an inexpensive way, but also in a quality way where you get the information you need.

The new trend I’ve been talking about is the growing importance of information science on the law. It’s one thing to have technology impact the law, but you must balance out the technology with the deep knowledge and real understanding that you can really only get from science. That’s the only way law is going to be able to use technology in an appropriate manner.


Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022

May 6, 2012

This is going to be a hyper-technical blog for all those professionals in e-discovery who are struggling, like I am, to fully understand the math governing random sampling, particularly as it is applied to our field of legal search. I can say with a high degree of confidence that most of us who specialize in e-discovery employ random sampling in some form or another as part of our quality control efforts. We typically use random sampling in large-scale review projects. But do we really understand all of the intricacies? Probably not.

Bubble People and the Future Here Now

I would estimate that 80% of the elite few who attend Sedona, as  mentioned in my last blog, use random sampling as part of their e-discovery work. But this is a small group of dedicated specialists, probably only a few hundred strong. They are in what Paul D. Weiner likes to call the Sedona Bubble. I have about only a 90% confidence level of that number, however, as I have not done a valid poll yet of the Sedonites (not the best word perhaps for Sedona members, but better than bubble-people). Moreover, I suspect that my margin of error, aka confidence interval, is a high one of 10%. That means that as few as 70% of the Sedonites in fact use sampling, or as many as 90%. See eg. “Sampling 101 for the e-Discovery Lawyer,” an appendix to The Sedona Conference Commentary on Achieving Quality in the E-Discovery Process (2009) at pgs. 35-39.

This kind of probabilistic thinking is all part of the future practice of law, coming your way soon. How soon? I’ll tell you in a minute. As William Gibson said: The future is already here — it’s just not very evenly distributed. Many of my readers may already be there, Sedonites or not, and may already use random sampling and statistics as part of their legal practice. But I am pretty sure, and here I’d go as far as say I have a 99.9% confidence level, that most lawyers in the world do not.

My guess is based on my travels and teachings to many lawyer groups around the U.S., not to mention my interaction with many of those delightful lawyers in towns large and small who go by the label of opposing counsel. In other words, these statements and predictions are based on what I have seen, not from a validly random sample of American lawyers. (Hint to the Rand Corporation: here is a good research project for you.) Still, my wetware (gooey brain based) estimates, with a 95% confidence level, that less than 2% of all lawyers now use random sampling in any way. Random sampling is still a rare exception in U.S. legal culture. And therein lies the problem, at least in so far as e-discovery quality control is concerned. Sampling now has a very low prevalence rate.

But those of us in the world of e-discovery are used to that. There are still very few full-time specialists in e-discovery. This is changing fast. It has to in order for the profession to cope with the exploding volume and complexity of written evidence, meaning of course, evidence stored electronically. We e-discovery professionals are also used to the scarcity of valuable evidence in any large e-discovery search. Relevant evidence, especially evidence that is actually used at trial, is a very small percentage of the total data stored electronically. DCG Sys., Inc. v. Checkpoint Techs, LLC, 2011 WL 5244356 at *1 (N.D. Cal. Nov. 2, 2011) (quoting Chief Judge Rader: only .0074% of e-docs discovered ever make it onto a trial exhibit list). Again, this is a question of low prevalence. So yes, we are used to that. See Good, Better, Best: a Tale of Three Proportionality Cases – Part Two; and, Secrets of Search article, Part Three (Relevant Is Irrelevant).

A Losey Prediction

I predict that the rate of prevalence of use of sampling and probabilistic thinking by lawyers will increase rapidly over the next ten years. It must. Random sampling is too powerful a tool for the profession to ignore. It has been well proven as an indispensable tool of science and industry. It is probably time for law to also embrace this tool.

But I will do more than make such vague general assertions. I will now get very specific and put hard metrics on my predictions, metrics with which future lawyers can hold me accountable. (I’m not really too worried as I’ll have Adam to defend me, and he’ll probably come up with some good excuses in the 5% unlikely event I’m wrong.)

I hereby predict that … (trumpets sound) … in the year 2022 a random sample polling of American lawyers will show that 20% of the lawyers in fact use random sampling in their legal practice. I make this prediction with an 95% confidence interval and an error rate of only 2%. I even predict how the growth will develop in a year by year basis, although my confidence in this detail is lower.

But I will go still further out on the limb, and make my prediction even more specific. Assuming that by the year 2022 there are 1.5 Million lawyers (the ABA estimated there were 1,128,729 resident, active lawyers in 2006), I predict that 300,000 lawyers in the U.S. will be using random sampling by 2022. The confidence interval of 2% by which I qualified my prediction means that the range will be between 18% and 22%, which means between 270,000 lawyers and 330,000 lawyers. I have a 95% level of confidence in my prediction, which means there is a 5% chance I could be way wrong, that there could be fewer than 270,000 using random sampling, or more than 330,000. This is all shown by the familiar bell curve first shown above and below.  (Hint – Adam, here’s the out to defend my predictions (in the unlikely event you’ll have to.))

I do all of this prognostication somewhat tongue-in-cheek, but with the ulterior motive to provide an example of what I mean by probabilistic thinking. Forget about absolute certainty of knowledge about anything. Forget about perfection. Think reasonability of efforts. Think preponderance of evidence. Think probability. Think in terms of degrees of confidence. For example, I am highly confident that most of you probably get 90% of my humor, give or take 2% of my jokes.

But enough with the pleasantries. I promised a hard-nosed technical math blog for all you super-nerds out there, and now you’re going to get it! (Here is where I predict 50% of my readers will stop reading!)

The Value and Limitations of Random Sampling

When you review a random sample of data (“corpus”), and categorize the sample data in some way, for instance by identifying all documents in the sample as either relevant or irrelevant, and you then project the percentage found in the sample onto the entire corpus, you can not know for certain that your percentage is the correct answer (i.e. – only 10% of the total corpus is relevant because only 10% of the sample is relevant). But, if the sample size is large enough, and the selection of the sample is truly random, you can know that there is a certain chance, i.e. 95% chance, or “confidence level,” that you are within a certain margin of error (“confidence interval”) of the correct answer. Put another way, there is a 95% chance that you are correct, at least within a defined plus or minus range.

For my purposes as an e-discovery lawyer concerned with quality control of document reviews, this explanation of near certainty is the essence of random probability theory. This kind of probabilistic knowledge, and use of random samples to gain an accurate picture of a larger group, has been used successfully for decades by science, technology, and manufacturing. It is key to both quality control and understanding large sets of data. The legal profession must now also adopt random sampling techniques to accomplish the same goals in large-scale document reviews.

You can use any standard random sample calculator to determine the appropriate size of a random sample, using either a 95% or 99% confidence level, and the confidence interval of your choice. I suggest you use the calculator shown at the top of random sample page in my FloridaLawFirm.com website.  The confidence interval you plug into the calculator represents the margin of error you find acceptable. Less documents are required for a valid random sample size as the confidence interval increases, or confidence level decreases.

In the example above where 10% of the sample was relevant, if a confidence interval of 4 is used, that means that the 10% projected level may be as high as 14% or as low as 6%. This means that with a corpus of 1,000,000 documents, and a review of a random sample of 600 documents, which is the sample size required for a 95% confidence level and +/- 4% confidence interval, wherein you find that 60 of the documents are relevant, and 540 are irrelevant, that you can know that there is a 95% chance that the range of relevant documents in the entire corpus is from between 140,000 to 60,000 documents. If a confidence interval of 2% is used, and the corresponding number of randomly selected documents is reviewed (2,395), and again 10% were found to be relevant (240), then the range of relevant documents in the corpus is from between 120,000 to 80,000. That is how random probability works in a binary classification system. Here is the standard bell curve graphic illustrating a 95% confidence level:

The variation in sample size required for various confidence levels and intervals  is shown in the graph below. It illustrates the sample sizes needed for 90%, 95%, and 99% confidence levels with confidence intervals of 10%, 5% and 2%.

The math at work for calculating sample sizes and confidence intervals involves square root calculations, as will be shown in the fun math part below. This essentially requires about a quadrupling of sample size in order to achieve a doubling of accuracy. Put another way, if you want to cut your error margin in half, you will have to quadruple your sample size. For instance, assuming a population size of 1,000,000, and a 95% confidence level, the sample size required for a 10% confidence interval is 96. The sample size required for a 5% confidence interval is 384. The sample size required for a 2.5% confidence interval is 1534. The sample size required for a 1.25% confidence interval is 6,109.

This is a good rule of thumb to remember. If you want to reduce your error rate in half, your confidence interval, and thus double your accuracy, your cost to do so will quadruple. It will quadruple, at least approximately, because you will have four times as many documents in the sample to review. Twice the quality at four times the cost. Thus 2=4 in the world of quick calculations for random sampling. Hopefully the picture of my old unmanicured thumb will help you to remember this.

The Impact of Prevalence on Random Sampling Calculations

The second calculator shown on my linked page allows you to add another dimension, another criterion, to your probability analysis, namely “prevalence.” This is especially important to understand in the field of legal search where low prevalence rates are common. In the binary example of relevance, the prevalence of the corpus is the percentage of relevant documents. The prevalence percentage has a direct numerical impact on the margin of error (“confidence interval”) applicable to the sample projections. Prevalence is also known as “richness,” as in target-richness, or “response distribution.” See eg. another sample size calculator by RAOsoft.com that includes these criteria and an explanation.

The first calculator shown on my website assumes what some call the “worst case scenario” for sample prediction where the prevalence is 50%. This a perfectly even distribution, which requires the largest sample size to attain a desired confidence level. The top calculator conservatively assumes that half of the corpus will be in the target group, i.e. – not relevant. When the target rate or prevalence is 50/50, that requires the highest number of documents to be sampled for statistical validity, which is why it is called the “worst case scenario.” When the prevalence rate is higher or lower 50%, the number of documents that must be sampled decreases.

Thus, if the prevalence rate is 95%, meaning in our example, 95% of the documents are relevant, or, conversely, if the richness is very low, and the prevalence rate is only 5%, again a smaller sample is required to attain the same confidence interval. Put another way, review of the same sample size creates a much lower confidence interval, and thus a much lower margin of error. This is very important to understanding the binary classifications of a large corpus of data where only a small amount of the data is responsive, i.e., are relevant. (Another example of a binary classification could be privileged or not.)

Try out the second standard random sample calculator shown on my website to see this for yourself. In the first example shown, assuming a corpus of one million documents, with a confidence interval of 4, you see that a sample size of 600 documents is required. This is the largest possible sample size required for the 95% +/- 4. It assumes the worst case scenario of 50% prevalence (i.e. – half of the documents are relevant). Now change the prevalence percentage to 95% in the second calculator, using a sample size of 600, and a corpus of 1,000,000. The confidence interval is now 1.74%. You get the same result when you assume a prevalence rate of only 5%.

Again, see the Sample Size Calculator at RaoSoft.com for a calculator that allows you to plug-in different prevalence rates (called “response distribution” in that calculator) to determine sample sizes for certain intervals based on prevalence. Bottom line, when you have a corpus with a high or low prevalence, one that is either target rich, or target poor, a smaller sample size is required to attain an acceptable confidence interval. (Note, there are some exceptions where, for instance, there are extreme values (“outliers”) or where there are small corpus sizes.)

A good way to understand prevalence is by example. Start by assuming a 1,000,000 document corpus, which has a prevalence rate of 5% (one where 5% or less of the documents are relevant), you need only review 456 documents to know with 95% certainty, and an error rate of only 2%, the total number of relevant documents. Remember, if you had assumed that half of the documents were relevant, then you would have had to review 2,395 documents to attain the same confidence level and interval. See for yourself by trying this out in the standard calculators on my page and on RaoSoft’s.

This characteristic of random sampling must be understood for cost-effective quality control in a corpus with low prevalence. This is important because low prevalence is the norm in legal search, and not the so-called standard normal distribution used in other fields, where you assume the hard-search of separating out half of a 50/50 split.

Mathematical Formula for Random Sample Size Calculations

Here is one way of expressing the basic formula behind most standard random sample size calculators:

n = Z² x p(1-p) ÷ I²

Description of the symbols in the formula:

n = required sample size

Z = confidence level (The value of Z is statistics is called the “Standard Score,” wherein a 90% confidence level=1.645, 95%=1.96, and 99%=2.577)

p = estimated prevalence of target data (richness)

I = confidence interval or margin of error

Putting the formula into words – the required sample size is equal to the confidence level squared, times (the estimated prevalence times one minus the estimated prevalence), then divided by the square of the confidence interval.

Here is an example of the formula in action where we assume a 95% confidence level and confidence interval of 2%, and a prevalence of 4%:

n = Z² x p(1-p) ÷ I²
n= 1.96² x .04(1-.04) ÷ .o2²
n = 3.8416 x .04(.96) ÷ .0004
n = 3.8416 x .0384 ÷ .0004
n = .14751744 ÷ .0004
n = 368.7936

The formula shows that with an estimated prevalence of 4% we need a sample size of 369 documents to attain a 95% confidence level with a margin of error of 2%.

It is important to understand that this sample size formula is derived from the formula for calculating confidence intervals (I).

If you take the “n” value as unknown (the number to be sampled for a specified confidence interval), and assign a value to the confidence level of say, 95%, wherein the value for “Z” is thus 1.96, and you move the “n” to the left side of the equation, the formula now looks like this:

Mathematically this is the same thing as our original formula:

n = Z² x p(1-p) ÷ I²

We can easily prove the formulas are identical by example where we again assume a 95% +/- 2%, and a prevalence of 4%:

I = Z√p(1-p)/n
.02 = 1.96 √.04(1-.04)/n
n = (1.96/.02)² x .04(.96)
n = (98)² x .0384
n = 9604 x .0384
n = 368.7936

Here is another example using the formula I prefer, and following our first assumptions where the estimated prevalence rate is 5% relevant documents, and a 95% confidence level is desired with a confidence interval of 2%. The following relatively simple mathematical calculation provides the required sample size:

n = 1.96² x .05(1-.05) ÷ .02²
n = 3.8416 x .05(.95) ÷ .0004
n = 3.8416 x .0475 ÷ .0004
n = .182476 ÷ .0004
n = 456.19

Now if you change the prevalence rate from 5% to 50%, the formula increases the required sample size for a 95% confidence with plus or minus 2% as follows:

n = 1.96² x .5(1-.5) ÷ .02²
n = 3.8416 x .5(.5) ÷ .0004
n = 3.8416 x .25 ÷ .0004
n = .9604 ÷ .0004
n = 2401

Do the math above. Really, it is not that hard. It is all just multiplication and division. It shows that with the lower prevalence rates commonly found in legal search you can make accurate predictions using lower sample sizes. Further, if you do determine sample size based on an assumed 50% prevalence rate, whereas in fact you have a much lower rate, you are actually lowering your confidence interval, your margin of error.

Thus, if you use a standard calculator that by default has a worst-case 50% distribution or prevalence rate built-in, and review 2,401 documents, which you thought was the sample size necessary to attain a confidence interval of 2%, and you in fact were dealing with a document corpus that only had a 5% prevalence rate, having 95% irrelevant documents, then in fact your calculations will have a confidence interval (error rate) of only .87%, and not the 2% interval you thought. That is a good thing.

Again, don’t believe me. Do the math. Use the Interval formula that the sample size formula is based upon. (You may also need a calculator that does square root.)

I = Z√p(1-p)/n
I = 1.96√.05(1-.05)/2401
I = 1.96√.05(.95)/2401
I = 1.96√.0475/2401
I = 1.96√.00001978342357
I = 1.96 x .004447856064443
I = .00871779788631

You can also use the second standard calculator on my page,  Just plug-in 95% confidence level, a sample size of 2401, a population of 1,000,000, and a prevalence percentage of 5. It should calculate a confidence interval of 0.87. You can also double-check by using the RAOsoft calculator.

Additional Math Disclaimer

I have a disclaimer on all of my blog postings. See the top title and the first link on the right hand column: DISCLAIMER. On this particular post I thought it would be a good idea to add yet another level of disclaimer. Although math is math, and these are well accepted formulas and principles, these are still just my personal applications and synthesis of information and rules applicable in the field of statistics and legal search. I reserve the right to go back and make revisions to this post as my understanding deepens and improves. I am an attorney, not an information scientist or statistician. These views should not be relied upon, nor accepted as anyone’s opinion other than my own. You should, of course, always do your own due diligence, study and analysis. Like I said, do the math.

As always, if you disagree with the analysis here, or detect any math errors, please let me know. I welcome a free exchange of ideas and information. You can either email me privately, or write a public comment. That is how my blog works. I put my ideas out there for peer-review, and I make corrections as I go along, and before the blogs are ultimately transformed into a book. I appreciate all of the help my learned readers have provided to me over the years since I first began this open writing experiment in 2006. The odds are, your comments will help make my next book even better.

Conclusion

This blog has discussed thirteen different scenarios showing probabilistic analysis:

  1. I began with analysis of e-discovery expert bubble people wherein I estimate, based on anecdotal evidence, that 80% already use random sampling in some manner. I have only a 90% confidence level in that, with a confidence interval of 10%, so it could actually range from 70% to 90%, and maybe a lot more or less.
  2. The I moved on to analysis of all lawyers in the world. I estimated that a majority (51% or more) do not use random sampling at all. I put a 99.9% confidence level on that opinion and invited the Rand Corporation to try to prove me wrong.
  3. Then I turned my half-witty attention to all lawyers in the U.S. and opined that less than 2% use random sampling. I put a 95% confidence level on that one.
  4. Then I made my prediction that in ten years the number of lawyers in the U.S. using random sampling will increase tenfold from 2% to 20%. I am 95% confident on that projection, but I put a margin of error on it of plus or minus 2%. Based on the ABA’s estimate of the number of lawyers in America, I projected that from between 270,000 to 330,000 lawyers will be using random sampling by 2022. Rand Corp., make a note and do a follow-up survey in 2022, would you please?
  5. I next estimated that my blog readers get 90% of the humor in this blog (or better said, attempts at same), with a confidence interval of 2%, meaning between 88% and 92%.
  6. Serious sampling examples then began where I assumed a 95% confidence level, and 4% confidence interval. A review of a sample of 600 documents found that 60 were relevant (10%). Based on the sample we can project that 100,000 of the documents in the million document corpus would be relevant, with a range of between 6% and 14%, which means between 60,000 and 140,000 documents.
  7. Another variation of the last example was then considered where a confidence interval of 2% was used, instead of 4%. This required a sample size of 2,395 documents, where 10% were again found to be relevant (240). Since a 2% interval was used, the range of relevant documents projected was narrower, from between 80,000 and 120,000.
  8. Next, I added consideration of prevalence into the sample size formulas and started with an example of a 95% confidence level, and either 5% or 95% prevalence ratio (same either way). With a review of a random sample of 600 documents, and either a 5% or 95% prevalence, I showed that the confidence interval improved from 4% to 1.74%. This is an important point.
  9. Then I considered a 5% prevalence, where I showed that a sample of only 456 documents provides a 95% certainty and an error rate of 2%. This compared to the need to sample 2,395 documents for a 2% confidence interval if you assume 50% prevalence. Another important point.
  10. Then I showed the actual mathematical calculations explaining the formulas and used an example of a 95% confidence level, a 2% confidence interval, and a prevalence of 4%. You remember, it went like this and showed you only had to sample 369 documents:
    n = Z² x p(1-p) ÷ I²
    n= 1.96² x .04(1-.04) ÷ .o2²
    n = 3.8416 x .04(.96) ÷ .0004
    n = 3.8416 x .0384 ÷ .0004
    n = .14751744 ÷ .0004
    n = 368.7936
  11. The next formula I ran again assumed a 95% confidence level and 2% interval, but this time changed the prevalence to 5%. The formula showed a required sample size of 456 documents.
  12. Then I ran the math on 95% +/- 2, but this time assuming a 50% prevalence. The formula showed a required sample size of 2,401 documents.
  13. Then I ended with another twist where the sample size of 2,401 documents is used, but this time a 5% prevalence is assumed. The interval calculation formula showed that a .87 confidence interval results. That was shown in only formula where you had to do a square root calculation:
    I = Z√p(1-p)/n
    I = 1.96√.05(1-.05)/2401
    I = 1.96√.05(.95)/2401
    I = 1.96√.0475/2401
    I = 1.96√.00001978342357
    I = 1.96 x .004447856064443
    I = .00871779788631

I pointed out that you could skip the math entirely if you wanted, and attain the same results by using the random sample size calculators on my page, or on the RAOsoft calculator, or any other of a number of calculators freely available on the web. Depending on what software you are using for review, you might also have this ability built-in. You can also skip formulas and calculators all together and rely upon charts that list common values. These charts typically assume a prevalence of 50%. See eg Sample Size Table from Research Advisors. It can anyway be helpful to look at these charts to get a feel for how the numbers relate. For instance, look at these tables from the University of Florida, Professor Glenn D. Israel:

Table 1. Sample size for ±3%, ±5%, ±7% and ±10% Precision Levels Where Confidence Level is 95% and P=.5.
Size of Sample Size (n) for Precision (e) of:
Population ±3% ±5% ±7% ±10%
500 a 222 145 83
600 a 240 152 86
700 a 255 158 88
800 a 267 163 89
900 a 277 166 90
1,000 a 286 169 91
2,000 714 333 185 95
3,000 811 353 191 97
4,000 870 364 194 98
5,000 909 370 196 98
6,000 938 375 197 98
7,000 959 378 198 99
8,000 976 381 199 99
9,000 989 383 200 99
10,000 1,000 385 200 99
15,000 1,034 390 201 99
20,000 1,053 392 204 100
25,000 1,064 394 204 100
50,000 1,087 397 204 100
100,000 1,099 398 204 100
>100,000 1,111 400 204 100
a = Assumption of normal population is poor (Yamane, 1967). The entire population should be sampled.

Even though calculators and charts make sample size determination easy, it is good to know how to do the math yourself. That provides a solid understanding of what the calculators and charts are doing and why. Also see the work of the EDRM on the subject: Statistical Sampling Applied to Electronic Discovery; and, Appendix 2: Application of Sampling to E-Discovery Search Result Evaluation.

The math we examined shows the importance of prevalence to random sample size calculations and confidence interval calculations. This has been overlooked, or at least underestimated, by many in the field of e-discovery. This error often leads to over-sampling and review of more documents than required to obtain reasonable confidence levels and intervals. The routine assumption of a worst-case-scenario of 50% prevalence leads to overkill and unnecessarily large samples for many (but not all) uses of random sampling, including many quality control calculations. We need to start adding prevalence into our equations, and start being more efficient in our quality control metrics.

I look forward to your public and private comments. Hopefully I have caught all of the minor number and math mistakes (I have already spotted and corrected quite a few), but it is late, and I may well have missed some. Please let me know if you see any more errors.


What Goes On In Sedona?

April 29, 2012

What goes on in Sedona … stays in Sedona. I’m not sure if Vegas copied the slogan from The Sedona Conference®, or visa versa, but when you attend a Sedona Conference event, it is one of the first things you learn. It is an iron clad rule, one which I have never broken. (Who wants to risk that kind of bad karma?) Because of this rule, even though I attended the Mid-Year meeting of Sedona this week (held in Denver, not Sedona), there is very little I can say about it. The Sedona rule keeps me quiet.

What I Can Say About The Sedona Conference

Under the Sedona rule, I can say that I had a great time, met with many of the top people in the field, and learned new things of interest. Also, on a personal note, I can say I saw several old friends and make some new ones, all good folks who share a common passion for e-discovery. But that is about all I can say.

I’ve been going to Sedona events since 2006, usually to both the mid-year and annual meetings. The quality of the events is head and shoulders above all others. In fact, it is the only event I will go to, even if I don’t happen to be a designated speaker. My time is limited. Life is short. So I choose to follow the path to Sedona.

You can learn a lot in Sedona when you turn off your inner chatter, and your external gizmos (iPhone, i Pads, etc), and just listen for a while. Truth be known, not everyone does that of course, me included (especially this week), but when you do, you get more out of it. A Sedona Conference is the best place around to try to understand many differing points of view, to experiment in a new kind of education program, one based on mutual respect and dialogue, not one-up-manship and argument. Dialogue is the key word here. Here is the official description of the mission of The Sedona Conference® (“TSC”):

The mission of TSC is to drive the reasoned and just advancement of law and policy by stimulating ongoing dialogue amongst leaders of the bench and bar to achieve consensus on critical issues. TSC brings together the brightest minds in a dialogue-based, think-tank setting with the goal of creating practical solutions and recommendations of immediate benefit to the bench and bar.

The Sedona website goes on to explain that its “conferences are dialogue-based mini-sabbaticals for the nation’s leading jurists, lawyers, and experts that allow them to examine leading edge issues of law and policy.”

Sedona Membership is Open to All

Curious about what really goes on at the secretive Sedona Conference? You should be. Its dialogue-based mini-sabbaticals are unlike any other event. You won’t find out from me, but still, not to worry, The Sedona Conference® is open to membership by anyone. It is not too expensive either. Just go to their newly refurbished website – www.thesedonaconference.org – and sign-up at the new member page. It costs $395 for an annual membership and you can pay online.

Attendance of the meetings like the Mid-Year I just attended is, of course, an additional charge. Also, attendance at these events is limited. That is no bull. Sedona Conference events always have more people sign up to attend than they will allow in. Sedona caps the size of these events in order to maintain quality. Sure, they could make much more money if they wanted to, just by making the conferences bigger. But, unlike many events, this is not about money. The Sedona Conference is a bona fide 501(c)(3) non-profit foundation. It is all about education and advancing the law in a reasonable and just manner. It is not a business, and the founder of The Sedona Conference®, Richard G. Braman, is all about ideas, not money.

Richard Braman

Richard is a cool dude who looks like more like a pirate than a lawyer. He’s into jazz, not possessions. No doubt that’s one reason he retired as a successful antitrust lawyer and moved to far out Sedona. Sedona – you know, the New Age paradise of energy vortexes, crystals, and spiritual visionaries. Sedona is an appropriate place for Braman. He is a lawyer visionary.

Richard Braman is the real thing. Although he is not a touchy-feely kind of guy, it is obvious he is motivated by love of profession, not money. He is deep into reason too, Man, don’t get me wrong. But he is balanced, just like the scales of Justice. So too, The Sedona Conference is open to all, defendants and plaintiffs alike, inside counsel, outside counsel, private lawyers, government lawyers, retired lawyers, practicing lawyers, judges, professors, scientists, paralegals, techs, super-geeks, and vendors of all sorts.

That is why I am proud to be a part of Richard’s mission, even if he does look like a pirate! His ship is pure. His mission is true. It is for all people who care about the pursuit of justice. The Working Group One part of the Sedona Conference is for those of us who really care about electronic discovery. (Sedona has other working groups too, ones that are focused on other areas of law: Antitrust, Complex Litigation, Intellectual Property Rights and International.)

The Many Sedona Publications on Electronic Discovery

The Sedona Conference® Working Group 1 on Electronic Document Retention and Production is called a working group for a reason. The group does not just meet and dialogue, it works. It creates outstanding writings to advance the law of e-discovery. The list of publications of Working Group 1 since 2003 is impressive. Moreover, true to their non-profit status, all of these publications can be obtained without charge from the Sedona website. Here is the official list, and I can tell you, there are several more in progress, I just can’t tell you what they are.

Conclusion

People like Richard Braman, and organizations like The Sedona Conference, are very rare in this over-commercialized, hyped up culture of ours. When you find a legal visionary with class, you should look deeply at his gifts, you should follow. Not him as a person. I’m not into that New Age guru stuff, and neither is Richard. You should follow his ideas. You should join his class and learn all you can. You should learn to dialogue. The Sedona Conference® is the only e-discovery organization I have joined. It is the only e-discovery group and education program I endorse. (Well, except mine of course:  e-DiscoveryTeamTraining® (yeah, I copied the trademark idea from him too)).

So, join up. Go to Sedona. You won’t find energy vortexes and doorways to other dimensions. You won’t find Gurus and Yodas (well, ok, maybe a couple of Yodas find you will). If you go to a Sedona event, wherever it is held, you will encounter a different and better approach to legal education. I look forward to meeting you there, maybe this Fall at the next big event, the Annual meeting. If so, we can toast a sunset and swap a few stories. All of which will be kept secret of course – a deep secret, but one filled with light, not darkness. For that is the Sedona way.


Second Ever Order Entered Approving Predictive Coding

April 24, 2012

An order approving predictive coding was entered on April 23, 2012 in Global Aerospace, Inc. v. Landow Aviation, L.P., et al. This is a complex dispute in a Virginia State Court. The defendants’ motion seeking the order was granted. It is not pretty, and not detailed, but appears to be the second such order in the history of Man. I can’t discuss the first case, but I can and will keep posting the next cases as they come rolling out. I predict there will be many this year. Send them to me, anonymously if you like, as in this case, and I will post them here, in full. I understand that in this second case the vendor was OrcaTec. Here it is (by the way, a short order like this, with handwriting, etc., is not uncommon in state court).

You can also download the order in PDF form here.

A Press Release by the vendor involved, OraTec, came out the day after I first published this news. It provides further background and some interesting quotes. Here are selected excerpts:

The consolidated case stems from a collapse of a commercial structure, which damaged hundreds of millions of dollars in personal property. The defendants are represented by Schnader Harrison Segal and Lewis LLP of Pittsburgh and Baxter, Baker, Sidle, Conn & Jones, PA of Baltimore. Schnader’s e-Discovery Practice Group, led by Thomas C. Gricks III,  initially directed the collection and preservation of the ESI.  When agreement on production methodology could not be reached, Schnader filed a motion for a protective order to allow the firm to use predictive coding to cull the collection.  …

The order was issued after a hearing on the defendants’ motion on Monday. The plaintiffs had argued against predictive coding, saying that it was not as effective as human review. Gricks presented the arguments for predictive coding to the court, noting that Schnader has been successful in using predictive coding to save time and money on first-pass review, which in this case will be significant.  He was backed up by experts Karl Schieneman of Review Less, Timothy Opsitnick of JurInnov, and Dr. Herbert L. Roitblat of OrcaTec.

 “The critical point of the order is that the Court allowed a party to choose predictive coding as its preferred method of responding to a request for production of ESI.  His decision was an express recognition of the evolution of document review to deal with ever-increasing volumes of data,” said Gricks.

“We were very pleased to be able to show the scientific accuracy of predictive coding to a court in a formal hearing setting,” said Dr. Roitblat.  “Keyword searching seems to be perfectly acceptable to attorneys, even though several studies have focused on its inaccuracy. If keyword searching with 20 percent proven accuracy is okay, how can predictive coding with more than 90 percent demonstrable accuracy be unacceptable? I see this as the first step in that mental barrier coming down for lawyers.” …

While the issued order was quite short, the judge said in the hearing that a producing party gets to use whatever method it wants to use to review documents. The receiving party can then raise issues if it doesn’t get what it thinks it should have in litigation.  Opsitnick said, “The Judge analogized using predictive coding to a choice between using paralegals or senior partners or younger associates to review documents, which we think is correct. Unfortunately, none of the Court’s helpful explanation made it to the Order this time, but this is the first break in the predictive coding logjam.”

“OrcaTec has shown over and over how much time, money and effort predictive coding saves, plus how great the accuracy and transparency is in using it,” said Roitblat.  “We are very grateful that the court recognized the value of predictive coding.”

Schieneman added, ”This ruling  should give attorneys a real green light for moving ahead with this truly effective technology.”