“The Hacker Way” – What the e-Discovery Industry Can Learn From Facebook’s Management Ethic

August 18, 2013

Facebook’s regulatory filing for its initial public stock offering included a letter to potential investors by 27 year old billionaire Mark Zuckerberg. The letter describes the culture and approach to management that he follows as CEO of Facebook. Zuckerberg calls it the Hacker Way. Mark did not invent this culture. In a way, it invented him. It molded him and made him and Facebook what they are today. This letter reveals the secrets of Mark’s success and establishes him as the current child prodigy of the Hacker Way.

Too bad most of the CEOs in the e-discovery industry have not read the letter, much less understand how Facebook operates. They are clueless about the management ethic it takes to run a high-tech company.

An editorial in Law Technology News explains why I think most of the CEOs in the e-discovery software industry are just empty suits. They do not understand modern software culture. They think the Hacker Way is a security threat. They are incapable of creating insanely great software. They cannot lead with the kind of inspired genius that the legal profession now desperately needs from its software vendors to survive the data deluge. From what I have seen most of the pointy-haired management types that now run e-discovery software companies should be thrown out. They should be replaced with Hacker savvy management before their once proud companies go the way of the Blackberry. The LTN article has more details on the slackers in silk suits. Vendor CEOs: Stop Being Empty Suits & Embrace the Hacker Way. This essay, a partial rerun from a prior blog, gives you the background on Facebook’s Hacker Way.

Hacker History

The Hacker Way tradition and way of thinking has been around since at least the sixties. It has little or nothing to do with illegal computer intrusions. Moreover, to be clear, NSA leaker Edward Snowden is no hacker. All he did was steal classified information, put it on a thumb drive, meet the press, and then flea the country, to communist dictatorships no less. That has nothing to do with the Hacker Way and everything to do with politics.

The Hacker Way – often called the hacker ethic – has nothing to do with politics. It did not develop in government like the Internet did, but in the hobby of model railroad building and MIT computer labs. This philosophy is well-known and has influenced many in the tech world, including the great Steve Jobs (who never fully embraced its openness doctrines), and Steve’s hacker friend, Steve Wozniak, the laughing Yoda of the Hacker Way. The Hacker approach is primarily known to software coders, but can apply to all kinds of work. Even a few lawyers know about the hacker work ethic and have been influenced by it.

Who is Mark Zuckerberg?

We have all seen a movie version of Mark Zuckerberg in The Social Network, who, by the way, will still own 56.9% voting control of Facebook after the public offering later this year. But who is Mark Zuckerberg really? His Facebook page may reveal some of his personal life and ideas, but how did he create a Hundred Billion Dollar company so fast?

How did he change the world at such a young age? There are now over 850 million people on Facebook with over 100 billion connections. On any one day there are over 500 million people using Facebook. These are astonishing numbers. How did this kind of creative innovation and success come about? What drove Mark and his hacker friends to labor so long, and so well? The letter to investors that Mark published  gives us a glimpse into the answer and a glimpse into the real Mark Zuckerberg. Do I have your full attention yet?

The Hacker Way philosophy described in the investor letter explains the methods used by Mark Zuckerberg’s and his team to change the world. Regardless of who Mark really is, greedy guy or saint (or like Steve Jobs, perhaps a strange combination of both), Mark’s stated philosophy is very interesting. It has applications to anyone who wants to change the world, including those of us trying to change the law and e-discovery.

Hacker Culture and Management

Mark’s letter to investors explains the unique culture and approach to management inherent in the Hacker Way that he and Facebook have adopted.

As part of building a strong company, we work hard at making Facebook the best place for great people to have a big impact on the world and learn from other great people. We have cultivated a unique culture and management approach that we call the Hacker Way.

The word `hacker’ has an unfairly negative connotation from being portrayed in the media as people who break into computers. In reality, hacking just means building something quickly or testing the boundaries of what can be done. Like most things, it can be used for good or bad, but the vast majority of hackers I’ve met tend to be idealistic people who want to have a positive impact on the world.

The Hacker Way is an approach to building that involves continuous improvement and iteration. Hackers believe that something can always be better, and that nothing is ever complete. They just have to go fix it — often in the face of people who say it’s impossible or are content with the status quo.

Hackers try to build the best services over the long term by quickly releasing and learning from smaller iterations rather than trying to get everything right all at once. To support this, we have built a testing framework that at any given time can try out thousands of versions of Facebook. We have the words `Done is better than perfect’ painted on our walls to remind ourselves to always keep shipping.

Hacking is also an inherently hands-on and active discipline. Instead of debating for days whether a new idea is possible or what the best way to build something is, hackers would rather just prototype something and see what works. There’s a hacker mantra that you’ll hear a lot around Facebook offices: `Code wins arguments.’

Hacker culture is also extremely open and meritocratic. Hackers believe that the best idea and implementation should always win — not the person who is best at lobbying for an idea or the person who manages the most people.

To encourage this approach, every few months we have a hackathon, where everyone builds prototypes for new ideas they have. At the end, the whole team gets together and looks at everything that has been built. Many of our most successful products came out of hackathons, including Timeline, chat, video, our mobile development framework and some of our most important infrastructure like the HipHop compiler.

To make sure all our engineers share this approach, we require all new engineers — even managers whose primary job will not be to write code — to go through a program called Bootcamp where they learn our codebase, our tools and our approach. There are a lot of folks in the industry who manage engineers and don’t want to code themselves, but the type of hands-on people we’re looking for are willing and able to go through Bootcamp.

So sayst Zuckerberg. Hands-on is the way.

Application of the Hacker Way to e-Discovery

E-discovery needs that same hands-on approach. E-discovery lawyers need to go through bootcamp too, even if they primarily just supervise others. Even senior partners should go, at least if they purport to manage and direct e-discovery work. Partners should, for example, know how to use the search and review software themselves, and from time to time, do it, not just direct junior partners, associates, and contact lawyers. You cannot manage others at a job unless you can actually do the job yourself. That is the hacker key to successful management.

Also, as I often say, to be a good e-discovery lawyer, you have to get your hands dirty in the digital mud. Look at the documents, don’t just theorize about them or what might be relevant. Bring it all down to earth. Test your keywords, don’t just negotiate them. Prove your search concept by the metrics of the search results. See what works. When it doesn’t, change the approach and try again. Plus, in the new paradigm of predictive coding, where keywords are just a start, the SMEs must get their hand dirty. They must use the software to train the machine. That is how the artificial intelligence aspects of predictive coding work. The days of hands-off theorists is over. Predictive coding work is the penultimate example of code wins arguments.

Iteration is king of ESI search and production. Phased production is the only way to do e-discovery productions. There is no one final, perfect production of ESI. As Voltaire said, perfect is the enemy of  good. For e-discovery to work properly it must be hacked. It needs lawyer hackers. It needs SMEs that can train the machine on what is relevant, on what evidence must be found to do justice. Are you up to the challenge?

Mark’s Explanation to Investors of the Hacker Way of Management

Mark goes on to explain in his letter to investors how the Hacker Way translates into the core values for Facebook management.

The examples above all relate to engineering, but we have distilled these principles into five core values for how we run Facebook:

Focus on Impact

If we want to have the biggest impact, the best way to do this is to make sure we always focus on solving the most important problems. It sounds simple, but we think most companies do this poorly and waste a lot of time. We expect everyone at Facebook to be good at finding the biggest problems to work on.

Move Fast

Moving fast enables us to build more things and learn faster. However, as most companies grow, they slow down too much because they’re more afraid of making mistakes than they are of losing opportunities by moving too slowly. We have a saying: “Move fast and break things.” The idea is that if you never break anything, you’re probably not moving fast enough.

Be Bold

Building great things means taking risks. This can be scary and prevents most companies from doing the bold things they should. However, in a world that’s changing so quickly, you’re guaranteed to fail if you don’t take any risks. We have another saying: “The riskiest thing is to take no risks.” We encourage everyone to make bold decisions, even if that means being wrong some of the time.

Be Open

We believe that a more open world is a better world because people with more information can make better decisions and have a greater impact. That goes for running our company as well. We work hard to make sure everyone at Facebook has access to as much information as possible about every part of the company so they can make the best decisions and have the greatest impact.

Build Social Value

Once again, Facebook exists to make the world more open and connected, and not just to build a company. We expect everyone at Facebook to focus every day on how to build real value for the world in everything they do.


Applying the Hacker Way of Management to e-Discovery


Focus on Impact

Law firms, corporate law departments, and vendors need to focus on solving the most important problems, the high costs of e-discovery and the lack of skills. The cost problem primarily arises from review expenses, so focus on that. The way to have the biggest impact here is to solve the needle in the haystack problem. Costs can be dramatically reduced by improving search. In that way we can focus and limit our review to the most important documents. This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom Line Driven Proportional Review and multimodal predictive coding search methods. Other hacker oriented lawyers and technologists have developed their own methods to give clients the most bang for their buck.

The other big problem in e-discovery is that most lawyers do not know how to do it, and so they avoid it altogether. This in turn drives up the costs for everyone because it means the vendors cannot yet realize large economies of scale. Again, many lawyers and vendors understand that lack of education and skill sets is a key problem and are focusing on it.

Move Fast

This is an especially challenging dictate for lawyers and law firms because they are overly fearful of making mistakes, of breaking things as Facebook puts it. They are afraid of looking bad and malpractice suits. But the truth is, professional malpractice suits are very rare in litigation. Such suits happen much more often in other areas of the law, like estates and trusts, property, and tax. As far as looking bad goes, they should be more afraid of the bad publicity from not moving fast enough, which is a much more common problem, one that we see daily in sanctions cases. Society is changing fast, if you aren’t too, you’re falling behind.

The problem of slow adoptions also afflicts the bigger e-discovery vendors who often drown in bureaucracy and are afraid to make big decisions. That is why you see individuals like me starting an online education program, while the big boys keep on debating. I have already changed my e-Discovery Team Training program six times since it went public almost two years ago. `Code wins arguments.’ Lawyers must be especially careful of the thinking Man’s disease, paralysis by analysis, if they want to remain competitive.

A few lawyers and e-discovery vendors understand this hacker maxim and do move fast. A few vendors appreciate the value of getting there first, but fewer law firms do. It seems hard for most of law firm management to understand that the risks of lost opportunities are far more dangerous and certain than the risks of a making a few mistakes along the way. The slower, too conservative law firms are already starting to see their clients move business to the innovators, the few law firms who are moving fast. These firms have more than just puffed-up websites claiming e-discovery expertise, they have dedicated specialists and, in e-discovery at least, they are now far ahead of the rest of the crowd. Will the slow and timid ever catch up, or will they simply dissolve like Heller Ehrman, LLP?

Be Bold

This is all about taking risks and believing in your visions. It is directly related to moving fast and embracing change; not for its own sake, but to benefit your clients. Good lawyers are experts in risk analysis. There is no such thing as zero-risk, but there is certainly a point of diminishing returns for every litigation activity that is designed to control risks. Good lawyers know when enough is enough and constantly consult with their clients on cost benefit analysis. Should we take more depositions? Should we do another round of document checks for privilege? Often lawyers err on the side of caution, without consulting with their clients on the costs involved. They follow an overly cautious approach wherein the lawyers profit by more fees. Who are they really serving when they do that?

The adoption of predictive coding provides a perfect example of how some firms and vendors understand technology and are bold, and others do not and are timid. The legal profession is like any other industry, it rewards the bold, the innovators who create new legal methods and law for the benefit of their clients. What client wants a wimpy lawyer who is over-cautious and just runs up bills? They want a bold lawyer, who at the same time remains reasonable, and involves them in the key risk-reward decisions inherent in any e-discovery project.

Be Open

In the world of e-discovery this is all about transparency and strategic lowering of the wall of work product. Transparency is a proven method for building trust in discovery. Select disclosure is what cooperation looks like. It is what is supposed to happen at Rule 26(f) conferences, but seldom does. The attorneys that use openness as a tool are saving their clients needless expense and disputes. They are protecting them from dreaded redos, where a judge finds that you did a review wrong and requires you to do it again, usually under very short timelines. There are limits to openness of course, and lawyers have an inviolate duty to preserve their client’s secrets. But that still leaves room for disclosure of information on your own methods of search and review when doing so will serve your client’s interests.

Build Social Value 

The law is not a business. It is a profession. Lawyers and law firms exist to do justice. That is their social value. We should never lose sight of that in our day-to-day work. Vendors who serve the legal profession must also support these lofty goals in order to provide value. In e-discovery we should serve the prime directive, the dictates of Rule 1, for just, speedy, and inexpensive litigation. We should focus on legal services that provide that kind of social value. Profits to the firm should be secondary. As Zuckerberg said in the letter to potential investors:

Simply put: we don’t build services to make money; we make money to build better services.

This social value model is not naive, it works. It eventually creates huge financial rewards, as a number of e-discovery vendors and law firms are starting to realize. But that should never be the main point.


Facebook and Mark Zuckerberg should serve as an example to everyone, including e-discovery lawyers and vendors. I admit it is odd that we should have to turn to our youth for management guidance, but you cannot argue with success. We should study Zuckerberg’s 21st Century management style and Hacker Way philosophy. We can learn from its tremendous success. Zuckerberg and Facebook have proven that these management principles work in the digital age. It is true if it works. That is the pragmatic tradition of American philosophy. We live in fast changing times. Embrace change that works. As the face of Facebook says: “The riskiest thing is to take no risks.”

Why a Receiving Party Would Want to Use Predictive Coding?

August 12, 2013

Predictive coding software is not just a game-changer for producing parties, it is invaluable for receiving parties as well, especially those faced with document dumps. Good predictive coding software ranks all documents in a collection according to the attorney trainer’s conception of relevance (or responsiveness). The software then orders all of the documents, from the most important, to the least relevant. The document ranking feature thereby empowers a receiving party to cull out the marginally irrelevant, or totally irrelevant documents that often clutter document productions. The receiving party can then review only the documents that it thinks are of the most importance to the case, the documents they want, and ignore the rest. That saves valuable time and effort, and transforms an ugly, imprecise, document dump into a delicious, high-tech feast of low hanging fruit.


Document Dumps

Document dumps are sometimes an intentional bad faith tactic by a producing party designed to hide evidence or overwhelm the requesting party, but are usually inadvertent or the result of carelessness. See eg Gottlieb v. Iskowitz, 2012 WL 2337290 (Cal. Ct. App. June 20, 2012) (default judgment entered as sanction for intentional, bad faith document dump in violation of court order); Branhaven LLC v. Beeftek, Inc., _F.R.D._, 2013 WL 388429 at *3 (D. Md. Jan. 4, 2013) (sanctions entered under Rule 26(g) for discovery abuses, including a document dump); Losey, R. The Increasing Importance of Rule 26(g) to Control e-Discovery Abuses; Denny & Cochran, The ESI Document Dump in White Collar CasesIn re Fontainebleau Las Vegas, 2011 U.S. Dist. LEXIS 4105 (S.D. Fla. 2011); Fisher-Price, Inc. v. Kids II, Inc., 2011 WL 6409665 (W.D.N.Y. Aug. 10, 2012); Rajala v. McGuire Woods, LLP, 2013 WL 50200 (D. Kan. Jan. 3, 2013) (no sanctions entered and clawback enforced because the judge saw no evidence that plaintiff’s counsel, in producing the privileged material, “intended to overwhelm or burden the receiving party [with] documents largely irrelevant to the litigation.”)

I would like to think the intentional bad faith actions are rare. That has been my experience. What I usually see are over-productions arising out of concern of missing relevant documents. The producing party does not want to face expensive motions to compel, or worse, motions for sanctions for withholding relevant documents, as seen for instance in several of the cases cited above.

Other producing parties make document dumps because they simply do not know any better. They do not have predictive coding software and are relying on imprecise search methods, like keyword or human review. Alternatively, they have predictive coding type software, but they licensed the wrong kind of software (hey, it was cheaper) and it was not any good. Sometimes they chose good software, but do not know how to use it properly. Most vendors have very poor education systems and few are willing to tell their clients, who are often hard-nosed lawyers, that they are not using it right. The reality is, despite the ads, there is no easy button on good search. Just the contrary. Other times, a lawyer will have the right tools, and the skills to use them, but not the time to do the job right. They have a deadline to meet and do not have time to make a more precise production. They would rather err on the side of recall and full production.

For all of these reasons, and more, a producing party will often err on the side of over-production, either on purpose or due to negligence or inadequate time. Sometimes even when the producing party does a perfect job, there will still be too many documents for a receiving party to review them all within the time and money constraints of the case. Is this still a document dump? Not really, but it is still too-much-information, more than the receiving party needs.

Whatever the cause, in many cases, especially large cases, the receiving party ends up with a haystack of documents that effectively hides the information they need to prepare for trial. The receiving party then has essentially the same problem the producing party had, too-much-information, although on a lesser scale. Instead of making a fuss, and engaging in often futile motion practice, the smart receiving party will use predictive coding to sort the documents. They will then only review the documents that they want to see. That is the great strength and beauty of relevancy ranking, a feature that can only be found with predictive coding type software, as I will explain in greater detail in a future blog.

Relevant Is Irrelevant

Zen_KoanRemember my fourth secret of search, relevant is irrelevant? See Secrets of Search – Part III. This Zen koan means that  merely relevant documents are not important to a case. What really counts are the hot documents, the highly relevant. The others are nothing, just more of the same.

There can be hundreds of thousands of technically relevant documents in a collection, especially when an overly broad production request has been made, and complied with. The proper response to an over-broad request is, of course, an objection, dialogues with the requesting party, and failing resolution, a motion for protective order. But sometimes the court may get it wrong and order over-production, or the responding party might not care. Perhaps, for instance, the law firm profits from over-review. I have heard that still goes on. Or perhaps the responding party wanted to go ahead and produce everything requested, saving only privileged documents, for other reasons, such as, saving money by forgoing a careful review. Maybe they are making a document dump on purpose to hide the needles of hot documents in a haystack of merely relevant. Would it not be ethical to do so if the requesting part asked for this dump, perhaps even insisted on it? The requesting party is just getting that they asked for.


Whatever the reasons, sometimes the requesting party receives far too many documents, much more than they wanted. They are then in the position that most producing parties are in when they review their clients documents, they have too much information. That is why relevancy ranking ability is a great reason for a receiving party to use predictive coding software to review large document productions. Even if the production is not a dump at all, it is just large, the receiving party needs help from predictive coding software to go beyond the merely relevant to the highly relevant.

Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents

June 17, 2013

Enron_2This is the conclusion of the report on the Enron document review experiment that I began in my last blog. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. The conclusion is an analysis of the relative effectiveness of the two reviews. Prepare for surprises. Artificial Intelligence has come a long way.

The Monomodal method, which I nicknamed Borg review for its machine dominance, did better than anticipated. Still, it came up short in the key component, as the graphic suggests, of finding Hot documents. Yes. There is still a place for keyword and other types of search. But it is growing smaller every year.

Description of the Two Types of Predictive Coding Review Methods Used

When evaluating the success of the Monomodal all predictive-coding-approach in the second review, please remember, that this is not pure Borg. I would not spend 52 hours of my life doing that kind of review. I doubt any SME or search expert would do so. Instead, I did my version of the Borg review, which is quite different from that endorsed by several vendors. I call my version the Enlightened Hybrid Borg Monomodal review. Losey, R., Three-Cylinder Multimodal Approach To Predictive Coding. I used all three-cylinders described in this article: one for random, a second for machine analysis, and a third cylinder powered by human input. The only difference from full Multimodal review is that the third engine of human input was limited to predictive coding based ranked searches.

This means that in the version of Monomodal review tested the random selection of documents played only a minor role in training (thus an Enlightened approach). It also means that the individual SME reviewer was allowed to supplement the machine selected documents with his own searches, which I did, so long as the searches were predictive coding based (thus the Hybrid approach, Man and Machine). For example, with the Hybrid approach to Monomodal the reviewer can select documents for review for possible training based on their ranked positions. The reviewer does not have to rely entirely on the computer algorithms to select all of the documents for review.

The primary difference between my two reviews was that the first Multimodal method used several search methods to find documents for machine training, including especially keyword and similarity searches, whereas the second did not. Only machine learning type searches were used in the Monomodal search. Otherwise I used essentially the same approach as I would in any litigation, and budgeted my time and expense to 52 hours for each project.

Both Reviews Were Bottom Line Driven

Both the Monomodal and Multimodal reviews were tempered by a Bottom Line Driven approach. This means the goal of the predictive coding culling reviews was a reasonable effort where an adequate number of relevant documents were found. It was not a unrealistic, over-expensive effort. It did not include a vain pursuit of more of the same type documents. These documents would never find their way into evidence anyway, and would never lead to new evidence. They would only make the recall statistics look good. The law does not require that. (Look out for vendors and experts who promote the vain approach of high recall just to line their own pockets.) The law requires reasonable efforts proportional to the value of the case and the value of the evidence. It does not require perfection. In most cases it is a waste of money to try.


In both reviews I stopped the iterative machine training when few new documents were located in the last couple of rounds. I stopped when the documents predicted as relevant were primarily just more of the same or otherwise not important. It was somewhat fortuitous that this point was reached after about the same amount of effort, even though I had only gone through 5 rounds of training in Multimodal, as compared to 50 rounds in Monomodal. I was about at the same point of new-evidence-exhaustion in both reviews and these final stats reflect the close outcomes.

There is no question in my mind that more relevant documents could have been found in both reviews if I had done more rounds of training. But I doubt that new, unique types of relevant documents would have been uncovered, especially in the first Multimodal review. In fact, I tested this theory after the first Multimodal review was completed and did a sixth round of training not included in these metrics. I called it my post hoc analysis and it is described at pages 74-84 of the Predictive Coding Narrative: Searching for Relevance in the Ashes of EnronI found 32 technically relevant documents in the sixth round, as expected, and, again as expected, none were of any significance.

In both reviews the decision to stop was tested, and passed, based on my version of the elusion test of the null-set (all documents classified as irrelevant and thus not to be produced). My elusion test has a strict accept-on-zero-error policy for Hot documents. This test does not prove that all Hot documents have been found. It just creates a testing condition such that if any Hot documents are found in the sample, then the test failed and more training is required. In the random sample quality assurance tests for both reviews no Hot documents were found, and no new relevant documents of any significance were found, so the tests were passed. (Note that the test passed in the second Monomodal review, even though, as will be shown, the second review did not locate four unique Hot documents found in the first review.) In both elusion tests the false negatives found in the random sample were all just unimportant more of the same type documents that I did not care about anyway.

Neither of my Enron reviews were perfect, and the recall and F1 tests reflect that, but they were both certainly reasonable and should survive any legal challenge. If I had gone on with further rounds of training and review, the recall would have improved, but to little or no effect. The case itself would not have been advanced, which is the whole point of e-discovery, not the establishment of artificial metrics. With the basic rule of proportionality in mind the additional effort of more rounds of review would not have been worth it. Put another way, it would have been unreasonable to have insisted on greater recall or F1 scores in these projects.

It is never a good idea to have a preconceived notion of a minimum recall or F1 measure. It all depends on the case itself, and the documents. You may know about the case and scope of relevance (although frequently that matures as the project progresses), but you usually do not about the documents. That is the whole point of the review.

It is also important to recognize that both of these predictive coding reviews, Multi and Monomodal, did better than any manual review. Moreover, they were both far, far, less expensive than traditional reviews. These last considerations will be considered in an upcoming blog and will not be addressed here. Instead I will focus on objective measures of prevalence, recall, precision, and total document retrieval comparisons. Yes, that means more math, but not much.

Summary of Prevalence and Comparative Recall Calculations

A total of three simple random samples were taken of the entire 699,082 dataset as described with greater particularity in the search narratives. Predictive Coding Narrative (2012); Borg Challenge Report (2013). A random sample of 1,507 documents was made in the first review wherein 2 relevant documents were found. This showed a prevalence rate of 0.13%.  Two more random samples were taken in the second review of 1,183 documents in each sample. The total random sample in the second review was thus 2,366 documents with 5 relevant found. This showed a prevalence rate of 0.21%. Thus a total of 3,873 random sampled documents were reviewed and a total of 7 relevant documents found.

Since three different samples were taken some overlap in sampled documents was possible. Nevertheless, since these three samples were each made without replacement we can combine them for purposes of the simple binomial confidence intervals estimated here.

By combining all three samples with a total of 3,873 documents reviewed, and 7 relevant documents found, you have a prevalence of 0.18%. The spot projection of 0.18% over the entire 699,082 dataset is 1,264. Using a Binomial calculation to determine the confidence interval, and using a confidence level of 95%, the error rage is from 0.07% to 0.37%. This represents a range of from between 489 to 2,587 projected relevant documents in the entire dataset.

From the perspective of the reviewer the low projected range represents the best-case-scenario for calculating recall. Here we know the 489 relevant documents is not correct because both reviews found more relevant documents than that. The Multimodal found 661 and the Monomodal found 579. Taking a conservative view for recall calculation purposes, and assuming that the 63 documents considered relevant in one review, and not in another, were in fact all relevant for purposes, this means we have a minimum floor of 955 relevant document. Thus under the best-case-scenario, the 955 found represents all of the relevant documents in the corpus, not the 489 or 661 counts.

From the perspective of the reviewer the high projected range in the above binomial calculations – 2,587 – represents the worst-case-scenario for calculating recall. It has the same probability as being correct as the 489 low range projection had. It is a possibility, albeit slim, and certainly less likely than the 955 minimum floor we were able to set using the binomial calculation tempered by actual experience

Under the most-likely-scenario, the spot projections, there are 1,264 relevant documents. This is shown in the bell curve below. Note that since the random sample calculations are all based on a 95% probability level, there was a 2.5% chance that fewer than 489 or greater than 2,587 relevant documents would be found (the left and right edges of the curve). Also note that the spot projection of 1,264 has the highest probability (9.5%) of being the correct estimate. Moreover, the closer to 1,264 you come on the bell curve the higher the probability of likely accuracy. Therefore, it is more likely that there are 1,500 relevant documents than 1,700, and more likely that there are 1,100 documents than 1,000.


The recall calculations under all three scenarios are as follows:

  • Under the most-likely-scenario using the spot projection of 1,264:
    • Monomodal (Borg) retrieval of 579 = 46% recall.
    • Multimodal retrieval of 661 = 52% recall (that’s 13% better than Monomodal (6/46)).
    • Projected relevant documents not found by best effort, Multimodal = 603.


  • Under the worst-case-scenario using the maximum count projection of 2,587:
    • Monomodal (Borg) retrieval of 579 = 22% recall.
    • Multimodal retrieval of 661 = 26% recall (that’s 18% better than Monomodal (4/22)).
    • Projected relevant documents not found by best effort, Multimodal = 1,926.
  • Best Case scenario = 955 relevant.
    • Monomodal (Borg) retrieval of 579 = 61% recall.
    • Multimodal retrieval of 661 = 69% recall (that’s 13% better than Monomodal (8/61)).
    • Projected relevant documents not found by best effort, Multimodal = 334.

In summary, the prevalence projections from the three random samples suggest that the Multimodal method recalled from between 26% to 69% of the total number of relevant documents, with the most likely result being 52% recall. The prevalence projections suggest that the Monomodal method recalled from between 22% to 61% of the total number of relevant documents, with the most likely result being a 46% recall. The metrics thus suggest that Multimodal attained a recall level from between 13% to 18% better than attained by the Monomodal method. 

Precision and F1 Comparisons 

The first Multimodal review classified 661 documents as relevant. The second review re-examined 403 of those 661 documents. The second review agreed with the relevant classification of 285 documents and disagreed with 118. Assuming that the second review was correct, and the first review incorrect, the precision rate was 71% (285/403).

When the content of these documents are examined, and the duplicate and near duplicate documents are removed from the analysis as previously explained, the Multimodal review classified 369 different unique documents as relevant. The second review re-examined 243 of those 369 documents. The second review agreed with the relevant classification of 211 documents and disagreed with 32. Assuming that the second review was correct, and the first review incorrect, the precision rate was 87% (211/243).

Conversely, if you assume the conflicting second review calls were incorrect, and the SME got it right on all of them the first time, the precision rate for the first review would be 100%. That is because all of the documents identified by the first review as relevant to the information request would in fact stand confirmed as relevant. As discussed previously, all of the disputed calls concerned ambiguous or borderline grey area documents. The classification of these documents is inherently arbitrary, to some extent, and they are easily subject to concept shift. The author takes no view as to the absolute correctness of the conflicting classifications.

The second Monomodal review classified 579 documents as relevant. The second review re-examined 323 of those 579 documents and agreed with the relevant classification of 285 documents and disagreed with 38. Assuming that the first review was correct, and the second review incorrect, the agreement rate on relevant classifications was 88% (285/323).

When the content of these documents are examined, and the duplicate and near duplicate documents are removed from the analysis as previously explained, the Monomodal review classified 427 different unique documents as relevant. The first review had examined 242 of those 427 documents. The first review agreed with the relevant classification of 211 documents and disagreed with 31. Assuming that the first review was correct, and the second review incorrect, the precision rate was again 87% (211/242).

Assuming the conflicting first review calls were incorrect, and the SME got it right on all of them the second time, then again the precision rate for the second review would be 100%. That is because all of the documents identified by the first review as relevant to the information request would in fact stand confirmed as relevant.

In view of the inherent ambiguity of all of the documents with conflicting coding the measurement of precision in these two projects is of questionable value. Nevertheless, assuming that the inconsistencies in coding were always correct, when you do not account for duplicate and near duplicate documents the second Monomodal review was 24% more consistent with the first Multimodal review. However when the duplicates and near duplicate documents are removed for a more accurate assessment, the precision rates of both reviews were almost identical at 87%.

The F1 measurement is the harmonic mean of the precision and recall rates.  The formula for calculating the harmonic mean is not too difficult: 2/(1/P + 1/R) where P is precision and R is recall. Thus using the more accurate 87% precision rate for both, the harmonic mean ranges for the projects are:

  • 40% to 77% for Multimodal
  • 35% to 71% for Monomodal

The F1 measures for most-likely-scenario spot projections for both are:

  • 65% for Multimodal
  • 61%  for Monomodal

In summary since the precision rates of the two methods were identical at a respectable 87%, the comparisons between the recall rates and F1 rates are nearly identical. The Multimodal F1 of 40% for the worst-case-scenario was 14% better than the Monomodal F1 of 35%. The Multimodal F1 of 65% for the best-case-scenario was 7% better than the Monomodal F1 of 61%. The most likely spot projection differential between 61% and 65% again shows Multimodal with a 7% improvement over Monomodal. 

Comparisons of Total Counts of Relevant Documents

The first review using the Multimodal method found 661 relevant documents. The second review using the Monomodal method found 579 relevant documents. This means that Multimodal found 82 more relevant documents than Monomodal. That is a 14% improvement. This is shown by the roughly proportional circles below.


Analysis of the content of these relevant documents showed that:

  • The set of 661 relevant documents found by the first Multimodal review contained 292 duplicate or near duplicate documents, leaving only 369 different unique documents. There were 74 duplicates or near duplicates in the 285 documents coded relevant by both Multimodal and Monomodal, and 218 duplicates in the 376 documents that were only coded relevant in the Multimodal review. (As the most extreme example, the 376 documents contained one email with the subject line Enron Announces Plans to Merge with Dynegy dated November 9, 2001, that had 54 copies.)
  • The set of 579 relevant documents found by second Monomodal review contained 152 duplicate or near duplicate documents, leaving only 427 different unique documents. There were 74 duplicates or near duplicates in the 285 documents coded relevant by both Multimodal and Monomodal, and 78 duplicates in the 294 documents that were only coded relevant in the Monomodal review. (As the most extreme example, the 294 documents contained one email with the subject line NOTICE TO: All Current Enron Employees who Participate in the Enron Corp. Savings Plan dated January 3, 2002, that had 39 copies.)
  • Therefore when you exclude the duplicate or near duplicate documents the Monomodal method found 427 different documents and the Multimodal method found 369. This means the Monomodal method found 58 more unique relevant documents than Multimodal, an improvement of 16%. This is shown by the roughly proportional circles below.

two_circles_Unique_relevantOn the question of effectiveness of retrieval of relevant documents under the two methods it looks like a draw. The Multimodal method found 14% more relevant documents, and likely attained a recall level from between 13% to 18% better than attained by the Monomodal method. But after removal of duplicates and near duplicates, the Monomodal method found 16% more unique relevant documents.

This result is quite surprising to the author who had expected the Multimodal method to be far superior. The author suspects the unexpectedly good results in the second review over the first, at least from the perspective of unique relevant documents found, may derive, at least in part, from the SME’s much greater familiarity and expertise with predictive coding techniques and Inview software by the time of the second review. Also, as mentioned, some slight improvements were made to the Inview software itself just before the second review, although it was not a major upgrade. The possible recognition of some documents in the second review from the first could also have had some slight impact.

Hot Relevant Document Differential

The first review using the Multimodal method found 18 Hot documents. The second review using the Monomodal method included only 13 Hot documents. This means that Multimodal found 5 more relevant documents than Monomodal. That is a 38% improvement. This is shown by the roughly proportional circles below.


Analysis of the content of these Hot documents showed that:

  • The set of 18 Hot documents found by first Multimodal review contained 7 duplicate or near duplicate documents, leaving only 11 different unique documents.
  • The set of 13 Hot documents found by second Monomodal review contained 6 duplicate or near duplicate documents, leaving only 7 different unique documents. Also, as mentioned, all 13 of the Hot documents found by Monomodal were also found by Multimodal, whereas Multimodal found 5 Hot documents that Monomodal did not.
  • Therefore when you exclude the duplicate or near duplicate documents the Multimodal method found 11 different documents and the Monomodal method found 7. This means the Multimodal method found 4 more unique Hot documents than Monomodal, an improvement of 57%. This is shown by the roughly proportional circles below.



Enron_2On the question of effectiveness of retrieval of Hot documents the Multimodal method did 57% better than Monomodal. Thus, unlike the comparison of effectiveness of retrieval of relevant documents, which was a close draw, the Multimodal method was far more effective in this category. In the author’s view the ability to find Hot documents is much more important than the ability to find merely relevant document. That is because in litigation such Hot documents have far greater probative value as evidence than merely relevant documents. They can literally make or break a case.

In other writings the author has coined the phrase Relevant is Irrelevant to summarize the argument that Hot documents are far more significant in litigation than merely relevant documents. The author contends that the focus of legal search should always be on retrieval of Hot documents, not relevant documents. Losey, R. Secrets of Search – Part III (2011) (the 4th secret). This is based in part on the well-known rule of 7 +/- 2 that is often relied upon by trial lawyers and psychologists alike as a limit to memory and persuasion. Id. (the 5th and final secret of search).

To summarize this study suggests that the hybrid multimodal search method, one that uses a variety of search methods to train the predictive coding classifier, is significantly more effective (57%) at finding highly relevant documents than the hybrid monomodal method. When comparing the effectiveness of retrieval of merely relevant documents the two methods did, however, perform about the same. Still, the edge in performance must again go to Multimodal because of the 7% to 14% better projected F1 measures.

An Elusive Dialogue on Legal Search: Part Two – Hunger Games and Hybrid Multimodal Quality Controls

September 3, 2012

This is a continuation of last week’s blog, An Elusive Dialogue on Legal Search: Part One where the Search Quadrant is Explained. The quadrant and random sampling are not as elusive as Peeta Mellark in The Hunger Games shown right, but almost. Indeed, as most of us lawyers did not major in math or information science, these new techniques can be hard to grasp. Still, to survive in the vicious games often played these days in litigation, we need to  find a way. If we do, we can not only survive, we can win, even if we are from District 12 and the whole world is watching our every motion.

The emphasis in the second part of this essay is on quality controls and how such efforts, like search itself, must be multimodal and hybrid. We must use a variety of quality assurance methods – we must be multimodal. To use the Hunger Games analogy, we must use both bow and rope, and camouflage too. And we must employ both our skilled human legal intelligence and our computer intelligence – we must be hybrid; Man and machine, working together in perfect harmony, but with Man in charge. That is the only way to survive the Hunger Games of litigation in the 21st Century. The only way the odds will be ever in your favor.

Recall and Elusion

But enough fun with Hunger Games, Search Quadrant terminology, nothingness, and math, and back to Herb Rotiblat’s long comment on my earlier blog, Day Nine of a Predictive Coding Narrative.

Recall and Precision are the two most commonly used measures, but they are not the only ones. The right measure to use is determined by the question that you are trying to answer and by the ease of asking that question.

Recall and Elusion are both designed to answer the question of how complete we were at retrieving all of the responsive documents. Recall explicitly asks “of all of the responsive documents in the collection, what proportion (percentage) did we retrieve?” Elusion explicitly asks “What proportion (percentage) of the rejected documents were truly responsive?” As recall goes up, we find more of the responsive documents, elusion, then, necessarily goes down; there are fewer responsive documents to find in the reject pile. For a given prevalence or richness as the YY count goes up (raising Recall), the YN count has to go down (lowering Elusion). As the conversation around Ralph’s report of his efforts shows, it is often a challenge to measure recall.

This last comment was referring to prior comments made in my same Day Nine Narrative blog by two other information scientists William Webber and Gordon Cormack. I am flattered that they all seem to read my blog, and make so many comments, although I suspect they may be master game-makers of sorts like we saw in Hunger Games.

The earlier comments of Webber and Cormack pertained to point projection of yield and the lower and upper intervals derived from random samples. All things I was discussing in Day Nine. Gordon’s comments focused on the high-end of possible interval error and said you cannot know anything for sure about recall unless you assume the worst case scenario high-end of the confidence interval. This is true mathematically and scientifically, I suppose (to be honest, I do not really know if it is true or not, but I learned long ago not to argue science with a scientist, and they do not seem to be quibbling amongst themselves, yet.) But it certainly is not true legally, where reasonability and acceptable doubt (a kind of level of confidence), such as a preponderance of the evidence, are always the standard, not perfection and certainty. It is not true in manufacturing quality controls either.

But back to Herb’s comment, where he picks up on their math points and elaborates concerning the Elusion test that I used for quality control.

Measuring recall requires you to know or estimate the total number of responsive documents. In the situation that Ralph describes, responsive documents were quite rare, estimated at around 0.13% prevalence. One method that Ralph used was to relate the number of documents his process retrieved with his estimated prevalence. He would take as his estimate of Recall, the proportion of the estimated number of responsive documents in the collection as determined by an initial random sample.

Unfortunately, there is considerable variability around that prevalence estimate. I’ll return to that in a minute. He also used Elusion when he examined the frequency of responsive documents among those rejected by his process. As I argued above, Elusion and Recall are closely related, so knowing one tells us a lot about the other.

One way to use Elusion is as an accept-on-zero quality assurance test. You specify the maximum acceptable level of Elusion, as perhaps some reasonable proportion of prevalence. Then you feed that value into a simple formula to calculate the sample size you need (published in my article the Sedona Conference Journal, 2007). If none of the documents in that sample comes up responsive, then you can say with a specified level of confidence that responsive documents did not occur in the reject set at a higher rate than was specified. As Gordon noted, the absence of a responsive document does not prove the absence of responsive documents in the collection.

The Sedona Conference Journal article Herb referenced here is called Search & Information Retrieval Science. Also, please recall that my narrative states, without using the exact same language, that my accept-on-zero quality assurance test pertained to Highly Relevant documents, not relevant documents. I decided in advance that if my random sample of excluded documents included any that were Highly Relevant documents, then I would consider the test a failure and initiate another round of predictive coding. My standard for merely relevant documents was secondary and more malleable, depending on the probative value and uniqueness of any such false negatives. False negatives are what Herb calls YN, and we also now know is called D in the Search Quadrant with totals shown again below.

Back to Herb’s comment, who, by the way looks a bit like President Snow, don’t you think? Herb is now going to start talking about Recall, which as we now know is A/G, and is a measure of accuracy that I did not directly make or claim.

If you want to directly calculate the recall rate after your process, then you need to draw a large enough random sample of documents to get a statistically useful sample of responsive documents. Recall is the proportion of responsive documents that have been identified by the process. The 95% confidence range around an estimate is determined by the size of the sample set. For example, you need about 400 responsive documents to know that you have measured recall with a 95% confidence level and a 5% confidence interval. If only 1% of the documents are responsive, then you need to work pretty hard to find the required number of responsive documents. The difficulty of doing consistent review only adds to the problem. You can avoid that problem by using Elusion to indirectly estimate Recall.

The Fuzzy Lens Problem Again

The reference to the difficulty of doing consistent review refers to the well documented inconsistency of classification among human reviewers. That is what I called in Secrets of Search, Part One, as the fuzzy lens problem that makes recall such an ambiguous measure in legal search. It is ambiguous because when large data sets are involved the value for G (total relevant) is dependent upon  human reviewers. The inconsistency studies show that the gold standard of measurement by human review is actually just dull lead.

Let me explain again in shorthand, and please fell free to refer to the Secrets of Search trilogy and original studies for the full story. Roitblot’s own well-known study of a large-scale document review showed that human reviewers only agreed with each other on average of 28% of the time. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010. An earlier study by one of the leading information scientists in the world, Ellen M. Voorhees, found a 40% agreement rate between human reviewers. Variations in relevance judgments and the measurement of retrieval effectiveness, 36:5 Information Processing & Management 697, 701 (2000). Voorhees concluded that with 40% agreement rates it was not possible to measure recall any higher than 65%. Information scientist William Webber calculated that with a 28% agreement rate a recall rate cannot be reliably measured above 44%. Herb Rotiblat and I dialogued about this issue before the last time in Reply to an Information Scientist’s Critique of My “Secrets of Search” Article

I prepared the graphics below to illustrate this problem of measurement and the futility of recall calculations when the measurements are made by inconsistent reviewers.

Until we can crack the inconsistent reviewer problem, we can only measure recall vaguely, as we see on the left, or at best the center, and can only make educated guesses as to the reality on the right. The existence of the error has been proven, but as Maura Grossman and Gordon Cormack point out, there is a dispute as to the cause of the error. In one analysis that they did of TREC results they concluded that the inconsistencies were caused by human error, not a difference of opinion on what was relevant or not. Inconsistent Responsiveness Determination in Document Review: Difference of Opinion or Human Error? But, regardless of the cause, the error remains.

Back to Herb’s Comment.

One way to assess what Ralph did is to compare the prevalence of responsive documents in the set before doing predictive coding with their prevalence after using predictive coding to remove as many of the responsive documents as possible. Is there a difference? An ideal process will have removed all of the responsive documents, so there will be none left to find in the reject pile.

That question of whether there is a difference leads me to my second point. When we use a sample to estimate a value, the size of the sample dictates the size of the confidence interval. We can say with 95% confidence that the true score lies within the range specified by the confidence interval, but not all values are equally likely. A casual reader might be led to believe that there is complete uncertainty about scores within the range, but values very near to the observed score are much more likely that values near the end of the confidence interval. The most likely value, in fact, is the center of that range, the value we estimated in the first place. The likelihood of scores within the confidence interval corresponds to a bell shaped curve.

This is a critical point. It means that the point projections, a/k/a, the spot projections, can be reliably used. It means  that even though you must always qualify any findings that are based upon random sampling by stating the applicable confidence interval, the possible range of error, you may still reliably use the observed score of the sample in most data sets, if a large enough sample size is used to create low confidence interval ranges. Back to Herb’s Comment.

Moreover, we have two proportions to compare, which affects how we use the confidence interval. We have the proportion of responsive documents before doing predictive coding. The confidence interval around that score depends on the sample size (1507) from which it was estimated. We have the proportion of responsive documents after predictive coding. The confidence interval around that score depends on its sample size (1065). Assuming that these are independent random samples, we can combine the confidence intervals (consult a basic statistics book for a two sample z or t test or http://facstaff.unca.edu/dohse/Online/Stat185e/Unit3/St3_7_TestTwoP_L.htm), and determine whether these two proportions are different from one another (0.133% vs. 0.095%). When we do this test, even with the improved confidence interval, we find that the two scores are not significantly different at the 95% confidence level. (try it for yourself here: http://www.mccallum-layton.co.uk/stats/ZTestTwoTailSampleValues.aspx.). In other words, the predictive coding done here did not significantly reduce the number of responsive documents remaining in the collection. The initial proportion 2/1507 was not significantly higher than 1/1065. The number of responsive documents we are dealing with in our estimates is so small, however, that a failure to find a significant difference is hardly surprising.

This paragraph appears to me to have assumed that my final quality control test was a test for Recall and uses the upper limit, the worst case scenario, as the defining measurement. Again, as I said in the narrative and replies to other comments, I was testing for Elusion, not Recall. Further, the Elusion test (D/F) here was for Highly Relevant documents, not relevant, and none were found, 0%. None were found in the first random sample at the beginning of the project, and none were found in the second random sample at the end. The yields referred to by Herb are for relevant documents, not Highly Relevant. The value of DFalse Negatives, in the elusion test was thus zero. As we have discussed, when that happens, where the numerator in a fraction is zero, the result of the division is also always zero, which, in an Elusion test, is exactly what you are looking for. You are looking for nothing and happy to find it.

The final sentence in Herb’s last paragraph is key to understanding his comment: The number of responsive documents we are dealing with in our estimates is so small, however, that a failure to find a significant difference is hardly surprising. It points to the inherent difficulty of using random sampling measurements of recall in low yield document sets where the prevalence is low. But there is still some usefulness for random sampling in these situations as the conclusion of his Comment shows.

Still, there is other information that we can glean from this result. The difference in the two proportions is approximately 28%. Predictive coding reduced by 28% the number of responsive documents unidentified in the collection. Recall, therefore, is also estimated to be 28%. Further, we can use the information we have to compute the precision of this process as approximately 22%. We can use the total number of documents in the collection, prevalence estimates, and elusion to estimate the entire 2 x 2 decision matrix.

For eDiscovery to be considered successful we do not have to guarantee that there are no unidentified responsive documents, only that we have done a reasonable job searching for them. The observed proportions do have some confidence interval around them, but they remain as our best estimate of the true percentage of responsive documents both before predictive coding and after. We can use this information and a little basic algebra to estimate Precision and Recall without the huge burden of measuring Recall directly.

These are great points made by Herb Rotiblat in the last paragraph regarding reasonability. It shows how lawyer-like he has become after working with our kind for so many years, rather than professor types like my brother in the first half of his career. Herb now well understands the difference between law and science and what this means to legal search.

Law is not a Science, and Neither Is Legal Search

To understand the numbers and need for reasonable efforts that accepts high margins of error, we must understand the futility of increasing sample sizes to try to cure the upper limit of confidence. William Webber in his Comment of August 6, 2012 at 10:28 pm said that “it is, unfortunately, very difficult to place a reassuring upper bound on a very rare event using random sampling.” (emphasis added) Dr. Webber goes on to explain that to attain even a 50% confidence interval would require a final quality control sample of 100,000 documents. Remember, there were only 699,082 documents to begin with, so that is obviously no solution at all. It is about as reassuring as the Hunger Games slogan, may the odds be ever in your favor, when we all know that all but 1 of the 24 gamers must die.

Aside from the practical cost and time issues, the fuzzy lens problem of poor human judgments also makes the quest for reassuring bounds of error a fool’s errand. The perfection is illusory. It cannot be attained, or more correctly put, if you do attain high recall in a large data set, you will never be able to prove it. Do not be fooled by the slogans and the flashy, facile analysis.

Fortunately, the law has long recognized the frailty of all human endeavors. The law necessarily has different standards for acceptable error and risks than does math and science. The less-than-divine standards also apply to manufacturing quality control where small sample sizes have long been employed for acceptable risks. There too, like in a legal search for relevance, the prevalence of defective items sampled for is typically very low.

Math and science demand perfection. But the law does not. We demand reasonability and good faith, not perfection. Some scientists may think that we are settling, but it is more like practical realism, and, is certainly far better than unreasonable and bad faith. Unlike science and math, the law is used to uncertainties. Lawyers and judges are comfortable with that. For example, we are reassured enough  to allow civil convictions when a judge or jury decides that it is more likely than not that the defendant is at fault, a 51% standard of doubt. Law and justice demand reasonable efforts, not perfection.

I know Herb Rotiblat agrees with me because this is the fundamental thesis of the fine paper he wrote with two lawyers, Patrick Oot and Anne Kershaw, entitled: Mandating Reasonableness in a Reasonable Inquiry. At pages 557-558 they sum up saying (footnote omitted):

We do not suggest limiting the court system’s ability to discover truth. We simply anticipate that judges will deploy more reasonable and efficient standards to determine whether a litigant met his Rule 26(g) reasonable inquiry obligations. Indeed, both the Victor Stanley and William A. Gross Construction decisions provide a primer for the multi-factor analysis that litigants should invoke to determine the reasonableness of a selected search and review process to meet the reasonable inquiry standard of Rule 26(f): 1. Explain how what was done was sufficient; 2. Show that it was reasonable and why; 3. Set forth the qualifications of the persons selected to design the search; 4. Carefully craft the appropriate keywords with input from the ESI’s custodians as to the words and abbreviations they use; and 5. Use quality control tests on the methodology to assure accuracy in retrieval and the elimination of false positives.

As to the fifth criteria, which we are discussing here, of quality control tests, Rotiblat, Oot and Kershaw assert in their article at page 551 that : “A litigant should sample at least 400 results of both responsive and non-responsive data.” This is the approximate sample size when using 95% confidence level and a 5% confidence interval. (Note in my sampling I used less than a 3% confidence interval with a much larger sample  size of 1,065 documents.) To support this assertion that a sample size of 400 documents is reasonable, the authors  in footnote 77 refer to an email they have on file from Maura Grossman regarding legal search of data sets in excess of 100,000 documents, which concluded with the statement:

Therefore, it seemed to me that, for the average matter with a large amount of ESI, and one which did not warrant hiring a statistician for a more careful analysis, a sample size of 400 to 600 documents should give you a reasonable view into your data collection, assuming the sample is truly randomly drawn.

Personally, I think a larger sample size than 400-600 documents is needed for quality control tests in large cases. The efficacy of this small calculated sample size using a 5% confidence interval assumes a prevalence of 50%, in other words, that half of the documents sampled are relevant. This is an obvious fiction in all legal search, just as it is in all sampling for defective manufacturing goods. That is why I sampled 1,065 documents using 3%. Still, in smaller cases, it may be very appropriate to just sample 400-600 documents using a 5% interval. It all depends, as I will elaborate further in the conclusion.

But regardless, all of these scholars of legal search make the valid point that only reasonable efforts are required in quality control sampling, not perfection. We have to accept the limited usefulness of random sampling alone as a quality assurance tool because of the margins of error inherent in sampling of the low prevalence data sets common in legal search. Fortunately, random sampling is not our only quality assurance tool. We have many other methods to show reasonable search efforts.

Going Beyond Reliance on Random Sampling Alone to a Multimodal Approach

Random sampling is not a magic cure-all that guaranties quality, or definitively establishes the reasonability of a search, but it helps. In low yield datasets, where there is a low percentage of relevant documents in the total collection, the value of random sampling for Recall is especially suspect. The comments of our scientist friends have shown that. There are inherent limitations to random sampling.

Ever increasing sample sizes are not the solution, even if that was affordable and proportionate. Confidence intervals in sampling of less than two or three percent are generally a waste of time and money. (Remember the sampling statistics rule of thumb of 2=4 that I have explained before wherein a halving of confidence interval error rate, say from 3% to 1.5%, requires a quadrupling of sample size.) Three or four percent confidence interval levels are more appropriate in most legal search projects, perhaps even the 5% interval used in the Mandating Reasonableness article by Roitblat, Oot and Kershaw. Depending on the data set itself, prevalence, other quality control measures, complexity of the case, and the amount at issue, say less than $1,000,000, the five percent based small sample size of approximately 400 documents could well be adequate and reasonable. As usual in the law, it all depends on many circumstances and variables.

The issue of inconsistent reviews between reviewers, the fuzzy lens problem, necessarily limits the effectiveness of all large-scale human reviews. The sample sizes required to make a difference are extremely large. No such reviews can be practically done without multiple reviewers and thus low agreement rates. The gold standard for review of large samples like this is made of lead, not gold. Therefore, even if cost was not a factor, large sample sizes would still be a waste of time.

Moreover, in the real word of legal review projects, there is always a strong component of vagary in relevance. Maybe that was not true in the 2009 TREC experiment as Grossman and Cormack’s study suggests, but it has been true in the thousands of messy real-world lawsuits that I have handled in the past 32 years. All trial lawyers I have spoken with on the subject agree.

Relevance can be, and usually is, a fluid and variable target depending on a host of factors, including changing legal theories, changing strategies, changing demands, new data, and court rulings. The only real gold standard in law is a judge ruling on specific documents. Even then, they can change their mind, or make mistakes. A single person, even a judge, can be inconsistent from one document to another. See Grossman & Cormack, Inconsistent Responsiveness Determination at pgs. 17-18 where a 2009 TREC Topic Authority contradicted herself 50% of the time when re-examining the same ten documents.

We must realize that random sampling is just one tool among many. We must also realize that even when random sampling is used, Recall is just one measure of accuracy among many. We must utilize the entire 2 x 2 decision matrix.

We must consider the possible applicability of all of the measurements that the search quadrant makes possible, not just recall.

  • Recall = A/G
  • Precision = A/C
  • Elusion = D/F
  • Fallout = B/H
  • Agreement = (A+E)/(D+B)
  • Prevalence = G/I
  • Miss Rate = D/G
  • False Alarm Rate = B/C

No doubt we will develop other quality control tests, for instance using Prevalence as a guide or target for relevant search as I described in my seven part Search Narrative. Just as we must use multimodal search efforts for effective search of large-scale data sets, so too must we use multiple quality control methods when evaluating the reasonability of search efforts. Random sampling is just one tool among many, and, based on the math, maybe not the best method at that, regardless of whether it is for recall, or elusion, or any other binary search quadrant measure.

Just as keyword search must be supplemented by the computer intelligence of predictive coding, so too must random based quality analysis be supplemented by skilled legal intelligence. That is what I call a Hybrid approach. The best measure of quality is to be found in the process itself, coupled with the people and software involved. A judge called upon to review reasonability of search should look at a variety of factors, such as:

  • What was done and by whom?
  • What were their qualifications?
  • What rules and disciplined procedures were followed?
  • What measures were taken to avoid inconsistent calls?
  • What training was involved?
  • What happened during the review?
  • Which search methods were used?
  • Was it multimodal?
  • Was it hybrid, using both human and artificial intelligence?
  • How long did it take?
  • What did it cost?
  • What software was used?
  • Who developed the software?
  • How long has the software been used?


These are just a few questions that occur to me off the top of my head. There are surely more. Last year in Part Two of Secrets of Search I suggested nine characteristics of what I hope would become an accepted best practice for legal review. I invited peer review and comments on what I may have left out, or any challenges to what I put in, but so far this list of nine remains unchallenged. We need to build on this to create standards so that quality control is not subject to so many uncertainties.

Jason R. Baron, William Webber, myself, and others keep saying this over and over, and yet the Hunger Games of standardless discovery goes on. Without these standards we may all fall prey at any time to a vicious sneak attack by another contestant in the litigation games. A contest that all too often feels like a fight to the death, rather than a cooperative pursuit of truth and justice. It has become so bad now that many lawyers snicker just to read such a phrase.

The point here is, you have to look at the entire process, and not just focus on taking random samples, especially ones that claims to measure recall in low yield collections.  By the way I submit that almost all legal search is of low yield collections, not just employment law related as some have suggested. Those who think the contrary have too broad a concept of relevance, and little or no understanding of actual trials, cumulative evidence, and the modern data koan of big data “relevant is irrelevant.” Even though random sampling is not The Answer we once thought, it should be part of the process. For instance, a random sample elusion test that finds no Highly Relevant documents should remain an important component of that process.

The no-holds-barred Hunger Games approach to litigation must end now. If we all join together, this will end in victory, not defeat. It will end with alliances and standards. Whatever district you hail from, join us in this nobel quest. Turn away from the commercial greed of winning-at-all-costs. Keep your integrity. Keep the faith. Renounce the vicious games; both hide-the-ball and extortion. The world is watching. But we are up for it. We are prepared. We are trained. The odds are ever in our favor. Salute all your colleagues who turn from the games and the leadership of greed and oppression. Salute all who join with us in the rebellion for truth of justice.










%d bloggers like this: