Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CAR with the Griswold’s, and a moral dilemma

This is my sixth in a series of narrative descriptions of a search of 699,082 Enron emails to find evidence concerning involuntary employee terminations. The preceding narratives are:

In this sixth installment I continue my description, this time covering day nine of the project. Here I do a quality control review of a random sample to evaluate my decision in day eight to close the search.

Ninth Day of Review (4 Hours)

I began by generating a random sample of 1,065 documents from the entire null set (95% +/- 3%) of all documents not reviewed. I was going to review this sample as a quality control test of the adequacy of my search and review project. I would personally review all of them to see if any were False Negatives, in other words, relevant documents, and if relevant, whether any were especially significant or Highly Relevant.

I was looking to see if there were any documents left on the table that should have been produced. Remember that I had already personally reviewed all of the documents that the computer had predicted were like to be relevant (51% probability). I considered the upcoming random sample review of the excluded documents to be a good way to check the accuracy of reliance on the computer’s predictions of relevance.

I know it is not the only way, and there are other quality control measures that could be followed, but this one makes the most sense to me. Readers are invited to leave comments on the adequacy of this method and other methods that could be employed instead. I have yet to see a good discussion of this issue, so maybe we can have one here.

If my decision in day eight to close the search was correct, then virtually all of the predicted irrelevant files should be irrelevant. For that reason I expected the manual review of the null set to go very fast. I expected to achieve speeds of up to 500 files per hour and to be able to complete the task in a few hours. Well, anyway, that was my hope. I was not, however, going to rush or in any way deviate from my prior review practices.

To be honest, I also hoped that I would not discover any Hot (Highly Relevant) documents in the null set. If I did, that would mean that I would have to go back and run more learning sessions. I would have to keep working to expand the scope so that the next time there would be no significant False Negatives. I was well aware of my personal prejudice not to find such documents, and so was careful to be brutally honest in my evaluation of documents. I wanted to be sure that I was consistent with past coding, that I continued the same evaluation standards employed throughout the project. If that led to discovery of hot documents and more work on my part, then so be it.

Scope of Null Set

I begin the Null Set review by noting that the random sample picked some that had already been categorized as Irrelevant as expected. I could have excluded them from the Null Set, but that did not seem appropriate, as I wanted the sample to be completely random from “all excluded,” whether previously categorized or not. But I could be wrong on that principle and will seek input from information scientists on that issue. What do you think? Scientist or not, feel free to leave a comment below. Anyway, I do not think it makes much difference as only 126 of the randomly selected documents had been previously categorized.

Review of the Null Set

Next I sorted by file type to look for any obvious irrelevant I could bulk tag. None found. I did see one PowerPoint and was surprised to find it had slides pertaining to layoffs, both voluntary and involuntary, as part of the Enron bankruptcy, control number 12114291.

Following my prior rules of relevance I had to conclude this document was relevant under the expanded scope I had been using at the end, although it was not really important, and certainly not Highly Relevant. It looked like this might be a privileged document too, but that would not make any difference to my quality control analysis. It still counted.

By itself the document was not significant, but I had just started the review and already found a relevant document, a false-negative. If I kept finding documents like this I knew I was in trouble. My emotional confidence in the decision to stop the search had dropped considerably. I began bracing for the possibility of several more days of work to complete the project.

I then used a few other sort techniques for some bulk coding. The “From” field found a few obvious junk based on sender. Note that using the Short Cut Keys can help with speed. I especially like shifting into and out of Power Mode (for review) with F6 and then the ALT Arrows keys on the keyboard for rapid movement, especially from one doc to the next. Keeping your hand positioned over the keys like a video game allows for very rapid irrelevancy tagging and movement from one doc to the next. You can do up to 20 individual docs per minute that way (3 seconds per doc), if the connection speed is good.

Most of these irrelevant docs are obvious and only a quick glance allows you to confirm this, so that is why you can get up to a 3 seconds per doc coding rate, even without mass categorization. Only a few in the null set required careful reading, where it may take a minute, but rarely more, to determine relevance.

This review took a bit longer than expected, primarily because I was in the office and kept getting interrupted. Starting and stopping always slows you down (except for periodic attention breaks, that actually speed you up). Not including the interruptions, it still took 4 hours to review these 1,065 documents. That means I “only” went about 260 files per hour.

The good news is I did not find another relevant document, or even arguable relevant document. One false negative out of 1,065 is an error of only .1% (actually .093%), and thus a 99.9% accuracy, a/k/a .1% elusion (the proportion of non-produced documents that are responsive). See Roitblat, H. L., The process of electronic discoveryAlso, and this is very important to me, the one false negative document found was not important.

For these reasons, I declared the search project a success and over. I was relieved and happy.

Recap – Driving a CAR at 13,444 Files Per Hour

I searched an Enron database of 699,082 documents over nine days. That was a Computer Assisted Review (“CAR”) using predictive coding methods and a hybrid multimodal approach. It took me 52 hours to complete the search project. (Day 1 – 8.5 hrs; day 2 – 3.5; day 3 – 4; 4 – 8; 5 – 4; 6 – 4; 7 – 7; 8 – 9; 9 – 4.) This means that my hybrid CAR cruised through the project at an average speed of 13,444 files per hour.

That’s fast by any standards. If it were a  car going miles per hour, that is over seventeen times faster than the speed of sound.

This kind of review speed compares very favorably to the two other competing modes of search and review, manual linear review and keyword search. Both of these other reviews are computer assisted, but only marginally so.

The Model-T version of CAR is linear review.  (It is computer assisted only in the sense that the reviewer uses a computer to look at the documents and code them.) A good reviewer, with average speed-reading capacities, can attain review speeds of 50 documents per hour. That’s using straight linear review and the kind of old-fashioned software that you still find in most law firms today. You know, the inexpensive kind of software with few if any bells and whistles designed to speed up review. I have incidentally described some of these review enhancement features during this narrative. These enhancements, common to all top software on the market today, not just Kroll Ontrack’s Inview, made it possible for me to attain maximum document reading speeds of up to 1,200 files per hour (3 seconds per document) during the final null-set review. I am a pretty fast reader, and have over 32 years of experience in making relevancy calls on documents, but without these enhancements my review of documents can rarely go over 100 files per hour.

A reviewer at an average rate of 50 docs per hours would, assuming no breaks, take 1,382 hours to complete the project. As you have seen in this narrative, I completed the project in 52 hours. I did so by relying in a hybrid manner on my computer to work with me, under my direct supervision and control, to review most of the documents for me.

The comparison shows that manual review is at least twenty-six times slower than hybrid multimodal. I say at least because the manual review calculation does not include the need for second reviews and other quality control efforts, so in actuality a pure linear review would probably take over 1,700 man-hours.

So much for linear review, especially when testing shows that such manual review over large scales is not more accurate. See eg. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010. In fact, the Roitblat, et al study showed that a second set of professional human reviewers only agreed with the first set of reviewers of a large collection of documents 28% of the time, suggesting error rates with manual review of 72%! 

Saving 92% (even with a billing rate twice as high)

Consider the costs of these CAR rides, which is central to my bottom line driven proportional review approach. It would be unfair to do a direct comparison, and say that a manual review CAR costs 26 times more than a predictive coding CAR. Or put another way, that the state-of-the-art predictive coding CAR costs 96.2% less than the Model-T. It is an unfair comparison because the billing rate of a predictive coding skilled attorney would not be the same as a linear document reviewer.

Still, even if you assumed the skilled reviewer charged twice as much, the predictive coding review would still cost 13 times less.

Lets put some dollars on this to make it more real. If that manual reviewer, the old-fashioned Model-T driving attorney, charged $250 per hour for his services, then the 1,382 hours would generate a fee of at least $345,500. On the other hand, at a double rate of $500 per hour, my 52 hours of work would cost the client $26,000. That represents a savings of $319,500.

My review, even at double rates, still cost only 8% of what the old-timey low-rates lawyer would have charged. That is a 92% savings.

This is significantly more than the estimate of a 75% savings made in the Rand Report, but in the same dramatic-savings neighborhood. Where The Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery (2012); also see my blog on the Rand Report. I wonder when insurers are going to catch on to this?

Griswold’s Keyword Car

But what about the reviewer driving the keyword search CAR, the gas guzzler that seemed so cool in the 90s? What if contract reviewers were used for first review, and full-fee lawyers only used for heavy lifting and final review. Yes, it would be cheaper than all-manual linear review. But by how much? And, here is the most important part, at what cost to accuracy? How would the Griswold keyword wagon compare to the 2012 hybrid CAR with a predictive coding search engine?

First, let’s give the Griswolds some credit. Keyword search was great when it was first used by lawyers for document review in the 1990s. It sure beat looking at everything. Use of keyword search culling to limit review to the documents with keyword hits limited the number of documents to be reviewed and thus limited the cost. It is obviously less expensive than linear review of all documents. But, it is still significantly more expensive than multimodal predictive coding culling before review. Importantly, keyword search alone is also far less accurate.

I have seen negotiated keyword search projects recently where manual review of the documents with hits showed that 99% of them were not relevant. In other words, the requesting parties keywords produced an astonishingly low precision rate of 1%. And this happened even though the keywords were tested (at least somewhat), hit-count metrics were studied, several proposed terms were rejected, and a judge (arbitrator) was actively involved. In other words, it was not a completely blind Go Fish keyword guessing game.

In that same case, after I became involved, the arbitrator then approved predictive coding (yes, not all such orders are published, nor the subject of sensationalist media-feeding frenzies). I cannot yet talk about the specifics of the case, but I can tell you that the precision rate went from 1% using keywords, to 68% using predictive coding. Perhaps someday I will be able to share the order approving predictive coding and my reports to the tribunal on the predictive coding search. Suffice it to say that it went much like this Enron search, but the prevalence and yield were much higher in that project, and thus the number of relevant documents found was also much higher.

But don’t just take my word for it on cost savings. Look at case-law where keyword search was used along with contract reviewers. In re Fannie Mae Securities Litigation, 552 F.3d 814, (D.C. App. Jan. 6, 2009). True, the keyword search in the case was poorly done, but they did not review everything. The DOJ lawyers reviewed 660,000 emails and attachments with keyword hits at a cost of $6,000,000. The DOJ only did the second reviews and final quality control. Contract lawyers did the first review, and yet it still cost $9.09 per document.

Further, in the Roitblat, et al Electronic Discovery Institute study a review of 2.3 million documents by contract reviewers cost $14,000,000. This is a cost of $6.09 per document. This compares with my review of 699,082 documents for $26,000. The predictive coding review cost less than four cents a document. Also see Maura Grossman & Gordon Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Rich. J.L. & Tech., Spring 2011.

That is the bottom line: four cents per document versus six dollars and nine cents per document. That is the power of predictive culling and precision. It is the difference between a hybrid, predictive coding, targeted approach with high precision, and a keyword search, gas-guzzler, shotgun approach with very low precision. The recall rates are also, I suggest, at least as good, and probably better, when using far more precise predictive coding, instead of keywords. Hopefully my lengthy narrative here of a multimodal approach, including predictive coding, has helped to show that. Also see the studies cited above and my prior trilogy Secrets of Search: Parts OneTwo, and Three.

92% Savings Is Not Possible Under Real World Conditions

In future articles I may opine at length on how my review of the Enron database was able to achieve such dramatic cost savings, 92% ($26,000 vs. $345,500.) Suffice it to say for now that I do not think this kind of 92% savings is possible in real world conditions, that 50%-75% is more realistic.

Even then, the 50%-75% savings assumes a modicum of cooperation between the parties. My review was done with maximum system efficiency, and thus resulted in maximum savings, because I was the requesting party, the responding party, the reviewer, the judge, and appeals court all rolled into one. There was no friction in the system. No vendor costs. No transaction costs or delays. No carrying costs. No motion costs. No real disagreements, just dialogue (and inner dialogue at that).

In the real world there can be tremendous transaction costs and inefficiencies caused by other parties, especially the requesting party’s attorney, called opposing counsel for a reason. Often opposing counsel object to everything and anything without thinking, or any real reason, aside from the fact that if you want it, that means it must be bad for their client. This is especially true when the requesting party’s legal counsel have little or no understanding of legal search.

Sometimes the litigation friction costs are caused by honest disagreements, such as good faith disagreements on scope of relevance. That is inevitable and should not really cost that much to work out and get rulings on. But sometimes the disagreements are not in good faith. Sometimes the real agenda of a requesting party is to make the other side’s e-discovery as expensive as possible.

Unfortunately, anyone who wants to game the system to intentionally drive up discovery costs can do so. The only restraint on this is an active judiciary. With a truly dedicated obstructionist the 50%-75% savings from predictive coding could become far less, even nil. Of course, even without predictive coding as an issue, a dedicated obstructionist will find a way to drive up the costs of discovery. Discovery as abuse did not just spring up last year. See Judge Frank H. Easterbrook, Discovery As Abuse, 69 B.U. L. REV. 635 (1989). That is just how some attorneys play the game and they know a million ways to get away with it.

From my perspective as a practicing attorney it seems to be getting worse, not better, especially in high-stakes contingency cases. I have written about this quite a few times lately without dealing with case specifics, which, of course, I cannot do. See eg.:

These transaction costs, including especially the friction inherent in the adversarial system, explain the difference between a 92% savings in an ideal world, and a 75%-50% savings in a real world, under good conditions, or perhaps no savings at all under bad conditions.

I readily admit this, but consider the implications of this observation. Consider the heavy price the adversary system imposes on legal search. Craig Ball, who, like me, is no stranger to high-stakes contingency litigation, recently made a good observation on human nature that sheds light on this situation in his LTN article Taking Technology-Assisted Review to the Next Level:

It’s something of a miracle that documentary discovery works at all. Discovery charges those who reject the theory and merits of a claim to identify supporting evidence. More, it assigns responsibility to find and turn over damaging information to those damaged, trusting they won’t rationalize that incriminating material must have had some benign, non-responsive character and so need not be produced. Discovery, in short, is anathema to human nature.

A well-trained machine doesn’t care who wins, and its “mind” doesn’t wander, worrying about whether it’s on track for partnership.

What, dear readers, do you see as an option to our current adversarial-based system of e-discovery? What changes in our system might improve the efficiency of legal search and thus dramatically lower costs? Although I am grateful to the many attorneys and judges laboring over still more rule changes, I personally doubt that more band-aid tweaks to our rules will be sufficient. We are, after all, fighting against human nature as Craig Ball points out.

I suspect that a radical change to our current procedures may be necessary to fix our discovery system, that technology and rule tweaks alone may be inadequate. But I will save that thought for another day. It involves yet another paradigm shift, one that I am sure the legal profession is not yet ready to accept. Let’s just say the Sedona Conference Cooperation Proclamation is a step in that direction. For more clues read my science fiction about what legal search might be like in 50 years. A Day in the Life of a Discovery Lawyer in the Year 2062: a Science Fiction Tribute to Ray Bradbury. In the meantime, I look forward to your comments, both on this overall search project, my final quality control check, and the implications for what may come next for legal search.

In the Interests of Science

When I first wrote this narrative I planned to end at this point. The last paragraph was to be my last words on this narrative. That would have been in accord with real world practices in legal search and review where the project ends with final a quality control check and production. The 659 documents identified as relevant to involuntary employee termination would be produced, and, in most cases, that would be the end of it.

In legal practice you do not look back (unless the court orders you to). You make a decision and you implement. Law is not a science. It is a profession where you get a job done under tight deadlines and budgets. You make reasonable efforts and understand that perfection is impossible, that perfect is the enemy of the good.

But this is not a real world exercise. If it was, then confidentiality duties would not have allowed me to describe my work to begin with. This is an academic exercise, a scientific experiment of sorts. Its purpose is training, to provide the legal community with greater familiarity with the predictive coding process. For that reason I am compelled to share with you my thoughts and doubts of last week, in late July 2012, when I was rewriting and publishing Days Seven and Eight of the narrative.

I started to wonder in earnest whether my decision to stop after five rounds of predictive coding was correct. I described the decision and rationale in my Day Eight narrative. As I concluded in the Enough Is Enough heading: I was pretty sure that further rounds of search would lead to the discovery of more relevant documents, but thought it very unlikely any more significant relevant documents would be found. But now I am having second thoughts.

Troubling Questions

What if I was wrong? What if running another round would have led to the discovery of more significant relevant documents, and not just cumulative, insignificant relevant documents as I thought? What if a bunch of hot documents turned up? What if a whole new line of relevance was uncovered?

I also realized that it would only take a few more hours to run a sixth round of predictive and find out. Thanks to the generosity of Kroll Ontrack, the database was still online and intact. I could do it. But should I do it? Should I now take the time to test my decision? Was my decision to stop after five right, or was it wrong? And if it was wrong, how wrong was it?

I knew that if I now tested the decision by running a sixth round, the test would provide more information on how predictive coding works, on how a lawyer’s use of it works. It would lead to more pieces of truth. But was it worth the time, or the risk?

Chance and Choice

The personal risks here are real. Another round could well disprove my own decision. It could show that I was mistaken. That would be an embarrassing setback, not only for me personally, but also for the larger, more important cause of encouraging the use of advanced technology in legal practice. As I said in Day One of the narrative, I took the time to do this in the hope that such a narrative will encourage more attorneys and litigants to use predictive coding technology. If I now go the extra mile to test my own supposition, and the test reveals failure and delusion on my part, what would that do for the cause of encouraging others to take up the gauntlet? Was my own vanity now forcing me to accept needless risks that could not only harm myself, but others?

Of course, I could do the experiment and only reveal it if it was positive, or at least not too embarrassing, and hide it if it was. That way I could protect my own reputation and protect the profession. But I knew that I could never live with that. I knew that if I ran the experiment, then no matter how embarrassing the results proved to be, that there was no way I could hide that and still keep my self-respect. I knew that it would be better to be humbled than be a fraud. I knew that if I did this, if I took the time to go back and double-check my decision, that I would have to go all the way, pride and professional reputation be damned. I would have to tell all. If it was a story of delusion that discouraged other lawyers from adopting technology, then so be it. Truth should always triumph. Maybe other lawyers should be discouraged. Maybe I should be more skeptical of my own abilities. After all, even though I have been doing legal search in one form or another all my career, I have only been doing predictive coding for a little over a year.

Of course, I did not have to run the test at all. No one but a few folks at Kroll Ontrack would even know that it was still possible to do so. Everyone would assume that the database had been taken down. By any logical analysis I should not run this test. I had little to gain if the test worked and confirmed my theory, and much to lose if it did not. Reason said I should just walk away and stick to my plan and end the narrative now. No one would ever know, except of course, I would know. Damn.

As I write this I realize that I really have no choice. I have to take the chance. A clean conscience is more important than a puffed ego, more important even than encouragement of the profession to adopt predictive coding. Anyway, what good is such encouragement if it is based on a lie, or even just an incomplete truth? I do not want to encourage a mistake. Yes, it means more work, more risk. But I feel that I have to do it. I choose to take a chance.

As I write this, I have not yet performed this experiment, and so I have no idea how it will turn out. But tomorrow is another day, the tenth day, wherein I will step outside of my normal protocol. I will run a sixth round of predictive coding to test and evaluate my decision to stop after five rounds.

To be continued . . . .

22 Responses to Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CAR with the Griswold’s, and a moral dilemma

  1. Ralph,

    As you know, the completeness of a search is measured by recall — the fraction of responsive documents that are returned by the search.

    Your search found 659 responsive documents. To estimate recall, you need an estimate of the total number of responsive documents in the collection.

    Your sample of 1,065 documents contained one responsive document. The binomial confidence interval calculator at (x=1, n=1065) shows the confidence interval for that sample size (at the 95% confidence level) to be 0.025% – 0.52%; that is, with 95% confidence, the number of missed relevant documents is between 18 and 3,629.

    In other words, the total number of responsive documents in the collection is between 677 (659+18) and 4,288 (659+3,629), and your recall is between 15.4% (659/4,288) and 97.34% (659/677).
    In short, because the prevalence of responsive documents in the collection is so low, the sampling you did to validate your search process does not show that your search was adequate. The most you can conclude from this sample alone is that, with 95% confidence, your recall is at least 15.4%.


    • Ralph Losey says:

      Thanks for your comment Gordon,

      I want to respond on some issues here in the hope that it will stimulate further dialogue and input. I would especially like to hear further comments on the “elusion” test and the distinctions I make below in response to Gordon comments on testing a null set, versus a random sample “recall” test, and also on the distinction I make in quality control at or near the end of a search between mere relevant, significant-relevant, and highly-relevant.

      1. First, I think there may have been some miscommunication here. I was not attempting to do a recall measurement in the quality control test in Day Nine that you commented on. It was instead a type of quality control test for “elusion” as used by Herb Roitblat and others. I am not sure that sampling the excluded set only (not including the documents I categorized as relevant), which is what I did here, is even a proper sample for a recall test. Would not a correct recall measurement involve a sample of the entire corpus? I have heard varying opinions from your colleagues on that, and any event, as I stated, I was not attempting to measure recall here.

      2. I did do a proper sample of the entire corpus at the beginning of the review as described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane. There I described how I found 2 relevant documents in a random sample of 1,507 documents from the entire corpus, generating a projected yield of 928 (.13%*699,082=928). I explained how this was a spot or point projection, my goal for retrieval, and with William Webber’s help in the calculations, I explained that the interval for this projection using 95% +/- 3% is between 112 and 3,345 relevant documents. (Actually, to be technically correct, I should have used a confidence interval of 2.52% since KO over-sampled, and my random sample was of 1,507 documents from a corpus of 699,082, which would make the range even tighter.) I revised the post after its first publication, and perhaps you did not see the revised version with William’s full comments and interval range calculations?

      I discussed recall, not elusion, at the end of the Eighth Day in Days Seven and Eight of a Predictive Coding Narrative. I noted that the 659 relevant documents found represented a 71% recall (659/928) of my target spot projection. I also went on to note the point that you appear to be making here that this was just a spot in a target range causes by the confidence interval of between 112 and 3,345 documents. It could be 100% recall (although as I said, I did not think that to be true), or, worse case scenario, which you have focused on, it could only be 20% recall (659/3345).

      3. The elusion test that you commented on was not aimed at finding merely-relevant documents. Its purpose was not to make a recall measurement as to the original binary task of relevant or irrelevant. Indeed, I expected to find some merely relevant documents and knew that recall was not 100%. But I did think that I had found all Highly Relevant documents and what was left was merely relevant and thus of no real legal probative vale. In other words, that I had reached a proportional-efforts tipping point, and further search would not be worth the time or trouble from the point of view of evidentiary and discovery value.

      The elusion test and sample described in Day Nine was to find by random sample Highly Relevant, or otherwise significant relevant documents. I had in effect changed the search parameters from a simple binomial to a more subtle weighted relevance measure. I explained that significant relevant documents would include those that were not only relevant, but of a new type, and not just cumulative, that is, of a type discovered before. I explained that if I found even one Highly Relevant document, I would consider that a failed test and require more training. I considered discovery of a significant relevant document, but not in itself Highly Relevant in nature, to be a possible trigger of further rounds depending on its exact nature. Certainly discovery of several such significant documents, even though not Highly Relevant, would be a test-failure event triggering the need for further rounds of training.

      To be perfectly accurate the results of my elusion search were ZERO – 0 – no false negatives. No highly relevant documents were found, and no significant relevant documents were found. All that was found was one mere-relevant document of a cumulative nature of no importance to my view of the discovery project. Plug zero into your mentioned calculators and see what happens!

      I hope others will join in this discussion of quality control efforts.

      • Ralph,

        Did you state the number of highly relevant documents that you found? If you tell me that number, I can compute a recall (of highly relevant documents) confidence interval for you. Or you can do it yourself. Just plug (x=0, n=1065) into the confidence interval calculator, and repeat the calculations, substituting the number of highly relevant documents you found, in place of 659.



      • Ralph Losey says:

        In both random samples, including the first one of 1,507 documents, I found ZER00Highly Relevant documents. In the actual multimodal searches that I performed I found a total of 18 Highly Relevant documents. Remember, I was searching for both relevant and highly relevant documents throughout the project, and testing for relevant documents of any kind or weight in the first sample to determine yield. I changed from binomial to flex-weighted relevance in the final elusion sample test only.

      • My question was unclear. How many highly relevant documents, overall, did your search find? I was aware that your samples contained no highly relevant documents.

      • Okay, I just noticed that you said you found a total of 18 highly relevant documents. Your sample of 1,065 contained 0 highly relevant documents. Using a binomial calculator, we find that, with 95% confidence, between 0% and 0.35% of the “null set” (or excluded documents) were highly relevant. That is, between 0 and 2,450 highly relevant documents were missed. So all you can claim based on your sample is that “with 95% confidence, no more than 2,450 highly relevant documents were missed.”

        If we translate that to recall, you can state “with 95% confidence, the recall for highly relevant documents was at least 0.73%.” (That is because you found 18 highly relevant documents during your search, and there may have been in total as many as 2,450+18 = 2468. 18/2468 is 0.73%. (Note that this is *not* a recall of 73%; it is a recall of 0.73%.)

      • Ralph Losey says:

        My sample of 1,507 documents from the entire corpus had no highly relevant, and my later sample of 1,065 documents from the null-set had no highly relevant. Zero in both random samples.

      • Ralph, Gordon,

        Hi! This is a very interesting discussion. The confidence interval that Gordon states, based on Ralph’s QA sample of 1065 documents, is correct. One cannot easily incorporate the 1507 documents in the initial sample into this estimate, since the production has been based upon this initial sample (it was used to train the classifier). And since (as I understand it) Ralph has reviewed all of the documents that are to be produced, there is no sampling uncertainty in that estimate, so Gordon’s calculation of an interval on recall is also correct. (If the produced set were itself sampled, a more complex interval on recall must be used. See my tutorial on intervals on a proportion for a discussion of intervals on a simple proportion.)

        It is, unfortunately, very difficult to place a reassuring upper bound on a very rare event using random sampling. If, for instance, a judge were now to intervene and say “Ralph, I want your quality assurance sample to give 95% confidence that the 18 highly relevant documents you’ve found make up at least half the highly relevant documents in the collection”, then we’d need to sample somewhere in the region of 100,000 documents to achieve this level of confidence, even if none of these documents proved to be highly relevant.

        The broader question though is how such samples and confidence intervals should be incorporated into a quality assurance protocol. The confidence interval on recall makes no assumptions about the reliability of our production; that is part of the discipline of quality assurance sampling (and frequentist statistics in general). But we ourselves do have beliefs about the production: that the initial sample (however subsequently compromised as a sample) suggests highly relevant documents are rare; that Ralph is an experienced and insightful searcher; that the search and review process employed represents good practice; that appropriate keywords have been chosen; that the predictive coding algorithm is an effective classifier; and so forth. These beliefs are rational and reasonable beliefs, based upon past experience and current evidence. They cannot, though, be incorporated into our confidence interval, because they cannot be stated as precise and objective probabilities. On the other hand, though, we also can’t say that we subjectively believe that there is a 1 in 20 probability that there are 2450 or more highly relevant documents left in the collection, because the method that generated this estimate incorporates none of our prior beliefs.

        The pitfalls of merging multiple sources of evidence and experience, emphasize the importance of the field adopting standards for good practice in document review, predictive coding, and quality assurance. Sampling and estimation will be a crucial part of such review protocols, but they will only be a part. Jason Baron’s goal of establishing an ANSI workgroup on quality standards in e-discovery is the way forward.

    • I understand and I stand by my calculations. Zero in the sample does not imply zero in the population.

      • Ralph Losey says:

        Of course, as I did find 18. One other point of clarification. My relevancy count answers may have been confusing. The 18 highly relevant were included in the total relevant count of 659. It was a sub-category of relevant. In other words, there were 641 relevant, plus 18 highly relevant, total 659.

  2. Two points:

    First, the argument for multilevel relevance coding appears to be reinforced by the comments above. It needs to become embedded in the pantheon of formal protocols. Long overdue.

    Second, given the usually minute ratio of relevant document- to-total document population, primarily using random sampling to quantify the likelihood of missing relevant documents — especially highly relevant documents that comprise the bulk of the evidentiary payload — seems a bit misdirected. This problem seems more akin to outlier detection. Statistician’s blinders perhaps.

    Machine learning in TAR is used to predictively classify documents based upon linguistic relationships in the content. No one sees the potential for predictive classification based upon relationships of other attributes, which can be derived from the relevant document set that has been discovered? Or is it thought that missed relevant documents are randomly dispersed in the no-hit set?

  3. John Hunt says:

    Heard Mark T Pappas has been brought back from Belize to face the music in a Baltimore courtroom. Think he pissed off Victor Stanley? How’s His Fuvista line doing now? Keep me posted as I’m loving all this! John Hunt

  4. […] who are following Ralph Losey’s live-blogged production of material on involuntary termination from the EDRM Enron collection will know that he has reached what was to be the quality assurance step (though he has decided to […]

  5. […] Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my CA… by Ralph Losey. […]

  6. […] Day Nine of a Predictive Coding Narrative: A scary search for false-negatives, a comparison of my C…. […]

  7. Herbert L. Roitblat says:

    I’m sorry for being late to this conversation. I hope that the this contribution is worthwhile.

    The goal of predictive coding, eDiscovery, or even information retrieval in general, is to separate the responsive from the nonresponsive documents. If we assume for a bit that there is some authoritative definition of responsiveness, then we can divide the decisions about every document into one of four categories. Those that are called responsive and truly are (YY), those that are truly responsive and are called nonresponsive(YN), those that are truly nonresponsive, but are called responsive(NY), and those that are truly nonresponsive and are called nonresponsive (NN). All measures of accuracy are derived from this 2 X 2 decision matrix (two rows and two columns showing all of the decision combinations).

    There are many measures of accuracy. These measures include: Recall (YY/(YY+YN)), Precision (YY/(YY+NY), Elusion (YN/(YN+NN), and Fallout(NY/(NY+NN), and agreement ((YY+NN)/(YN+NY). Richness or prevalence is ((YY+YN)/(YY+YN+NY+NN)). There are several different ways of combining the information in this matrix to get an accuracy level, but these alternatives are all derived from the decision matrix.

    Recall and Precision are the two most commonly used measures, but they are not the only ones. The right measure to use is determined by the question that you are trying to answer and by the ease of asking that question.

    Recall and elusion are both designed to answer the question of how complete we were at retrieving all of the responsive documents. Recall explicitly asks “of all of the responsive documents in the collection, what proportion (percentage) did we retrieve?” Elusion explicitly asks “What proportion (percentage) of the rejected documents were truly responsive?” As recall goes up, we find more of the responsive documents, elusion, then, necessarily goes down; there are fewer responsive documents to find in the reject pile. For a given prevalence or richness as the YY count goes up (raising Recall), the YN count has to go down (lowering Elusion). As the conversation around Ralph’s report of his efforts shows, it is often a challenge to measure recall.

    Measuring recall requires you to know or estimate the total number of responsive documents. In the situation that Ralph describes, responsive documents were quite rare, estimated at around 0.13% prevalence. One method that Ralph used was to relate the number of documents his process retrieved with his estimated prevalence. He would take as his estimate of Recall, the proportion of the estimated number of responsive documents in the collection as determined by an initial random sample.

    Unfortunately, there is considerable variability around that prevalence estimate. I’ll return to that in a minute. He also used Elusion when he examined the frequency of responsive documents among those rejected by his process. As I argued above, Elusion and Recall are closely related, so knowing one tells us a lot about the other.

    One way to use Elusion is as an accept-on-zero quality assurance test. You specify the maximum acceptable level of Elusion, as perhaps some reasonable proportion of prevalence. Then you feed that value into a simple formula to calculate the sample size you need (published in my article the Sedona Conference Journal, 2007). If none of the documents in that sample comes up responsive, then you can say with a specified level of confidence that responsive documents did not occur in the reject set at a higher rate than was specified. As Gordon noted, the absence of a responsive document does not prove the absence of responsive documents in the collection.

    If you want to directly calculate the recall rate after your process, then you need to draw a large enough random sample of documents to get a statistically useful sample of responsive documents. Recall is the proportion of responsive documents that have been identified by the process. The 95% confidence range around an estimate is determined by the size of the sample set. For example, you need about 400 responsive documents to know that you have measured recall with a 95% confidence level and a 5% confidence interval. If only 1% of the documents are responsive, then you need to work pretty hard to find the required number of responsive documents. The difficulty of doing consistent review only adds to the problem. You can avoid that problem by using Elusion to indirectly estimate Recall.

    One way to assess what Ralph did is to compare the prevalence of responsive documents in the set before doing predictive coding with their prevalence after using predictive coding to remove as many of the responsive documents as possible. Is there a difference? An ideal process will have removed all of the responsive documents, so there will be none left to find in the reject pile.

    That question of whether there is a difference leads me to my second point. When we use a sample to estimate a value, the size of the sample dictates the size of the confidence interval. We can say with 95% confidence that the true score lies within the range specified by the confidence interval, but not all values are equally likely. A casual reader might be led to believe that there is complete uncertainty about scores within the range, but values very near to the observed score are much more likely that values near the end of the confidence interval. The most likely value, in fact, is the center of that range, the value we estimated in the first place. The likelihood of scores within the confidence interval corresponds to a bell shaped curve.

    Moreover, we have two proportions to compare, which affects how we use the confidence interval. We have the proportion of responsive documents before doing predictive coding. The confidence interval around that score depends on the sample size (1507) from which it was estimated. We have the proportion of responsive documents after predictive coding. The confidence interval around that score depends on its sample size (1065). Assuming that these are independent random samples, we can combine the confidence intervals (consult a basic statistics book for a two sample z or t test or, and determine whether these two proportions are different from one another(0.133% vs. 0.095%). When we do this test, even with the improved confidence interval, we find that the two scores are not significantly different at the 95% confidence level. (try it for yourself here: In other words, the predictive coding done here did not significantly reduce the number of responsive documents remaining in the collection. The initial proportion 2/1507 was not significantly higher than 1/1065. The number of responsive documents we are dealing with in our estimates is so small, however, that a failure to find a significant difference is hardly surprising.

    Still, there is other information that we can glean from this result. The difference in the two proportions is approximately 28%. Predictive coding reduced by 28% the number of responsive documents unidentified in the collection. Recall, therefore, is also estimated to be 28%. Further, we can use the information we have to compute the precision of this process as approximately 22%. We can use the total number of documents in the collection, prevalence estimates, and elusion to estimate the entire 2 x 2 decision matrix.

    For eDiscovery to be considered successful we do not have to guarantee that there are no unidentified responsive documents, only that we have done a reasonable job searching for them. The observed proportions do have some confidence interval around them, but they remain as our best estimate of the true percentage of responsive documents both before predictive coding and after. We can use this information and a little basic algebra to estimate Precision and Recall without the huge burden of measuring Recall directly.

  8. […] But enough fun with Hunger Games, Search Quadrant terminology, nothingness, and math, and back to Herb Rotiblat’s long comment on my earlier blog, Day Nine of a Predictive Coding Narrative. […]

  9. […] year with multimodal search allowed me to review 699,082 Enron documents in just 52 hours. See Day Nine of a Predictive Coding Narrative. But perhaps a monomodal approach that just used predictive coding would have taken far less time […]

  10. Susmita Ramani says:

    Do you know of a quick and easy primer on appropriate sampling methods – what is and is not okay? Thank you!

  11. […] Nine of a Predictive Coding Narrative: A Scary Search for False Negatives (And More) – (Ralph […]

  12. mikerossander says:

    Apologies that I am very late to this thread but there is a factor that I think was missed in the cost-benefit calculation – that is, the size of the total document set.

    To this point, how many documents did you put eyes on compared to the total number of documents in the data set? (Your argument that this would be the logical stopping point in a real production is compelling so the cost-benefit calculations should be based on reviews to here as well.)

    If I’m following the thread correctly, you tagged 2663 for training and another 939 for QA but I haven’t been able to tally the ones you read and rated but excluded from training. For the sake of argument, I’ll swag the total of documents reviewed at 5000. That works out to 96 docs per hour or $5.2 per document. Those seem low given a) the doubled rate and b) the high standard of review that you conducted.

    However many you put eyes-on, I think that number is likely to be close to a fixed cost for a production. That is, for any data set of similar distribution, you will have to go through about the same number of iterations and put eyes on about the same number of documents whether there are 100,000 in the total population or 100,000,000.

    Okay, it’s not perfecly fixed since population size is a factor in the original sampling but it doesn’t scale linearly, either.

    My point is that your 92% cost improvement and the extrapolated 13,444 documents per hour are artifacts of the total population size as much as anything else. If the data set had started with only 300,000 documents, your cost savings compared to a linear review would still have been positive but far lower.

    I think this is an important line of reasoning to explore because it might yield some benchmarks about cases that are too small to justify the cost/effort of predictive coding.

Leave a Reply