- Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron.
- Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane.
- Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.
- Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment.
- Days Seven and Eight of a Predictive Coding Narrative: Where I have another hybrid mind-meld and discover that the computer does not know God.
In this sixth installment I continue my description, this time covering day nine of the project. Here I do a quality control review of a random sample to evaluate my decision in day eight to close the search.
Ninth Day of Review (4 Hours)
I began by generating a random sample of 1,065 documents from the entire null set (95% +/- 3%) of all documents not reviewed. I was going to review this sample as a quality control test of the adequacy of my search and review project. I would personally review all of them to see if any were False Negatives, in other words, relevant documents, and if relevant, whether any were especially significant or Highly Relevant.
I was looking to see if there were any documents left on the table that should have been produced. Remember that I had already personally reviewed all of the documents that the computer had predicted were like to be relevant (51% probability). I considered the upcoming random sample review of the excluded documents to be a good way to check the accuracy of reliance on the computer’s predictions of relevance.
I know it is not the only way, and there are other quality control measures that could be followed, but this one makes the most sense to me. Readers are invited to leave comments on the adequacy of this method and other methods that could be employed instead. I have yet to see a good discussion of this issue, so maybe we can have one here.
If my decision in day eight to close the search was correct, then virtually all of the predicted irrelevant files should be irrelevant. For that reason I expected the manual review of the null set to go very fast. I expected to achieve speeds of up to 500 files per hour and to be able to complete the task in a few hours. Well, anyway, that was my hope. I was not, however, going to rush or in any way deviate from my prior review practices.
To be honest, I also hoped that I would not discover any Hot (Highly Relevant) documents in the null set. If I did, that would mean that I would have to go back and run more learning sessions. I would have to keep working to expand the scope so that the next time there would be no significant False Negatives. I was well aware of my personal prejudice not to find such documents, and so was careful to be brutally honest in my evaluation of documents. I wanted to be sure that I was consistent with past coding, that I continued the same evaluation standards employed throughout the project. If that led to discovery of hot documents and more work on my part, then so be it.
Scope of Null Set
I begin the Null Set review by noting that the random sample picked some that had already been categorized as Irrelevant as expected. I could have excluded them from the Null Set, but that did not seem appropriate, as I wanted the sample to be completely random from “all excluded,” whether previously categorized or not. But I could be wrong on that principle and will seek input from information scientists on that issue. What do you think? Scientist or not, feel free to leave a comment below. Anyway, I do not think it makes much difference as only 126 of the randomly selected documents had been previously categorized.
Review of the Null Set
Next I sorted by file type to look for any obvious irrelevant I could bulk tag. None found. I did see one PowerPoint and was surprised to find it had slides pertaining to layoffs, both voluntary and involuntary, as part of the Enron bankruptcy, control number 12114291.
Following my prior rules of relevance I had to conclude this document was relevant under the expanded scope I had been using at the end, although it was not really important, and certainly not Highly Relevant. It looked like this might be a privileged document too, but that would not make any difference to my quality control analysis. It still counted.
By itself the document was not significant, but I had just started the review and already found a relevant document, a false-negative. If I kept finding documents like this I knew I was in trouble. My emotional confidence in the decision to stop the search had dropped considerably. I began bracing for the possibility of several more days of work to complete the project.
I then used a few other sort techniques for some bulk coding. The “From” field found a few obvious junk based on sender. Note that using the Short Cut Keys can help with speed. I especially like shifting into and out of Power Mode (for review) with F6 and then the ALT Arrows keys on the keyboard for rapid movement, especially from one doc to the next. Keeping your hand positioned over the keys like a video game allows for very rapid irrelevancy tagging and movement from one doc to the next. You can do up to 20 individual docs per minute that way (3 seconds per doc), if the connection speed is good.
Most of these irrelevant docs are obvious and only a quick glance allows you to confirm this, so that is why you can get up to a 3 seconds per doc coding rate, even without mass categorization. Only a few in the null set required careful reading, where it may take a minute, but rarely more, to determine relevance.
This review took a bit longer than expected, primarily because I was in the office and kept getting interrupted. Starting and stopping always slows you down (except for periodic attention breaks, that actually speed you up). Not including the interruptions, it still took 4 hours to review these 1,065 documents. That means I “only” went about 260 files per hour.
The good news is I did not find another relevant document, or even arguable relevant document. One false negative out of 1,065 is an error of only .1% (actually .093%), and thus a 99.9% accuracy, a/k/a .1% elusion (the proportion of non-produced documents that are responsive). See Roitblat, H. L., The process of electronic discovery. Also, and this is very important to me, the one false negative document found was not important.
For these reasons, I declared the search project a success and over. I was relieved and happy.
Recap – Driving a CAR at 13,444 Files Per Hour
I searched an Enron database of 699,082 documents over nine days. That was a Computer Assisted Review (“CAR”) using predictive coding methods and a hybrid multimodal approach. It took me 52 hours to complete the search project. (Day 1 – 8.5 hrs; day 2 – 3.5; day 3 – 4; 4 – 8; 5 – 4; 6 – 4; 7 – 7; 8 – 9; 9 – 4.) This means that my hybrid CAR cruised through the project at an average speed of 13,444 files per hour.
That’s fast by any standards. If it were a car going miles per hour, that is over seventeen times faster than the speed of sound.
This kind of review speed compares very favorably to the two other competing modes of search and review, manual linear review and keyword search. Both of these other reviews are computer assisted, but only marginally so.
The Model-T version of CAR is linear review. (It is computer assisted only in the sense that the reviewer uses a computer to look at the documents and code them.) A good reviewer, with average speed-reading capacities, can attain review speeds of 50 documents per hour. That’s using straight linear review and the kind of old-fashioned software that you still find in most law firms today. You know, the inexpensive kind of software with few if any bells and whistles designed to speed up review. I have incidentally described some of these review enhancement features during this narrative. These enhancements, common to all top software on the market today, not just Kroll Ontrack’s Inview, made it possible for me to attain maximum document reading speeds of up to 1,200 files per hour (3 seconds per document) during the final null-set review. I am a pretty fast reader, and have over 32 years of experience in making relevancy calls on documents, but without these enhancements my review of documents can rarely go over 100 files per hour.
A reviewer at an average rate of 50 docs per hours would, assuming no breaks, take 1,382 hours to complete the project. As you have seen in this narrative, I completed the project in 52 hours. I did so by relying in a hybrid manner on my computer to work with me, under my direct supervision and control, to review most of the documents for me.
The comparison shows that manual review is at least twenty-six times slower than hybrid multimodal. I say at least because the manual review calculation does not include the need for second reviews and other quality control efforts, so in actuality a pure linear review would probably take over 1,700 man-hours.
So much for linear review, especially when testing shows that such manual review over large scales is not more accurate. See eg. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010. In fact, the Roitblat, et al study showed that a second set of professional human reviewers only agreed with the first set of reviewers of a large collection of documents 28% of the time, suggesting error rates with manual review of 72%!
Saving 92% (even with a billing rate twice as high)
Consider the costs of these CAR rides, which is central to my bottom line driven proportional review approach. It would be unfair to do a direct comparison, and say that a manual review CAR costs 26 times more than a predictive coding CAR. Or put another way, that the state-of-the-art predictive coding CAR costs 96.2% less than the Model-T. It is an unfair comparison because the billing rate of a predictive coding skilled attorney would not be the same as a linear document reviewer.
Still, even if you assumed the skilled reviewer charged twice as much, the predictive coding review would still cost 13 times less.
Lets put some dollars on this to make it more real. If that manual reviewer, the old-fashioned Model-T driving attorney, charged $250 per hour for his services, then the 1,382 hours would generate a fee of at least $345,500. On the other hand, at a double rate of $500 per hour, my 52 hours of work would cost the client $26,000. That represents a savings of $319,500.
My review, even at double rates, still cost only 8% of what the old-timey low-rates lawyer would have charged. That is a 92% savings.
This is significantly more than the estimate of a 75% savings made in the Rand Report, but in the same dramatic-savings neighborhood. Where The Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery (2012); also see my blog on the Rand Report. I wonder when insurers are going to catch on to this?
Griswold’s Keyword Car
But what about the reviewer driving the keyword search CAR, the gas guzzler that seemed so cool in the 90s? What if contract reviewers were used for first review, and full-fee lawyers only used for heavy lifting and final review. Yes, it would be cheaper than all-manual linear review. But by how much? And, here is the most important part, at what cost to accuracy? How would the Griswold keyword wagon compare to the 2012 hybrid CAR with a predictive coding search engine?
First, let’s give the Griswolds some credit. Keyword search was great when it was first used by lawyers for document review in the 1990s. It sure beat looking at everything. Use of keyword search culling to limit review to the documents with keyword hits limited the number of documents to be reviewed and thus limited the cost. It is obviously less expensive than linear review of all documents. But, it is still significantly more expensive than multimodal predictive coding culling before review. Importantly, keyword search alone is also far less accurate.
I have seen negotiated keyword search projects recently where manual review of the documents with hits showed that 99% of them were not relevant. In other words, the requesting parties keywords produced an astonishingly low precision rate of 1%. And this happened even though the keywords were tested (at least somewhat), hit-count metrics were studied, several proposed terms were rejected, and a judge (arbitrator) was actively involved. In other words, it was not a completely blind Go Fish keyword guessing game.
In that same case, after I became involved, the arbitrator then approved predictive coding (yes, not all such orders are published, nor the subject of sensationalist media-feeding frenzies). I cannot yet talk about the specifics of the case, but I can tell you that the precision rate went from 1% using keywords, to 68% using predictive coding. Perhaps someday I will be able to share the order approving predictive coding and my reports to the tribunal on the predictive coding search. Suffice it to say that it went much like this Enron search, but the prevalence and yield were much higher in that project, and thus the number of relevant documents found was also much higher.
But don’t just take my word for it on cost savings. Look at case-law where keyword search was used along with contract reviewers. In re Fannie Mae Securities Litigation, 552 F.3d 814, (D.C. App. Jan. 6, 2009). True, the keyword search in the case was poorly done, but they did not review everything. The DOJ lawyers reviewed 660,000 emails and attachments with keyword hits at a cost of $6,000,000. The DOJ only did the second reviews and final quality control. Contract lawyers did the first review, and yet it still cost $9.09 per document.
Further, in the Roitblat, et al Electronic Discovery Institute study a review of 2.3 million documents by contract reviewers cost $14,000,000. This is a cost of $6.09 per document. This compares with my review of 699,082 documents for $26,000. The predictive coding review cost less than four cents a document. Also see Maura Grossman & Gordon Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Rich. J.L. & Tech., Spring 2011.
That is the bottom line: four cents per document versus six dollars and nine cents per document. That is the power of predictive culling and precision. It is the difference between a hybrid, predictive coding, targeted approach with high precision, and a keyword search, gas-guzzler, shotgun approach with very low precision. The recall rates are also, I suggest, at least as good, and probably better, when using far more precise predictive coding, instead of keywords. Hopefully my lengthy narrative here of a multimodal approach, including predictive coding, has helped to show that. Also see the studies cited above and my prior trilogy Secrets of Search: Parts One, Two, and Three.
92% Savings Is Not Possible Under Real World Conditions
In future articles I may opine at length on how my review of the Enron database was able to achieve such dramatic cost savings, 92% ($26,000 vs. $345,500.) Suffice it to say for now that I do not think this kind of 92% savings is possible in real world conditions, that 50%-75% is more realistic.
Even then, the 50%-75% savings assumes a modicum of cooperation between the parties. My review was done with maximum system efficiency, and thus resulted in maximum savings, because I was the requesting party, the responding party, the reviewer, the judge, and appeals court all rolled into one. There was no friction in the system. No vendor costs. No transaction costs or delays. No carrying costs. No motion costs. No real disagreements, just dialogue (and inner dialogue at that).
In the real world there can be tremendous transaction costs and inefficiencies caused by other parties, especially the requesting party’s attorney, called opposing counsel for a reason. Often opposing counsel object to everything and anything without thinking, or any real reason, aside from the fact that if you want it, that means it must be bad for their client. This is especially true when the requesting party’s legal counsel have little or no understanding of legal search.
Sometimes the litigation friction costs are caused by honest disagreements, such as good faith disagreements on scope of relevance. That is inevitable and should not really cost that much to work out and get rulings on. But sometimes the disagreements are not in good faith. Sometimes the real agenda of a requesting party is to make the other side’s e-discovery as expensive as possible.
Unfortunately, anyone who wants to game the system to intentionally drive up discovery costs can do so. The only restraint on this is an active judiciary. With a truly dedicated obstructionist the 50%-75% savings from predictive coding could become far less, even nil. Of course, even without predictive coding as an issue, a dedicated obstructionist will find a way to drive up the costs of discovery. Discovery as abuse did not just spring up last year. See Judge Frank H. Easterbrook, Discovery As Abuse, 69 B.U. L. REV. 635 (1989). That is just how some attorneys play the game and they know a million ways to get away with it.
From my perspective as a practicing attorney it seems to be getting worse, not better, especially in high-stakes contingency cases. I have written about this quite a few times lately without dealing with case specifics, which, of course, I cannot do. See eg.:
- Discovery As Abuse
- E-Discovery Gamers: Join Me In Stopping Them
- Judge David Waxse on Cooperation and Lawyers Who Act Like Spoiled Children
These transaction costs, including especially the friction inherent in the adversarial system, explain the difference between a 92% savings in an ideal world, and a 75%-50% savings in a real world, under good conditions, or perhaps no savings at all under bad conditions.
I readily admit this, but consider the implications of this observation. Consider the heavy price the adversary system imposes on legal search. Craig Ball, who, like me, is no stranger to high-stakes contingency litigation, recently made a good observation on human nature that sheds light on this situation in his LTN article Taking Technology-Assisted Review to the Next Level:
It’s something of a miracle that documentary discovery works at all. Discovery charges those who reject the theory and merits of a claim to identify supporting evidence. More, it assigns responsibility to find and turn over damaging information to those damaged, trusting they won’t rationalize that incriminating material must have had some benign, non-responsive character and so need not be produced. Discovery, in short, is anathema to human nature.
A well-trained machine doesn’t care who wins, and its “mind” doesn’t wander, worrying about whether it’s on track for partnership.
What, dear readers, do you see as an option to our current adversarial-based system of e-discovery? What changes in our system might improve the efficiency of legal search and thus dramatically lower costs? Although I am grateful to the many attorneys and judges laboring over still more rule changes, I personally doubt that more band-aid tweaks to our rules will be sufficient. We are, after all, fighting against human nature as Craig Ball points out.
I suspect that a radical change to our current procedures may be necessary to fix our discovery system, that technology and rule tweaks alone may be inadequate. But I will save that thought for another day. It involves yet another paradigm shift, one that I am sure the legal profession is not yet ready to accept. Let’s just say the Sedona Conference Cooperation Proclamation is a step in that direction. For more clues read my science fiction about what legal search might be like in 50 years. A Day in the Life of a Discovery Lawyer in the Year 2062: a Science Fiction Tribute to Ray Bradbury. In the meantime, I look forward to your comments, both on this overall search project, my final quality control check, and the implications for what may come next for legal search.
In the Interests of Science
When I first wrote this narrative I planned to end at this point. The last paragraph was to be my last words on this narrative. That would have been in accord with real world practices in legal search and review where the project ends with final a quality control check and production. The 659 documents identified as relevant to involuntary employee termination would be produced, and, in most cases, that would be the end of it.
In legal practice you do not look back (unless the court orders you to). You make a decision and you implement. Law is not a science. It is a profession where you get a job done under tight deadlines and budgets. You make reasonable efforts and understand that perfection is impossible, that perfect is the enemy of the good.
But this is not a real world exercise. If it was, then confidentiality duties would not have allowed me to describe my work to begin with. This is an academic exercise, a scientific experiment of sorts. Its purpose is training, to provide the legal community with greater familiarity with the predictive coding process. For that reason I am compelled to share with you my thoughts and doubts of last week, in late July 2012, when I was rewriting and publishing Days Seven and Eight of the narrative.
I started to wonder in earnest whether my decision to stop after five rounds of predictive coding was correct. I described the decision and rationale in my Day Eight narrative. As I concluded in the Enough Is Enough heading: I was pretty sure that further rounds of search would lead to the discovery of more relevant documents, but thought it very unlikely any more significant relevant documents would be found. But now I am having second thoughts.
What if I was wrong? What if running another round would have led to the discovery of more significant relevant documents, and not just cumulative, insignificant relevant documents as I thought? What if a bunch of hot documents turned up? What if a whole new line of relevance was uncovered?
I also realized that it would only take a few more hours to run a sixth round of predictive and find out. Thanks to the generosity of Kroll Ontrack, the database was still online and intact. I could do it. But should I do it? Should I now take the time to test my decision? Was my decision to stop after five right, or was it wrong? And if it was wrong, how wrong was it?
I knew that if I now tested the decision by running a sixth round, the test would provide more information on how predictive coding works, on how a lawyer’s use of it works. It would lead to more pieces of truth. But was it worth the time, or the risk?
Chance and Choice
The personal risks here are real. Another round could well disprove my own decision. It could show that I was mistaken. That would be an embarrassing setback, not only for me personally, but also for the larger, more important cause of encouraging the use of advanced technology in legal practice. As I said in Day One of the narrative, I took the time to do this in the hope that such a narrative will encourage more attorneys and litigants to use predictive coding technology. If I now go the extra mile to test my own supposition, and the test reveals failure and delusion on my part, what would that do for the cause of encouraging others to take up the gauntlet? Was my own vanity now forcing me to accept needless risks that could not only harm myself, but others?
Of course, I could do the experiment and only reveal it if it was positive, or at least not too embarrassing, and hide it if it was. That way I could protect my own reputation and protect the profession. But I knew that I could never live with that. I knew that if I ran the experiment, then no matter how embarrassing the results proved to be, that there was no way I could hide that and still keep my self-respect. I knew that it would be better to be humbled than be a fraud. I knew that if I did this, if I took the time to go back and double-check my decision, that I would have to go all the way, pride and professional reputation be damned. I would have to tell all. If it was a story of delusion that discouraged other lawyers from adopting technology, then so be it. Truth should always triumph. Maybe other lawyers should be discouraged. Maybe I should be more skeptical of my own abilities. After all, even though I have been doing legal search in one form or another all my career, I have only been doing predictive coding for a little over a year.
Of course, I did not have to run the test at all. No one but a few folks at Kroll Ontrack would even know that it was still possible to do so. Everyone would assume that the database had been taken down. By any logical analysis I should not run this test. I had little to gain if the test worked and confirmed my theory, and much to lose if it did not. Reason said I should just walk away and stick to my plan and end the narrative now. No one would ever know, except of course, I would know. Damn.
As I write this I realize that I really have no choice. I have to take the chance. A clean conscience is more important than a puffed ego, more important even than encouragement of the profession to adopt predictive coding. Anyway, what good is such encouragement if it is based on a lie, or even just an incomplete truth? I do not want to encourage a mistake. Yes, it means more work, more risk. But I feel that I have to do it. I choose to take a chance.
As I write this, I have not yet performed this experiment, and so I have no idea how it will turn out. But tomorrow is another day, the tenth day, wherein I will step outside of my normal protocol. I will run a sixth round of predictive coding to test and evaluate my decision to stop after five rounds.
To be continued . . . .