Reply to an Information Scientist’s Critique of My “Secrets of Search” Article

One of the leading information scientists in the field of legal search, was kind enough to write a detailed critique of my Secrets of Search series. I tried to post a response on his blog, Information Discovery where it appeared. But the website would not accept the comments for some technical reasons, so I am replying here, knowing that Herb will find them, and maybe some of his readers. I bring Dr. Roitblat’s comments to your attention, even though they are in the nature of a critique, because I want my readers to hear all sides of the story, not just mine. I am just a lawyer, and especially welcome peer review from the scientific community. That is part of a team approach to e-discovery, where Law, Science, and IT work together, and learn from each other, to do e-discovery right. The world is too complex, the electronic haystacks too vast, for lawyers to find relevant evidence without such an interdisciplinary team approach.

It is also interesting to see that a lot of Herb Roitblat’s stated disagreements appear to be based on misunderstandings of what I was trying to say. That is common in interdisciplinary team efforts. I accept that it was probably my fault, as I was writing about information science topics in Secrets of Search. But I don’t beat myself up too much about it because it is just so damned difficult to write intelligibly on the matrix between e-discovery law and information science. That is one reason that almost no one even tries. Still, this apparent miscommunication presents an opportunity. By addressing Herb’s issues we can attain greater clarity of the emerging consensus. His comments, and my responses, suggest that we both agree on far more than we disagree. From my perspective, at least, the points of disagreement are really minor and technical. They pale in comparison to our mutual agreement as to the superiority of technology assisted review over mere manual review.

Before I post the reply, I have to give you an idea of  Dr. Roitblat’s commentary, so you can understand what I’m replying to. But better still, take few minutes to read his entire article, On Some Selected Search Secrets. Only in this way will my response make full sense. I do not really know Herb, although I think we’ve met at some events over the years, but I certainly know of him. He is one of the few info scientists around that is focused on legal search and actually makes a living out of it. (Apparently he also likes dolphins and killer whales.) He owns a company, OrcaTec, that, in his words, provides professional services and software for information discovery and information management. My Secrets of Search article frequently cited one of his works with the e-Discovery Institute: Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review; Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010.

Summary of Herb Roitblat’s Critique

Here is Herb’s analysis, which begins in a very flattering manner that I don’t deserve:

Ralph Losey recently wrote an important series of blog posts (here, here, and here) describing five secrets of search. He pulled together a substantial array of facts and ideas that should have a powerful impact on eDiscovery and the use of technology in it. He raised so many good points, that it would take up all of my time just to enumerate them. He also highlighted the need for peer review. In that spirit I would like to address a few of his conclusions in the hope of furthering discussions among lawyers, judges, and information scientists about the best ways to pursue eDiscovery.

These are the problematic points I would like to consider:
1. Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.
2. Webber’s analysis shows that human review is better than machine review
3. Reviewer quality is paramount.
4. Human review is good for small volumes, but not large ones.
5. Random samples with 95% confidence levels +/- 2 are unrealistically high.

Ralph’s Response: Thanks for taking the time to provide input on my article. I appreciate your comments, and actually agree with most of the points you make. I think you may have misunderstood some of what I was saying and your disagreement is actually agreement, but I much appreciate your clarifications. I will respond based on your enumerated points above.

Dr. Roitblat then explains the five problems that he had with a few of the conclusions that I made in Secrets of Search. Again, I urge you to read all of his comments, but for ease of reference, I here quote what I think is the essence of each of his five issues, and then follow with my response.

Issue [1]: Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall. Losey quotes extensively from a paper written by William Webber, which reanalyzes some results from the TREC Legal Track, 2009, and some other sources. Like Losey’s commentary, this paper also has a lot to recommend it. Some of the conclusions that Losey reaches are fairly attributable to Webber, but some go beyond what Webber would probably be comfortable with. The most significant fact, because important arguments are based on it, is a description of some work by Ellen Voorhees that concluded that 65% recall at 65% precision is the best performance one can expect. The problem is that this 65% factoid is taken out of context. In the context of the TREC studies and the way that documents are ultimately determined to be relevant or not, this is thought to be the best that can be achieved. The 65% is not a fact of nature. It says, actually, nothing about the accuracy of the predictive coding systems being studied. Losey notes that this limit is due to the inherent uncertainty in human judgments of relevance, but goes on to claim that this is a limit on machine-based or machine assisted categorization. It is not. …

Ralph’s Response [1]: I agree with you. I was not trying to say 65% precision or recall is all that is possible to attain, just that the fuzziness of our lenses makes it hard to prove anymore than that, unless special review controls are put in place for the measurements. These controls have been lacking in most legal tests to date. TREC is making progress with limited subject matter expert input, but even there, thanks to monetary constraints, we still have a ways to go to use a true gold standard that could improve our measurements. So I agree with you that 65% is no “fact of nature” as you put it, or inherent limitation in human relevancy determinations. (I am not ruling that possibility out entirely, but if such a mental limit like that does exist, my experience tells me that it is higher than 65%.) This fuzziness issue is more than a mere anomaly and deserves wide-spread discussion and recognition. In so far as large-scale human reviews are concerned, reviews unassisted by technologies, the kind of reviews that were common in the past, the 65% fuzzy focus may well be an inherent human limit. With predictive coding and other automated process, however, this barrier can be broken. Finally, I like your suggestion to improve TREC experiments by using both an authoritative training set and an authoritative judgment set.

Issue [2]: Webber’s analysis shows that human review is better than machine review. I have no doubt that human review could sometimes be better than machine-assisted review, but the data discussed by Webber do not say anything one way or the other about this claim. Webber did, in fact, find that some of the human reviewers showed higher precision and recall than did the best-performing computer system on some tasks. But, because of the methods used, we don’t know whether these differences were due merely to chance, to specific methods used to obtain the scores, or to genuine differences among reviewers. Moreover, the procedure prevents us from making a valid statistical comparison. …

Ralph’s Response [2]: Again, I agree with you. I get that Webber’s analysis suggests that humans are only sometimes better, not always. In fact, I would go much further and say that humans always lose over large-scale review (weeks on end of 8 hours a day reviewing hundreds of thousands of boring documents) when paired against today’s good software. Still, Webber pointed out what no one else had before about the TREC results, that the humans sometimes did win on the small-scale, even when substandard manual review methods were used. I think it is wrong to just sweep that under the rug as an anomaly or luck. This realization of human abilities is important for proper application of the predictive coding process, where, in my opinion, input by experts on the seed coding is key. These experts need a clear understanding of what is relevant, and what is not. Otherwise, no matter how good the software, the computer principle of garbage in, garbage out, will control.

This realization of the continued importance of Man in the technology equation is also important to defeat the sophistic arguments of some plaintiffs’ lawyers (or better put “requesting party lawyers). They are now arguing in multiple courts around the country that a defendant (responding party) should forego any manual review and just turn over documents based solely on automated review. They use that argument to oppose motions for protective orders based on excessive cost and burden to review. They are misusing distorted reports of scientific research to try to force quick peek disclosures. But the truth is, automated coding is not good enough yet to dispense with final manual quality control reviews to protect confidential information in a litigation context. Webber’s findings help prove that. The advantage to plaintiffs’ counsel of such disingenuous, forced quick peek strategy is obvious and substantial. Claw backs and Rule 502 are inadequate protections. Once the bell has been rung, the damage is done, regardless of whether the documents are returned. The main point I was trying to make by publicizing Webber’s finding is that humans still have a place at the table, not that they should sit there alone without reliance on the latest software for culling review. I suspect you agree with me on that.

Issue [3]: Reviewer quality is paramount. Webber found that some assessors performed better than others. Continuing the argument of the previous section, though, we cannot infer from this that some assessors were more talented, skilled, or better prepared than others. … The best reviewers on each topic could have been the best because they got lucky and got an easy bin, or they got a bin with a large number of responsive documents, or just by chance. Unless we disentangle these three possibilities, we cannot claim that some reviewers were better or that reviewer quality matters. In fact, these data provide no evidence one way or the other relative to these claims. … In some sense, the ideal would be for the senior attorney in the case to read every single document with no effect of fatigue, boredom, distraction, or error. Instead, the current practice is to farm out first pass review to either a team of anonymous, ad hoc, or inexpensive reviewers or to search by keyword. Even if Losey were right, the standard is to use the kind of reviewers that he says are not up to the task.

Ralph’s Response [3]: I think you misunderstood my point and again assumed incorrectly that I was advocating for large-scale manual review. I am not. I agree the reviewers are not up to the task, even the best. As explained above, I think humans cannot perform over long periods of time, and so I am not advocating against machine review, I am advocating for hybrid review, man and machine working together. Like you, I advocate for change. So really we agree.

But I do disagree with some of your statements here. To paraphrase Shakespeare: me thinks thou dost protest too much. I don’t think it is wrong to assume a correlation between accuracy and skill. That connection is based on experience and common sense. All large-scale review project metrics show that some reviewers are better than others, just like some trial lawyers are better than others, and some scientists, etc. It is inherent that we all perform to different levels at different tasks. I do not understand the need to try to explain all of the variances as just luck or chance. (As the great golfer Gary Player used to say: the more I practice, the luckier I get.)  Although I concede some chance or luck is possible, the same could be said of the software tested. Perhaps the “winning software” just got lucky. I would not seriously make that argument, so I am surprised to hear it made about the reviewers here. TREC only tried to measure comparisons, as you said, and lady luck knows no favorites.

Issue [4]: Human review is good for small volumes, but not large ones. This claim may also be true, but the present data do not provide any evidence for or against it. The evidence that Losey cites in support of this claim is the same evidence that, I argued, failed to show that human review is better than machine review. It requires the same circular reasoning. … Based on other evidence from psychology and other areas, it is likely that performance will decline somewhat with larger document sets, but there is no evidence here for that. If this were the only factor, we could arrange the situation so that reviewers only looked at 500 documents at a time before they took a break.

Ralph’s Response [4]: I agree this was not tested. I was again relying on my experience outside of these experiments and relying on my common sense built from over 30 years of doing document review, paper and electric, big and small. But I get your point of scientific discipline that it was not tested and so not here proven. Still, I’m not a scientist, nor do I care to become one. Also, I write primarily for lawyers, not scientists (although I am very happy a few of you are interested enough to read them too,  at least when it touches on your work). I am a lawyer interested in learning from science for purposes of improving law, not visa versa, although that may be a secondary benefit. That would depend on scientists like yourself. Also, as you know, there is more to establishing best practices in review processes than simply adding in periodic breaks.

Issue [5]: Random samples with 95% confidence levels +/- 2% confidence intervals are unrealistically high. It’s not entirely clear what this claim means. On the one hand, there is a common misperception of what it means to have a 95% confidence level. Some people mistakenly assume that the confidence level refers to the accuracy of the results. But the confidence level is not the same thing as the accuracy level. … I suspect that Losey means something different. I suspect that he is referring to the relatively weak levels of agreement found by the TREC studies and others. If our measurement is not very precise, then we can hardly expect that our estimates will be more precise.

Ralph’s Response [5]: You have correctly divined my intent here on sampling. I was again referring to the measurements fuzziness issue reported by your scientific colleagues, Voorhees, Webber and to some extent Oard. I understand that you are uncomfortable with their findings and conclusions on accuracy. I sincerely hope that you and other scientists will work this issue out.

I want accurate measurements too, especially when important points of justice are at stake. I want all of the scientific research out there for full public view, even the troubling preliminary conclusions of Voorhees, Webber and Oard.  If the measurements are disputed, I want full disclosure on that. If it takes more money, time and effort to get these measurements done properly in scientific testing, then lets raise the funds to do it right. I support the important scientific research now going on in legal search. On that point I suspect we once again agree.

Again, thanks for your comments on my article.


DEAR READERS: I’m off to LegalTech, where I will not only be presenting with Craig Ball and Judge Andrew Peck in the much hyped debate on Tuesday, January 31st at 4:00 at the Sutton Center on the 2nd floor (sponsored by BIA), but I will also be presenting three more times on predictive coding related subjects. I am thinking of preparing for all of that the way Pat Sajak prepared to host Wheel of Fortune.

On Monday the 30th, I present at 12:30 on The Promise and Challenge of Predictive Coding and Other Disruptive Technologies with Judge Andrew Peck, Maura Grossman and Dean Gonsowski (sponsored by Clearwell/Symantec).

On Wednesday, February 1st, I present at 10:30 on Technology Assisted Review: When to Use it and How to Defend It, with Maura Grossman, Judge Frank Maas, and Ann Marie Gibbs (sponsored by Daegis).

My last gig on Wednesday is at 1:45 in the Sutton South Parlor on E-discovery Circa 2015: Will Aggressive Preservation/Collection and Predictive Coding be Commonplace? My fellow panelists are David Kessler, Robert Trenchard, Julie Colgan, Stephanie Blair, and Craig Carpenter (sponsored by Recommind and ARMA).

If you see me around, please stop and say hello. I like to meet all of my readers whenever possible. Please forgive me if you catch me at a time during the day when I don’t have time to chat, but I always have time to shake hands and say hello.

11 Responses to Reply to an Information Scientist’s Critique of My “Secrets of Search” Article

  1. Herbert L. Roitblat says:

    Thanks, so much, Ralph. My comments about your posts are, in my opinion, well deserved. You do a great service to the community when you discuss these issues. I appreciate your kind words as well.

    I am very pleased to learn that I misunderstood you on some of your points. Frankly, I was hoping that that was the case because I know that you have advocated along these lines. As you said, writing so that your words can be understood by a broad audience can be painfully difficult.

    I have only one additional point to try to clarify. Occam’s razor says, basically, don’t attribute results to a more complex process when a simpler one will do. I, too, agree that some reviewers must be more skilled than others. My point was not to deny that common-sense observation, merely to point out that these data were not collected in a way that would allow us to know whether in this particular case, the difference among reviewers was due to differences in skill or to something simpler (e.g., chance). We would have to design a different study to answer that question. Put simply, these results do not provide enough information to tell the difference between the two explanations.

    I hope to see you at Legal Tech. I will be spending most of my time in the OrcaTec Booth, #1421, but I will try to get to some of your presentations. I love talking about this stuff. Let me say again, that I appreciate the service that you provide to the community. Your blog, and Chris Dale’s are two of my favorites.

  2. […] Reply to an Information Scientist’s Critique of My “Secrets of Search” Article […]

  3. […] Reply to an Information Scientist’s Critique of My “Secrets of Search” Article […]

  4. […] is continuously growing.  According to e-Discovery experts like Magistrate Judge Andrew Peck and Ralph Losey, keyword searching is an outdated methodology for identifying  potentially relevant […]

  5. […] Reply to an Information Scientist’s Critique of My “Secrets of Search” Article; […]

  6. […] Reply to an Information Scientist’s Critique of My “Secrets of Search” Article; […]

  7. […] devoted a blog to responding to one of Herb’s lengthy, and good blog Comments. The first was Reply to an Information Scientist’s Critique of My “Secrets of Search” Article that appeared in late January […]

  8. […] Let me explain again in shorthand, and please fell free to refer to the Secrets of Search trilogy and original studies for the full story. Roitblot’s own well-known study of a large-scale document review showed that human reviewers only agreed with each other on average of 28% of the time. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010. An earlier study by one of the leading information scientists in the world, Ellen M. Voorhees, found a 40% agreement rate between human reviewers. Variations in relevance judgments and the measurement of retrieval effectiveness, 36:5 Information Processing & Management 697, 701 (2000). Voorhees concluded that with 40% agreement rates it was not possible to measure recall any higher than 65%. Information scientist William Webber calculated that with a 28% agreement rate a recall rate cannot be reliably measured above 44%. Herb Rotiblat and I dialogued about this issue before the last time in Reply to an Information Scientist’s Critique of My “Secrets of Search” Article.  […]

  9. […] Reply to an Information Scientist’s Critique of My “Secrets of Search” Article. […]

  10. […] Reply to an Information Scientist’s Critique of My “Secrets of Search” Article. […]

  11. […] Reply to an Information Scientist’s Critique of My “Secrets of Search” Article. […]

%d bloggers like this: