Secrets of Search – Part One

Two weeks ago I said I would write a blog revealing the secrets of search experts. I am referring to the few technophiles, lawyers, and scientists in the e-discovery world who specialize in the search for relevant electronic evidence in large chaotic collections of ESI such as email. I promised the exposé would include a secret deeply hidden in shadows, one only half-known by a few. Before I can get to the dark secret, I must lay bare a few other search secrets that are not so hidden.

A Secret of Search Already Known to Many

The first secret of search here exposed is the same kind of secret as those revealed in Spilling the Beans on a Dirty Little Secret of Most Trial Lawyers. You probably have heard it already, especially if you have read Judge Peck’s famous wake-up call opinion in William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Co., 256 F.R.D. 134, 136 (S.D.N.Y. 2009). He repeated it again recently in his article Predictive Coding: Reading the Judicial Tea Leaves, (Law Tech. News, Oct. 17, 2011), that I wrote about in Judge Peck Calls Upon Lawyers to Use Artificial Intelligence and Jason Baron Warns of a Dark Future of Information Burn-Out If We Don’t. Despite these writings and many CLEs on the subjects, most of your less informed colleagues in the law still don’t know these things, much less litigants or the public at large. It would seem that Jason R. Baron’s dark vision of a future where no one can find anything is still a very real possibility.

The wake-up call on search has a long way to go before it is a shot heard round the world. I am reminded of that on almost a daily basis as I interact, usually indirectly, with opposing counsel in employment cases around the country. They often insist on antiquated search methods. So bear with me while I begin by repeating what you may have already heard before. I promise that the exposé of these more common secrets will also set the stage for revealing the seventh step of incompetence causality that I mentioned in last week’s blog, Tell Me Why?, and the one deep dark search secret that you probably have not heard before. Yes, the one is related to the other.

The First Secret: Keywords Search Is Remarkably Ineffective at Recall

First of all, and let me put this in very plain vernacular so that it will sink in, keyword search sucks. It does not work, that is, unless you consider a method that misses 80% of relevant evidence to be a successful method. Keyword search alone only catches 20% of relevant evidence in a large, complex dataset, such as an email collection. Yes, it works on Google, it works on Lexis and Westlaw, but it sucks in the legal world of evidence gathering. It only provides reliable recall value when used as part of a multimodal process that uses other search methods and quality controls, such as iterative testing, sampling, and adjustments. It fails miserably when used in the Go Fish context of blind guessing, which is the negotiated method still used by most lawyers today. I have written about this many times before and will not repeat it here again. See eg. Child’s Game of “Go Fish” is a Poor Model for e-Discovery Search.

Keyword Search Still Has a Place in Best Practices

Keyword search still has a place at the table of Twenty-First Century search, but only when used as part of a multimodal search package with other search tools, and only when the multimodal search is used properly with iterative processes, real-time adjustments, testing, sampling, expert input and supervision, and other quality control procedures. For one very sophisticated example of what I mean, consider the following description by Recommind, Inc. of their patented Predictive Coding process that is embedded in their software review tool, Axcelerate. Their software uses advanced AI guided search processes, but keywords are still one of the many search tools used in that process:

The Predictive Coding starts with a person knowledgeable about the matter, typically a lawyer, developing an understanding of the corpus while identifying a small number of documents that are representative of the category(ies) to be reviewed and coded (i.e. relevance, responsiveness, privilege, issue-relation). This case manager uses sophisticated search and analytical tools, including keyword, Boolean and concept search, concept grouping and more than 40 other automatically populated filters collectively referred to as Predictive Analytics, to identify probative documents for each category to be reviewed and coded. The case manager then drops each small seed set of documents into its relevant category and starts the “training” process, whereby the system uses each seed set to identify and prioritize all substantively similar documents over the complete corpus.7 The case manager and review team (if any) then review and code all “computer suggested” documents to ensure their proper categorization and further calibrate the system. This iterative step is repeated … (emphasis added)

The final step in the process employs Predictive Sampling methodology to ensure the accuracy and completeness of the Predictive Coding process (i.e. precision and recall) within an acceptable error rate …

Sklar, Howard, Using Built-In Sampling to Overcome Defensibility Concerns with Computer-Expedited Review, Recommind DESI IV Position Paper. Here is the diagram that Recommind now uses to describe their overall process, which they gave me permission to use:

Note that keyword search, including Boolean refinements, is used as part of the seed set generation step, which they call the first Predictive Analytics step in their multimodal process. By the way, as I will explain when I reveal the second search secret in a minute, that 95%-99% accuracy statement you see in their chart should be taken with a very large grain of salt. Still, aside from the dubious percentages claimed in this chart, the actual search methods and processes used are good. If you like videos and images to help this all sink in, check out Recommind’s YouTube video that has a good, albeit over-simplistic explanation of predictive coding:

Proof of the Inadequacies of Keyword Search When Not Used as Part of a Multimodal Process

Want scientific proof of the incompetence of keyword search alone when not used as part of a modern multimodal process? Look at the landmark research on Boolean search by information scientists David Blair and M.E. Maron in 1985. The study involved a 40,000 document case (350,000 pages). The lawyers, who were experts in keyword search, estimated that the Boolean searches they ran uncovered 75% of the relevant documents. In fact, they had only found 20%. Blair, David C., & Maron, M. E., An evaluation of retrieval effectiveness for a full-text document-retrieval system; Communications of the ACM Volume 28, Issue 3 (March 1985).

Delusion is a wonderful thing, is it not? We are confident our search terms uncovered 75% of the relevant evidence. Really? Still, no one likes the fool who points out that the emperor is naked, especially the emperor and his tailors who frequently pay all of the bills. Still, here I must go, where angels fear to tread. I must point out what science says.

Please join me in this Quixotic quest. Spread the word. Somebody has to do it. We must all continue to tell the unpopular truth, lest Baron’s dark vision of a future world comes true. A world of injustice where relevant evidence is lost in ESI skyscrapers of junk, where cases are decided on false testimony and whim. We don’t want that world. We have worked way too hard over centuries to build our systems of justice to let a few billion terabytes of ESI destroy them. But destroy them they will, if we are complacent. Baron’s dystopian nightmares are real.

Want more recent scientific proof of the Emperor’s old clothes? See the research conducted by the National Institute of Standards and Technology TREC Legal Track. It has again confirmed that keyword search alone still finds only about 20%, on average, of relevant ESI in the search of a large data-set. In batch tests in 2009 of negotiated keyword terms they did much worse. Hedlin, Tomlinson, Baron, Oard, 2009 TREC Legal Track Overview, TREC legal track at §3.10.9. The Boolean searches had a mean precision ratio of 39%, but recall averaged less than 4%. Yes! You read that right. The negotiated keywords missed 96% of the documents. Oopsie. I wonder how many times lawyers have done this in practice and never known it? We are confident our search terms uncovered 75% of the relevant evidence.

Please note this awful 4% recall came out of what they called the batch tasks, where there were no subject matter experts, testing, or appeals. These safeguards were present only in the interactive tasks. The batch tasks are thus like my Go Fish scenario, where people simply guess keywords in the blind, and never test, sample, refine and iterate.

The same research also shows that alternative multimodal methods do much better. They still use some keyword based search tools, but also use predictive coding and other artificial intelligence algorithms with seed-set iteration and sampling methodologies. I wrote about these new methods in Judge Peck Calls Upon Lawyers to Use Artificial Intelligence and Jason Baron Warns of a Dark Future of Information Burn-Out If We Don’t and before that in The Information Explosion and a Great Article by Grossman and Cormack on Legal Search.

Want still more recent proof? The final report on the 2010 TREC tests has not been completed, but many participants reports are final. I have done some deep digging and read most of them, and the draft summary report, in order to try to bring to you the latest evidence on search. See The 2010 tests once again confirm our little secret on the absurd ineffectiveness of keyword search alone. The confirmation comes inadvertently from the tests done by a fine team of information science graduate students from the Indian Statistical Institute, Kolkata, in West Bengal, India. They participated in the 2010 TREC Legal Interactive task in Topic 301 and Topic 302. (Yes, science is very international, including information science and TREC Legal Track.) They performed what proved to be an interesting (to me) experiment, although for reasons other than what they intended.

The Indian Statistical Institute had an AI predictive software coding tool using clustering techniques that they wanted to test. But the software could not handle the high volumes of email involved in the 2010 test: 685,592 items. So they had no choice but to cull down the amount of email somewhat before they could use their software. For that reason they decided to use keywords to cull down the corpus (a term information scientists love to use) before running their AI clustering software. Here is their own description of the process:

We attempted to apply DFR-BM25 ranking model on the TREC legal corpus. We chose Terrier 3.0 as this toolkit has most of the IR methods implemented within. But as we received the TREC legal data set we realized that it would be difficult to manage such a large volume of data. So, we decided to reduce the corpus size by Boolean retrieval. We chose Lemur 4.11 as it supports various useful Boolean query operators which would suit our purpose. On the set obtained from Boolean retrieval we decided to apply ranked retrieval techniques. … The use of Boolean retrieval has the disadvantage that it will limit further search to the documents retrieved at this stage and have an adverse e ffect on our recall performance. But it would scale down the huge corpus size considerably (see Table 1) and enable us to perform our experiments on a smaller set which would reduce processing time.

That use of keyword Boolean as an upfront filter turns out to have been a mistake, at least in so far as any quest for good recall was concerned. Who knows, maybe they thought their keywords would be better than the lawyer derived keywords in the famous Blair Maron study. I see this kind of mistake made by opposing counsel all of the time. We are confident our search terms will uncover 75% of the relevant evidence. They think their keywords are so good that they could not possibly miss 80% of all relevant document in the corpus. They have an almost superstitious belief in the magical power of keywords, and think that their Boolean is better that your Boolean. Hogwash! All keyword search sucks, no matter who you are, or how many lawsuits you’ve won, or Google sites you’ve found.

The computer algorithms used in the 1985 Blair Maron test are essentially the same used today for keyword search. Keyword search is pretty simple index matching stuff. Antiquated software really. It works fine in academic settings with artificiality controlled data sets or organized databases, but it does not survive contact with the real world where words and symbols are chaotic and vague, just like the people who create them. In real world email collections the meaning of documents is hidden in subtle, and not-so-subtle, word and phrase variations, misspellings, abbreves, slang, obtusity, etc. In reality, when large data sets are involved, no human is smart enough to guess the right keywords.

Getting back to the 2010 TREC study, in topic 301 the use of Boolean retrieval allowed the scientists from India to reduce the initial corpus from 685,592 to 2,715. Then they ran their sophisticated software on the whittled down corpus. The final metrics must have been disappointing. The TREC judges found that their precision in topic 301 was pretty good. It was 87% (meaning 87% of the items retrieved were determined to be relevant after an appeal process). But their recall was simply terrible, only 3% (meaning their method failed to retrieve an estimated 97% of the relevant documents in the original 685,592 collection). Random guessing might have done as well in the recall department, maybe even in the F1 measure (the harmonic mean of precision and recall).

In their other interactive task topic 302 the results were comparable. They attained a precision rate of 69% and a recall rate of 9%. Again this means that they left 91% of the relevant documents on the table and only managed to find 9% of the relevant documents.

The Second Search Secret (Known Only to a Few): The Gold Standard to Measure Review is Really Made Out of Lead

The so called gold standard used to judge recall and precision rates in information science studies is human review. This brings up an even more important secret of search, a subtle secret known only to a few. Experiments in TREC conducted well before the legal track even began showed that we humans are very poor at making relevancy determinations in large data sets. This is a very inconvenient truth because it puts all precision and recall measurements in doubt. It means that the recall and precision measures we use are more like rough estimates than calculations. It may be the measurements can be improved by expensive remedial, three-pass expert human reviews, and other methods, but even that has yet to be proven. But see Cormack, Grossman, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? (2011) (humans can agree and create a gold standard if relevance is defined clearly enough to reviewers and if objective mistakes by reviewers (as opposed to subjective disagreements) are identified and corrected).

This secret of human inadequacy and resulting measurement vagaries in large data-set reviews has been known in the information science world since at least 2000. I understand from inquiring of Doug Oard, a well-known information scientist and one of the TREC Legal Track founders, that the problem of the “fuzziness” of relevance judgments remains an important and ongoing discussion among scientists. Apparently the “fuzziness” issue is far less of a problem when simply trying to compare one system with another, and determine which one is better, than it is when trying to report a correct (“absolute”) value for some quantity such as recall or precision. I corresponded with Doug Oard on this issue and he advised me that:

The Legal Track of TREC has generated quite a lot of attention to the problem of absolute evaluation simply because the law, properly, has a need for that information. But the law also has a need for relative evaluation (which can help to answer questions like “did you use the best available approach under the circumstances”), and “fuzziness” is well-known to have only limited effects on such relative comparisons

So even though our measurements are too fuzzy to ever really say with any assurance that there is 95%-99% accuracy, it can tell us how one method compares with another. For instance, we can know that keyword search sucks when compared with multimodal, we just cannot know exactly how well either of them do.

The fuzziness of recall measurements may explain the wide divergences in measurements of search effectiveness. For instance, it could explain how the 2009 batch tests of keywords only measured a remarkably low 4% recall rate. 2009 TREC Legal Track Overview, TREC legal track at §3.10.9. It may have been better than that, more in line with the usual 20% recall rates that other experiments have shown, but we do not really know because the gold standard measurements can fluctuate wildly. Again this is all because average one-pass human review is known to be unreliable.

William Webber

The fuzziness issue is one of several important topics addressed in an interesting paper written this year by a young information scientist, William Webber, entitled Re-examining the Effectiveness of Manual Review. Webber, shown right, is an Australian now doing his post-doctoral work with Professor Oard. His paper arose out of an e-discovery search conference held this year in China of all places, the SIGIR 2011 Information Retrieval for E-Discovery (SIRE) Workshop, July 28, 2011, Beijing, China. You may have heard about this event from some of its other attendees, including Jason R. Baron, Patrick Oot, Jonathan Redgrave, Conor Crowley, Bill Butterfield, Doug Oard, and David Lewis. Anyway, Webber in his China paper explains:

It is well-known that human assessors frequently disagree on the relevance of a document to a topic. Voorhees [2000] found that experienced TREC assessors, albeit working from only sentence-length topic descriptions, had an average overlap (size of intersection divided by size of union) of between 40% and 50% on the documents they judged to be relevant. Voorhees concludes that 65% recall at 65% precision is the best retrieval effectiveness achievable, given the inherent uncertainty in human judgments of relevance. Bailey et al. [2008] survey other studies giving similar levels of inter-assessor agreement.

Can anyone validly claim absolute recall or precision rates in large data set reviews that is more than 65% when the determinations are made by single pass human review? Apparently not. Maybe double or triple pass review can create a true gold standard. I know that is what TREC is now striving for using sampling and an appeals process in the experiments since 2009. But has that been proven? I don’t think so, and least that is my impression after reading Webber’s work.

Webber’s China paper goes on to explain the well-known study by Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010.

For their study, the authors revisit the outcome of an earlier, in-house manual review. The original review surveyed a corpus of 2.3 million documents in response to a regulatory request, and produced 176,440 as responsive to the request; the process took four months and cost almost $14 million. Roitblat et al. had two automated systems and two manual review teams review the documents again for relevance to the original request. The automated systems worked on the entire corpus; the manual review teams looked at a sample of 5,000 documents.

Roitblat et al. (Table 1) found that the overlap between the relevance sets of the two manual teams was only 28%, even lower than the 40% to 50% observed in Voorhees [2000] for TREC AdHoc assessors. The overlap between the new and the original productions was also low, 16% for each of the manual teams, and 21% and 23% for the automatic systems. …

The effectiveness scores calculated on the original production seemingly show that the automated systems are as reliable as the manual reviewers. However, as Roitblat et al. note, the original production is a questionable gold standard, since it likely is subject to the same variability in human assessment that the study itself demonstrates. Instead, the claim Roitblat et al. make for automated review is a more cautious one; namely, that two manual reviews are no more likely to produce results consistent with each other than an automated review is with either of them.

Given the remarkably low level of agreement observed by Roitblat et al., their conclusion might seem a less than reassuring one; an attorney might ask not, which of these methods is superior, but, is either of these methods acceptable? More importantly, the study does not address the attorney’s fundamental question: does automated or does manual review result in a production that more reliably meets the overseeing attorney’s conception of relevance?

Think about that. Lawyers are on average even worse than non-lawyers in making relevancy reviews. We only agree 28% of the time, compared to earlier non-lawyer tests noted by Voorhess showing 40% agreement rates. The 40% agreement rates showed that the best retrieval effectiveness achievable, given the inherent uncertainty in human judgments of relevance, was only 65% recall and 65% precision.  See Ellen M. Voorhees, Variations in relevance judgments and the measurement of retrieval effectiveness, 36:5 Information Processing & Management 697, 701 (2000). I wondered what an even lower 28% agreement rate as found in the Roitblat et al. study meant? In private correspondence with Webber to prepare this essay, he advised me that a 28% agreement rate produces a mean precision and recall rate of 44%.

It seems to me as if Webber and Voorhees are saying that on average the best that lawyers can ever do using the so-called gold standard of human review for measurement is something like 65%-44% recall? Any measurements higher than that are suspect because the gold standard itself is suspect. I think Webber, Voorhees, and others are saying that the human relevancy determinations lens we are using to study these processes is too fuzzy, too out of focus, to give us any real confidence in exactly what we are seeing, but the fuzzy lens does allow us to compare one method against another.

The Triple Pass Solution

Although I do not understand the math on the fuzziness issue, I understand it in an intuitive way from over thirty years of arguing with other attorneys and judges over relevancy. I also know from the thousands of vague requests for production I have read and tried to respond to. In the law we use a kind of triple pass quality control method based on disagreements of experts. The triple-pass method has evolved in the common law tradition over the past few centuries. We never simply rely on one tired lawyer’s opinion. One lawyer expresses their view on relevance, then another lawyer, opposing counsel, uses their independent judgment to either agree or disagree, and, if they disagree, to object. A third expert, a judge, then hears argument from both sides and makes a final determination. Without such triple expert input and review the determination of the legal relevance of evidence in legal proceedings would also be unreliable.

TREC has been trying to use such a triple pass method since 2009 to buttress the accuracy of its findings. The first reviewers make their determinations, then the participants make theirs. If the participants disagree, then the participants can ask for a ruling from the subject matter expert who had been guiding the participants with up to ten hours of consults. The first review team has no such appeal rights and far less guidance. Also, the first pass reviewers cannot present their side of the arguments to the judge. Not surprisingly under these conditions, if and when the participants appeal, the reports show that the expert judges usually rule with the participants. They have, after all, had ongoing ex parte communications with them and don’t hear from the other side. Not exactly the same triple play as in the real world of American justice, but it is far better than the flawed single human review that Voorhees initially studied. Moreover, it is improving each year as TREC’s experiments are refined. To get closer to real world practice would require a lot more money for the experiments.

In my view the inherent fuzziness (or not) of human relevance capacities is a significant problem that needs a lot of further study. Think of the implications on our current legal practice. (Hint – this has something to do with the seventh insight into trial lawyer resistance as I will explain in Part II of Secrets of Search.)

Not Too Fuzzy To Allow Valid Comparisons

Although the measures are fuzzy, they are not too fuzzy to make comparisons between reviews. So, for instance, you can compare two human reviews and use the differences to show just how vague and inaccurate human review really is. This would be a comparison to establish the fuzziness of the gold standard you use to make recall, precision and other measurements.

The study by Roitblat et al. sponsored by the Electronic Discovery Institute (EDI) did just that. It proved the incredible inconsistencies of single pass human review in large data-sets. This study examined a real world event where Verizon paid $14,000,000 for contract reviewers to review 2.3 million documents in four months. (This is, by the way, a cost of $6.09 per document for review and logging only, a pretty good price for those days.) A second review by other reviewers commissioned by the study only agreed with 16% of the first determinations. Yes, there was only a 16% agreement rate. Incredible. Does that not suggest likely error rates of 84%?!

Surely this study by EDI is the death-blow to large-scale human reviews that are not in some way computer assisted to at least cull out documents before review. Why should anyone spend $14 Million for such a poor quality product after seeing this study? (Yet, I’m told they still do this in the world of mergers and acquisitions and second reviews.) This is especially true when you consider that machine assisted review is much faster and less expensive. Further, as the studies also show, the computer assisted review is at least as reliable as most of the human reviewers (but maybe not all, as will be explained (that is yet another search secret)).

With these limitations of human review and measurements in mind consider the paper by Maura R. Grossman and Gordon V. Cormack, which analyzed the 2009 TREC legal track studies on this issue. Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review. Richmond Journal of Law and Technology, 17(3):11:1–48, 2011. I have written about their paper before: The Information Explosion and a Great Article by Grossman and Cormack on Legal Search. Grossman and Cormack found that:

[T]he levels of performance achieved by two technology-assisted processes exceed those that would have been achieved by the official TREC assessors – law students and lawyers employed by professional review companies – had they conducted a manual review of the entire document collection.

Id. at 4. This was good research and a great paper as I’ve noted before, but the gold standard was again just human reviewers and so subject to the vagaries of fuzzy measurement when it comes to calculating absolute values. As mentioned, TREC is working on this issue with their appeals process, but due to economic constraints, it still differs from actual practice in several ways as mentioned. The first reviewers have relatively limited upfront instruction and training on the relevance issues, only limited contact with subject matter experts during the review, no testing or sampling feedback, and no appeal rights.

Also, the human review in TREC 2009 did not meet minimum ethical standards of supervision established by most state Bar associations that have considered the propriety of delegated review to contract lawyers. Most Bar associations require direct supervision of contract lawyers by counsel of record, and, in my opinion, that requires direct, ongoing contact to supervise. Aside from the supervision issue, the statistics were skewed by a one-sided appeals process where the judge only heard one-side of a relevancy argument from the party they trained in relevance. It reminds me of a secret for getting  an “A” in law school from some Professors: just tell them what you think they want to hear, not what you really think.

For that reason the win observed by Grossman and Cormack may not say as much about technology as it does about methodologies. Also, the paper focuses on the two technology-assisted processes that were better. What about the other technology-assisted processes that were not better?

Aside from these methodology concerns, as Webber points out, none of the studies so far, by TREC or anyone else, have addressed the key issue of concern to lawyers:

… which is not how much different review methods agreed or disagree with each other (as in the study by Roitblat et al. [2010]), nor even how close automated or manual review methods turn out to have come to the topic authority’s gold standard (as in the study by Grossman and Cormack [2011]).  Rather, it is this: which method can a supervising attorney, actively involved in the process of production, most reliably employ to achieve their overriding goal, to create a production consistent with their conception of relevance. (emphasis added)

Let’s Spend the Money Necessary to Turn Lead Into Gold

How can the studies and scientific research ever give us an answer to Webber’s question, to our question, if the measuring device, the gold standard, is too fuzzy to make absolute measurements, just comparative ones? It seems to me the solution is a series of multimillion dollar scientific experiments, instead of the shoe-string-budget type projects we have had so far. We need experiments where the three-pass gold standard developed by the law is employed, where time consuming quality controls are employed for both automated and manual reviews, and for various types of combined multimodal methods. We need to transform our lead standard into a bonafide gold standard. Yes, that means expensive relevancy determinations made by three-expert, triple-pass, statistically checked, state of the art reviews. But we have to go for the gold. We need absolute measurements we can trust and bank on to do justice.

These kind of scientific experiments will be expensive, but I think we should do it. Gold for gold. But it is worth it. After all, billions of dollars in fees are spent each year on e-discovery review. Trillions of dollars more ride on the outcome of litigation. What if a method we employ does not work as well as we think, and a key privileged document is overlooked? It could be game over. What if we end up looking at way too many of the wrong documents? How much money is lost already each year doing that? What if the 50% recall measurement you made is rejected by the court as too low, when it fact it was a 95% recall rate? What if the 95% recall measurement is really just 50%? With these constraints on measurements, how much recall should be considered legally sufficient? Should these measurements be used at all? Or should we just use methods that compare well with others, that use best practices, and not try to quantify precision and recall?

We need to really know what we are doing. We cannot just be alchemists playing with quicksilver. We need real science to verify exactly how accurate our methods are, not just compare them. We need to know more than comparative values. We need absolute measures. Heisenberg be damned, we need certainty in the law, or at least a lot more of it than we have now. Sure we know that computer assisted review is faster, cheaper and at least as good as average human review. But what recall rates do any of them really achieve? Sure we know that keyword search sucks, that multimodal is comparatively much better. But how much better? Is the true rate of recall for keywords 20% or 4%, or is it 44%? What is the true rate of recall for our top multimodal search techniques today, the ones like Recommind’s that uses keywords, Predictive Coding and a variety of other tools and methods? Is it 97% or is it 44%, or less? We need hard numbers, not just comparisons.

Law and IT alone cannot give us the answers. The e-discovery team also needs scientists. We need to know what kind of recall rates and precision rates we are capable of measuring with a confidence level in the 90s, not just 44% to 65%. Is plus or minus 44% recall really the best anyone can hope for? Is the confidence level such that a measure of 44% recall might actually be much higher, might actually be 98%. And visa versa? Are we just kidding ourselves with all of the recall measures we now have? Apparently so. All we can tell for sure right now is which method is better than another. That is not enough for the law. We need much more certainty than that.

The secret is now out and we have to address it. We have to talk about it. We have to perform experiments and peer review these experiments. I personally think the law’s triple-pass methods with the latest quality control techniques will produce significantly higher rates of agreement, maybe even in the 90s, but who actually knows until we pay for the experiments?

I think the research that TREC and EDI have done to date are a good start, but not the final word by any means. We need many more open scientific experiments. The testing must be improved and several more groups should join in. Our major information science universities worldwide should join in. So should the National Science Foundation and other charitable organizations. So too should the big companies that can afford to finance pure research. How about Google? IBM? Microsoft? EMC? HP? Xerox? How about your company? Every e-discovery company should have some skin in this game.

The budgets of the testing organizations need to be ramped up for all of these experiments. We need gold to make measurements with a true gold standard, to give us real answers, not just qualified comparisons. I will make a donation and participate in fundraisers for that kind of scientific research. Will you? Will your company or firm join in?


There is still more to the insights contained in Webber’s research in Re-examining the Effectiveness of Manual Review. But this week’s essay is already too long, so that, my friends, will have to wait again for next week. Webber’s work and the discussion so far sets the stage for an even deeper and darker secret of search, the one that ties into the seventh insight to lawyer resistance to e-discovery. That will come at the conclusion of next week’s blog, Secrets of Search – Part Two.

39 Responses to Secrets of Search – Part One

  1. Pete says:


    Another great article. Lot’s of important insight here. I would like to know a bit more about what percentage of the hot documents are consistently caught by the various review methods and whether it is consistent with the less-interesting “responsive” group,higher or lower.

    Also, is a 25% accuracy rate on a 1 million document production really a problem? Obviously, there is always a chance of one (or several) super-hot outlier documents being missed. But in terms of the much more financially achievable goal of a statistically defensible sample size, how much of a problem is this really causing? (I am not suggesting that the answer is “not much”–I want to know your thoughts.)

  2. Ralph,

    Hi! Great post, and thanks for your interest in my work.

    We don’t know enough yet to know whether agreement amongst lawyers in review is higher or lower than (say) amongst TREC document assessors. What we do know is that agreement differs greatly between different tasks and groups of assessors. The overlap of 28% observed by Roitblat, Kershaw and Oot is certainly quite low; on the other hand, dual-assessment of one of the topics at TREC 2010 showed a more reasonable overlap of 55%, giving an upper-bound recall/precision of 0.71 (still below, though, the 75% level which seems to be a rule-of-thumb treshhold for a reasonable production).

    An important question here is whether disagreement in relevance assessment is due primarily to differences in conception of relevance, or to assessor error. Grossman and Cormack (DESI, 2011), through a post-review of TREC 2009 assessments, find that assessor error is the chief culprit; in other words, it is possible to write detailed enough criteria to objectively determine the relevance status of most (but never quite all) documents. That would be a hopeful conclusion: a reasonable gold standard is achievable, against which retrieval effectiveness and review accuracy can be measured. But Grossman and Cormack’s is only one study.

    One of the reasons it would be desirable to find that criteria of relevance documents can be sufficient to objectively determine relevance (in most cases) is that it helps address an otherwise thorny problem: we measure the quality of an e-discovery retrieval by how well it replicates the overseeing attorney’s conception of relevance; but the quality of the resulting production depends also on what the court would find reasonable if a dispute arose. A detailed criteria-of-relevance document acts as a medium between them: the court can assess the reasonableness of the criteria (though some familiarity with the corpus may be required to determine that), while the quality of the retrieval can be measured against the criteria. For this reason it seems to me important that even technology-assisted production methods do not skip the step of responsiveness criteria generation for the seductive allure but intangible substance of predictive coding alone.

    On the question of assessor error, a change in the multi-tier review procedure between TREC 2009 and TREC 2010 provides interesting evidence on how subject to error overseeing lawyers may be. Final review by the overseeing lawyer (known as the “topic authority” or TA) is sparked by appeals of initial assessments, lodged by participating teams. In TREC 2009, teams submitted detailed rationales for their appeals; in TREC 2010, however, the appealed documents were sent uncommented, and without notification of the initial assessment, to the TA. The appeal success rate fell from 78% in 2009 to 38% in 2010, and there is independent (though circumstantial and statistical) evidence that most of this difference was due to errors being missed (which is to say, like errors being committed) by the TA. In other words, even experienced e-discovery lawyers are subject to errors of inattention and fatigue. So cutting out the middle-man by having the supervising attorney provide predictive coding directly to the machine is no panacea, since the attorney herself is subject to significant error and shifts in conception of relevance, particularly if no formal criteria of relevance are drawn up.

    In sum, what is required is quality assurance at both ends of the process: qualitative QA of the supervising attorney, to verify that their conception of relevance is consistent and sufficiently thorough; and quantitative QA of the production, whether manually or automatically generated, to verify that it follows the supervising attorney’s conception of relevance.

  3. […] you take a look at William Webber’s public comment to my blog, Secrets of Search, Part One. Click here to jump right to his Comment. William is the scientist whose paper was featured in the blog. In Part Two of Secrets of Search, […]

  4. Jim Cook says:

    Mr. Webber’s observation that the topic authority lawyer’s concept of relevance is subject to

    “significant error and shifts in conception of relevance, particularly if no formal criteria of relevance are drawn up”

    is a very important observation.

    Essentially anything of reasonable complexity needs a written document to keep people anchored to a specific objective. In software development I’ve dealt with the problem of creeping scope (i.e. feature creep) when requirements for a particular release were not documented, understood and agreed upon by everyone responsible for a given release.

    There may be valid reasons to change objectives but it should be done explicitly rather than letting it just happen. Written documents are essential to controlling “creep”.

    This makes a good argument that parties should, among other things, discuss and document what is “relevant” to a particular matter to prevent “discovery creep”. If things need to change, then the relevance criteria should be revised as necessary with information about what is driving the need for change.

    • Jim,

      Hi! Interesting observation. I’d add, though, that it is natural and desirable that the supervising lawyer’s conception of relevance (perhaps not in its foundations, but in its ramifications) change and improve as they become more familiar with the corpus. The detailed criteria guidelines produced in TREC are drawn up after the topic authority has spent many hours answering clarification questions from participants.

      Automated (“predictive coding”) systems can actually react better to this than manual review efforts, at least in principle, since the automated system is able to distinguish between recent and ancient assessments of responsiveness, hopefully detect when they change, and propose earlier assessments that now appear anomalous for re-assessment. The down-side of the predictive coding approach, however, is that no explicit criteria of relevance may be drawn up; the conception of relevance is contained solely in the set of documents labelled as relevant (and the feature weights the algorithm has learnt from them, which are not human interpretable).

      It seems to me that an explicit criteria of relevance document would be desirable as an end product even where predictive coding has been employed, to allow third parties to check the reasonableness of the assumptions and the conformance of the results to these assumptions. But I’ve been told by an experienced e-discovery practitioner that predictive coding makes responsiveness criteria documents redundant. (I should also point out, by the way, that I am not a lawyer, and have no direct real-world experience in performing an e-discovery production; what seems to me correct in principle and theoretically may not be workable or desirable in practice.)

      • Jim Cook says:


        I agree that a criteria of relevance document (CRD) would change over time. As you note, predictive coding systems may have the ability to use the chronology of various document sets identified as relevant over time to create relevance weighting factors. I don’t have enough experience with any of the specific products and their algorithms to know which predictive coding systems are able to incorporate chronology or how they might use chronology. So, I can’t make any specific judgment about whether these systems would actually make a CRD document redundant based on my personal experience.

        However, I still think that there is value in a CRD beyond its use as an end product for third party or after-the-fact assessment based on the following reasons.

        Let’s assume that a “topic authority” (TA) lawyer identifies various subsets of documents over some period of time as relevant or responsive. Those document subsets are used by the automated system’s algorithms to select a set of presumably responsive documents from the full document set that need to be reviewed. Let’s also assume that this is done iteratively and with appropriate sampling and testing so that we are reasonably assured that the responsive document set is the best we can obtain within the time and cost constraints imposed.

        After this process, we have a subset of documents to be reviewed and the various subsets of documents identified by the TA that were used by the predictive coding engine. As you noted, the feature or relevance weighing information is locked up in the software and is not likely to be usable by humans (at least at this time). I don’t omit the possibility that predictive coding engines might be able to produce information about the relevance criteria at some future time in a form that would be usable by humans.

        What we do not have is any information in a concise summary form about the relevance criteria that the TA used to make his/her judgment calls in selecting the various document subsets fed into the predictive coding engine. That is all in the mind of the TA and this is what I think should be in a CRD.

        If the review document set is large (it could be large even if there is 95-99%+ culling if the full document set is very large), then there is likely to be a human review stage that needs multiple reviewers. The CRD would be a useful resource for training human reviewers and also for referral by the human reviewers during the course of the review phase.

        If the full document set is very large or involves multiple areas of language or knowledge, then there may have to be more than one TA and they may have to communicate with each other about overlaps in their respective areas. The CRD provides a useful communications nexus for the TAs to exchange (and capture) this type of information. The TAs could do this in person or by phone but then the exchange of information would not be captured (unless recorded) and would again not be generally available to anyone who might need it.

        Also, TAs are human and humans have limitations. If all we have is a set of selected document subsets, we don’t really have information in a concise form about why those decisions were made. This makes it very difficult to quality check the decisions of TAs by clients or other people with knowledge of the issues and the matter. I believe that this type of quality checking should be done before and during the relevant document selection process. The earlier you can course correct in any complex project (software development and e-Discovery are just specific types of complex projects) the better. Without a CRD, it’s more difficult for others to evaluate the selections and identify errors or omissions in the criteria used by the TA or TAs early in the project.

        These are some of the reasons why I think that a CRD is a useful tool at all stages of an e-Discovery project. If these reasons are persuasive that a CRD is needed, then the issue becomes in what form or forms should a CRD be produced and maintained.

        Switching back to software development, the analog to a CRD would be a Software Requirements Specification (SRS). Many years ago, SRS documents morphed into large binders containing hundreds or thousands of pages of text and diagrams used by a development team as the basis for what the team was to design, code, document, test and finally release. The problem with these types of paper SRS documents was that it was impossible to distribute them to everyone that could use the information and to maintain them when dozens of paper copies were distributed. So, the SRS documents were rarely if ever updated or maintained as things changed and the development team acquired a deeper insight into the problem(s) to be solved by the software release. So, when the development team started coding, the information contained in the SRS documents was less and less reflective of the changing requirements. Eventually the development team simply ignored the SRS documents, which just sat on a shelf gathering dust. As software development became a much more incremental process, large SRS documents became useless and were often abandoned.

        Unfortunately what is lost when SRS documents are abandoned is the process of capturing communications and decisions about a complex project that would provide insights to a development team. I think part of the value of the CRD is capturing the communications and decisions regarding relevance that will provide insights to an e-Discovery team.
        If a CRD is thought of as a large paper memo or a word processing document that is just passed around by email, it will be no more useful than the old paper SRS document binders.

        So, I think that a good format for a CRD would be a protected wiki based website. This would allow many individuals to be readers and/or collaborators against a single source document that can be maintained and updated as required. Exactly what would be in such a CRD and how it would be organized is a topic for a much longer discussion.

        An even larger discussion could be about tools and communications methods are effective in managing large complex projects. However, that is a big drift off the topic of Ralph’s article.

      • Jim,

        Hi! Thanks, a very interesting and thought-provoking reply. It made me wonder what tools a predictive coding (or, in research parlance, text classification) tool could provide to help reverse-engineer a CRD?

  5. […] SIRE paper has since generously been picked up by Ralph Losey in his blog post, Secrets of Search — Part One. Ralph’s post first stresses the inadequacy of keyword search alone as a tool in e-discovery. […]

  6. Great piece, well constructed.

    I wonder if the problems of consensus are compounded somewhat needlessly by the blunt edge of the measuring tool, that being “relevance”. Switching from binary to scaled relevance scores might provide insight. It would be interesting to find, as I suspect, that convergence increases with perceived relevance.

    If so, the problem of disparities among reviewers in their conclusions about relevance might be a bit of a tempest in a teapot.

    • The problem of graded relevance does need to be addressed. The particular question is, how many documents that the supervising attorney would consider “hot” are missed by manual reviewers altogether? You would want to assign a higher penalty for missing “hot” documents than for missing “responsive, but not hot” ones, but such that the total penalty for missing all documents would be the same. Unfortunately, we don’t have legal collections with multi-grade relevance judgments. There is some work outside the legal domain on assessor agreement for multi-grade relevance; see Bailey et al., “Relevance assessment: are judges exchangeable and does it matter?”, SIGIR 2008. They find that inexpert judges
      rarely regard as irrelevant documents that experts regard has highly relevant. However, the dataset is unusual (popular write-ups of scientific discoveries), and there is a confounding effect that non-experts have a broader conception of relevance; they tend to think that an article is related to some topic, where experts think that it is not.

      Mind you, Grossman and Cormack suggest that most assessor errors are due to inattention, and it is not clear that inattention errors would be less of an issue on “hot” documents. Nevertheless, this is an issue that clearly needs to be addressed experimentally.

  7. […] lots of decisions, they make lots of errors.  My friend and fellow commentator, Ralph Losey, lately blogged about the shortcomings of search and review calling them “dark secrets.”  Don’t […]

  8. […] Secrets of Search – Part One […]

  9. […] that: Spilling the Beans on a Dirty Little Secret of Most Trial Lawyers and Tell Me Why?  In Secrets of Search – Part One we left off with a review of some of the analysis on fuzziness of recall measurements included in […]

  10. […] can’t say that I really saw this, but Ralph Losey’s piece on the accuracy of e-discovery software comes close. I’ve often linked to John Markoff’s New Tork Times piece on how […]

  11. […] can’t say that I really saw this, but Ralph Losey’s piece on the accuracy of e-discovery software comes close. I’ve often linked to John Markoff’s New Tork Times piece on how […]

  12. […] you cannot just dispense with final manual review. As I explained in my series Secrets of Search, Parts One, Two and Three, we are not going to turn that over to the Borg anytime soon. I’ve asked […]

  13. […] Losey recently wrote an important series of blog posts (here, here, and here) describing five secrets of search. He pulled together a substantial array of facts […]

  14. […] find all relevant ESI? I examined this question at length in my Secrets of Search series, volumes one, two and three. Still, people find it hard to accept, especially in view of the unregulated clamor […]

  15. […] Secrets of Search – Part One ( […]

  16. […] of this series on the Secrets of Search where all will be revealed. Secrets of Search: Parts One, Two, and Three. (Well, to be entirely honest, not all will be revealed. I’m still going to keep […]

  17. […] Waldo?” Approach to Search and My Mock Debate with Jason Baron; Secrets of Search: Parts One, Two, and […]

  18. […] Secrets of Search Parts One, Two and Three, I outlined the five key characteristics of effective search today, using the rubric […]

  19. […] get the picture. (Side note: when I say keyword search sucks, as I did in Secrets of Search: Part One, this is the kind of search I am referring to: the blind guessing, Go Fish, linear kind with no […]

  20. […] an information scientist whose excellent work in the field of legal search and statistics I have described before. I asked Webber to evaluate my math and analysis. He was kind enough to provide the following […]

  21. […] the five rules of search safety. They were explained in my Secrets of Search trilogy in parts One, Two, and Three and are shown below in another version of the Olympic […]

  22. […] well documented inconsistency of classification among human reviewers. That is what I called in Secrets of Search, Part One, as the fuzzy lens problem that makes recall such an ambiguous measure in legal search. It […]

  23. […] manual review of large amounts of documents is not the gold standard it was once thought to be. Secrets of Search, Part One. In fact, I have taken to calling typical manual review a lead standard because every time […]

  24. […] find all relevant ESI? I examined this question at length in my Secrets of Search series, volumes one, two and three. Still, people find it hard to accept, especially in view of the unregulated clamor […]

  25. […] I explained in my series Secrets of Search, Parts One, Two and Three, the latest AI enhanced software is far better than keyword search, but not yet […]

  26. […] I explained in my series Secrets of Search, Parts One, Two and Three, the latest AI enhanced software is far better than keyword search, but not yet […]

  27. […] the 2011 TREC Legal Track – Part One, Part Two and Part Three; Secrets of Search: Parts One, Two, and Three. Also see Jason Baron, DESI, Sedona and […]

  28. […] According to information scientist, William Webber, who has provided me with invaluable assistance in understanding all of these studies, the 70% disagreement rate between all three reviewers (overlap of only 30%) would place a practical limit on precision and recall calculation of approximately 45%. This is the fuzzy lens problem I have written about before. Secrets of Search – Part One. […]

  29. […] you try to observe them, they will change. I call this the fuzzy lens phenomena of big data in my Secrets of Search essays. Let us all get real and realize relativity. It is all probability now, not Newtonian […]

  30. […] as the trees grow. It is the same old problem of garbage in, garbage out. I addressed this in Part One on this article, in the section, The Second Search Secret (Known Only to a Few): The Gold Standard […]

%d bloggers like this: