This is Part Two of the blog that I started last week on the Secrets of Search, which was in turn a sequel to two blogs before that: Spilling the Beans on a Dirty Little Secret of Most Trial Lawyers and Tell Me Why? In Secrets of Search – Part One we left off with a review of some of the analysis on fuzziness of recall measurements included in the August 2011 research report of information scientist, William Webber: Re-examining the Effectiveness of Manual Review. We begin part two with the meat of his report and another esoteric search secret. This will finally set the stage for the deepest secret of all and the seventh insight into trial lawyer resistance to e-discovery.
Summarizing Part One of this Blog Post
and the First Two Secrets of Search
I can quickly summarize the first two secrets with popular slang: keyword search sucks, and so does manual review (although not quite as bad), and because most manual review sucks, most so-called objective measurements of precision and recall are unreliable. Sorry to go all negative on you, but only by outing these not-so-little search secrets can we establish a solid foundation for our efforts with the discovery of electronic evidence. The truth must be told, even if it sucks.
I also explained that keyword search would not be so bad if it were not done blindly like a game of Go Fish, where it achieves really pathetic recall percentages in the 4% to 20% range (the TREC batch tasks). It still has a place with smarter software and improved, cooperation based Where’s Waldo type methods and quality controls. In that same vein I explained that manual review can probably also be made good enough for accurate scientific measurements. But, in order to do so, the manual reviews would have to replicate the state-of-the-art methods we have developed in private practice, and that is expensive. I concluded that we should come up with the money for better scientific research so we could afford to do that. We could then develop and test a new gold standard for objective search measurements. Scientific research could then test, accurately measure, and guide the latest hybrid processes the profession is developing for computer assisted review.
Another conclusion you could also fairly draw is that since the law already accepts linear manual review and keyword search as reasonable methods to respond to discovery requests, the law has set a very low standard and so we do not need better science. All you need to do to establish that an alternative method is legally reasonable is to show that it does as well as the previously accepted keyword and manual methods. That kind of comparison sets a low hurdle, one that even our existing fuzzy research proves we have already met. This means we already have a green light under the law, or logically we should have, to proceed with computer assisted review. Judge Peck’s article on predictive coding stated an obvious logical conclusion based upon the evidence.
You could, and I think should, also conclude that any expectation that computer assisted reviews have to be near perfect to be acceptable is misplaced. The claim that some vendor’s make as to near perfection by their search methods is counter to existing scientific research. It is wrong, mere marketing puff, because the manual based measurements of recall and precision are too fuzzy to measure that closely. If any computer assisted or other type of review comes up with 44%, it might in fact be perfect by an actual objective standard, and visa versa. Allegedly objective measurements of high recall rates in search is, for the time being at least, an illusion. It is a dangerous delusion too because this misinformation could be used against producing parties to try to drive up the costs of production for ulterior motives. Let’s start getting real about objective recall claims.
In any event, most computer assisted search is already better than average keyword or manual search, so it should be accepted as reasonable under the law without confidence inflation. We don’t need perfection in the law, we don’t need to keep reviewing and re-reviewing to try to reach some magic, way-too-high measure of recall. Although we should always try to get more and more of the truth, we should always try to improve, we should also remember that there is only so much truth that any of us can afford when faced with big data sets and limited financial resources.
As I have said time and again when discussing e-discovery efforts in general, including preservation related efforts, the law demands reasonable efforts not perfection. Now science buttresses this position in document productions by showing that we have never had perfection in search of large numbers of documents, not with manual, and certainly not with keyword, and, here is the kicker, it is not possible to objectively measure it anyway!
At least not yet. Not until we start taking our ignorance of the processes of search and discovery as a disease. Then maybe we will start allocating our charitable and scientific efforts accordingly, so we can have better measurements. Then with reliable and more accurate measurements, with solid gold objective standards, we can create more clearly defined best practices, ones that are not surrounded with marketing fluff. More on this later, but first let’s move onto another secret that comes out of Webber’s research. I’m afraid it will complicate matters even further, but life is often like that. We live in a very complex and imperfect world.
The Third Search Secret (Known Only to a Very Few): e-Discovery Watson May Still Not Be Able to Beat Our Champions
Webber’s report reveals that there is more to the man versus machine question than we first thought. His drill down analysis of the 2009 TREC interactive tasks shows that the computer assisted reviews were not the hands down victors over human reviewers as we first thought, at least not victors over many of the well-trained, exceptional reviewer men and women. Putting aside the whole fuzziness issue, Webber’s research suggests that the TREC and EDI tests so far have been the equivalent of putting Watson up against the average Jeopardy contestants, you know, the poor losers you see each week who, like me, usually fail to guess anything right.
The real test of IBM’s Watson, the real proof, didn’t come until Watson went up against the champions, the true professionals at the game. We have not seen that yet in TREC or the EDI studies. But the current organizers know this, and they are trying to level the playing field with multi-pass reviews and, as Webber notes, trying to answer the question we lawyers really want to know, the one that has not been answered yet, namely which Watson, which method can an attorney most reliably employ to create a production consistent with their conception of relevance.
Webber in his research and report digs deep into the TREC 2009 results and looked at the precision and recall rates of individual first pass reviewers. Re-examining the Effectiveness of Manual Review. He found that while Grossman and Cormack were accurate to say that overall two of the top machines did better than man, the details showed that:
Only for Topic 203 does the best automated system clearly outperform the best manual reviewer. As before, the professional manual review team for Topic 207 stands out. Several reviewers outperform the best automated system, and even the weaker individual reviewers have both precision and recall above 0.5.
This means the best team of professional reviewers who participated in Topic 207 actually beat the best machines! They did this in spite of the mentioned inequities in training, supervision, and appeal. Did you know that secret? I’m told that topic 203 was an easy one having to do with junk filters, but still, easy or not, the human team won.
There is still more to this secret. When you drill down even further you find that certain individual reviewers on each team topic actually beat the best machines on each topic in some way, even if their entire human team did not. That’s right, the top machines were defeated by a few champion humans in most every event. Humans won even though they were disadvantaged by not having an even playing field. I guaranty that this is a secret you have never heard before (unless you went to China) because Webber just discovered it from his painstaking analysis of the 2009 TREC results. Chin up contract reviewers, the reports of your death have been greatly exaggerated. Watson has not beat you yet, in fact, Watson still needs you to set up the gold standard to determine who wins.
Webber’s research shows that a competition between the best Watsons and best reviewers is still a very close race where humans often win. Please note this analysis assumes no time limits or cost limits for the human review, which are, of course, false assumptions in legal practice. This is why pure manual review is still, or should be, as dead as a doornail. The future is a team approach where humans use machines in a nonlinear fashion, not visa versa. More on this later.
Webber’s findings are the result of something that is not a secret to anyone who has ever been involved in a large search project, that all reviewers are not created equal. Some are far better than others. There are many good psychological, intelligence, and project management and methodology reasons for this, especially the management and methodology issues. See eg the must read guest blog by contract review attorney Larry Chapin, Contract Coders: e-Discovery’s “Wasting Asset”?
The facts supporting Webber’s findings on individual reviewer excellence are shown in Figure 2 of his paper on the variability in review team reliability. Re-examining the Effectiveness of Manual Review. The small red crosses in each figure (except flawed task 205) show the computer’s best efforts. Note how many individual reviewers (a bin is 500 documents that were reviewed by one specific reviewer) were able to beat the computer’s best efforts in either precision, or recall, or both. They are shown as either to the right or above the red cross. If above this means they were more precise. If to the right, they had better recall.
William Webber summarizes these findings in his blog recently by saying:
The best reviewers have a reliability at or above that of the technology-assisted system, with recall at 0.7 and precision at 0.9, while other reviewers have recall and precision scores as low as 0.1. This suggests that using more reliable reviewers, or (more to the point) a better review process, would lead to substantially more consistent and better quality review. In particular, the assessment process at TREC provided only for assessors to receive written instructions from the topic authority, not for the TA to actively manage the assessment process, by (for instance) performing an early check on assessments and correcting misconceptions of relevance or excluding unreliable assessors. Now, such supervision of review teams by overseeing attorneys may (regrettably) not always occur in real productions, but it should surely represent best practice.
Webber, W., How Accurate Can Manual Review Be? IREvalEtAl (12/15/11). Better review process and project management are key, which is the next part of the secret.
How to Be Better Than Borg
Webber’s research shows that some of the human reviewers in TREC stood out as better than Borg. They beat the machines. Does this really surprise anyone in the review industry? Sure, human review may be (should be) dead as a way to review all documents in large-scale reviews, but it is alive and well as the most reliable method for final check of computer suggested coding, a final check for classifications like privilege before production.
This is a picture of humans and machines working together as a team, as friends, but not as Borg implants where machines dictate, nor as human slaves where smart machines are not allowed. I know that George Socha, whom I quoted in Tell Me Why?, much like one of my fictional heroes, Jean Luc Picard, was glad to escape the Borg enslavement. So too would most contract lawyers who are stuck in dead-end review jobs with cruel employers. By this way, his embarrassing, unprofessional, contract lawyers as slaves mentality was shown dramatically by some of the reader comments to Contract Coders: e-Discovery’s “Wasting Asset”? They report incredible incidents of abuse by some law firms. Some of the private complaints I have heard from document reviewers about abuse and mismanagement are even worse than these public comments. The primary rule of any relationship must always be mutual respect. That applies to contract lawyers, and, if they are a part of your team, even to artificial intelligence agents like Watson, Siri, and their predictive coding cousins. Get to know and understand your entire team and to appreciate their respective strengths and weaknesses.
Webber’s study shows that the quality of the individual human reviewers on a team is paramount. He makes several specific recommendations in section 3.4 of his report for improving review team quality, including:
Dual assessment, for instance, can help catch random errors of inattention, while second review by an authoritative reviewer such as the supervising attorney can correct misconceptions of relevance during the review process, and adjust for assessor errors once it is complete [Webber et al., 2010]. …
[S]ignificant divergence from the median appears to be a partial, though not infallible, indicator of reviewer unreliability. A simple approach to improving review team quality is to exclude those reviewers whose proportion relevant are significantly different from the median, and re-apportion their work to the more reliable reviewers. …
Fully excluding reviewers based solely on the proportion of documents they find relevant is a crude technique. Nevertheless, the results of this section suggest that this proportion is a useful, if only partial, indicator of reliability, one which could be combined with additional evidence to alert review managers when their review process is diverging from a controlled state. It may be that review teams with better processes, such as the team from Topic 207, already use such techniques. Therefore, they need to be considered when a benchmark for manual review quality is being established, against which automatic techniques can be compared.
Webber’s conclusion summarizes his findings and bears close scrutiny, so I quote it here in full:
5. CONCLUSIONS. The original review from which Roitblat et al. draw their data cost $14 million, and took four months of 100-hour weeks to complete. The cost, effort, and delay underline the need for automated review techniques, provided they can be shown to be reliable. Given the strong disagreement between manual reviews, even some loss in review accuracy might be acceptable for the efficiency gained. If, though, automated methods can conclusively be demonstrated to be not just cheaper, but more reliable, than manual review, then the choice requires no hesitation. Moreover, such an achievement for automated text-processing technology would mark an epoch not just in the legal domain, but in the wider world.
Two recent studies have examined this question, and advanced evidence that automated retrieval is at least as consistent as manual review [Roitblat et al., 2010], and in fact seems to be more reliable [Grossman and Cormack, 2011]. These results are suggestive, but (we argue) not conclusive as they stand. For the latter study in particular (leaving questions of potential bias in the appeals process aside), it is questionable whether the assessment processes employed in the track truly are representative of a good quality manual review process.
We have provided evidence of the greatly varying quality of reviewers within each review team, indicating a lack of process control (unsurprising since for four of the seven topics the reviewers were not a genuine team). The best manual reviewers were found to be as good as the best automated systems, even with the asymmetry in the evaluation setup. The one, professional team that does manage greater internal consistency in their assessors is also the one team that, as group, outperforms the best automated method. We have also pointed out a simple, statistically based method for improving process control, by observing the proportion of documents found relevant by each assessor, and counseling or excluding those who appear to be outliers.
Above all, it seems that previous studies (and this one, too) have not directly addressed the crucial question, which is not how much different review methods agreed or disagree with each other (as in the study by Roitblat et al. ), nor even how close automated or manual review methods turn out to have come to the topic authority’s gold standard (as in the study by Grossman and Cormack ). Rather, it is this: which method can a supervising attorney, actively involved in the process of production, most reliably employ to achieve their overriding goal, to create a production consistent with their conception of relevance. There is good, though (we argue) so far inconclusive, evidence that an automated method of production can be as reliable a means to this end as a (much more expensive) full manual review. Quantifying the tradeoff between manual effort and automation, and validating protocols for verifying the correctness of either approach in practice, are particularly relevant in the multi-stage, hybrid work-flows of contemporary legal review and production. Given the importance of the question, we believe that it merits the effort of a more conclusive empirical answer.
The evidence shows that it is at least very difficult, perhaps even impossible (I await for more science to form a definite opinion), for us humans to maintain the concentration necessary to review tens of thousands of documents, day in and day out, for weeks. Sure we can do it for a few hours, and for 500 or so documents, but for 8-10 hours a day with tens or hundreds of thousands of documents for weeks on end? I doubt it. We need help. We need suggestive coding. We need a team that includes smart computers.
Know Your Team’s Strengths and Weaknesses
The challenge to human reviewers becomes ridiculously hard when you ask them to not only make relevancy calls, but, at the same time, to also make privilege calls, and confidentiality calls, and, here is the worst, multiple case issues categorization calls, a/k/a, issue tagging. Experience shows that the human mind cannot really handle more than five or six case issues at a time, at least when reviewing all day. But I keep hearing tales of lawyers asking reviewers to make ten to twenty case issue calls for weeks on end. If you think it is hard to get consistent relevancy calls, just think of the problem of putting relevant docs into ten to twenty buckets. Might as well throw darts. That is a scientific experiment I’d like to see, one testing the efficacy of case issue tags. How many categorizations can humans really handle before it becomes a complete waste of time?
I call on e-discovery lawyers everywhere to better understand their team members and stop asking them to do the impossible. Issue tagging must be kept simple and straightforward for the human members of your team to deal with it. The ten to twenty case-issue tags is a complete waste of time, perhaps with the exception of seed-set training, as thereafter Watson has no such limitations. But in so far as the final, out-the-door review goes, do not encumber your humans with mission impossible tasks. Know your team members, their strengths and weaknesses. Know what the humans do best, like catch obvious bloopers beyond the kin of present day AI agents, and do not expect them to be as tireless as machines.
The review process improvements mentioned by Webber, and other safeguards touted by most professional review companies who truly understand and care about the strengths and weaknesses of their team, will certainly mitigate against the problems inherent in all human review. In my mind the most important of these are experience, training, mutual respect, good working conditions, motivation, and quality controls, including quick terminations or reassignments when called for. More innovative methods are, I believe, just around the corner, such as game theory applications discussed by Lawrence Chapin in Contract Coders: e-Discovery’s “Wasting Asset”? But the bottom line will always be that computers are much better at complex repetitive drudgery tasks such as reviewing tens of thousands, or millions, of documents. Thankfully our minds are not designed for this, whereas computers are.
Reviewers Need Subject Matter Expertise and Money Motivation
Based on my experience as a reviewer and supervisor, the human challenges to make review determinations over large scales of data are magnified when the human reviewers are not themselves subject matter experts, and magnified even further when the reviewers have no experience in the process. This was not only true of all of the student volunteer reviewers at TREC, but is also sometimes true in real world practice as well. That is just invited error. Training is part of the solution to that.
It is also my supposition that in our culture the errors are magnified again when there is no, or inadequate, compensation provided. All TREC reviewers were unpaid volunteers except for the professional review team members. They were paid by the companies they work for, although those companies were not paid, and the rate of pay to the individuals is unknown. Still, can you be surprised that the top reviewers, the ones who beat the machines, were all paid, and only a few of the student teams came close? In our culture money is a powerful motivator. That is another reason to have better funded experiments that come closer to real world conditions. The test subjects in our experiments should be paid.
The same principle applies in the real world too. Contract review companies should stop competing on price alone and we consumers should stop being fooled by that. Quality is job number one, or should be. Do you really think the company with the lowest price is providing the best service? Do you think their attorney reviewers don’t resent this kind of low pay, sometimes in the $15-$20 per hour range. Most of these lawyers have six-figure student loans to pay off. They deserve a fair wage and, I hypothesize, will perform better if they are paid better.
To test my money-motivation theory I’d love to see an experiment where one review team is paid $25 an hour, and another is paid $75. Be real and let them know which team they are on. Then ask both to review the same documents involving weeks of grueling, boring work. Add in the typical vagaries of relevance, and equal supervision and training, and then see which team does better. Maybe add another variation where there is a stick added to the carrot and you can be fired for too many mistakes. Anyone willing to fund such a study? A contract review company perhaps? (Doubtful!) Better yet, perhaps there is a tech company out there willing to do so, one that competes with cheap human review teams? They should be motivated by money to finance such research (why would most contract review companies want this investigated?). The research would, of course, have to be done by bona fide third-party scientists in a peer review setting. We don’t want the profit motive messing with the truth and objective science.
Secret of Sampling
There is one more fundamental thing you need to understand about the TREC tests, indeed all scientific tests, one which I suppose you could also call a secret since so few people seem to know it, and that is, no one, I repeat, no person, ever sat down and looked at all of the 685,592 documents under consideration in 2010 TREC Legal Track interactive tasks. No one has ever looked at all of the documents in any TREC task. No person, much less a team of subject matter experts with three-pass reviews as I discussed in Part One, has determined the individual relevancy, or not, of all of these documents by which to judge the results of the software assisted reviews. All that happened (and I don’t mean that as a negative connotation), is that a random sample of the 685,592 documents were reviewed by a variety of people.
I have no trouble with sampling and do not think it really matters that only a random sample of the 685,592 corpus was reviewed. Sampling and math are the most powerful tools in every information scientist’s pocket. It seems like magic (much like the hash algorithms), but random sampling has been proven time and again to be reliable. For instance, a sample of 2,345 documents is needed to know the contents of 100,000, with a 95% confidence level and a +/-2 % confidence interval. Yet for a collection of 1,000,000 with the same confidence levels, a sample of only 2,395 is required (just 50 more to sample 900,00 more documents). If you add another zero and seek to know about 10,000,000 documents, you need only sample 2,400.
To play with the metrics yourself I suggest you see the calculator at http://www.surveysystem.com/sscalc.htm. For a good explanation of sampling see: Application of Simple Random Sampling (SRS) in eDiscovery, Manuscript By Doug Stewart, submitted to the Organizing Committee of the Fourth DESI Workshop on Setting Standards for Electronically Stored Information in Discovery Proceedings on April 20, 2011. Sampling is important. As I have been saying for over two years now, all e-discovery software should include a sampling button as a basic feature. (Many vendors have taken my advice, and I keep asking some of them to whom I made specific demands, to now call the new feature the Ralph Button, but they just laugh. Oh well:)
If the Human Review is Unreliable, Then so is the Gold Standard
The problem with average human review and the comparative measurements of computer assisted alternatives is not with the sampling techniques used to measure. The problem is that if the sample set created by average Joe or Jane reviewer is flawed, then so is the projection. Sampling has the same weakness as AI agent software, including predictive coding seed sets. If the seeds selected are bad, then the trees they grow will be bad too. They won’t look at all like what you wanted and the errors will magnify as the trees grow. It is the same old problem of garbage in, garbage out. I addressed this in Part One on this article, in the section, The Second Search Secret (Known Only to a Few): The Gold Standard to Measure Review is Really Made Out of Lead, but it bears repetition. It is a critical point that has been swept under the carpet until now.
Like it or not, aside from a few top reviewers working with relatively small sets, like the champs in TREC, most human review of relevancy in large-scale reviews is basically garbage, unless it is very carefully managed and constantly safeguarded by statistical sampling and other procedures. Also, if there is no clear definition of relevance, or if relevance is a constantly moving target, or both as is often the case, then the reviewers work will be poor (inconsistent), no matter what methods you use. Note this clear understanding of relevance is often missing in real world reviews for a variety of reasons, including the requesting party’s refusal to clarify under mistaken notions of work product protection, vigorous advocacy, and the like.
Even in TREC, where they claim to have clear relevancy definitions and the review sets were not that large, I’m told by Webber that:
TREC assessors disagree with themselves between 15% to 19% of the times when shown the same document twice (due to undetected duplication in the corpus).
That’s right, the same reviewers looking at the same document at different times disagreed with themselves between 15% to 19% of the time. For authority Webber refers to: Scholer et al., Quantifying Test Collection Quality Based on the Consistency of Relevance Judgements. As you start adding multiple reviewers to a project the disagreement rates naturally get much higher. That is in accord with most everyone’s experience and the scientific tests. If people cannot agree with themselves on questions of relevance, how can you expect them to agree with others? Despite a few champs, human relevancy review is generally very fuzzy.
Some Things Can Still Be Seen Through the Fuzzy Lenses
The exception to the fuzzy measurements problem, which I noted in Part One, is that the measures are not too vague for purposes of comparison, at least that is what the scientists tell me. Also, and this is very important, when you add the utility measures of time and money to review evaluation, which in the real world of litigation we must do, but has not yet been done in scientific testing, and do not just rely on the abstract measures of precision and recall, then computer assisted review must always win, at least in large-scale projects. We never have the time and money to manually review hundreds of thousands, or millions, of documents, just because they are in the custody of a person of interest. I don’t care what kind of cheap, poor quality labor you use. As Jason Baron likes to point out, at a fast review speed of 100 files per hr, and a cost of $50 per hour for a reviewer, it would still take $500 Million and 10 Million hours to review the 1 Billion emails in the White House.
When you consider the utility measures of time and cost, it is obvious that pure manual review is dead. Even our weak, fuzzy comparative testing lens shows that shows manual and computer review precision and recall are about equal, and maybe the computer is even leading (hard to tell with these fuzzy lenses on). But when you add the time and costs measures, the race is not even close. Computers are far faster and should also be much cheaper. The need for computer assisted review to cull down the corpus, and then assist in the coding, is painfully obvious. The EDI study of a $14 Million review project by all too human contract coders with an overlap rate of only 28% proved that. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010.
Going for the Gold
The old gold standard of average human reviewers, working in dungeons <smile>, unassisted by smart technology, and not properly managed, has been exposed as a fraud. What else do you call a 28% overlap rate? We must now develop a new gold standard, a new best practice for big data review. And we must do so with the help and guidance of science and testing. The exact contours of the new gold are now under development in dozens of law firms, private companies, and universities around the world. Although we do not know all of the details, we know it will involve:
- Bottom Line Driven Proportional Review where the projected costs of review are estimated at the beginning of a project (more on this in a future blog);
- High quality tech assisted review, with predictive coding type software, and multiple expert review of key seed-set training documents using both subject matter experts (attorneys) and AI experts (technologists);
- Direct supervision and feedback by the responsible lawyer(s) (merits counsel) signing under 26(g);
- Extensive quality control methods, including training and more training, sampling, positive feedback loops, clever batching, and sometimes, quick reassignment or firing of reviewers who are not working well on the project;
- Experienced, well motivated human reviewers who know and like the AI agents (software tools) they work with;
- New tools and psychological techniques (e.g. game theory, story telling) to facilitate prolonged concentration (beyond just coffee, $, and fear) to keep attorney reviewers engaged and motivated to perform the complex legal judgment tasks required to correctly review thousands of usually boring documents for days on end (voyeurism will only take you so far);
- Highly skilled project managers who know and understand their team, both human and computer, and the new tools and techniques under development to help coach the team;
- Strategic cooperation between opposing counsel with adequate disclosures to build trust and mutually acceptable relevancy standards; and,
- Final, last-chance review of a production set before going out the door by spot checking, judgmental sampling (i.e. search for those attorney domains one more time), and random sampling.
I have probably missed a few key factors. This is a group effort and I cannot talk to everyone, nor read all of the literature. If you think I have missed something key here, please let me know. Of course we also need understanding clients who demand competence, and judges willing to get involved when needed to rein in intransigent non-cooperators and to enforce fair proportionality. Also, you should always go for confidentiality and clawback agreements and orders.
Technology Assisted Review
When I say technology assisted review in the best practices list above, which is now a popular phrase, I mean the same thing as computer assisted review. I mean a review method where computerized processes are used to cull down the corpus, and then again to assist in the coding. In the first step technology is used to cull out final selections of documents from a larger corpus for humans to review before final production. The probable irrelevant documents are culled-out and not subject to any further human reviews, except perhaps for quality control random sampling. Keyword search is one very primitive example of that computer assisted culling. Concept search is another more recent, advanced example. There are many others. Think for instance of Axcellerate’s 40 automatically populated filters, which they collectively refer to as their Predictive Analytics™ step that I described in Part One of Secrets of Search.
These days the software is so smart that technology assisted review can not only intelligently cull out likely irrelevant documents, it can also make predictions for how the remaining relevant documents should be categorized. That is the second step where all of the remaining documents are reviewed by software to predict key classifications like privileged, confidential, hot, and maybe even a few case specific issues. The software predicts how a human will likely code a documents and batches documents out in groups accordingly. This predictive coding, combined with efficient document batching (putting into sets of documents for human review), makes the human review work easier and more efficient. For instance, one reviewer, or small review team, might be assigned all of the probable privileged documents, another the probable confidential for redaction, a third the probable hot documents, and the remaining documents divided into teams by case issue tags, or maybe by date, or custodian, all depending on the specifics of the case. It is an art, but one that can and should be measured and guided by science.
I contrast this kind of technology assisted review with pure Borg type computer controlled review, where there is complete computer delegation, where the computer does all, with little or no human involvement, except for the first seed set generation of relevancy patterns. Here we trust the AI agent and produce all documents determined to be relevant and not-privileged. No human does a double-check of the computer’s coding before the documents go out the door. In my opinion, we are still far away from such total delegation, although I don’t rule it out someday. (Resistance is futile.) Do you agree?
Is anyone out there relying on 100% computer review with no human eye quality controls? Conversely, as to the opposite, is there anyone out there who still uses pure (100%) human review? Who has humans (lawyers or paralegals) review all documents in a custodian collection (assuming, as you should, that there are thousands or tens of thousands of documents in the collection)? Is there anyone who does not rely on some little brother of Watson to review and cull out at least some of the corpus first?
More Research Please
The fuzzy standard of most human review is an inconvenient truth known to all information scientists. As we have seen, it has been known to TREC researchers since at least 2000 with the study by Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness, 36:5 Information Processing & Management 697, 701 (2000). Yet I for one have not heard much discussion about it. This flaw cuts to the core of information science, because without accurate, objective measurements, there can be no science. For that reason scientists have come up with many techniques to try to overcome the inherent fuzziness of relevancy determinations, in and outside of legal search. I concede they are making progress, and TREC legal track is, for instance, getting better every year, but, like Voorhees and Webber, I insist there is still a long way to go.
Maybe the best software programs (whatever they are) are far better than our best reviewers under ideal conditions (that’s what I think), maybe not. But the truth is, we don’t really know what our real precision and recall rates are now, we don’t really know how much of the truth we are finding. The measures are, after all, so vague, so human dependent. What are we to make of our situation in legal review where the Roitblat et al study shows an overlap rate of only 28%? Here is Webber’s more precise information science language explanation that he made in reviewing my blog article in his blog:
The most interesting part of Ralph’s post, and the most provocative, both for practitioners and for researchers, arises from his reflections on the low levels of assessor agreement, at TREC and elsewhere, surveyed in the background section of my SIRE paper. Overlap (measured as the Jaccard coefficient; that is, size of intersection divided by size of union) between relevant sets of assessors is typically found to be around 0.5, and in some (notably, legal) cases can be as low as 0.28. If one assessor were taken as the gold standard, and the effectiveness of the other evaluated against it, then these overlaps would set an upper limit on F1 score (harmonic mean of precision and recall) of 0.66 and 0.44, respectively. Ralph then provocatively asks, if this is the ground truth on which we are basing our measures of effectiveness, whether in research or in quality assurance and validation of actual productions, then how meaningful are the figures we report? At the most, we need to normalize reported effectiveness scores to account for natural disagreement between human assessors (something which can hardly be done without task-specific experimentation, since it varies so greatly between tasks). But if our upper bound F1 is 0.66, then what are we to make of rules-of-thumb such as “75% recall is the threshold for an acceptable production”?
As Webber well knows, this means that such 75% or higher rules-of-thumb for acceptable recall are just wishful thinking. It means they should be disregarded because they are counter to the actual evidence of measurement deficiencies. The evidence instead shows that the maximum possible mean precision and recall rate measured objectively is only 44%. Demands in litigation for objective search recall rates higher than 44% fly in the face of the EDI study. It is an unreasonable request on its face, never mind the legal precedent for accepting keyword search or manual review. I understand that the research also shows that technology assisted reviews are at least as good as manual, but that begs the real question as to how good either of them are!
I personally find it hard to believe that with today’s technology assisted reviews we are not in fact doing much better than 44% or 65% recall, but then I think back to the lawyers in the 1980s in the Blair Moran study: We are confident our search terms uncovered 75% of the relevant evidence. Well, who knows, maybe they did, but the measurements were wrong. Who knows how well any of us are doing in big data reviews? The fuzziness of the measures is an inconvenient truth that must be faced. The 44% max objective rate creates a lack of confidence interval that must be corrected. We have to significantly improve the gold standard, we have to upgrade the quality of reviews used for measurements.
This is one reason I call for more research, and better funded research. We need to know how much of the truth we are finding, we need a recall rate we can count on to do justice. Large corporations should especially step up to the plate and fund pure scientific research, not just product development. I trust you that it works, but, as President Regan said, I still want you to verify. I still want you to show me exactly how well it works, and I want you to do it with objective, peer-reviewed science, and to use a gold standard that I can trust.
Trust But Verify
As it now stands, the confidence rates and error margins are too low for me to entirely trust Watson, much less his little brothers. The computer was, after all, trained by humans, and they can be unreliable. Garbage in, garbage out. I will only trust a computer trained by several humans, checking against each other, and all of them experts, well paid experts at that. Even then, I’d like to have a final expert review of the documents finally selected for production before they actually go out the door. After all, the determinations and samples are based on all too human judgments. If the stakes are high, and they usually are in litigation, especially where privileges and confidential information are involved, there needs to be a final check before documents are produced. That is the true gold standard in my world. Do you agree? Please leave a comment below.
Apology and Holiday Greetings from Ralph
Now I must apologize to my readers. I promised a two-part blog on Secrets of Search where the deepest secret would be revealed in Part Two, along with the seventh insight into why most lawyers in the world do not want to do e-discovery. But admit it, this Part Two is already too long isn’t it (over 7,100 words)? How long can we mere mortals maintain our attention on this stuff? You already have a lot to think about here. So, it looks like I lied before. It now seems to me better to wait and finish this article in a Part III, rather than ask you to read on and on.
So stay tuned friends, I promise this soap opera will finally come to a conclusion next time, when we are all much fresher and finally ready to hear the truth, the whole truth, and nothing but the truth about the secrets of search. (And yes, I really have four monitors at my desk, actually I have five when you include my personal MacBook Pro, which is by far my favorite computer.) Oh yeah, and the next blog may be late too. We’ll see how busy Santa keeps me. Happy Holidays!