This is part two of my description and analyses the official report of the 2011 TREC Legal Track search project. I urge you to read the original report that was published in July 2012: Overview of the TREC 2011 Legal Track, and of course, be sure to read Part One first.
Trying to Turn the Lead Standard Back to Gold
I have pointed out many times that manual review of large amounts of documents is not the gold standard it was once thought to be. Secrets of Search, Part One. In fact, I have taken to calling typical manual review a lead standard because every time scientists analyze manual review in large-scale projects they find an incredibly high error ratios. The inconsistency in coding of the Verizon project reported by the e-Discovery Institute was an astonishing 72%. I refer to this problem, as do a few information scientists I have spoken with about it, as the fuzzy lens problem.
This problem is well-known to the TREC Legal Track Coordinators (the folks in charge). In fact, two of the Coordinators in 2011 wrote an article dealing with an aspect of the problem. Grossman & Cormack, Inconsistent Responsiveness Determination in Document Review: Difference of Opinion or Human Error? They were trying to explain the cause of the high error rates, which is disputed. Maura Grossman and Gordon Cormack contend that the inconsistencies by prior TREC manual reviewers were caused by human error, not by a difference of opinion on what was relevant or not. Other information science researchers studying the problem have reached a different conclusion, suggesting that relevancy judgments are inherently fuzzy and not just the result of human error. Webber, Chandar & Carterette, Alternative Assessor Disagreement and Retrieval Depth (2012) (found a strong and consistent relationship between document rank and probability of disagreement). Regardless of the cause, the error remains.
The 2011 Coordinators made some significant attempts to try to rehabilitate the once gold standard, and make the human review of samples to calculate test scores more reliable. They have no choice but to do so, for otherwise scientific studies and research are extremely limited. Glasses must be put on the manual review process to correct for the fuzzy lens problem and allow for clearer measurements of the efficacy of search. For instance, how accurate is the coordinator’s estimated number of responsive documents for each topic upon which the participant’s effectiveness is based?
If Grossman and Cormack are correct, and legal relevance itself is not the cause of the fuzziness, then it should be possible to correct the measurement problems, or at least dramatically mitigate the errors, by safeguarding against human errors. If they are wrong, then the ability of these procedures to mitigate against the consistency errors will not be as effective.
Personally I think legal relevance itself is inherently fuzzy, to a point, and that both human error and vague relevancy are to blame for inconsistent reviews. Regardless, I concede, as does everyone else that I am aware of, that human error is at least partly to blame, and thus that correction lenses for this problem are appropriate. I just think we should think in terms of bifocals. With one lens to correct for human errors, simple mistakes, and another lens to correct for relevancy vagaries. But regardless of our efforts, the results may always be somewhat fuzzy, much like law itself.
Here is how the report at page two describes the gold standard problem.
In order to measure the efficacy of TREC participants’ efforts, it is necessary to compare their results to a gold standard indicating whether or not each document in the collection is responsive to a particular discovery request. The learning task had three distinct topics, each representing a distinct request for production.
Ideally, a gold standard would indicate the responsiveness of each document to each topic. Because it would be impractical to use human assessors to render these two million assessments, a sample of documents was identified for each topic, and assessors were asked to code only the documents in the sample as responsive or not. Since errors in the gold standard can have substantial impact on evaluation, redundant independent assessments were made for the majority of the sampled documents, and disagreements were adjudicated by the Topic Authority.
I suspect that comment about the impracticality of human assessors rendering two million assessments to be somewhat tongue-in-cheek. Not only would it be impractical, for instance too expensive and time-consuming, but it would also be completely worthless. No serious student of legal search today still thinks that review of all documents, with no computer filtering of any kind, is a gold standard.
I suppose the Coordinators still have to write like this and talk about review of every document as a gold standard, as a bridge to hopelessly out-of-date lawyers who may read the report. I am talking about the lawyers who have never done large-scale document reviews, except for on such a high level that they do not really know what is going on (not uncommon). These lawyers, typically my age, still live in a dream world based upon their experience of reading thousands of paper documents in the eighties and nineties. These old paper-lawyers still extrapolate their small volume paper experience onto today’s high volume digital reality. They imagine that hundreds of lawyers carefully reading millions of documents is still the gold standard. Ask any contract lawyer off the record who has worked in a tiny cubicle for months on end what a gold standard it really is. Try it yourself for a few days. Read the studies.
The authors go on to describe the new gold standard with corrective lenses that they devised for measurements of the 2011 task.
A total of 16,999 documents – about 5,600 per topic – were selected and assessed to form the gold standard. The documents that were selected met one or more of the following four criteria:
1. All documents that were identified by the Track coordinators to be potentially responsive in the course of developing the topics before the start of the task;
2. All documents submitted by any team for responsiveness determination;
3. All documents ranked among the 100 most probably responsive by any submission;
4. A uniform random sample of the remaining documents.
As you can see from this description, the corrective lens created in 2011 was redundant independent assessments. The documents were selected for a second assessment based on judgmental and random sampling. The first three criteria of judgmental sampling selected 11,612 documents. The fourth random sampling criteria only selected 5,387 documents. (Although the report does not specify the exact sampling criteria, the 5,387 number is apparently based on a 95% confidence level with interval of 1.33% and an assumed prevalence of 50%.) Each of the 5,387 random sampled documents were reviewed twice in this final quality control step, since most would not have been previously reviewed, and thus that was the only way the review of these documents would be redundant.
In cases where the two reviews differed, the designated Topic Authority adjudicated conflicting assessments. Thus a triple review was possible in criteria two and four above for documents where the first two reviews were inconsistent. The documents selected in criteria one above (the topic coordinators documents) do not appear to have been subject to a redundancy check nor Topic Authority review. The criteria three documents (100 highest ranked by any submission) also do not appear to have been double checked.
The reviews were done by four e-discovery vendors with professional review teams who generously donated their services: ACT Litigation Services, Inc., Business Intelligence Associates, Inc. (“BIA”), Daegis, and IE Discovery, Inc. It is not known whether the individual reviewers were paid, but the authors of the study stated that they presumed they were. (I hope they were too, and at their full rates, but perhaps these companies would leave a comment below to confirm these assumptions? This has to do with the observation that I made in a prior article that money is a motivator. Secrets of Search – Part II)
The particular qualifications and experience of the reviewers was also not stated. But knowing these companies, and knowing that they are all top-notch, it is safe to assume that the reviewers they employed were all attorneys who were very experienced with review.
The report at page three also states that professional review companies each used their own established commercial practice, including their quality assurance procedures, but did not describe what those practices and procedures were, nor how they might differ from one company to another. It would be good to know what these procedures were, and, to have all reviewers follow the same procedures. Still, I know several of these the companies, and know that they have excellent internal procedures. I do not know what variables there may be between them. Ideally one review company would do the whole thing, but that is asking a lot for pro bono work.
Great efforts were made in TREC 2011 for standardization to try to eliminate variables in review. For instance, each reviewer was provided with an orientation. Although the report does not specifically state this, I have to assume that the orientation for each reviewer was exactly the same. This could have been done, for instance, by all reviewers attending the same phone conference, or by all watching the same videotaped presentation. I think it was the former, but I am not sure from the report. I certainly hope it was not ad hoc orientation done at different times, or provided by different people, answering different questions and the like.
Each reviewer was also provided with detailed guidelines created by a Topic Authority on how to determine relevancy on the three issues. Again, I assume they were each given the exact same notebook. The review platforms provided to each reviewers were not exactly the same, although I assume the issues tags in each were identical. The differences in software was probably not important, but it does introduce another variable.
It is interesting to note that each review software platform included a neat feature called a “seek assistance” link. All reviewers were encouraged to use it to request that the Topic Authority resolve any uncertainties. The information on individual reviewer usage was probably kept (I cannot tell from the report), and this may be useful for the ad hoc scientific analysis of the 2011 project that will certainly follow.
We are also not told in the report the total number of reviewers, but again this information might be available for scientific research. In my experience the quality of any review is dramatically effected by the number of reviewers. The fewer the better, assuming they are not overburdened and have enough time to complete the task. There is no information in the Overview report on these key facts; although again, it may exist in supplemental reports.
This quality control process was designed primarily to catch human errors, but, it seems to me, the process also addresses the relevance malleability problem, at least to some degree. No doubt future studies will analyze the effectiveness of this attempt.
How Well Did the 2011 Participants Do?
I know this is the question on everyone’s mind. But, to be honest, it is difficult to answer this question based on what you will read in the report. One almost gets the feeling of deliberate obfuscation. The charts shown below summarize some of the results and give you an example of what I am talking about. In view of the controversies TREC has been involved in over the past few years, this inclination to technical scientific obtusity is understandable. As a personal player in some of the e-disco Hunger Games going on right now, I cannot really say much more.
Read the Overview of the TREC 2011 Legal Track report for yourself, including the charts, and the reports of participants that will be described at the end. Then also look for and study the additional commentaries and scientific reports that are bound to come out about this important experiment, the 2011 TREC Legal Track. You will find the results summarized in various lengthy charts and tables in the Overview. These tasks are not designed to find winners and losers. They are designed for participants to test things out of their own choosing within the context of a task that the coordinator’s design, and that is how most of the participants responded.
Here is how the coordinators address the contentiousness issues and their findings:
Some participants may have conducted an all-out effort to achieve the best possible results, while others may have conducted experiments to illuminate selected aspects of document review technology. It is inappropriate – and forbidden by the TREC participation agreement – to claim that the results presented here show that one participant’s system or approach is generally better than another’s. It is also inappropriate to compare the results of TREC 2011 with the results of past TREC Legal Track exercises, as the test conditions as well as the particular techniques and tools employed by the participating teams are not directly comparable.
An excerpt from a conclusion to the report, along with the report’s charts, provide about as clear an answer as you will find to the question of how well the participants did in 2011.
The 2011 TREC Legal Track was the sixth since the Track’s inception in 2006, and the third that has used a collection based on Enron email (see [5, 14, 12, 11, 6]). From 2008 through 2011, the results show that the technology-assisted review efforts of several participants achieve recall scores that are about as high as might reasonably be measured using current evaluation methodologies. These efforts require human review of only a fraction of the entire collection, with the consequence that they are far more cost-effective than manual review. There is still plenty of room for improvement in the efficiency and effectiveness of technology-assisted review efforts, and, in particular, the accuracy of intra-review recall estimation tools, so as to support a reasonable decision that “enough is enough” and to declare the review complete. Commensurate with improvements in review efficiency and effectiveness is the need for improved external evaluation methodologies that address the limitations of those used in the TREC Legal Track and similar efforts.
If you are looking for a take-away here, a blog-soundbite, I think it is found in this key excerpt from the above longer quote:
[T]he results show that the technology-assisted review efforts of several participants achieve recall scores that are about as high as might reasonably be measured using current evaluation methodologies. These efforts require human review of only a fraction of the entire collection, with the consequence that they are far more cost-effective than manual review.
Again, for a deeper understanding, I suggest you study the Overview of the TREC 2011 Legal Track report and its many charts. The charts especially bear close scrutiny. The report may not give you the clear answer to your questions about who won TREC, or how good any particular search software or strategy may be. But, with respect dear readers, these are not the right questions to ask. Beware of all marketing claims about search, and especially ones with easy buttons and claims of near perfect recall. It is a myth. See The Legal Implications of What Science Says About Recall.
Olympics and a Proposal for a John Henry Duel to the Death
Win-lose questions are not what TREC Legal Track is all about. Contrary to what some pseudo-experts have said, TREC is not a bake-off, much less an Olympics, and it certainly does not establish standards. I have no idea where the last assertion comes from, except perhaps that NIST is a co-sponsor. All the TREC officials are laughing at that assertion. Standards are a good idea, and a necessary next step. But that is not what TREC is all about. Instead, TREC Legal Track is an incubator of sorts, an event to test out software and strategies. It is not race to be won.
If you want an Olympics, a race, you will have to start another event. Maybe a private vendor sponsored group will do that. But it would have to have impeccable judges and you know the thing about races, there are winner and losers. How many companies would really be willing to risk a loss?
Still, maybe a John Henry type contest, where “steel-driving” manual review teams go head-to-head with software powered partial reviews. Then there could be many winners, and only the pure, pro-linear, read everything vendors would lose. And lose they surely would.
It would cost a million bucks or more to put on and do even a limited John Henry duel correctly, and there would still be the fuzzy-lens problem to declare a winner, but that would not matter, because it would not even be close. No one outside of Vegas would bet on the contract reviewers. Their work is as doomed as the buggy whip. They were never as heroic as John Henry anyway, who, by the way, was a freed slave. Plus remember, although John Henry won his contest with the steam-hammer that day, he died in the process.
Still, aside from the costs involved, and, as President Nixon famously said, we could raise a million dollars, I doubt very much that this kind of classic John Henry test will ever happen. Everyone in the business knows that humans with computer tools are better, that the gold standard of hundreds of reviewers spending months reading emails is a total fabrication, a myth. The manual reviewer attorneys know this better than anyone. What a joke. No manual review company would ever dare to participate. All the ones I know have already adopted some computer assistance and hybrid approach anyway.
Another Proposal for a Hybrid v. Borg Olympics
The only real test left is Hybrid v Borg. Most e-discovery specialists who are actually practicing lawyers say, bring it on Borg! We believe in the value of skilled, legal judgement. We are not impressed with the one-off study by William Webber using high school students to do review. See my July 23, 2012 LTN article, Can High School Students Review E-Discovery Documents? We know from hard experience how difficult and subtle relevancy determinations can be, not to mention that many other types of legal distinctions that legal search requires, such as privilege, confidentiality, and issues of all types.
We think that resistance is futile talk is a bluff. Humans using computers, hybrid liberated Seven of Nine types, will beat computers that control, any day.The Borg queen and her hive mind, monomodal approach, are no match for the e-discovery federation using a multimodal approach. In the Borg predictive coding only approach, which TREC called Automatic, the reviewers just do what they are told and only review the documents that the computer selects. The computer drives the CAR. In the multimodal Hybrid approach, which TREC called, TechAssist, the skilled attorneys have a say in what documents are reviewed and all types of search methods are used. We humans drive the Computer Assisted Review. The computer assists us, not visa versa.
Someday pure automated legal search may improve and be far better than it is now. Someday the baby Borg may grow up. Someday legal search may just be a random stroll where human legal judgment is marginalized. The CAR may drive itself. In the meantime, I prefer the Seven of Nine Hybrid human dominant approach. I for one want to keep my hands firmly on the wheel of the CAR.
Still, I know that many non-practicing attorney search experts think that we pro-Hybrid searchers are wrong, that the software is already there. They think we are just being needlessly over-diligent by using all of the other old tools, like Boolean keyword, concept, similarity and the like. We are just wasting our time with those old methods. Let the computer do it, after all, it does not have human biases. They would just use the top of my multimodal search pyramid. The human would just mark yes or no, relevant or not, on the documents the software selects for them.
We could settle this dispute outside of TREC in an Olympics type contest. Again, like the John Henry proposal, there could be many winners to this too. Let us pit teams using monomodal predictive coding only, the Automatic approach, against teams using multimodal with predictive coding included, the TechAssist approach. In fact, any team could compete against itself using dual methods. For instance, although I do not usually do so, I could use Inview in a monomodal way. I could just use random selection and predictive coding only. I could refrain from using any of the other search methods in Inview to find and supplement seed sets. Then I could do the same search using all of the search capacities, the full search pyramid, including my own judgment.
I assume the hybrid approach would be far better than the pure predictive coding approach, but at what cost? Which method would require the most time? I am not completely sure about that. My experiment this year with multimodal search allowed me to review 699,082 Enron documents in just 52 hours. See Day Nine of a Predictive Coding Narrative. But perhaps a monomodal approach that just used predictive coding would have taken far less time with equivalent results?
We need an open Olympics to find out whether the old methods are obsolete as some contend. We need an open battle of the Borg. Unlike my proposed John Henry Olympics, the Borg sympathizers have a chance. Even if they lose the battles, they might still win the wars, if they do not lose by too much. If the multimodal approach only attains a marginal improvement in accuracy, but at a much greater time and effort expenditure, is it really worth it?
Maybe the Borg are more mature than I think? I have been wrong before. Just ask my wife. Plus, the older I get, the less I know. In a few years I may know Nothing and forget to check for poison in my wine. Anyway, I am not quite arrogant enough now to assert the absolute superiority of Hybrid Multimodal, at least not without further experiments. It is not a done deal. I will give the Borg that much. The 2011 TREC results suggest that TechAssist is better than Automatic, but the results were inconclusive. Plus, TREC is not a contest.
We need another forum for a true contest like that. We need another e-discovery Olympics, this time a Borg v Hybrid Olympics. This contest could maximize the winners, and minimize the losers, by requiring all teams to use dual approaches. Like TREC itself, such a contest would have to be above reproach, with objective science-based judging and use of expensive adequate corrective lenses for a credible gold standard. Again, it would probably cost a million dollars to do right. At the end of the day, there would be winners and losers. For that reason, most vendors would probably not have the stomach for it. Still, if they did, it could teach us a lot. It could even advance one of the goals of TREC itself, and speed the transfer of technology from research labs into commercial products.
To summarize, I propose two Olympics for those vendors interested in proving that their software and methods are truly the best – a John Henry, Man versus Machine Olympics, and a Borg, Hybrid-Multimodal versus Pure-Predictive-Coding Olympics. Such open contests are risky to commercial participants, and probably will not come about for that reason. But only such events, not TREC, will satisfy natural consumer interest. Either that or an independent testing laboratory type group, but nothing like that exists now.
Standards and Best Practices
Another misconception some people have about TREC, is that it has created legal search standards or best-practices. That is not at all what TREC Legal Track is about. A new standards focused group may be coming, hopefully soon. I expect Jason R. Baron will take action on standards next year and it will be vendor oriented, not legal services oriented. As to best-practices for legal services, I am working on that and will go public with my proposals soon.
I consider best-practices to be different from standards. Standards for me represents a kind of base-line agreement of minimal requirements. It is consensus driven. Best-practices are more aspirational in nature. You have seen me write about best-practices for search and review for months now. My olympics rings symbols above are summary diagrams that I created on this topic. But my work as an e-discovery lawyer, where I have doing nothing else for the past six years, has concerned all aspects of the legal practice of e-discovery, not just search and review. Although as you may surmise, search topics are my favorite.
EDRM and EDBP
For the past several years I have been working on a best-practices model for all of e-discovery legal practice, including, but not limited to, search. Recently I had a breakthrough. I am close to going public with this work. Look for an announcement here in the next thirty days. It will have a practicing attorney perspective and use a work-flow model. The model to summarize the collection of best-practices will look somewhat similar to the well-known EDRM nine-step chart. That is the diagram below that the EDRM community of vendors, consultants, and law-firms came up with over seven years ago. Although all flow charts look somewhat alike, I am working hard to make the look and feel of my chart as distinctive as possible. I want to avoid any confusion with the EDRM chart.
Unlike the EDRM, my best-practices model will not be a reference model for all of electronic discovery. It will not attempt to cover all activities that go on in a project, only the legal services. That means it will not be concerned with any vendor or consultant activities. Their activities are not legal services. Indeed, they are not permitted to practice law or provide legal advice of any kind. My chart will focus solely on legal practice and legal services. It will be by and for lawyers only.
Thus, unlike the EDRM chart, the summary chart of best-practices that I have developed will be a completely lawyer-centric model. I have decided to call it Electronic Discovery Best Practices (“EDBP”) and last night reserved EDBP.net as its future home page.
The model I create and describe will not be a committee creation. That will be a strength, because it means it can change quickly and will not depend on compromises and politics. But I recognize that it is also a weakness. Although I am by nature an amalgamator and synthesizer of the thoughts of many, any creation by a single person lacks the wisdom of the crowd. I will correct for this weakness by inviting and remaining open to input. I will invite private and public comments on legal best practices from any practicing attorney. Dialogue will be welcome. No membership in any special group will be required for an attorney to provide input. I will also arrange for successors when the time comes for me to move on.
I expect the model to change from time to time as legal practice and technology changes, and as our analysis changes. If the EDBP does not change in seven years, the project will have been a failure. As Zeno said, the only constant is change. I expect the EDBP to change annually.
The purpose of EDBP will be to provide a model of best practices for use by law firms and corporate law departments. It will be designed to help all lawyers striving to implement best-practices for e-discovery in their legal practice.
The EDBP will not address best-practices in the vendor community, nor standards for vendors. EDBP is meant to complement the fine work done by EDRM, but its scope is different. EDBP will be limited to legal services and will not include vendor work. The EDBP will not provide standards for legal services either. It will not presume to establish minimum base levels of performance, wherein failure might be malpractice. To the contrary, EDBP will be aspirational and goal-oriented. It will be a lawyer-centric e-discovery reference that embodies an evolving understanding of excellence in legal services. The aspirational goals will, of course, always be subject to proportionality constraints. In litigation, one size does not fit all. What is a best practice for one size case might be needless over-kill for another.
Stay tuned for the public unveiling of the EDBP, but first I want to finish my analysis of TREC 2011. The third and final part this Analysis of the Official Report on the 2011 TREC Legal Track will be published next week. It will consider the ten reports from the individual participants in the 2011 TREC Legal Track.
To be continued …
Ralph – as one of the companies whose human reviewers were proud to help code these documents, I can provide you with a few of the variables you questioned. First, our review attorneys are all salaried, so while we provided the work to TREC pro bono, nobody went without a paycheck. 😉 For TREC 2010, we used a small group (probably 10 reviewers), but for TREC 2011, we used a VERY small group – only three or four of our best reviewers. The reviewers who worked on TREC 2011 have a 98% or higher accuracy rating internally.
Our reviewers thoroughly enjoyed participating in the TREC project. It was exciting to be included in something that has such as impact (whether I agree with the impact or not!) So to answer your question regarding motivation for success, I’d say the team who reviewed from BIA were definitely motivated to perform well.
While the TREC paper states that the companies “established commercial practice, including their quality assurance procedures”, I’m not sure I’d agree. We were limited to an extremely basic review program; we were not supposed to take document family relationships into account; we were not supposed to take our own understanding of issues into account (basically, “tag the document on its face, regardless of what the previous or next document says”), etc. Our own internal practice differs enormously. Additionally, our quality assurance process is partially based on software (that we could not use, as the documents weren’t loaded into that software), as well as sampling alongside counsel to make sure we’re all on the same page. TREC allows for questions, but at no point do they come back and say, “this document is incorrect”, which would allow our team to adjust their understanding and make corrections.
At this point, I’m a firm believer in technology assisted coding – using both machines and humans. I have seen our TAC team code a large set of documents (300,000) at a 96% accuracy, while temporary reviewers (not our team) who reviewed the same documents had a 50% to 60% accuracy rating. If documents were always coded purely for relevancy, then I have no doubt that properly trained machines would “win” most of the time. However, nearly all of our reviews are for responsiveness (which, as we know, is not the same as relevance), privilege, confidentiality, issue tagging, etc. I have not seen any machine coding programs yet that are sophisticated enough to take all of those different areas into account. For those reasons, I am not ready to give up the humans. I’m happy to let the computers help though!
[…] This is part three of my description and analyses of the official report of the 2011 TREC Legal Track search project. I urge you to read the original report that was published in July 2012: Overview of the TREC 2011 Legal Track, and of course, be sure to read Part One and Part Two first. […]
[…] weeks ago I mentioned that my work on this new lawyer-centric model was nearing the public stage. Analysis of the Official Report on the 2011 TREC Legal Track – Part Two. As I explained then, my work is not on standards, Jason Baron and others are working on that. I […]
[…] Seed set selection method. (Of course, I said I was going to use a multimodal seed set selection method. I was going to use a judgmental sample found by various search methods and a random sample. I wanted disclosure of what the defendant was doing for many reasons. One of which is that I wanted to know if defendants were using the dreaded Borg approach, the reliability of which I suspect. See Eg. Analysis of the Official Report on the 2011 TREC Legal Track – Part Two.) […]
[…] Analysis of the Official Report on the 2011 TREC Legal Track – Part Two. […]
[…] of the Official Report on the 2011 TREC Legal Track – Part Two – http://bit.ly/Nxai6H (Ralph […]
[…] Technology. See eg. Analysis of the Official Report on the 2011 TREC Legal Track – Part One, Part Two and Part Three; Secrets of Search: Parts One, Two, and Three. Also see Jason Baron, DESI, […]
[…] Analysis of the Official Report on the 2011 TREC Legal Track – Part Two. […]