Our Team’s Final Report of its participation in the 2016 TREC ESI search Conference has now been published online by NIST and can be found here and the final corrected version can be found here. TREC stands for Text Retrieval Conference. It is co-sponsored by a group within the National Institute of Standards and Technology (NIST), which is turn is an agency of the U.S. Commerce Department. The stated purpose of the annual TREC conference is to encourage research in information retrieval from large text collections.
The other co-sponsor of TREC is the United States Department of Defense. That’s right, the DOD is the official co-sponsor of this event, although TREC almost never mentions that. Can you guess why the DOD is interested? No one talks about it at TREC, but I have some purely speculative ideas. Recall that the NSA is part of the DOD.
We participated in one of several TREC programs in both 2015 and 2016, the one closest to legal search, called the Total Recall Track. The leaders, administrators of this Track were Professors Gordon Cormack and Maura Grossman. They also participated each year in their own track.
[T]o speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems.
Our participation in TREC in 2015 and 2016 has demonstrated substantial improvements in retrieval methodologies. That is what we set out to do. That is the whole point of the collaboration between the Department of Commerce and Department of Defense to establish TREC.
The e-Discovery Team has a commercial interest in participation in TREC, not a defense or police interest. Although from what we saw with the FBI’s struggles to search email last year, the federal government needs help. We were very unimpressed by the FBI’s prolonged efforts to review the Clinton email collection. I was one of the few e-discovery lawyers to correctly call the whole Clinton email server “scandal” a political tempest in a teapot. I still do and I am still outraged by how her email review was handled by the FBI, especially with the last-minute “revelations.”
The executive agencies of the federal government have been conspicuously absent from TREC. They seem incapable of effective search, which may well be a good thing. Still, we have to believe that the NSA and other defense agencies are able to do a far better job at large-scale search than the FBI. Consider their ongoing large-scale metadata and text interception efforts, including the once Top Secret PRISM operation. Maybe it is a good thing the NSA doe not share it abilities with the FBI, especially these days. Who knows? We certainly will not.
The e-Discovery Team’s commercial interest is to transfer Predictive Coding technology from our research labs into commercial products, namely transfer our Predictive Coding 4.0 Method using KrolL Discovery EDR software to commercial products. In our case at the present time “commercial products” means our search methods, time and consultations. But who knows, it may be reduced to a robot product someday like our Mr. EDR.
The e-Discovery Team method can be used on other document review platforms as well, not just Kroll’s, but only if they have strong active machine learning features. Active machine learning is what everyone at TREC was testing, although we appear to have been the only participant to focus on a particular method of operation. And we were the only team led by a practicing attorney, not an academic or software company. (Catalyst also fielded a team in 2015 and 2106 headed by Information Science Ph.D., Jeremy Pickens.)
The e-Discovery Team wanted to test the hybrid multimodal software methods we use in legal search to demonstrate substantial improvements in retrieval methodologies on real-world problems. We have now done so twice; participating in both the 2015 and 2016 Total Recall Tracks. The results in 2016 were even better than 2015. We obtained remarkable results in document review speed, recall and precision; although, as we admit, the search challenges presented at TREC 2016 were easier than most projects we see in legal discovery. Still, to use the quaint language of TREC, we have demonstrated the robustness of our methods and software.
These demonstrations, and all of the reporting and analysis involved, have taken hundreds of hours of our time, but there was no other venue around to test our retrieval methodologies on real-world problems. The demonstrations are now over. We have proven our case. Our standard Predictive Coding method has been tested and its effectiveness demonstrated. No one else has tested and proven their predictive coding methods as we have done. We have proven that our hybrid multimodal method of AI-Enhanced document review is the gold standard. We will continue to make improvements in our method and software, but we are done with participation in federal government programs to prove our standard, even one run by the National Institute of Standards and Technology.
To prove our point that we have now demonstrated substantial improvements in retrieval methodologies, we quote below Section 5.1 of our official TREC report, but we urge you to read the whole thing. It is 164 pages. This section of our report covers our primary research question only. We investigated three additional research questions not included below.
What Recall, Precision and Effort levels will the e-Discovery Team attain in TREC test conditions over all thirty-four topics using the Team’s Predictive Coding 4.0 hybrid multimodal search methods and Kroll Ontrack’s software, eDiscovery.com Review (EDR).
Again, as in the 2015 Total Recall Track, the Team attained very good results with high levels of Recall and Precision in all topics, including perfect or near perfect results in several topics using the corrected gold standard. The Team did so even though it only used five of the eight steps in its usual methodology, intentionally severely constrained the amount of human effort expended on each topic and worked on a dataset stripped of metadata. The Team’s enthusiasm for the record-setting results, which were significantly better than its 2015 effort, is tempered by the fact that the search challenges presented in most of the topics in 2016 were not difficult and the TREC relevance judgments had to be corrected in most topics. …
This next chart uses the corrected standard. It is the primary reference chart we use to measure our results. Unfortunately, it is not possible to make any comparisons with BMI standards because we do not know the order in which the BMI documents were submitted.
The average results obtained across all thirty-four topics at the time of reasonable call using the corrected standard are shown below in bold. The average scores using the uncorrected standard are shown for comparison in parentheses.
- 91.57% Recall (75.46%)
- 65.90% Precision (57.12%)
- 76.64% F1 (57.69%)
- 124 Docs Reviewed Effort (124)
At the time of reasonable call the Team had recall scores greater than 90% in twenty-two of the thirty-four topics and greater than 80% in five more topics. Recall of greater than 95% was attained in fourteen topics. These Recall scores under the corrected standard are shown in the below chart. The results are far better than we anticipated, including six topics with total recall – 100%, and two topics with both total recall and perfect precision, topic 417 Movie Gallery and topic 434 Bacardi Trademark.
At the time of reasonable call the Team had precision scores greater than 90% in thirteen of the thirty-four topics and greater than 75% in three more topics. Precision of greater than 95% was attained in nine topics. These Precision scores under the corrected standard are shown in the below chart. Again, the results were, in our experience, incredibly good, including three topics with perfect precision at the time of the reasonable call.
At the time of reasonable call the Team had F1 scores greater than 90% in twelve of the thirty-four topics and greater than 75% in two more. F1 of greater than 90% was attained in eight topics. These F1 scores under the corrected standard are shown in the below chart. Note there were two topics with a perfect score, Movie Gallery (100%) and Bacardi Trademark (100%) and three more that were near perfect: Felon Disenfranchisement (98.5%), James V. Crosby (97.57%), and Elian Gonzalez (97.1%).
We were lucky to attain two perfect scores in 2016 (we attained one in 2015), in topic 417 Movie Gallery and topic 434 Bacardi Trademark. The perfect score of 100% F1 was obtained in topic 417 by locating all 5,945 documents relevant under the corrected standard after reviewing only 66 documents. This topic was filled with form letters and was a fairly simple search.
The perfect score of 100% F1 was obtained in topic 434 Bacardi Trademark by locating all 38 documents relevant under the corrected standard after reviewing only 83 documents. This topic had some legal issues involved that required analysis, but the reviewing attorney, Ralph Losey, is an SME in trademark law so this did not pose any problems. The issues were easy and not critical to understand relevance. This was a simple search involving distinct language and players. All but one of the 38 relevant documents were found by tested, refined keyword search. One additional relevant document was found by a similarity search. Predictive coding searches were run after the keywords searches and nothing new was uncovered. Here machine learning merely performed a quality assurance role to verify that all relevant documents had indeed been found.
The Team proved once again, as it did in 2015, that perfect recall and perfect precision is possible, albeit rare, using the Team’s methods and fairly simple search projects.
The Team’s top ten projects attained remarkably high scores with an average Recall of 95.66%, average Precision of 97.28% and average F-Measure: 96.42%. The top ten are shown in the chart below.
In addition to Recall, Precision and F1, the Team per TREC requirements also measured the effort involved in each topic search. We measured effort by the number of documents that were actually human-reviewed prior to submission and coded relevant or irrelevant. We also measured effort by the total human time expended for each topic. Overall, the Team human-reviewed only 6,957 documents to find all the 34,723 relevant documents within the overall corpus of 9,863,366 documents. The total time spent by the Team to review the 6,957 documents, and do all the search and analysis and other work using our Hybrid Multimodal Predictive Coding 4.0 method, was 234.25 hours.
It is typical in legal search to try to measure the efficiency of a document review by the number of documents classified by an attorney in an hour. For instance, a typical contract review attorney can read and classify an average of 50 documents per hour. The Team classified 9,863,366 documents by review of 6,957 documents taking a total time of 234.25 hours. The Team’s overall review rate for the entire corpus was thus 42,106 files per hour (9,863,366/234.25).
In legal search it is also typical, indeed mandatory, to measure the costs of review and bill clients accordingly. If we here assume a high attorney hourly rate of $500 per hour, then the total cost of the review of all 34 Topics would be $117,125. That is a cost of just over $0.01 per document. In a traditional legal review, where a lawyer reviews one document at a time, the cost would be far higher. Even if you assume a low attorney rate of $50 per hour, and review speed of 50 files per hour, the total cost to review every document for every issue would be $9,863,366. That is a cost of $1.00 per document, which is actually low by legal search standards.13
Analysis of project duration is also very important in legal search. Instead of the 234.25 hours expended by our Team using Predictive Coding 4.0, traditional linear review would have taken 197,267 hours (9,863,366/50). In other words, the review of thirty-four projects, which we did in our part-time after work in one Summer, would have taken a team of two lawyers using traditional methods, 8 hours a day, every day, over 33 years! These kinds of comparisons are common in Legal Search.
Detailed descriptions of the searches run in all thirty-four topics are included in the Appendix.
We also reproduce below Section 1.0, Summary of Team Efforts, from our 2016 TREC Report. For more information on what we learned in the 2016 TREC see also: Complete Description in 30,114 Words and 10 Videos of the e-Discovery Team’s “Predictive Coding 4.0” Method of Electronic Document Review. Nine new insights that we learned in the 2016 research are summarized by the below diagram more specifically described in the article.
Excerpt From Team’s 2016 Report
1.1 Summary of Team’s Efforts. The e-Discovery Team’s 2016 Total Recall Track Athome project started June 3, 2016, and concluded on August 31, 2016. Using a single expert reviewer in each topic the Team classified 9,863,366 documents in thirty-four review projects.
The topics searched in 2016 and their issue names are shown in the chart below. Also included are the first names of the e-Discovery Team member who did the review for that topic, the total time spent by that reviewer and the number of documents manually reviewed to find all of the relevant documents in that topic. The total time of all reviewers on all projects was 234.25 hours. All relevant documents, totaling 34,723 by Team count, were found by manual review of 6,957 documents. The thirteen topics in red were considered mandatory by TREC and the remaining twenty-one were optional. The e-Discovery Team did all topics.
They were all one-person, solo efforts, although there was coordination and communications between Team members on the Subject Matter Expert (SME) type issues encountered. This pertained to questions of true relevance and errors found in the gold standard for many of these topics. A detailed description of the search for each topic is contained in the Appendix.
In each topic the assigned Team attorney personally read and evaluated for true relevance every email that TREC returned as a relevant document, and every email that TREC unexpectedly returned as Irrelevant. Some of these were read and studied multiple times before we made our final calls on true relevance, determinations that took into consideration and gave some deference to the TREC assessor adjudications, but were not bound by them. Many other emails that the Team members considered irrelevant, and TREC agreed, were also personally reviewed as part of their search efforts. As mentioned, there was sometimes consultations and discussion between Team members as to the unexpected TREC opinions on relevance.
This contrasts sharply with participants in the Sandbox division. They never make any effort to determine where their software made errors in predicting relevance, or for any other reasons. They accept as a matter of faith the correctness of all TREC’s prior assessment of relevance. To these participants, who were all academic institutions, the ground truth itself as to relevance or not, was of no relevance. Apparently, that did not matter to their research.
All thirty-four topics presented search challenges to the Team that were easier, some far easier, than the Team typically face as attorneys leading legal document review projects. (If the Bush email had not been altered by omission of metadata, the searches would have been even easier.) The details of the searches performed in each of the thirty-four topics are included in the Appendix. The search challenges presented by these topics were roughly equivalent to the most simplistic challenges that the e-Discovery Team might face in projects involving relatively simple legal disputes. A few of the search topics in 2016 included quasi legal issues, more than were found in the 2015 Total Recall Track. This is a revision that the Team requested and appreciated because it allowed some, albeit very limited testing of legal judgment and analysis in determination of true relevance in these topics. In legal search relevancy, legal analysis skills are obviously very important. In most of the 2016 Total Recall topics, however, no special legal training or analysis was required for a determination of true relevance.
At Home participants were asked to track and report their manual efforts. The e-Discovery Team did this by recording the number of documents that were human reviewed and classified prior to submission. More were reviewed after submission as part of the Team’s TREC relevance checking. Virtually all documents human reviewed were also classified, although all documents classified were not used for active training of the software classifier. The Team also tracked effort by number of attorney hours worked as is traditional in legal services. Although the amount of time varied somewhat by topic, the average time spent per topic was only 6.89 hours. The average review and classification speed for each project was 42,106 files per hour (9,863,366/234.25).
Again, for the full picture and complete details of our work please see the complete 164 page report to TREC of the e-Discovery Team’s Participation in the 2016 Total Recall Track.