Concept Drift and Consistency: Two Keys To Document Review Quality

High quality effective legal search, by which I mean a document review project that is high in recall, precision and efficiency, and proportionally low in cost, is the holy grail of e-discovery. Like any worthy goal it is not easy to attain, but unlike the legendary grail, there is no secret on how to find it. As most experts already well know, it can be attained by:

Following proven document search and review protocols;
Using skilled personnel;
Using good multimodal software with active machine learning features; and,
Following proven methods for quality control and quality assurance.

Effective legal search is the perfect blend of recall and proportionate precision. See: Rule 26(b)(1), FRCP (creating nexus between relevance and six proportionality criteria). The proportionate aspect keeps the cost down, or at least at a spend level appropriate to the case. The quality control aspects are to guaranty that effective legal review is attained in every project.

The Importance of Quality Control was a Lesson of TREC 2015

This need for quality measures was one of the many lessons we re-learned in the 2015 TREC experiments. These scientific experiments (it is not a competition) were sponsored by the National Institute of Standards and Technology. They are designed to test the information text retrieval technology, which at this point means the latest active machine learning software and methods. My e-Discovery Team participated in the TREC Total Recall Track in 2015. We had to dispense with most of our usual quality methods to save time, and to fit into the TREC experiment format. We had to skip steps one, three, and seven, where most of our quality control and quality assurance methods are deployed. These methods take time, but are key to consistent quality and we would not do a large commercial project without them.

By skipping step one, which we had to do because of the TREC experiment format, and skipping steps three and seven, where most of the quality control measures are situated, to save time, we were able to do mission impossible. A couple of attorneys working alone were able to complete thirty review projects in just forty-five days, and on a part-time after hours basis at that. It was a lot of work, approximately 360 hours, but it was exciting work, much like an Easter egg hunt with race cars. It is fun to see how fast you can find and classify relevant documents and still stay on-track. Indeed, I could never have done it without the full support and help of the software and top experts at Kroll Ontrack. At this point they know these eight-step 3.0 methods pretty well.

In all we classified as relevant or irrelevant over seventeen million documents. We did so at a truly thrilling average speed of review at 47,261 files per hour! Think about that the next time your document review company brags that it can review from 50 to 100 files per hour. (If that were miles per hour, not files per hour, that would be almost twice as fast as Man has ever gone (Apollo 10 lunar module reentry)). Reviewers augmented with the latest AI, the latest CARs (computer assisted review), might as well be in a different Universe. Although 47,261 files per hour might be a record speed for multiple projects, it is still almost a thousand times faster than humans can go alone. Moreover, any AI-enhanced review project these days is able to review documents at speeds undreamed of just a few years ago.

In most of the thirty review projects we were able to go that fast and still attain extraordinarily high precision and recall. In fact we did so at levels never before seen at past TREC Legal Tracks, but we had a few problem projects too. In only seventeen of the thirty projects were we able to attain record-setting high F1 scores, where both are recall and precision high. This TREC, like others in the past, had some challenging aspects, especially the search for target posts in the ten BlackHat World Forum review projects.

To get an idea of how well we did in 2015, as compared to prior legal teams at TREC, I did extensive research of the TREC Legal Tracks of old, as well as the original Blair Maron study. Here are the primary texts I consulted:

Grossman and Cormack, Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review, CoRR abs/1504.06868 at pgs. 2-3 (estimating Blair Maron precision score of 20% and listing the top scores (without attribution) in most TREC years);
Grossman and Cormack, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014; at pgs. 24-27.
Hedin, Tomlinson, Baron, and Oard, Overview of the TREC 2009 Legal Track;
Cormack, Grossman, Hedin, and Oard; Overview of the TREC 2010 Legal Track;
Grossman, Cormack, Hedin, and Oard, Overview of the TREC 2011 Legal Track;
Losey, The Legal Implications of What Science Says About Recall (1/29/12).

Based on this research I prepared the following chart showing the highest F1 scores attained during these scientific tests. (Note that my original blog also identified the names of the participants with these scores, which was information gained from my analysis of public information, namely the five above cited publications. Unidentified persons, I must assume one of the entities named, complained about my disclosure. They did not complain to me, but to TREC. Out of respect to NIST the chart below has been amended to omit these names. My attitude towards the whole endeavor has, however, been significantly changed as a result.)

This is not a listing of the average score per year, such scores would be far, far lower. Rather this shows the very best effort attained by any participant in that year in any topic. These are the high, high scores. Now compare that with not only our top score, which was 100%, but our top twelve scores. (Of course, the TREC events each year have varying experiments and test conditions and so direct comparisons between TREC studies are never valid, but general comparisons are instructive and frequently made in the cited literature.)

On twelve of the topics in 2015 the e-Discovery Team attained F1 scores of 100%, 99%, 97%, 96%, 96%, 95%, 95%, 93%, 87%, 85%, 84% and 82%. One high score as we have seen in past TRECs might just be chance, but not twelve. The chart below identifies our top twelve results and the topic numbers where they were attained. For more information on how we did, see e-Discovery Team’s 2015 Preliminary TREC Report. Also come hear us speak at Legal Tech in New York on February 3, 2016 10:30-11:45am. I will answer all questions that I can within the framework of my mandatory NDA with TREC. Joining me on the Panel will be my teammate at TREC, Jim Sullivan, as well as Jason R. Baron of Drinker Biddle & Reath, and Emily A. Cobb of Ropes & Gray. I am not sure if Mr. EDR will be able to make it or not.

The numbers and graphs speak for themselves, but still, not all of our thirty projects attained such stellar results. In eighteen of the projects our F1 score was less than 80%, even though our recall alone was higher, or in some topics, our precision. (Full discussion and disclosure will be made in the as yet unpublished e-Discovery Team Final Report.) Our mixed results at TREC were due to a variety of factors, some inherent in the experiments themselves (mainly the omission of Step 1, the difficulty of some topics, and the debatable gold-standards for some of the topics), but also, to some extent, the omission of our usual quality control methods. Skipping Steps 3 and 7 was no doubt at least a factor in the sub-average performance – by our standards – in some of the eighteen projects we were disappointed with. Thus one of the take-away lessons from our TREC research was the continued importance of a variety of quality control methods. See eg: ZeroErrorNumerics.com. It is an extra expense, and takes time, but is well worth it.

Consistency and Concept Drift

The rest of this article will discuss two of the most important quality control considerations, consistency and concept drift. They both have to do with human review of document classification. This is step number five in the eight-step standard workflow for predictive coding. On the surface the goals of consistency and drift in document review might seem opposite, but they are not. This article will explain what they are, why they are complementary, not opposite, and why they are important to quality control in document review.

Consistency here refers to the coding of the same or similar documents, and document types, in the same manner. This means that a single reviewer determines relevance in a consistent manner throughout the course of a review project. It also means that multiple reviewers determine relevance in a consistent manner with each other. This is a very difficult challenge, especially when dealing with grey area documents and large projects.

The problem of inconsistent classifications of documents by human reviewers, even very expert reviewers, has been well documented in multiple information retrieval experiments. See eg: Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt 697 (2000); Losey, Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two (12/2/13). Fortunately, the best document review and search software now has multiple features that you can use to help reduce inconsistency, including the software I now use. See eg: MrEDR.com.

Concept drift is a scientific term from the fields of machine learning and predictive analytics. (Legal Search Science is primarily informed by these fields, as well as information retrieval. See eg: LegalSearchScience.com.) As Wikipedia puts it, concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. In Legal Search the model we are trying to predict is one of legal relevance. See eg. Rule 26(b)(1), FRCP.

The target is all of the relevant documents in a large collection, a corpus. The documents them self do not change, of course, but whether they are relevant or not does change. The statistical properties are the content of individual documents, including their metadata, that makes them relevant or not. These properties change, and thus the documents that are predicted to be relevant change, as the concept of relevance evolves during the course of a document review.

In Legal Search Concept drift emerges from lawyers changing understanding of the relevance of documents. In the Law this may also be referred to as relevance shift, or concept shift. In some cases the change is the result of changes in individual lawyer analysis. In others it is the result of formalized judicial processes, such as new orders or amended complaints. Most large cases have elements of both. Quality control requires that concept drift be done intentionally and with retroactive corrections for consistency. Concept drift, as used in this article, is an intentional deviation, a pivot in coding to match an improved understanding of relevance.

Conversely, from a quality control perspective, you are trying to avoid two common project management errors. You are trying to avoid concept freeze, where your initial relevance instructions never shift, never drift, during the course of a review. You are also trying to avoid inconsistencies, typically by reviewers, but really from any source.

To be continued ….

This entry was posted on Wednesday, January 20th, 2016 at 10:11 am and is filed under Review, Search, Technology, VENDORS. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.