This is part-two of a three-part blog, so please read Part One first.
Scientific Experiments on Inconsistencies of Relevance Determinations in Large Scale Document Reviews
The base work in this area was done by Ellen M. Voorhees: Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt 697 (2000). The second study of interest to lawyers on this subject came ten years later by Herbert L. Roitblat, Anne Kershaw and Patrick Oot, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Journal of the American Society for Information Science and Technology, 61 (2010) (draft found at Clearwell Systems). The next study with significant new data on inconsistent relevance review determinations was by Maura Grossman and Gordon Cormack in 2012: Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012); also see Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011). The fourth and last study on the subject with new data is my own review experiment done in 2012 and 2013. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents (2013).
Voorhees Study
Ellen M. Voorhees is a Computer Scientist at the National Institute of Standards and Technology (NIST). Her title is Group Leader in the Information Access Division of NIST. Her primary responsibility at NIST is to manage the Text REtrieval Conference (TREC) project. Voorhees is well qualified for this work. She received a B.S. in computer science from Pennsylvania State University, and Masters and Ph.D. degrees in computer science from Cornell University. Ellen has made several comments to this blog, and been mentioned several times before. I am grateful for the assistance she has provided to me over the years to help understand scientific research in this field.
Ellen Voorhees’ study was based upon analysis of TREC 4 data from 1995. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt 697 (2000). The relevance determinations of three expert reviewers were studied. There were 49 search topics run against the same set of 13,435 documents. This was done serially, not all at once, and the document set remained static.
The reviewer who created a particular topic was called the “primary assessor.” The two reviewers who independently double-checked the first relevance determinations were called the “secondary assessors.” Each reviewer created some of the topics and took turns serving as primary or secondary assessor for the 49 topics. For each of the 49 topics the two secondary assessors reviewed 200 documents that the primary assessor had marked relevant, if there were that many. If there were more than 200 documents ranked relevant, which was apparently infrequent, then a random sample of 200 relevant would be taken. A random sample of 200 documents judged irrelevant was also taken for each topic and given to each secondary assessor for review along with the 200 documents previously judged relevant.
The secondary assessors would then each independently review the 400 or less documents provided to them for each topic and determine which were relevant and which irrelevant. This means there were three reviews made of up to 400 documents in 49 topics. Presumably the secondary reviewers all knew that 200 of the documents provided to them for review on each topic were irrelevant and up to another 200 were supposed to be relevant. If I have read the report correctly, this means that if a secondary assessor received 400 documents they would know that exactly half were relevant. In my experience such information about prevalence is a strong clue that can assist a reviewer, and certainly would influence their determinations.
Nevertheless, in spite of this prevalence knowledge, and in spite of the fact that all three assessors were search experts with very similar backgrounds as retired information analysts, the inconsistencies in their determinations were quite high (although Voorhees seemed to think it was a relatively low disagreement rate, at least as compared to prior smaller studies). As is shown in Table 1 of the report, the average agreement rate on determinations that documents were relevant, called overlap, between the primary assessor and one of the secondary assessors was 42%. The overlap between the primary assessor and the other secondary assessor was 49%. The average overlap between the two secondary assessors was 43%, and the overlap between all three of the assessors for the 49 topics was an average of only 30%. This means that two secondary reviewers disagreed on relevance 57% of the time, and three reviewers disagreed 70% of the time.
Here is Ellen Voorhees’ more detailed report on the reviewer inconsistencies:
The overlap shown in Table 1 for pairs of assessors is greater than the overlap in the earlier studies. This is not surprising since the NIST assessors all have a similar background (retired information analyst), had the same training for the TREC task, and judged documents under identical conditions. Indeed, it is perhaps surprising that the overlap is not higher than it is given how similar the judges are; this is yet more evidence for the variability of relevance judgments. For some topics, the judgment sets produced by the three assessors are nearly identical, but other topics have very large differences in the judgment sets. For example, the primary assessor judged 133 documents as relevant for Topic 219, and yet no document was unanimously judged relevant. One secondary assessor judged 78 of the 133 irrelevant, and the other judged all 133 irrelevant (though judged one other document relevant). Across all topics, 30% of the documents that the primary assessor marked relevant were judged non relevant by both secondary assessors. In contrast, less than 3% of the documents judged non relevant by the primary assessor were considered relevant by both secondary assessors.
Based upon the 57% inconsistency rate between two reviewers (43% overlap), Voorhees calculated that you could never know whether recall or precision rates of higher than 65% had been attained:
The recall and precision scores … for the two sets of secondary judgments imply a practical upper bound on retrieval system performance is 65% precision at 65% recall since that is the level at which humans agree with one another.
According to information scientist, William Webber, who has provided me with invaluable assistance in understanding all of these studies, the 70% disagreement rate between all three reviewers (overlap of only 30%) would place a practical limit on precision and recall calculation of approximately 45%. This is the fuzzy lens problem I have written about before. Secrets of Search – Part One.
This fact frustrates any attempts to measure absolute recall or precision ratios, such that all software or search-methods testing can ever do reliably is make comparisons. But, as will be explained later in this article, at least Ellen Voorhees confirmed by this study that valid comparisons by participants who enroll in the same search event are scientifically valid and can be relied upon to test relative performance. At the present time at least, that is as clear a picture as we can get in Big Data search.
Roitblat, Kershaw, and Oot Study
The next study with new experimental results on reviewer inconsistencies was performed by Herbert L. Roitblat, Anne Kershaw and Patrick Oot. All three experts are well-known professionals in the e-discovery world. Their study considered a modern, real-world, e-discovery review project involving professional contract attorneys reviewers, project managers, and quality controls circa 2005. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review. The project involved a “second request” by the Justice Department for production of certain documents the Department specified as relevant to its investigation on whether to approve Verizon’s acquisition of long-distance carrier, MCI, for $8.44 Billion.
According to the Roitblat, Kershaw, and Oot report a total of 1,600,047 documents were reviewed by two teams of contract attorneys retained by Verizon. One focused on review for relevance, and one focused on review for privilege. A total of 225 attorneys participated in this initial review. The attorneys spent about 4 months, 7 days a week, and 16 hours per day on the review at a total cost of $13,598,872.61 or about $8.50 per document. After review, a total of 176,440 items were produced to the Justice Department as relevant and not privileged.
Two re-review teams of professional reviewers were retained by the Electronic Discovery Institute (EDI) who sponsored the Roitblat, Kershaw, and Oot Study. They were each provided with the same random sample of 5,000 documents from the original review. The two teams independently classified each of these 5,000 documents. A study of their classifications showed that the two re-review teams disagreed on the classification of 1,487 documents. These documents were submitted to a senior Verizon litigator (Patrick Oot), who made a final decision on relevance of these documents. Patrick did so without knowledge of the specific decisions made about each document during the first review. EDI also had two e-discovery vendors running computer-assisted reviews of some type to independently search the entire 1,600,047 corpus for relevant documents.
The Roitblat, Kershaw, and Oot study found that the two re-review teams agreed with the original review on about 76% and 72% of the documents. They agreed with one another on about 70% of the documents. These percentages refer to all determinations, both relevant and irrelevant. The comparative relevance overlap calculation as made in the Voorhees study, where only relevance determinations are considered, was only 28%. Specifically, the overlap between Team A and the original production was 16.3%; and the overlap between Team B and the original production was 15.8%. By comparison the overlap in productions using the two automated systems was significantly higher: 21% and 23%.
The overlap between the two re-review Teams A and B was also studied and found to be 28.1%. According to William Webber’s analysis even the 28% agreement rate found here produces a maximum possible precision and recall measurement rate of 44%. The even lower 16% overlap in the two manual review productions would put an even lower limit on recall and precision calculations.
Grossman and Cormack Study
The next study of interest with new data on reviewer inconsistencies was by the best known couple in e-discovery, Gordon Cormack and Maura Grossman. Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012). It considered data from the TREC 2009 Legal Track Interactive Task. This TREC used a triple pass method to try to mitigate the problem of inconsistent reviewer adjudication in measuring precision and recall. Without going into the details, in the triple pass method the first reviewers make their determinations, then the participants make theirs. Then if the participants disagree they can ask for a ruling from the topic’s subject matter expert (called a Topic Authority), who will, at that point make a third and final review adjudication. In the 2009 TREC the review teams that asked for appeals attained a suspicious 89% success ratio, a result contrary to legal experience, and one which raised questions as to efficacy of the review process. See William Webber, Re-examining the Effectiveness of Manual Review (2011).
This triple pass method was followed in 2009, the first year TREC used the Enron dataset, instead of the prior tobacco litigation collection, which was by then very dated and dirtied with paper scanning errors. The Grossman and Cormack study looked at inconsistencies in relevance determinations in the triple pass procedure. Maura and Gordon also personally re-reviewed a random sample of documents for which the first-pass reviewer’s responsiveness determinations were reversed by the TREC Topic Authority. They took a random sample of one-hundred such documents from the appeals of each of the seven topics. A total of 700 documents were thus re-reviewed by them. (I personally repeated their experiment and re-reviewed these same documents under Gordon’s auspices this summer. Any attorney is invited to do the same, and need only contact Gordon to receive access to the 700 document test collection. It will provide you with a greater insight into their experiment, and them with more data to improve their analysis.)
As to the seven topics considered in 2009, the average agreement for documents coded responsive by the first-pass reviewers was 71.2 percent (28.8% inconsistent), while the average agreement for documents coded non-responsive by the first-pass reviewer was 97.4 percent (2.6% inconsistent). Id. at 274 (parentheticals added). Specifically, over the seven topics studied in 2009 TREC:
A total of 49,285 documents—about seven thousand per topic—were assessed during the first-pass review. A total of 2,976 documents (5 percent) were appealed and therefore adjudicated by the Topic Authority. Of those appeals, 2,652 (89 percent) were successful; that is, the Topic Authority disagreed with the first-pass reviewer 89 percent of the time.
Id. at 281. Thus while the overlap of relevance determinations was a very respectable 71.2% overall (this assumes all non-appealed determinations were correct, which is obviously not correct since some teams did not bother to appeal), the overlap among appealed decision was a very low 11%.
When Gordon and Maura made a fourth review of a random sample of 700 appealed relevance classifications, and then compared their 700 determinations with that of the appeal reviewer (the third review by the Topic Authority), they found, they say to their surprise, that they agreed with the Topic Authority’s final determinations 90% of the time. They found another 5 percent or so of the documents to be clearly responsive or clearly non-responsive, contradicting the determination of the Topic Authority. Only 5 percent of the documents were found to be arguable. Id. at 285. (No word yet from them on how my review compared with theirs, but I doubt my agreement rates were as high, no doubt in part because I found several of the Topic Authorities’ definitions and instructions of relevance vague and confusing. For that reason I did not feel like I had a good understanding of several of the topics.)
Maura Grossman had served as the Topic Authority on one of the seven topics (Topic 204) and so during TREC had made the third review of the documents appealed. When she re-reviewed ten disputed documents in this same topic over a year later as part of this study she found that she disagreed with herself on five of the ten same documents considered:
For three of the ten documents, the Topic Authority contradicted her earlier assessment; for two of the ten, the Topic Authority coded the documents as arguable. For only half of the documents did the Topic Authority unequivocally confirm her previous coding decision.
Id. at 286. Thus the overlap in Maura’s determinations of the same documents was 50%, with a disagreement or inconsistency rate of 50%. These inconsistencies arose in the topic is which she was designated as the SME. For details, see Gordon Cormack’s explanatory comment below.
Losey Study
The final study with new data on reviewer inconsistencies was mine. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents (2013). Here I reviewed 699,082 Enron documents by myself, twice, on two review projects. Each project had the same relevance target, and used almost the same Kroll Ontrack predictive coding enhanced software, but used different methodologies. In the first review I used my normal multimodal methods, in the second review nine months later I used what I called a Borg or monomodal methodology. In a later post hoc analysis of these two reviews I discovered that I had made 63 inconsistent relevance determinations of the same documents. I thus inadvertently contributed to the research on relevance inconsistencies.
Here are the specifics on this aspect of my research. In the first multimodal project I read approximately 2,500 individual documents to categorize the entire set. I found 597 relevant documents. In the second monomodal project I read 12,000 documents. I found 376 relevant documents. After removal of duplicate documents, which were all coded consistently thanks to simple quality controls employed in both projects, there were a total of 274 different documents coded relevant by one or both methods.
Of the 274 overlapping relevant categorizations, 63 of them were inconsistent. In the first project I found 31 documents to be irrelevant that I determined to be relevant in the second project. In the second project I found 32 documents to be irrelevant that I had determined to be relevant in the first project. An inconsistency of coding of 63 out of 274 relevant documents represents an inconsistency rate of 23%.
While this kind of inconsistency by an SME might be seen as surprising, even embarrassing to some, in fact, this 23% inconsistency rate represents the lowest on record for a large-scale document review, and thus the best result ever documented. It is also better than the only other recorded instance of a single reviewer re-coding the same documents over a year later, namely Maura Grossman’s mentioned re-review of ten documents in Topic 204.
My re-review of 274 documents, where I made 63 errors, creates an overlap or Jaccard index of 77% (211/274), which again is the highest on record. See Grossman Cormack Glossary, Ver. 1.3 (2012) (defines the Jaccard index and goes on to state that expert reviewers commonly achieve Jaccard Index scores of about 50%, and scores exceeding 60% are very rare.) This overlap or Jaccard index is shown by the Venn diagram below.
By comparison the Jaccard index in the Voorhees studies were only 30% (three reviewers) and 45% (two reviewers). The Jaccard index of the Roitblat, Kershaw and Oot study was only 16% (two reviewers). The Jaccard index of the topic authority reviews done twice on the same documents by Maura Grossman in the 2009 TREC was 50% (one reviewer). The overlap for two reviewers relevance calls was 71% in the Grossman Cormack study, if you assume all unappealed decisions were correct, but if you only consider the appealed decisions, the Jaccard Index was a dismal 11%. Thus in context of the only tests we have on this subject of consistency of document review, the consistent coding of 211 out of 274 documents was extremely high.
Further, when you consider the determinations of not-relevant, as was done in the Grossman Cormack study, my consistency rate jumps to about 99% (01% inconsistent). Compare this with the Grossman Cormack study where agreement on non-relevant adjudications, assuming all non-appealed decisions were correct, was 97.4 percent (2.6% inconsistent).
Prior Studies Have Not Addressed the Impact of Inconsistencies on Machine Training
None of the studies to date on relevance coding inconsistencies were made to evaluate the impact of such inconsistencies on active machine learning. Recent data obtained by the Electronic Discovery Institute in their Oracle project, may, however, make it possible for scientists to make such evaluations in the future. Bay, M., EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013). As reported in the Monica Bay article the data on inconsistencies and number of reviewers used by each participating team may make it possible for scientists to now prove, or disprove, my theory that less is more, that consistent input by bona fide experts, SMEs, is critical to attaining comparatively high performance in real-world legal search projects. That is what I have been referring to for some time as the Army of One approach. See eg. LegalSearchScience.com; The Solution to Empty-Suits in the Board Room: The “Hacker Way” of Management – Part Two; New Developments in Advanced Legal Search: the emergence of the “Multimodal Single-SME” approach.
The Voorhees study addressed the issue of whether it was possible for TREC to ever make reliable comparisons of the relative effectiveness of search software and methods in view of the inconsistency of the human reviewers that TREC depended upon to verify the accuracy of the search results. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Voorhees was responding to other scientists who questioned the value of TREC’s work in this area based on the assumption that all relevance determinations are inherently subjective. They argued that since relevance is just subjective, the test methods used by TREC to evaluate the comparative efficacy of different methods and software are invalid. (As a side note, if this were true, the evidentiary foundations of the common law tradition, based as it is on the admission of only relevant evidence, would also be called into question as a mere subjective exercise of judicial power.)
Ellen Voorhees assumed as true that relevancy determinations are very subjective, but concluded that the TREC research was anyway effective in making valid comparisons between search methods and search software. For instance, she states in her conclusion, that it is a fact that “relevance” is idiosyncratic. Despite her assumption of the idiosyncratic nature of relevance determinations, Ellen concluded that:
[T]he relative effectiveness of different retrieval strategies is stable despite marked differences in the relevance judgments used to define perfect retrieval. … These results validate the use of the TREC test collections for comparative retrieval experiments.
The study by Herbert L. Roitblat, Anne Kershaw and Patrick Oot, also did not concern the impact of inconsistencies on machine training. Instead, their study addressed the issue of whether automated systems could categorize documents at least as well as human reviewers could. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review. They were responding to the then (2010) still widely held belief that the gold standard in legal review was for an attorney to study every document to determine relevance, that computer-assisted review was not as reliable. Roitblat, Kershaw and Oot used data from a real matter based on a Department of Justice request for information about a merger. They found that computer-assisted review was at least as good as manual review, and much more cost-effective.
The studies of Maura Grossman and Gordon Cormack also did not concern the impact of inconsistencies on machine training. Their first report, like the Roitblat, Kershaw and Oot study, evaluated the comparative effectiveness of computer-assisted review versus manual review. Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review. They concluded that the technology-assisted methods were not only more efficient and cost-effective exhaustive manual review, but were significantly superior in precision, recall and F1 measures.
Maura and Gordon’s second study considered reviewer inconsistencies, but was concerned with the question as to why human reviewers are so inconsistent. Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? Their findings challenged the assumptions of Ellen Voorhees and others that relevancy determinations were inherently subjective, and that this explained the high rate of reviewer inconsistencies.
Gordon and Maura considered data from the TREC 2009 Legal Track Interactive Task. They concluded that only 5% of the inconsistencies in determinations that a document were relevant were attributable to differences in opinion, that 95% were attributable to human error. They concluded that most reviewer categorizations were caused by carelessness, such as not following instructions, and were not caused by differences in subjective evaluations.
The accuracy of their conclusion that relevance is not inherently subjective not only has important consequences for philosophy of law, which their study did not discuss, but also for whether, and how, quality control procedures can be implemented to reduce the inconsistencies in human review. This is of utmost importance to researchers and reviewers like myself trying to improve methods of predictive coding by making the machine training as accurate as possible. My personal view on the objective versus subjective relevance controversy is generally consistent with Maura and Gordon’s. As stated in A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents:
The findings in this study thus generally support the conclusions of Cormack and Grossman that most inconsistencies in document classifications are due to human error, not the presence of borderline documents or the inherent ambiguity of all relevancy determinations. Id. Of the 3,274 different documents the SME read in both projects in the instant study only 63 were seen to be borderline grey area types, which is less than 2%.
My study also did not concern the impact of inconsistencies on machine training. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. I was instead concerned with comparing the relative effectiveness of two different predictive coding methods. More specifically my research implemented the suggestion found at the conclusion of Grossman and Cormack’s report, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, at page 33:
The particular processes found to be superior in this study are both interactive, employing a combination of computer and human input. While these processes require the review of orders of magnitude fewer documents than exhaustive manual review, neither entails the naive application of technology absent human judgment. Future work may address which technology-assisted review process(es) will improve most on manual review, not whether technology-assisted review can improve on manual review.
My 2013 study compared a technology-assisted review process that included what Grossman and Cormack called a combination of computer and human input, a process that I call a hybrid multimodal approach, with another process that had less human input, a monomodal approach that I sometimes call the Borg methodology. My study found that the multimodal process was superior, especially at locating highly relevant documents, but not by as much as I had expected. (The Borg method was, however, more boring that I had expected!)
As mentioned, after completing this experiment I discovered that I had made 63 inconsistent review determinations of the same documents. Since a total of 274 identical documents were re-reviewed, my commission of only 63 errors created a record high overlap or Jaccard index of 77% (211/274). When you consider all same-document re-reviews, both relevant and irrelevant, the numbers are even better. In both projects I coded 31,109 identical unique documents as irrelevant. Of the 31,109 total overlapping documents coded, I actually read and reviewed approximately 3,000 of these documents and bulk coded the rest (28,109).
Thus in both projects I read and individually reviewed 3,274 unique documents: 3,000 documents were marked irrelevant and 274 marked relevant. This is shown in the Venn diagram below. Of the 3,274 identical documents reviewed there were only 63 known inconsistencies. This represents an overall inconsistency error rate of 01.9%. Thus the Agreement rate for review of both relevant and irrelevant documents was 98.1% (3274/3,337).
The inclusion of all review determinations in a consistency analysis, not just review decisions where a document is classified as relevant, provides critical information to understand the reasonability of disclosure positions in litigation, specifically in whether non-relevant training documents used in predictive coding searches should be disclosed to the requesting party. This was discussed in the conclusions section of my 2013 report. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. Inclusion of both relevant and irrelevant determinations is also appropriate when analyzing active machine learning where the training on irrelevance is just as important as the training on relevance.
To be continued …. The conclusion in Part Three is coming soon.