Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two

This is part-two of a three-part blog, so please read Part One first.

Scientific Experiments on Inconsistencies of Relevance Determinations in Large Scale Document Reviews

The base work in this area was done by Ellen M. Voorhees: Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt 697 (2000). The second study of interest to lawyers on this subject came ten years later by Herbert L. Roitblat, Anne Kershaw and Patrick Oot, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Journal of the American Society for Information Science and Technology, 61 (2010) (draft found at Clearwell Systems). The next study with significant new data on inconsistent relevance review determinations was by Maura Grossman and Gordon Cormack in 2012: Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012); also see Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011). The fourth and last study on the subject with new data is my own review experiment done in 2012 and 2013. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents (2013).

Voorhees Study

Ellen M. Voorhees is a Computer Scientist at the National Institute of Standards and Technology (NIST). Her title is Group Leader in the Information Access Division of NIST. Her primary responsibility at NIST is to manage the Text REtrieval Conference (TREC) project. Voorhees is well qualified for this work. She received a B.S. in computer science from Pennsylvania State University, and Masters and Ph.D. degrees in computer science from Cornell University. Ellen has made several comments to this blog, and been mentioned several times before. I am grateful for the assistance she has provided to me over the years to help understand scientific research in this field.

Ellen Voorhees’ study was based upon analysis of TREC 4 data from 1995. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt 697 (2000). The relevance determinations of three expert reviewers were studied. There were 49 search topics run against the same set of 13,435 documents. This was done serially, not all at once, and the document set remained static.

The reviewer who created a particular topic was called the “primary assessor.” The two reviewers who independently double-checked the first relevance determinations were called the “secondary assessors.” Each reviewer created some of the topics and took turns serving as primary or secondary assessor for the 49 topics. For each of the 49 topics the two secondary assessors reviewed 200 documents that the primary assessor had marked relevant, if there were that many. If there were more than 200 documents ranked relevant, which was apparently infrequent, then a random sample of 200 relevant would be taken. A random sample of 200 documents judged irrelevant was also taken for each topic and given to each secondary assessor for review along with the 200 documents previously judged relevant.

The secondary assessors would then each independently review the 400 or less documents provided to them for each topic and determine which were relevant and which irrelevant. This means there were three reviews made of up to 400 documents in 49 topics. Presumably the secondary reviewers all knew that 200 of the documents provided to them for review on each topic were irrelevant and up to another 200 were supposed to be relevant. If I have read the report correctly, this means that if a secondary assessor received 400 documents they would know that exactly half were relevant. In my experience such information about prevalence is a strong clue that can assist a reviewer, and certainly would influence their determinations.

Nevertheless, in spite of this prevalence knowledge, and in spite of the fact that all three assessors were search experts with very similar backgrounds as retired information analysts, the inconsistencies in their determinations were quite high (although Voorhees seemed to think it was a relatively low disagreement rate, at least as compared to prior smaller studies). As is shown in Table 1 of the report, the average agreement rate on determinations that documents were relevant, called overlap, between the primary assessor and one of the secondary assessors was 42%. The overlap between the primary assessor and the other secondary assessor was 49%. The average overlap between the two secondary assessors was 43%, and the overlap between all three of the assessors for the 49 topics was an average of only 30%. This means that two secondary reviewers disagreed on relevance 57% of the time, and three reviewers disagreed 70% of the time.

Here is Ellen Voorhees’ more detailed report on the reviewer inconsistencies:

The overlap shown in Table 1 for pairs of assessors is greater than the overlap in the earlier studies. This is not surprising since the NIST assessors all have a similar background (retired information analyst), had the same training for the TREC task, and judged documents under identical conditions. Indeed, it is perhaps surprising that the overlap is not higher than it is given how similar the judges are; this is yet more evidence for the variability of relevance judgments. For some topics, the judgment sets produced by the three assessors are nearly identical, but other topics have very large differences in the judgment sets. For example, the primary assessor judged 133 documents as relevant for Topic 219, and yet no document was unanimously judged relevant. One secondary assessor judged 78 of the 133 irrelevant, and the other judged all 133 irrelevant (though judged one other document relevant). Across all topics, 30% of the documents that the primary assessor marked relevant were judged non relevant by both secondary assessors. In contrast, less than 3% of the documents judged non relevant by the primary assessor were considered relevant by both secondary assessors.

Based upon the 57% inconsistency rate between two reviewers (43% overlap), Voorhees calculated that you could never know whether recall or precision rates of higher than 65% had been attained:

The recall and precision scores … for the two sets of secondary judgments imply a practical upper bound on retrieval system performance is 65% precision at 65% recall since that is the level at which humans agree with one another.

According to information scientist, William Webber, who has provided me with invaluable assistance in understanding all of these studies, the 70% disagreement rate between all three reviewers (overlap of only 30%) would place a practical limit on precision and recall calculation of approximately 45%. This is the fuzzy lens problem I have written about before. Secrets of Search – Part One.

This fact frustrates any attempts to measure absolute recall or precision ratios, such that all software or search-methods testing can ever do reliably is make comparisons. But, as will be explained later in this article, at least Ellen Voorhees confirmed by this study that valid comparisons by participants who enroll in the same search event are scientifically valid and can be relied upon to test relative performance. At the present time at least, that is as clear a picture as we can get in Big Data search.

Roitblat, Kershaw, and Oot Study

The next study with new experimental results on reviewer inconsistencies was performed by Herbert L. Roitblat, Anne Kershaw and Patrick Oot. All three experts are well-known professionals in the e-discovery world. Their study considered a modern, real-world, e-discovery review project involving professional contract attorneys reviewers, project managers, and quality controls circa 2005. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review. The project involved a “second request” by the Justice Department for production of certain documents the Department specified as relevant to its investigation on whether to approve Verizon’s acquisition of long-distance carrier, MCI, for $8.44 Billion.

According to the Roitblat, Kershaw, and Oot report a total of 1,600,047 documents were reviewed by two teams of contract attorneys retained by Verizon. One focused on review for relevance, and one focused on review for privilege. A total of 225 attorneys participated in this initial review. The attorneys spent about 4 months, 7 days a week, and 16 hours per day on the review at a total cost of $13,598,872.61 or about $8.50 per document. After review, a total of 176,440 items were produced to the Justice Department as relevant and not privileged.

Two re-review teams of professional reviewers were retained by the Electronic Discovery Institute (EDI) who sponsored the Roitblat, Kershaw, and Oot Study. They were each provided with the same random sample of 5,000 documents from the original review. The two teams independently classified each of these 5,000 documents. A study of their classifications showed that the two re-review teams disagreed on the classification of 1,487 documents. These documents were submitted to a senior Verizon litigator (Patrick Oot), who made a final decision on relevance of these documents. Patrick did so without knowledge of the specific decisions made about each document during the first review. EDI also had two e-discovery vendors running computer-assisted reviews of some type to independently search the entire 1,600,047 corpus for relevant documents.

The Roitblat, Kershaw, and Oot study found that the two re-review teams agreed with the original review on about 76% and 72% of the documents. They agreed with one another on about 70% of the documents. These percentages refer to all determinations, both relevant and irrelevant. The comparative relevance overlap calculation as made in the Voorhees study, where only relevance determinations are considered, was only 28%. Specifically, the overlap between Team A and the original production was 16.3%; and the overlap between Team B and the original production was 15.8%. By comparison the overlap in productions using the two automated systems was significantly higher: 21% and 23%.

The overlap between the two re-review Teams A and B was also studied and found to be 28.1%. According to William Webber’s analysis even the 28% agreement rate found here produces a maximum possible precision and recall measurement rate of 44%. The even lower 16% overlap in the two manual review productions would put an even lower limit on recall and precision calculations.

Grossman and Cormack Study

The next study of interest with new data on reviewer inconsistencies was by the best known couple in e-discovery, Gordon Cormack and Maura Grossman. Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012). It considered data from the TREC 2009 Legal Track Interactive Task. This TREC used a triple pass method to try to mitigate the problem of inconsistent reviewer adjudication in measuring precision and recall. Without going into the details, in the triple pass method the first reviewers make their determinations, then the participants make theirs. Then if the participants disagree they can ask for a ruling from the topic’s subject matter expert (called a Topic Authority), who will, at that point make a third and final review adjudication. In the 2009 TREC the review teams that asked for appeals attained a suspicious 89% success ratio, a result contrary to legal experience, and one which raised questions as to efficacy of the review process. See William Webber, Re-examining the Effectiveness of Manual Review (2011).

This triple pass method was followed in 2009, the first year TREC used the Enron dataset, instead of the prior tobacco litigation collection, which was by then very dated and dirtied with paper scanning errors. The Grossman and Cormack study looked at inconsistencies in relevance determinations in the triple pass procedure. Maura and Gordon also personally re-reviewed a random sample of documents for which the first-pass reviewer’s responsiveness determinations were reversed by the TREC Topic Authority. They took a random sample of one-hundred such documents from the appeals of each of the seven topics. A total of 700 documents were thus re-reviewed by them. (I personally repeated their experiment and re-reviewed these same documents under Gordon’s auspices this summer. Any attorney is invited to do the same, and need only contact Gordon to receive access to the 700 document test collection. It will provide you with a greater insight into their experiment, and them with more data to improve their analysis.)

As to the seven topics considered in 2009, the average agreement for documents coded responsive by the first-pass reviewers was 71.2 percent (28.8% inconsistent), while the average agreement for documents coded non-responsive by the first-pass reviewer was 97.4 percent (2.6% inconsistent). Id. at 274 (parentheticals added). Specifically, over the seven topics studied in 2009 TREC:

A total of 49,285 documents—about seven thousand per topic—were assessed during the first-pass review. A total of 2,976 documents (5 percent) were appealed and therefore adjudicated by the Topic Authority. Of those appeals, 2,652 (89 percent) were successful; that is, the Topic Authority disagreed with the first-pass reviewer 89 percent of the time.

Id. at 281. Thus while the overlap of relevance determinations was a very respectable 71.2% overall (this assumes all non-appealed determinations were correct, which is obviously not correct since some teams did not bother to appeal), the overlap among appealed decision was a very low 11%.

When Gordon and Maura made a fourth review of a random sample of 700 appealed relevance classifications, and then compared their 700 determinations with that of the appeal reviewer (the third review by the Topic Authority), they found, they say to their surprise, that they agreed with the Topic Authority’s final determinations 90% of the time. They found another 5 percent or so of the documents to be clearly responsive or clearly non-responsive, contradicting the determination of the Topic Authority. Only 5 percent of the documents were found to be arguable. Id. at 285. (No word yet from them on how my review compared with theirs, but I doubt my agreement rates were as high, no doubt in part because I found several of the Topic Authorities’ definitions and instructions of relevance vague and confusing. For that reason I did not feel like I had a good understanding of several of the topics.)

Maura Grossman had served as the Topic Authority on one of the seven topics (Topic 204) and so during TREC had made the third review of the documents appealed. When she re-reviewed ten disputed documents in this same topic over a year later as part of this study she found that she disagreed with herself on five of the ten same documents considered:

For three of the ten documents, the Topic Authority contradicted her earlier assessment; for two of the ten, the Topic Authority coded the documents as arguable. For only half of the documents did the Topic Authority unequivocally confirm her previous coding decision.

Id. at 286. Thus the overlap in Maura’s determinations of the same documents was 50%, with a disagreement or inconsistency rate of 50%. These inconsistencies arose in the topic is which she was designated as the SME. For details, see Gordon Cormack’s explanatory comment below.

Losey Study

The final study with new data on reviewer inconsistencies was mine. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents (2013). Here I reviewed 699,082 Enron documents by myself, twice, on two review projects. Each project had the same relevance target, and used almost the same Kroll Ontrack predictive coding enhanced software, but used different methodologies. In the first review I used my normal multimodal methods, in the second review nine months later I used what I called a Borg or monomodal methodology. In a later post hoc analysis of these two reviews I discovered that I had made 63 inconsistent relevance determinations of the same documents. I thus inadvertently contributed to the research on relevance inconsistencies.

Here are the specifics on this aspect of my research. In the first multimodal project I read approximately 2,500 individual documents to categorize the entire set. I found 597 relevant documents. In the second monomodal project I read 12,000 documents. I found 376 relevant documents. After removal of duplicate documents, which were all coded consistently thanks to simple quality controls employed in both projects, there were a total of 274 different documents coded relevant by one or both methods.

Of the 274 overlapping relevant categorizations, 63 of them were inconsistent. In the first project I found 31 documents to be irrelevant that I determined to be relevant in the second project. In the second project I found 32 documents to be irrelevant that I had determined to be relevant in the first project. An inconsistency of coding of 63 out of 274 relevant documents represents an inconsistency rate of 23%.

While this kind of inconsistency by an SME might be seen as surprising, even embarrassing to some, in fact, this 23% inconsistency rate represents the lowest on record for a large-scale document review, and thus the best result ever documented. It is also better than the only other recorded instance of a single reviewer re-coding the same documents over a year later, namely Maura Grossman’s mentioned re-review of ten documents in Topic 204.

My re-review of 274 documents, where I made 63 errors, creates an overlap or Jaccard index of 77% (211/274), which again is the highest on record. See Grossman Cormack Glossary, Ver. 1.3 (2012) (defines the Jaccard index and goes on to state that expert reviewers commonly achieve Jaccard Index scores of about 50%, and scores exceeding 60% are very rare.) This overlap or Jaccard index is shown by the Venn diagram below.

By comparison the Jaccard index in the Voorhees studies were only 30% (three reviewers) and 45% (two reviewers). The Jaccard index of the Roitblat, Kershaw and Oot study was only 16% (two reviewers). The Jaccard index of the topic authority reviews done twice on the same documents by Maura Grossman in the 2009 TREC was 50% (one reviewer). The overlap for two reviewers relevance calls was 71% in the Grossman Cormack study, if you assume all unappealed decisions were correct, but if you only consider the appealed decisions, the Jaccard Index was a dismal 11%. Thus in context of the only tests we have on this subject of consistency of document review, the consistent coding of 211 out of 274 documents was extremely high.

Further, when you consider the determinations of not-relevant, as was done in the Grossman Cormack study, my consistency rate jumps to about 99% (01% inconsistent). Compare this with the Grossman Cormack study where agreement on non-relevant adjudications, assuming all non-appealed decisions were correct, was 97.4 percent (2.6% inconsistent).

Prior Studies Have Not Addressed the Impact of Inconsistencies on Machine Training

None of the studies to date on relevance coding inconsistencies were made to evaluate the impact of such inconsistencies on active machine learning. Recent data obtained by the Electronic Discovery Institute in their Oracle project, may, however, make it possible for scientists to make such evaluations in the future. Bay, M., EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013). As reported in the Monica Bay article the data on inconsistencies and number of reviewers used by each participating team may make it possible for scientists to now prove, or disprove, my theory that less is more, that consistent input by bona fide experts, SMEs, is critical to attaining comparatively high performance in real-world legal search projects. That is what I have been referring to for some time as the Army of One approach. See eg. LegalSearchScience.com; The Solution to Empty-Suits in the Board Room: The “Hacker Way” of Management – Part Two; New Developments in Advanced Legal Search: the emergence of the “Multimodal Single-SME” approach.

The Voorhees study addressed the issue of whether it was possible for TREC to ever make reliable comparisons of the relative effectiveness of search software and methods in view of the inconsistency of the human reviewers that TREC depended upon to verify the accuracy of the search results. Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. Voorhees was responding to other scientists who questioned the value of TREC’s work in this area based on the assumption that all relevance determinations are inherently subjective. They argued that since relevance is just subjective, the test methods used by TREC to evaluate the comparative efficacy of different methods and software are invalid. (As a side note, if this were true, the evidentiary foundations of the common law tradition, based as it is on the admission of only relevant evidence, would also be called into question as a mere subjective exercise of judicial power.)

Ellen Voorhees assumed as true that relevancy determinations are very subjective, but concluded that the TREC research was anyway effective in making valid comparisons between search methods and search software. For instance, she states in her conclusion, that it is a fact that “relevance” is idiosyncratic. Despite her assumption of the idiosyncratic nature of relevance determinations, Ellen concluded that:

[T]he relative effectiveness of different retrieval strategies is stable despite marked differences in the relevance judgments used to define perfect retrieval. … These results validate the use of the TREC test collections for comparative retrieval experiments.

The study by Herbert L. Roitblat, Anne Kershaw and Patrick Oot, also did not concern the impact of inconsistencies on machine training. Instead, their study addressed the issue of whether automated systems could categorize documents at least as well as human reviewers could. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review . They were responding to the then (2010) still widely held belief that the gold standard in legal review was for an attorney to study every document to determine relevance, that computer-assisted review was not as reliable. Roitblat, Kershaw and Oot used data from a real matter based on a Department of Justice request for information about a merger. They found that computer-assisted review was at least as good as manual review, and much more cost-effective.

The studies of Maura Grossman and Gordon Cormack also did not concern the impact of inconsistencies on machine training. Their first report, like the Roitblat, Kershaw and Oot study, evaluated the comparative effectiveness of computer-assisted review versus manual review. Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review. They concluded that the technology-assisted methods were not only more efficient and cost-effective exhaustive manual review, but were significantly superior in precision, recall and F1 measures.

Maura and Gordon’s second study considered reviewer inconsistencies, but was concerned with the question as to why human reviewers are so inconsistent. Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error? Their findings challenged the assumptions of Ellen Voorhees and others that relevancy determinations were inherently subjective, and that this explained the high rate of reviewer inconsistencies.

Gordon and Maura considered data from the TREC 2009 Legal Track Interactive Task. They concluded that only 5% of the inconsistencies in determinations that a document were relevant were attributable to differences in opinion, that 95% were attributable to human error. They concluded that most reviewer categorizations were caused by carelessness, such as not following instructions, and were not caused by differences in subjective evaluations.

The accuracy of their conclusion that relevance is not inherently subjective not only has important consequences for philosophy of law, which their study did not discuss, but also for whether, and how, quality control procedures can be implemented to reduce the inconsistencies in human review. This is of utmost importance to researchers and reviewers like myself trying to improve methods of predictive coding by making the machine training as accurate as possible. My personal view on the objective versus subjective relevance controversy is generally consistent with Maura and Gordon’s. As stated in A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents:

The findings in this study thus generally support the conclusions of Cormack and Grossman that most inconsistencies in document classifications are due to human error, not the presence of borderline documents or the inherent ambiguity of all relevancy determinations. Id. Of the 3,274 different documents the SME read in both projects in the instant study only 63 were seen to be borderline grey area types, which is less than 2%.

My study also did not concern the impact of inconsistencies on machine training. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. I was instead concerned with comparing the relative effectiveness of two different predictive coding methods. More specifically my research implemented the suggestion found at the conclusion of Grossman and Cormack’s report, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, at page 33:

The particular processes found to be superior in this study are both interactive, employing a combination of computer and human input. While these processes require the review of orders of magnitude fewer documents than exhaustive manual review, neither entails the naive application of technology absent human judgment. Future work may address which technology-assisted review process(es) will improve most on manual review, not whether technology-assisted review can improve on manual review.

My 2013 study compared a technology-assisted review process that included what Grossman and Cormack called a combination of computer and human input, a process that I call a hybrid multimodal approach, with another process that had less human input, a monomodal approach that I sometimes call the Borg methodology. My study found that the multimodal process was superior, especially at locating highly relevant documents, but not by as much as I had expected. (The Borg method was, however, more boring that I had expected!)

As mentioned, after completing this experiment I discovered that I had made 63 inconsistent review determinations of the same documents. Since a total of 274 identical documents were re-reviewed, my commission of only 63 errors created a record high overlap or Jaccard index of 77% (211/274). When you consider all same-document re-reviews, both relevant and irrelevant, the numbers are even better. In both projects I coded 31,109 identical unique documents as irrelevant. Of the 31,109 total overlapping documents coded, I actually read and reviewed approximately 3,000 of these documents and bulk coded the rest (28,109).

Thus in both projects I read and individually reviewed 3,274 unique documents: 3,000 documents were marked irrelevant and 274 marked relevant. This is shown in the Venn diagram below. Of the 3,274 identical documents reviewed there were only 63 known inconsistencies. This represents an overall inconsistency error rate of 01.9%. Thus the Agreement rate for review of both relevant and irrelevant documents was 98.1% (3274/3,337).

The inclusion of all review determinations in a consistency analysis, not just review decisions where a document is classified as relevant, provides critical information to understand the reasonability of disclosure positions in litigation, specifically in whether non-relevant training documents used in predictive coding searches should be disclosed to the requesting party. This was discussed in the conclusions section of my 2013 report. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. Inclusion of both relevant and irrelevant determinations is also appropriate when analyzing active machine learning where the training on irrelevance is just as important as the training on relevance.

To be continued …. The conclusion in Part Three is coming soon.

This entry was posted on Monday, December 2nd, 2013 at 10:23 am and is filed under Lawyers Duties, Metadata, Review, Search, Technology. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

17 Responses to Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two

Legal Search Science | e-Discovery Team ® says:

December 2, 2013 at 10:44 am

[…] When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and […]

Loading...

Reply
Jeremy Pickens says:

December 2, 2013 at 12:59 pm

Ok, I’ve got about a hundred comments I could make, but I’m going to try to be good and keep it short and small.

First thing that struck me. You wrote:

Recent data obtained by the Electronic Discovery Institute in their Oracle project, may, however, make it possible for scientists to make such evaluations in the future. Bay, M., EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013). As reported in the Monica Bay article the data on inconsistencies and number of reviewers used by each participating team may make it possible for scientists to now prove, or disprove, my theory that less is more, that consistent input by bona fide experts, SMEs, is critical to attaining comparatively high performance in real-world legal search projects.”

When I read the LTN report, it said this:

The study “considered multiple evaluation systems using litigation data from real high-stakes litigation where the producing party was confident that it conducted a meticulous attorney-based document review to respond to the document request,” he explained. Said Oracle’s Chakraborty: “The document review originally conducted by outside counsel in the study matter was a rigorous undertaking that included a thorough review with multiple quality checks. The review team was comprised of both law firm associates and contract attorneys.”

Perhaps I am reading this wrong, but it sounds like the final ground truth in the Oracle study was established, in part, by contract attorneys.

So look, I have no problem with using quality-checked contract reviewers to judge documents. But part of your argument is that non-SMEs are even more unreliable. So how is it that you think non-SME ground truth can be used to validate different training regimens? Isn’t that a paradox?

Loading...

Reply
- Ralph Losey says:
  
  December 3, 2013 at 9:22 am
  
  I do not know any details of the Oracle production or review aside from the LTN quote, but just because contract reviewers were used in the review does not mean they were driving the CAR, nor even making any “final” relevance determinations at all. You make a big assumption there. What we do know is that this was a state-of-the-art, real world, litigation production. It was thus an excellent comparator. Moreover, to my knowledge we have never had a test collection like that before. TREC collections must by regulation be public and the test runs are all done with artificial hypotheticals and volunteer reviewers. The Verizon study was close, as it was real world, but it was not in a litigation context, just a document review for merger approval, and was, I think, done with high time pressures and hundreds of reviewers.
  
  If you study my review best practices you will see that I still use contract lawyers too, just not for AI training, and not for final relevancy calls. Under my methods contract lawyers are primarily used only in what I call the “second pass” reviews, usually performed after the AI training is complete. They can make relevancy judgments (more accurately irrelevancy judgments) but they are always double checked by an SME or SME delegate. They mainly do redaction and privilege logging, and other very time consuming tasks for which an expensive SME is not needed.
  
  By the way, anything you would share concerning your “inconsistency smoothing” algorithmic work would be of interest to readers I am sure.
  
  Loading...
  
  Reply
  - Jeremy Pickens says:
    
    December 3, 2013 at 10:30 am
    
    Under my methods contract lawyers are primarily used only in what I call the “second pass” reviews, usually performed after the AI training is complete. They can make relevancy judgments (more accurately irrelevancy judgments) but they are always double checked by an SME or SME delegate.
    
    No, I never assumed they were driving the CAR (as in, used for training data) in this scenario. I assumed Oracle had used them the way you are describing: To provide testing data. To see what the outcome of the process actually was. To provide the judgments that get used to determine final precision, recall, F1, etc. scores.
    
    Where I am a little doubtful is when you say that an SME always double checks these judgments. That doesn’t make any sense to me. If every document that the contract reviewer is judging during this “second pass” (aka evaluation aka testing) phase is again judged by an SME, why bother with the contract reviewer in the first place? That’s just wasted time and effort, is it not?
    
    So I suspect that what Oracle has done, after having used contract reviewers in exactly the way you describe, for post-AI results evaluation, is to do some high level QC but let most of the judgments stand, as is. And if that’s indeed the case, then again, we’re in paradox land.
    
    It really would be nice if we could get someone from Oracle to comment on this, because otherwise it is very difficult to reach any conclusions if we don’t know what actually happened.
    
    Loading...
  - Ralph Losey says:
    
    December 3, 2013 at 5:15 pm
    
    You are assuming that contract lawyers make many relevance changes. Not in my world. That should not happen if your training the system properly and your document ranking is working correctly. For example, in one project I was involved with the contract reviewers were finding about 2% relevant. It would have cost a small fortune for them to review it all, no matter what the discounted rate. Then I was brought in and did my predictive coding thing for a few days, and marked the whole set either relevant or not. The less than 50% predicted probability was marked relevant and the contract lawyers then did second pass, where on one their jobs was to confirm my prediction of relevance.
    
    Then the contract lawyers found almost all were relevant at the very high end, and found 80% to 90% relevance in the top 20% where most of the predicted relevant were sorted to. They liked that. Me and my surrogates then only had to double check their reversals from relevant to irrelevant. Then when they contract lawyers reached the relatively few docs in the 80% to 50% probable relevance range, the prevalence was lower. More reversals.
    
    We did not look at most of the docs less than probable relevant (49% and under). But we did sample the less than 50% majority of docs. That confirmed the predictions. Not perfectly mind you. I understand the limits of statistics and the remote, but real possibility of still finding relevant documents that you and other scientists are fond of pointing out (which you sometimes call the “long tail”), but it was confirmed within reason. That is, after all, what the law requires, reasonable, proportionate efforts, not perfection and mathematical certitude that many causal-type scientists are used to. I’m not saying you are one of “those” mind you. You appear to be more of a quantum relativity type to me! Law has always been there, dealing with probabilities and self-organization from chaos, not old-fashioned (and now disproven) Newtonian causality. We are used to probable relevance and not knowing for sure if the cat is dead.
    
    Anyway, that production I was talking about went very well. The documents needed for justice were all found, and then some, and the clients saved a lot of money in the process. Contract lawyers, myself, and a few of my surrogates worked hand in hand and the duplication of review was not too bad, and assured everyone of quality control.
    
    Loading...
  - Jeremy Pickens says:
    
    December 3, 2013 at 10:55 am
    
    By the way, anything you would share concerning your “inconsistency smoothing” algorithmic work would be of interest to readers I am sure.
    
    I will, but my goal is not to take over your blog 🙂 So, in another forum.
    
    Loading...
  - Jeremy Pickens says:
    
    December 4, 2013 at 11:38 am
    
    You are assuming that contract lawyers make many relevance changes.
    
    Not.. quite. I’m.. well.. I think we’re talking at cross-purposes here. I mostly agree with the gist of what you’re saying (though I think there are still one or two hidden gotchas that you’re not considering), but that gist is not really what I’m talking about here. I take full responsibility for not being very good at explaining myself via comment text. I think this discussion would be better in person, with a whiteboard or napkin or something else that I could sketch on.
    
    Loading...
Gordon V. Cormack says:

December 2, 2013 at 2:11 pm

Ralph,

Two Desi-V papers, and a white paper by Jeremy, investigate the impact of training errors on predictive coding for document review. In a nutshell, the impact is “not much.”

Jianlin Cheng, Amanda Jones, Caroline Privault and Jean-Michel Renders, Soft Labeling for Multi-Pass Document Review. http://www.umiacs.umd.edu/~oard/desi5/research/Cheng-final.pdf

Johannes C. Scholtes, Tim van Cann, Mary Mack, The Impact of Incorrect Training Sets and Rolling Collections on Technology-Assisted Review. http://www.umiacs.umd.edu/~oard/desi5/additional/Scholtes.pdf

(Sorry Jeremy, I don’t have a link to your work.)

Gordon

Loading...

Reply
- Ralph Losey says:
  
  December 9, 2013 at 9:30 am
  
  Thanks for the comment Gordon, who, in case any of my readers do not know, is another star scientist in the field of legal search. Sorry it took me so long to approve your comment, but just noticed it now, even though you posted it several days ago. (Since it had links, it needed my approval as part of the spam filter.) I’ll check out the papers you mention. Already knew about Jeremy’s. Thanks again.
  
  Loading...
  
  Reply
Gordon V. Cormack says:

December 2, 2013 at 2:20 pm

Ralph,

To date, I am aware of no study other than yours has measured intra-assessor overlap on the same ediscovery review task. Certainly not Grossman & Cormack.

In Grossman & Cormack, Maura did not re-review the several hundred documents she reviewed as topic authority from TREC 2009. Cormack reviewed a sample of 100, and of those 100, he disagreed with 10 of Maura’s judgements. Maura re-reviewed only these 10 documents. Of the 10, Maura held her ground on 5, coded 2 as “arguable,” and reversed herself on 3. Furthermore, Cormack held (prior to Maura’s re-review) that the 3 on which Maura reversed herself were “arguable.” (Note that “arguable” was not an option in the original TREC 2009 review so switching to arguable should not be scored as a reversal.)

These ten documents comprise a tiny judgmentally sampled fraction of all the documents that Maura reviewed at TREC 2009, which themselves were a judgmentally sampled fraction of the review set. The documents that Maura reviewed at TREC were only those that were appealed and were hence controversial, and the ten of those that were selected by me were especially controversial. You simply cannot conclude that she would have reversed herself this frequently had she conducted a second review of a representative set of documents.

regards,
Gordon

Loading...

Reply
- Ralph Losey says:
  
  December 3, 2013 at 9:31 am
  
  Thanks for that comment and explanation. I have just corrected my blog explanation on that point accordingly. I consider the errors an anomaly and of no statistical importance since the sample was so small, and, as you point out, not at all representative of the topic collection.
  
  Loading...
  
  Reply
William Webber says:

December 3, 2013 at 3:42 pm

Ralph,

Hi! Thanks for a great summary post of research findings on inter-assessor agreement.

Note that Jeremy and I had a short paper at this year’s SIGIR in which we took the Voorhees dataset you describe here, and examined what effect the use of alternative (non-authoritative) assessors had upon the reliability of machine classification. The paper can be found here:

http://www.williamwebber.com/research/papers/wp13sigir.pdf

The takeaway finding was that using non-authoritative trainers meant that on average 25% more documents had to be reviewed in order to achieve the same level of recall (on this particular dataset, which admittedly is not very representative of what is found in e-discovery). This might, though, work out as cheaper overall if the non-authoritative trainers themselves were cheaper than the authoritative one.

William

Loading...

Reply
Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three | e-Discovery Team ® says:

December 8, 2013 at 6:29 pm

[…] This is part-three of a three-part blog, so please read Part One and Part Two first. […]

Loading...

Reply
Top Ten e-Discovery Predictions for 2014 | e-Discovery Team ® says:

January 1, 2014 at 11:18 am

[…] how inconsistent human reviewers are, even when using search experts. See Less Is More, parts One, Two and Three. They still try to fix the old methods, and try to use human reviewers to measure what […]

Loading...

Reply
Less is More: In Predictive Coding, Fewer Reviewers Are Better | @ComplexD says:

January 26, 2014 at 8:25 am

[…] This is part-three of a three-part blog, so please read Part One and Part Two first. […]

Loading...

Reply
Beware of the TAR Pits! – Part One | e-Discovery Team ® says:

February 16, 2014 at 4:31 pm

[…] When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three; and, Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using […]

Loading...

Reply
Beware of the TAR Pits! – Part Two | e-Discovery Team ® says:

February 23, 2014 at 6:42 pm

[…] Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two, and search of Jaccard in my […]

Loading...

Reply