A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents

Enron_Losey_StudyThis is my first report comparing two different searches of 699,062 Enron documents that I performed in 2012 and 2013. I understand this may be the first study of the outcomes of two searches of a large dataset by a single reviewer. It is certainly the first such study concerning legal search. The inconsistencies between the two reviews is, I am told, of scientific interest. The fact that I used two different predictive coding methods in my experiment is also of some interest. This blog is a draft of what I hope will become a formal technical article. My private thanks to those scientists and other experts who have already provided criticisms, suggestions and encouragement of this study.

This draft report sets forth the metrics of both reviews in detail and provides a preliminary analysis of the consistencies and inconsistencies of the document classifications. I conclude with my opinion of the legal implications of these findings on the current debate over disclosure of irrelevant documents used in machine training. In a future blog I will provide a preliminary analysis of the comparative effectiveness of the two methods used in the reviews.

I welcome peer reviews, criticisms, and suggestions from scientists and academics with an interest in this study. I also welcome dialogue with attorneys concerning the legal implications of these new findings. Private comments may be made by email to Ralph.Losey@gmail.com and public comments in the comment section at the end of this article.

Objective Report of the Two Reviews

The 699,082 Enron dataset reviewed is the EDRM derived version of emails and attachments. It was processed and hosted by Kroll Ontrack on two different accounts. Both reviews used Kroll Ontrack’s Inview software, although the second review used a slightly upgraded version. Both reviews had the same goal to find all documents related to involuntary employee termination, not voluntary. A simple classification scheme was used where all documents were either coded as irrelevant, relevant, or relevant and hot (highly relevant).

The review work was performed by a single subject matter expert (SME) on employee termination, namely the author, a partner in the Jackson Lewis law firm, which specializes in employment litigation. The author is in charge of the firm’s electronic discovery and has thirty-three years of experience with legal document reviews.

The first review was done in May and June 2012 over eight days. The second was done in January and February 2013 over approximately twelve days. Both reviews were done solo by the same SME without outside help or assistance. In both reviews the SME expended a total of approximately 52 hours on each project, for a total of 104 hours. That was 52 hours of review and analysis time, but did not include time to write-up the search reports or wait on computer processing.

The original purpose of the first review was to improve the author’s familiarity with the predictive coding features of Inview and provide a narrative for instructional purposes of his use of the bottom line driven hybrid multimodal approach to review that he endorses. The author prepared a detailed narrative describing this first review project published on his e-Discovery Team blog. Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (2012).

The purpose of the second review was to perform an experiment to evaluate the impact of using a different methodology to do the same review. In the second review the author used a bottom line driven hybrid monomodal approach. A series of videos and blogs describing the review have also been published on the author’s blog. Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (2013). The video reports include satirical segments based on the Startrek Borg villains to try to convey the boring, stifling qualities of the Monomodal review method.

The review method used in the first review is called Multimodal in this report, and the second method is called Monomodal.  A nickname is also sometimes used for the second approach, where it is called the Borg method, more specifically the Hybrid Enlightened Borg approach. Losey, R., Three-Cylinder Multimodal Approach To Predictive Coding. The author does not endorse the Monomodal method, but wanted to know how effective it was compared to the Multimodal method. The author has discussed these two methods of search and review at length in many articles. See CAR page of e-Discovery Team blog for a complete listing.

Since these contrasting review methods are described in detail elsewhere only a simple summary is provided now. The two methods both use predictive coding analysis and document ranking, and both use human (SME) judgment to select documents for training in an active machine learning process. The primary difference is that the Monomodal method only uses the predictive coding search and review techniques, whereas the Multimodal used predictive coding methods, plus a variety of other search methods, including especially keyword search and similarity search. The Multimodal method used multiple modes of search to find training documents for active machine learning.

Attempt to Emulate Two Separate Reviews

An attempt was made to keep each of the reviews as separate and independent as possible. The goal was to avoid the SME’s memory of coding a document one way in the first review to influence his coding of the same document in the second review. For that reason, and others, the SME never consulted the classifications made in the first review as part of the second. In fact, the Kroll Ontrack review platform for the first review project was never opened after the first project completed until just recently to make this comparative analysis. Further, the SME intentionally did not review notes of the first project to try to refresh his memory for the second. To the contrary, the SME tried as far as possible to forget his first reviews and approach the second project as a tabula rasa. That is one reason there was a seven-month delay between the two reviews.

In general the SME self-reports a good but not exceptional memory for document recollection. Moreover, in the seven-month interim between the two reviews (May-June 2012 to January-February 2013), the SME had done many other review projects. He had literally read tens of thousands of other documents during that time period, none of which were part of this Enron database.

For those reasons the SME self-reports that his attempt to start afresh was largely successful. He did not recognize most of the documents he saw for the second time in the second review, but he did recognize a few, and recalled his prior classifications for some of them. It was not possible for him to completely forget all of the classifications he had made in the first review during the course of the second review. The ones he recognized tended to be the more memorable documents (such as the irrelevant photos of naked women that he stumbled upon in both reviews, and the Ken Lay emails). He did recall those documents and his previous classifications of those documents. But this involved a very small number of documents. The SME estimates that he recognized less than 100 unique documents (not including duplicates and near duplicates of the same documents, of which there are many in the 699,082 EDRM Enron dataset).

Also, the SME recognized between 10-20 grey area type documents where the relevancy determinations were difficult and to a certain extent arbitrary. He knew that he had seen them before, but could not recall how he had previously coded these documents. As mentioned, the SME made no effort to do so. His analysis and internal debate on these and all other documents reviewed concerned whether they were relevant, or not. The classifications were made entirely anew on all documents, especially including these ambiguous documents, rather than trying to rely on the SME’s uncertain memory of how they were previously classified.

Caveats and Speculations

In spite of these efforts to emulate two separate reviews, the recollection of the SME on some documents should be taken into consideration and the metrics on inconsistent reviews taken as a floor. If there had been a longer delay in time between the two reviews, say two years instead of seven months, it is reasonable to assume the inconsistencies would increase. The author would, however, expect any such increase to be relatively minor.

It is also important to note the SME’s impression (admittedly subjective, but based on over thirty years of experience with document review and relevancy determinations), that if he had studied his prior reviews before beginning the second review, and if he had otherwise taken some minimal efforts to refresh his memory, then he would have significantly reduced the number of inconsistencies. Further, the author believes that a shorter delay in time between the reviews (for instance, 10 days instead of 10 months) would also have lessened the inconsistency rate with no additional efforts on his part.

The imposition of quality control procedures designed for consistencies between the two reviews would, in the author’s view, have drastically reduced the inconsistency rate. Again, any such procedures were intentionally omitted here to try to emulate, as far as possible, two completely separate and independent reviews.

Summary of Metrics of the Two Reviews

In the first review, which used the Multimodal method, 146,714 documents were coded as follows:

  • 1,507 random sample generated at the beginning of the project, plus
  • 1,605 from null-set random sample at the end of the project, plus
  • 1,000 machine selected from five rounds of training (sub-total 4,112), plus
  • 142,602 human judgmental selected.

The coding classifications were 661 relevant and 146,053 irrelevant. (This count does not include the approximate 30 additional relevant documents found in the post-hoc analysis of the project.)

This 661 total includes 18 documents considered Highly Relevant or Hot.

Further, it should be noted that the remaining 552,368 documents (699,082-146,714) not classified were considered by the SME to be irrelevant due to low predictive coding ranking and other reasons. They were treated as irrelevant even though not classified by the SME through bulk coding.

Of the 146,714 total documents categorized only approximately 2,500 were actually read and individually reviewed by the SME. This study of inconsistent classifications only considers these documents. (Note that only 1,981 documents are recorded in Inview software as reviewed, but the SME read many documents without categorization and used bulk coding instead, and thus this count is artificially low. The SME believes the upward adjustment to 2,500 is approximately correct). This means the SME categorized or coded 144,214 documents by using Inview software’s mass categorization features, which allows for categorization without actually reviewing each individual document. This is common for duplicative documents or document types.

In the first review of the 661 documents classified as relevant only 333 were specified for training. The 661 training documents include all documents identified as relevant in the machine selected document sets. The 328 documents classified as relevant by the SME but not specified for training were all derived from the 142,602 human judgmental selected documents. These documents were not specified for use as training documents because the SME thought it might skew or confuse the machine learning to include them. Documents were sometimes excluded because of unusual traits and characteristics of the relevant document, or to avoid excessive weighting of particular document types that might bias the training. For example, where other duplicates or near duplicates had already been used several times for training. (As mentioned, this was one of my first predictive coding projects, and I am not sure this strategy of mass withholding of documents from training to mitigate against bias was correct. If I had a do-over I would probably train on more documents and trust the software more to sort it out.) Some documents specified for training by the SME were not in fact used for training, but were instead only used by the Inview software as part of the initial control set for testing purposes. Documents in a control set for testing purposes are not also used for machine training. Only 1 of the 333 relevant documents here specified for training by the SME in the first review was so removed from training and instead used in the control set.

In the first review of the 146,053 documents classified as irrelevant only 2,586 were specified for training. The 2,586 training documents include all documents identified as irrelevant in the machine selected document sets. The 143,467 documents classified as irrelevant by the SME but not specified for training were all derived from the 142,602 human judgmental selected documents or the random samples. These documents were not specified for use as training documents because the SME thought it might skew or confuse the machine learning to include them. Documents were sometimes excluded because of unusual traits and characteristics of the irrelevant document, or to avoid excessive weighting of particular document types that might bias the training. For example, where other duplicates or near duplicates had already been used several times for training. 1,063 of the 2,586 documents specified for training by the SME were not, in fact, used for training by the Inview software. They were instead used by the Inview software as part of the initial control set for testing purposes. Therefore after removal of the control set of 1,063 irrelevant documents used for testing, only 1,523 irrelevant documents were used for machine training.

In the second review, which used the Monomodal method, 48,959 documents were coded as follows:

  • 10,000 machine selected, not random, with exactly 200 documents in each of the 50 rounds, plus
  • 2,366 random selected by two 1,183 random samples, one at the beginning and another at the end of the project, plus
  • 36,593 human judgmental selected.

The coding classifications were 579 documents relevant and 48,380 irrelevant.

This 579 total includes 13 documents considered Highly Relevant or Hot.

Again, it should be noted that the remaining 650,123 documents (699,082-48,959) were considered by the SME to be irrelevant due to low predictive coding ranking and other reasons. They were treated as irrelevant even though not classified by the SME through bulk coding.

Of the 48,959 total documents categorized only approximately 12,000 were actually read and individually reviewed by the SME. This study of inconsistent classifications only considers these documents.  (Again note this is a best estimate as explained above. Inview records 11,601 as physically reviewed.) This means the SME categorized or coded 36,959 documents by using Inview software’s mass categorization features.

The first Multimodal review identified 18 highly relevant or Hot Documents. The second Monomodal review found only 13 of these 18 Hot documents. No Hot documents were found in the Monomodal review that had not been found in the Multimodal review. Five Hot documents were found in the Multimodal Review that were not also found in the Monomodal review. All were individually reviewed by the SME in both projects.

The Monomodal review thus found only 72% of the Hot documents found by the earlier Multimodal review. Put another way, the Multimodal method did 38% better in finding the total Hot documents than Monomodal.

In the second review of the 579 documents classified as relevant only 577 were specified for training. The 2 documents categorized as relevant and not specified for training were from the 36,593 human judgmental selected documents for the same reasons mentioned in the first review. Further, 1 relevant document specified for training was not in fact used to train the system, but was instead used by the Inview software as part of the control set. Therefore only 576 relevant documents were used for machine training.

In the second review of the 48,380 documents classified as irrelevant only 10,948 were used for training. All 10,000 documents identified as irrelevant in the machine selected document sets were used for training. The 37,432 documents classified as irrelevant by the SME and not used for training were all derived from the 36,593 human judgmental selected documents and the random samples. These documents were not used for training for the reasons previously described, but primarily to avoid confusing cumulative training that might bias the training. In addition, of the 10,948 irrelevant documents specified for training, 1,063 were diverted by the Inview software for use in the control set, and thus used for testing and not machine training. Therefore only 9,885 irrelevant documents were used for machine training.

A comparison of the relevant documents found by each method showed the following:

  • The 661 relevant found by Multimodal included 376 documents not found in Monomodal, which means 57% were unique. The 661 relevant included 18 Hot documents, 5 of which were not found by Multimodal, which means 28% were unique.
  • The 579 relevant found by Monomodal included 294 documents not found in Multimodal, which means 51% were unique. The 579 relevant included 7 Hot documents, none of which were not found by Multimodal, which means 0% were unique.
  • There were a total of 955 relevant documents found by using both the Multimodal and Monomodal method.
  • There were 285 relevant documents found by both the Multimodal and Monomodal methods, which is 30% of the total 955 found.

The comparisons between the two reviews of relevant document classifications are shown in the Venn diagram below.

two_methods_compare copy

The 285 relevant documents found in both reviews represent an Overlap or Jaccard index of 29.8% (285/(376+579). Ellen M. Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt  697, 700 (2000) (“Overlap is defined as the size of the intersection of the relevant document sets divided by the size of the union of the relevant document sets.”); Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011), pgs 10-11. 

32 Different Documents Reviewed and Coded Relevant by Multimodal and Irrelevant by Monomodal

A study was made for this report of the content of the 376 documents that were only marked as relevant in the Multimodal review performed in March 2012, and not marked as relevant in the later January 2013 Monomodal (Borg) review. The Inview software shows that the SME had in fact individually reviewed 118 of these 376 documents in the second Monomodal review and determined them to be irrelevant.

A study of the content of these 118 documents shows that 86 of the 118 documents were duplicates, or near duplicates, leaving a total of 32 unique documents with inconsistent SME review classifications. When the SME found or was presented these same 32 documents in the earlier March 2012 Multimodal review he had marked them as relevant.

This is evidence of inconsistent reviews by the SME showing concept drift. All of these documents were grey-area types where the SME changed his view of relevance to be more constrictive. The SME had narrowed his concept of relevance.

A study of these 32 documents shows that there were no obvious errors made in the coding. It is therefore reasonable to attribute all of the inconsistent classifications to concept shift on these documents, not pure human error, such as where the SME intended to mark a document relevant, but accidentally clicked on the irrelevant coding button instead. (This kind of error did happen in the course of the review but quality control efforts easily detected these errors.)

31 Different Documents Reviewed and Coded Relevant by Monomodal and Irrelevant by Multimodal

A study was also made for this report of the content of the 294 documents that were only marked as relevant in the Monomodal (Borg) review performed in January 2013, and not marked as relevant in the earlier March 2012 Multimodal review. The Inview software shows that the SME had in fact individually reviewed 38 of these 294 documents in the second Monomodal review and determined them to be irrelevant.

A study of the content of these 38 documents shows that 7 of the 38 documents were duplicates, or near duplicates, leaving a total of 31 unique documents with inconsistent SME review classifications. When the SME found or was presented with these same 31 documents in the later January 2013 Monomodal review he had marked them as relevant.

This is again evidence of inconsistent reviews by the SME showing concept drift. All of these documents were grey area types where the SME changed his view of relevance, but this time to be more inclusive. The SME had expanded his concept of relevance.

A study of these 31 documents shows that there were no obvious errors made in the coding.  It is therefore reasonable to attribute all of the inconsistent classifications to concept shift on these documents, not pure human error.

211 Different Documents Reviewed and Coded Relevant by Both Multimodal and Monomodal 

A study was also made for this report of the content of the 285 relevant documents found by both the Multimodal and Monomodal methods. In both projects all documents coded as relevant by the SME had been individually reviewed by him before final classification. A study of the content of these 285 documents shows that 74 of them were duplicates, or near duplicates, leaving a total of 211 unique documents with consistent SME review classifications. When the SME found or was presented with these same 211 documents in both projects he had marked them as relevant.

274 Different Documents Reviewed and Coded Relevant by Both Methods

To summarize the prior unique total relevant document counts, after removal of all duplicates or near duplicate there were a total of 274 different documents coded relevant by one or both methods. This compares to the earlier 955 total relevant document count before deduplication.

11 Different Documents Reviewed and Coded as Hot By Both Multimodal and Monomodal

A study was also made of the content of the 18 documents coded as Hot. In both projects all documents coded as Hot by the SME had been individually reviewed by him before final classification. A study of the content of these of these 18 documents shows that 7 of them were duplicates, or near duplicates, leaving a total of 11 unique documents. There was only 1 duplicate in the 5 Hot documents that the Multimodal review located and the Monomodal review did not. There were 6 more duplicates found in the 13 other Hot documents discovered in both reviews. Therefore, after removing a total of 7 duplicate documents there were a total of 11 unique Hot documents. (These 11 unique Hot documents are also included within the total 274 unique Relevant documents count.) Monomodal found 7 and missed 4.  Multimodal found all 11. Monomodal review thus missed 36% of the Hot documents. Put another way, the Multimodal methods did 57% better in finding the unique hot documents than Monomodal.

This differential between the different unique Hot documents discovered is both reviews is shown in this Venn diagram. The Jaccard Index for Hot document classification was 64% (7/7+4).

hot_Vinn_unique

Documents Categorized as Irrelevant by Both Multimodal and Monomodal 

The Multimodal method review categorized 146,053 documents as irrelevant. Of that total, 1,517 were categorized after review of each document, and 144,536 were bulk coded without the SME reviewing each individual document.

The Monomodal method review categorized 48,380 documents as irrelevant. Of that total, 11,083 were categorized after review of each document, and 37,297 were bulk coded without the SME reviewing each individual document.

The Agreement in coding the same documents irrelevant in both reviews was 31,109.

Of the 31,109 total documents categorized as irrelevant in both projects only approximately 3,000 were actually read and individually reviewed by the SME in both projects. This study of inconsistent classifications only considers these documents. (Note that only 2,500 documents are recorded in Inview software as reviewed, but the SME read many documents without categorization and used bulk coding instead, and thus this count is artificially low. The SME believes a 500 document upward adjustment to 3,000 is approximately correct.) This means the SME categorized or coded 28,109 of the 31,109 overlapping irrelevant documents by using Inview software’s mass categorization features.

Concept Drift Analysis

First, it is interesting to see that the change in concept drift from the first project to the second was approximately equal in both directions. Although the total counts were different due to duplicate documents, the SME changed his opinion in the second review from irrelevant to relevant on 31 different documents, and from relevant to irrelevant on 32 different documents.

The overall metrics of inconsistent coding of 274 unique relevant documents are as follows:

  • 211 different documents were coded relevant consistently in both reviews;
  • An additional 63 different documents were coded inconsistently, of which,
    • 49% (31) were first coded irrelevant in Multimodal and then coded relevant in Monomodal (false positives).
    • 51% (32) were first coded relevant in Multimodal and then coded irrelevant in Monomodal (false negatives).

An inconsistency of coding of 63 out of 274 relevant documents represents an inconsistency rate of 23%. Put another way, the coding on documents determined to be relevant was consistent 77% of the time. Again, this later calculation is known as the Jaccard measure. See Voorhees’ Variations, supra, and Grossman & Cormack, Technology Assisted Review, supra. Also See William Webber, How accurate can manual review be? Again, the Jaccard index is formally defined as the size of the intersection, here 211, divided by the size of the union of the sample sets, here 274 (211+32+31). Therefor the Jaccard index for the individual review of relevant documents in the two projects is 77% (211/274). This is shown by the Venn diagram below.

Unique_Docs_Venn

Several prior studies have been made of reviews for relevant documents that employed the Jaccard measure. The best known is the study of Ellen Voorhees that analyzed agreement among professional analysts (SMEs) in the course of a TREC study. It was found that two SMEs (retired intelligence officers) agreed on responsiveness on only 45% of the documents. When three SMEs were considered they agreed on only about 30% of the documents. Voorhees, Variations, supraAlso see: Grossman & Cormack, Technology Assisted Review, supra. It appears from the Voorhees report that the SMEs in this study were examining different documents that did not include duplicates. For that reason the Jaccard measure of different documents in the instant study of 77% would be the appropriate comparison, not the measure of 30% when duplicate documents were included.

A more recent study of a legal project using contract lawyers had Jaccard measures of 16% between the first review and follow-up reviews based on samples of the first. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review, Journal of the American Society for Information Science and Technology, 61(1):70–80. The Jaccard index numbers were extrapolated by Grossman and Cormack in Technology Assisted Review, supra at pgs. 13-14. Also see Grossman Cormack Glossary, Ver. 1.3 (2012) that defines the Jaccard index and goes on to state that expert reviewers commonly achieve Jaccard Index scores of about 50%, and scores exceeding 60% are very rare.

Analysis of Agreement in Coding Irrelevant Documents

The author is aware that comparisons of coding of irrelevant documents are not typically considered important in information retrieval studies for a variety of reasons, including the different prevalence rates in review projects. For that reason studies typically only include the Jaccard measure for comparison of relevant classifications only. Still, in view of the legal debate concerning the disclosure of irrelevant documents, this paper includes a brief examination of the total Agreement rates, including irrelevancy determinations. Further, Agreement rates are interesting and appropriate here since both studies consider a review of the exact same Enron dataset of 699,082 documents, and thus the same prevalence, and they are not relying on random samples, but on two full reviews.

The high Agreement rates on irrelevant classifications in the two reviews are of special significance in the author’s opinion because of the current debate in the legal community concerning procedures for predictive coding review. Several courts have already adopted the position that all relevant and all irrelevant documents used in training should be disclosed to a requesting party, even though the legal rules of procedures only require disclosure of relevant documents. Da Silva Moore et al. v. Publicus Groupe SA, 2012 WL 607412 (S.D.N.Y. Feb. 24, 2012) (Peck., M.J.), aff’d, 2012 WL 1446534 (S.D.N.Y. April 26, 2012) (Carter, J.); Global Aerospace Inc., et al. v. Landow Aviation, L.P., et al., 2012 WL 1431215 (Va. Cir. Cit. April 23, 2012); In re Actos (Pioglitazone) Products, MDL No. 6-11-md-2299 (W.D. La. July 27, 2012). Many attorneys and litigants take the contrary position that irrelevant documents should never be disclosed, even in the context of active machine learning. See Solomon, R., Are Corporations Ready To Be Transparent And Share Irrelevant Documents With Opposing Counsel To Obtain Substantial Cost Savings Through The Use of Predictive Coding, Metropolitan Corporate Counsel 20:11 (Nov. 2012).

Although the author has been flexible on this issue in some cases, before these results were studied the author had been advocating a do-not-disclose irrelevant documents position. Losey, R., Keywords and Search Methods Should Be Disclosed, But Not Irrelevant Documents (May 26, 2013). The author now contends that the Agreement and Jaccard index data shown in this study support a compromise position where limited disclosure may sometimes be appropriate, but only of borderline documents where irrelevancy is uncertain or likely subject to debate.

In the author’s opinion the inclusion of analysis of irrelevant coding by the SME in these two reviews allows for a more complete analysis and understanding of the types of documents and document classifications that cause inconsistent reviews. Again, to do this fairly the universe of classifications has been limited to those where the SME actually reviewed the documents, and also duplicate document counts have been eliminated. This seems to be the best measure to provide a clear indication of the types of documents that are inconsistently coded.

The inclusion of all review determinations in a consistency analysis, not just review decisions where a document is classified as relevant, provides critical information to understand the reasonability of disclosure positions in litigation. This is discussed in the conclusions below. This also seems appropriate when analyzing active machine learning where the training on irrelevance is just as important as the training on relevance.

In both projects the SME coded 31,109 identical unique documents as irrelevant. Of the 31,109 total overlapping documents coded, the SME actually read and reviewed approximately 3,000 of these documents and bulk coded the rest (28,109).

Thus in both projects the SME read and individually reviewed 3,274 unique documents: 3,000 documents were marked irrelevant and 274 marked relevant. This is shown in the Venn diagram below. Of the 3,274 identical documents reviewed there were only 63 inconsistencies. This represents an overall inconsistency error rate of 01.9%. Thus the Agreement rate for review of both relevant and irrelevant documents is 98.1% (3274/3,337).

Inconsistency_compare Conclusions Regarding Inconsistent Reviews

These results suggest that when only one human reviewer is involved who is an SME, and highly motivated, that the overall consistency rates in review are much higher than when multiple non-SME reviewers are involved with questionable motivation (contract reviewers) (77% v 16%), or multiple SMEs of unknown motivation and knowledge (retired intelligence officers in Voorhees study), (77% v. 45% with two SMEs, and 30% with three SMEs). These comparisons are shown visually in this graph.

Review_Consistency_Rates

These results also suggest that with one SME reviewer the classification of irrelevant documents is nearly uniform (98%+ Agreement), and that the inconsistencies primarily lie in relevant categorizations (77% Jaccard) of borderline relevant documents. (A caveat should be made that this observation is based on unfiltered data, and not a keyword collection or data otherwise distorted with artificially high prevalence rates.)

The 77% Jaccard measure is consistent with the test reported by Grossman and Cormack of an SME (Topic Authority in TREC language) reviewing her own prior adjudications of ten documents and disagreeing with herself on three of the ten classifications, and classifying another two as borderline. Grossman & Cormack, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012) at pgs. 17-20.

The overall Agreement rate of 98% of all relevancy determinations, including irrelevant classifications where almost all classifications are easy and obvious, strongly suggests that the very low Jaccard index rates measured in previous studies of 16% to 45% were more likely caused by human error, not document relevance ambiguity or genuine disagreement on the scope of relevance. A secondary explanation for the low scores is lack of significant subject matter expertise such that the reviewers were not capable of recognizing a clearly relevant document. Even if you only consider the determinations of relevancy, and exclude determinations of irrelevancy, the 77% Jaccard index is still significantly greater than the prior 16% to 45% consistency rates.

The findings in this study thus generally support the conclusions of Cormack and Grossman that most inconsistencies in document classifications are due to human error, not the presence of borderline documents or the inherent ambiguity of all relevancy determinations. Id. Of the 3,274 different documents the SME read in both projects in the instant study only 63 were seen to be borderline grey area types, which is less than 2%. There are certainly more grey area relevant documents than that in the 3,274 documents reviewed (excluding the duplication and near duplication issue), but they did not come to the author’s attention in this post-hoc analysis because the SME was consistent in review of these other borderline documents. Still, the findings in this study support the conclusions of Grossman and Cormack that only approximately 5% of documents in a typical unfiltered predictive coding review project are of a borderline grey area type.

The findings and conclusions support the use of SMEs with in-depth knowledge of the legal subject, and the use of as few SMEs to do the review as possible. The study also strongly suggests that the greatest consistency in document review arises from the use of one SME only.

These findings and conclusions also reinforce the need for strong quality control measures in large reviews where multiple reviewers must be used, especially when the reviewers are relatively low-paid, non-SMEs.

The inconsistencies (opposite of Jaccard index) shown in this study of determinations of relevance, and excluding the classifications of irrelevant, were relatively small – 23%, as compared to 55%, 70% and 84% in prior studies. Moreover, as mentioned, they were all derived from grey area or borderline type documents, where relevancy was a matter of interpretation. In the author’s experience documents such as this tend to have low probative value. If they were significant to litigation discovery, then they usually would not be of a grey area, subjective type. They would instead be obviously relevance. I say usually because the author has seen rare exceptions, typically in situations where one borderline document leads to other documents with strong probative value. Still, this is unusual. In most situations the omission of borderline ambiguous documents, and others like them, would have little or no impact on the case.

These observations, especially the high consistency of irrelevance classifications (98%+), support the strict limitation of disclosure of irrelevant documents as part of a cooperative litigation discovery process. Instead, only documents that a reviewer knows are of a grey area type or likely to be subject to debate should be disclosed. (The SME in this study was personally aware of the ambiguous type grey area documents when originally classifying these documents. They were obvious because it was difficult to decide if they were within the border of relevance, or not. The ambiguity would trigger an internal debate where a close question decision would ultimately be made.)

Even when limiting disclosure of irrelevant documents to those that are known to be borderline, disclosure of the actual documents themselves may frequently not be necessary. A summary of the documents with explanation of the rationale as to the ultimate determination of irrelevance should often suffice. The disclosure of a description of the borderline documents will at least begin a relevancy dialogue with the requesting party. Only if the abstract debate fails to reach agreement would disclosure of the actual documents be required. Even then it could be done in camera to a neutral third-party, such as a judge or special master. Alternatively, disclosure could be made with additional confidentiality restrictions pending a ruling by the court.

I am interested in what conclusions others may draw from these metrics regarding concept drift from one review project to the next, inconsistencies of single human reviewers, and other issues here discussed. The author welcomes public and private comments. Private comments may be made by email to Ralph.Losey@gmail.com and public remarks in the comment section below. Marketing type comments will be deleted.

11 Responses to A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents

  1. Ellen Voorhees says:

    Hi, Ralph. Interesting study.

    I do disagree with one point, though. You say:


    The overall Agreement rate of 98% of all relevancy determinations, including irrelevant classifications where almost all classifications are easy and obvious, strongly suggests that the very low Jaccard index rates measured in previous studies of 16% to 45% were more likely caused by human error, not document relevance ambiguity or genuine disagreement on the scope of relevance. A secondary explanation for the low scores is lack of significant subject matter expertise such that the reviewers were not capable of recognizing a clearly relevant document. Even if you only consider the determinations of relevancy, and exclude determinations of irrelevancy, the 77% Jaccard index is still significantly greater than the prior 16% to 45% consistency rates.
    —-

    The disagreements among the TREC assessors (the 45% overlap agreement) was definitely NOT due to mostly human error (though, of course, there were some errors made). It was also not a case of insufficent expertise. In that study, the differences among assessors would be equivalent to difference in opinion among different topic authorities, not among different reviewers who were all trying to match one topic authority’s conception of relevance. And that is a key point: different people *do* have different conceptions of what is relevant, akin to your loosening or narrowing the scope of what you considered relevant.

    I completely agree with the conclusion that to get the most consistent judgments you want to use as few reviewers as possible and they should all be judging against one person’s opinion as to what is relevant. (And that is precisely why TREC only uses one assessor per topic for most of its collections.) But once one side has perfectly judged its documents according to its lead attorney’s conception of relevance, the other side may well—and probably will—disagree on some of those decisions.

    Ellen Voorhees

    • Ralph Losey says:

      Thank you for the comment. Also, I note you did not correct the assumption I made about your prior study with the intelligence officers inconsistencies that the documents they examined did not include duplicates, and thus my removal of duplicates in my datasets was needed for a fair comparison. If that assumption is incorrect, please let me know.

      As to the merits of your comment: “The disagreements among the TREC assessors (the 45% overlap agreement) was definitely NOT due to mostly human error (though, of course, there were some errors made).” – I take it you also do not agree with the Cormack & Grossman paper that concludes, as I did, that human error was the primary contributor to inconsistencies, not inherent ambiguity of relevance calls. Have you set forth particular rebuttals or support for your position on this issue in a paper that I could examine? Or can you refer me to other papers on this subject that supports your contention?

      Again, thanks for taking the time to comment. This is a preliminary draft and my analysis is certainly not final. All of your input will be carefully studied and considered. I realize that this is your field of special expertise, and I am just a lawyer trying to understand legal document reviews better by dabbling in this field. Like you I am after the truth, not to vindicate any particular position. Frankly I had previously disagreed with the Cormack and Grossman conclusion too, but my experience with this study is changing my mind. I cant get past the fact that the ambiguous documents, which we all agree exist in any dataset on most any relevance classification, was only 2%, although I concede, as did Grossman and Cormack in their study, that is was probably around 5%. If that is correct, how can 5% ambiguous documents explain so many inconsistent reviews? Human error and inadequate expertise seem to be valid and reasonable explanations to me; especially as compared to the alternative that a majority of relevance calls are just a matter of conjecture. As a lawyer accustomed to rulings on relevance I find that hard to accept. Indeed, our system of justice and evidentiary proof is built on the assumption that relevance is objective, not subjective.

      Ralph

      • Ellen Voorhees says:

        The document collections on which the original TREC relevance study was conducted were newswire collections. So there were probably some close duplicates (re-writes correcting/updating some info) but we treated each document as if it were independent for both system scoring and in the relevance study. I’m sure the amount of duplication that is present in those collections is very much smaller than in the Enron collection. Assuming duplicates were bulk coded, then I think it is correct to compute the agreement levels over the de-duped numbers. (If they were not bulk-coded, that is another level of (in)consistency checking that can be done: were identical documents coded identically.)

        I don’t explicitly disagree with the Cormack and Grossman conclusion, because my reading [perhaps incorrect] of their paper is that they are claiming that error rate for the situation in which a cadre of reviewers were trying to match an articulated common relevance definition. The articulation may have been imperfect, but there was an explicit expectation that the reviewers were to judge by the *topic authority’s* concept of relevance (not the reviewer’s own concept of relevance). As I was trying to say in my earlier comment, trying to match a common definition of relevance is very different from comparing individual independent definitions of relevance. In the case of the one common definition, you can call judgments that differ from the common definition ‘errors’ when they directly conflict with that common definition. When you have independent definitions of relevance, there is much less ground for calling one judgment an error as compared to a differing judgment.

        This is one area where legal discovery differs from most of IR research. For legal discovery, there really is some one person (the lead attorney) whose definition of relevance has more authority than other definitions, and reviewers are rightly expected to match to that definition. In most other IR tasks, the only person’s opinion that matters (for the sake of evaluating how well the system did) is the user doing the search. If you and I both use the same query to answer the same information need, but we have different ideas as to what is relevant to that need, I am going to be less satisfied with the system’s output if it matches your conception of relevance, even though you will think the system is performing perfectly. And their is no basis to pick either one of our definitions as the (one and only) ‘correct’ one.

        I have not written anything directly on error rates vs. differences of opinion. My claim that the TREC study differences were not mostly errors is based on me looking at some of the conflicts—and having no basis to claim that one of the assessor’s opinion was any more valid than anothers. (Also note that as an IR test collection builder, the main complaint lobbed my way is that test collections do not have a sufficiently nuanced conception of relevance—that the simple binary (or even 3-5 levels) relevance levels is insufficient to capture the true nature of relevance. The citation I use for an information scientist’s viewpoint of relevance is
        author = {Linda Schamber},
        title = {Relevance and Information Behavior},
        journal = {Annual Review of Information Science and Technology},
        volume = {29},
        pages = {3–48},
        year = {1994})

        I do think that your desire for relevance to be deemed ‘objective’ rather than ‘subjective’ is doomed—but I also think it doesn’t matter that much. I think it is doomed in the sense that given the task of classifying a set of items into a small number of disjoint classes where each item must be placed in exactly one class, every non-trival set of items is going to have multiple reasonable classifications. I say this based on my work with the TREC question answering track. In that track, the questions were short-answer, factoid type questions (“Who invented the paper clip?” “When were the Crusades”?) and assessors were asked to judge the correctness of answer strings. Even here, we had different assessors give somewhat different judgments, because assessors differed in the granularity of the answer strings they were willing to accept as correct. For names, some assessors accepted last-name-only, while others wanted more complete names. For dates, some wanted precise time frames, while others accepted more generic designations (the “Middle Ages”). For locations, there were geographic designation differences: if something happens in Hollywood, is it ok to give the answer as LA? as California? as US? as Earth? Different people draw the line differently (and the same person draws the line differently depending on the precise context of the question). (The details of the QA track assessor-agreement study is written up as
        author = {Ellen M. Voorhees and Dawn M. Tice},
        title = {Building a Question Answering Test Collection},
        booktitle = {Proceedings of the 23rd Annual International
        {ACM SIGIR} Conference on Research and Development in Information Retrieval},
        month = {July},
        year = {2000},
        pages = {200–207})

        How can it not be a big deal that relevance is not objective? Because the disagreements tend to be on edge cases that don’t cause material differences in the final result of the end task. For retrieval system evaluation it doesn’t matter very much because average scores are stable despite the differences. For jurisprudence it doesn’t matter very much because the probative documents are almost certainly going to be agreed on as responsive, and because the parties in a case could come to a common conception of relevance even if they don’t start out that way (say she who has no legal training whatsover nor any experience in legal proceedings).

        Ellen

  2. […] the scientific aspects. For my own recent contribution to the science of search, see: Losey, R., A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications … […]

  3. […] the conclusion of the report on the Enron document review experiment that I began in my last blog. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications …. The conclusion is an analysis of the relative effectiveness of the two reviews. Prepare for […]

  4. […] Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents, found in two parts at http://e-discoveryteam.com/2013/06/11/a-modest-contribution-to-the-science-of-search-report-and-anal…, […]

  5. […] Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents, found in two parts at http://e-discoveryteam.com/2013/06/11/a-modest-contribution-to-the-science-of-search-report-and-anal…, […]

  6. […] A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications …. (Part One). […]

  7. […] The base work in this area was done by Ellen M. Voorhees: Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt  697 (2000). The second study of interest to lawyers on this subject came ten years later by Herbert L. Roitblat, Anne Kershaw and Patrick Oot, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Journal of the American Society for Information Science and Technology, 61 (2010) (draft found at Clearwell Systems). The next study with significant new data on inconsistent relevance review determinations was by Maura Grossman and  Gordon Cormack in 2012: Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012); also see Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011). The fourth and last study on the subject with new data is my own review experiment done in 2012 and 2013. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications … (2013). […]

  8. […] The base work in this area was done by Ellen M. Voorhees: Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt  697 (2000). The second study of interest to lawyers on this subject came ten years later by Herbert L. Roitblat, Anne Kershaw and Patrick Oot, Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Journal of the American Society for Information Science and Technology, 61 (2010) (draft found at Clearwell Systems). The next study with significant new data on inconsistent relevance review determinations was by Maura Grossman and  Gordon Cormack in 2012: Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012); also see Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011). The fourth and last study on the subject with new data is my own review experiment done in 2012 and 2013. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications … (2013) […]

  9. […] and video series comparing two different kinds of predictive coding search methods). Also see: A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications …. (Part One); Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,106 other followers