A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents

June 11, 2013

Enron_Losey_StudyThis is my first report comparing two different searches of 699,062 Enron documents that I performed in 2012 and 2013. I understand this may be the first study of the outcomes of two searches of a large dataset by a single reviewer. It is certainly the first such study concerning legal search. The inconsistencies between the two reviews is, I am told, of scientific interest. The fact that I used two different predictive coding methods in my experiment is also of some interest. This blog is a draft of what I hope will become a formal technical article. My private thanks to those scientists and other experts who have already provided criticisms, suggestions and encouragement of this study.

This draft report sets forth the metrics of both reviews in detail and provides a preliminary analysis of the consistencies and inconsistencies of the document classifications. I conclude with my opinion of the legal implications of these findings on the current debate over disclosure of irrelevant documents used in machine training. In a future blog I will provide a preliminary analysis of the comparative effectiveness of the two methods used in the reviews.

I welcome peer reviews, criticisms, and suggestions from scientists and academics with an interest in this study. I also welcome dialogue with attorneys concerning the legal implications of these new findings. Private comments may be made by email to Ralph.Losey@gmail.com and public comments in the comment section at the end of this article.

Objective Report of the Two Reviews

The 699,082 Enron dataset reviewed is the EDRM derived version of emails and attachments. It was processed and hosted by Kroll Ontrack on two different accounts. Both reviews used Kroll Ontrack’s Inview software, although the second review used a slightly upgraded version. Both reviews had the same goal to find all documents related to involuntary employee termination, not voluntary. A simple classification scheme was used where all documents were either coded as irrelevant, relevant, or relevant and hot (highly relevant).

The review work was performed by a single subject matter expert (SME) on employee termination, namely the author, a partner in the Jackson Lewis law firm, which specializes in employment litigation. The author is in charge of the firm’s electronic discovery and has thirty-three years of experience with legal document reviews.

The first review was done in May and June 2012 over eight days. The second was done in January and February 2013 over approximately twelve days. Both reviews were done solo by the same SME without outside help or assistance. In both reviews the SME expended a total of approximately 52 hours on each project, for a total of 104 hours. That was 52 hours of review and analysis time, but did not include time to write-up the search reports or wait on computer processing.

The original purpose of the first review was to improve the author’s familiarity with the predictive coding features of Inview and provide a narrative for instructional purposes of his use of the bottom line driven hybrid multimodal approach to review that he endorses. The author prepared a detailed narrative describing this first review project published on his e-Discovery Team blog. Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (2012).

The purpose of the second review was to perform an experiment to evaluate the impact of using a different methodology to do the same review. In the second review the author used a bottom line driven hybrid monomodal approach. A series of videos and blogs describing the review have also been published on the author’s blog. Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (2013). The video reports include satirical segments based on the Startrek Borg villains to try to convey the boring, stifling qualities of the Monomodal review method.

The review method used in the first review is called Multimodal in this report, and the second method is called Monomodal.  A nickname is also sometimes used for the second approach, where it is called the Borg method, more specifically the Hybrid Enlightened Borg approach. Losey, R., Three-Cylinder Multimodal Approach To Predictive Coding. The author does not endorse the Monomodal method, but wanted to know how effective it was compared to the Multimodal method. The author has discussed these two methods of search and review at length in many articles. See CAR page of e-Discovery Team blog for a complete listing.

Since these contrasting review methods are described in detail elsewhere only a simple summary is provided now. The two methods both use predictive coding analysis and document ranking, and both use human (SME) judgment to select documents for training in an active machine learning process. The primary difference is that the Monomodal method only uses the predictive coding search and review techniques, whereas the Multimodal used predictive coding methods, plus a variety of other search methods, including especially keyword search and similarity search. The Multimodal method used multiple modes of search to find training documents for active machine learning.

Attempt to Emulate Two Separate Reviews

An attempt was made to keep each of the reviews as separate and independent as possible. The goal was to avoid the SME’s memory of coding a document one way in the first review to influence his coding of the same document in the second review. For that reason, and others, the SME never consulted the classifications made in the first review as part of the second. In fact, the Kroll Ontrack review platform for the first review project was never opened after the first project completed until just recently to make this comparative analysis. Further, the SME intentionally did not review notes of the first project to try to refresh his memory for the second. To the contrary, the SME tried as far as possible to forget his first reviews and approach the second project as a tabula rasa. That is one reason there was a seven-month delay between the two reviews.

In general the SME self-reports a good but not exceptional memory for document recollection. Moreover, in the seven-month interim between the two reviews (May-June 2012 to January-February 2013), the SME had done many other review projects. He had literally read tens of thousands of other documents during that time period, none of which were part of this Enron database.

For those reasons the SME self-reports that his attempt to start afresh was largely successful. He did not recognize most of the documents he saw for the second time in the second review, but he did recognize a few, and recalled his prior classifications for some of them. It was not possible for him to completely forget all of the classifications he had made in the first review during the course of the second review. The ones he recognized tended to be the more memorable documents (such as the irrelevant photos of naked women that he stumbled upon in both reviews, and the Ken Lay emails). He did recall those documents and his previous classifications of those documents. But this involved a very small number of documents. The SME estimates that he recognized less than 100 unique documents (not including duplicates and near duplicates of the same documents, of which there are many in the 699,082 EDRM Enron dataset).

Also, the SME recognized between 10-20 grey area type documents where the relevancy determinations were difficult and to a certain extent arbitrary. He knew that he had seen them before, but could not recall how he had previously coded these documents. As mentioned, the SME made no effort to do so. His analysis and internal debate on these and all other documents reviewed concerned whether they were relevant, or not. The classifications were made entirely anew on all documents, especially including these ambiguous documents, rather than trying to rely on the SME’s uncertain memory of how they were previously classified.

Caveats and Speculations

In spite of these efforts to emulate two separate reviews, the recollection of the SME on some documents should be taken into consideration and the metrics on inconsistent reviews taken as a floor. If there had been a longer delay in time between the two reviews, say two years instead of seven months, it is reasonable to assume the inconsistencies would increase. The author would, however, expect any such increase to be relatively minor.

It is also important to note the SME’s impression (admittedly subjective, but based on over thirty years of experience with document review and relevancy determinations), that if he had studied his prior reviews before beginning the second review, and if he had otherwise taken some minimal efforts to refresh his memory, then he would have significantly reduced the number of inconsistencies. Further, the author believes that a shorter delay in time between the reviews (for instance, 10 days instead of 10 months) would also have lessened the inconsistency rate with no additional efforts on his part.

The imposition of quality control procedures designed for consistencies between the two reviews would, in the author’s view, have drastically reduced the inconsistency rate. Again, any such procedures were intentionally omitted here to try to emulate, as far as possible, two completely separate and independent reviews.

Summary of Metrics of the Two Reviews

In the first review, which used the Multimodal method, 146,714 documents were coded as follows:

  • 1,507 random sample generated at the beginning of the project, plus
  • 1,605 from null-set random sample at the end of the project, plus
  • 1,000 machine selected from five rounds of training (sub-total 4,112), plus
  • 142,602 human judgmental selected.

The coding classifications were 661 relevant and 146,053 irrelevant. (This count does not include the approximate 30 additional relevant documents found in the post-hoc analysis of the project.)

This 661 total includes 18 documents considered Highly Relevant or Hot.

Further, it should be noted that the remaining 552,368 documents (699,082-146,714) not classified were considered by the SME to be irrelevant due to low predictive coding ranking and other reasons. They were treated as irrelevant even though not classified by the SME through bulk coding.

Of the 146,714 total documents categorized only approximately 2,500 were actually read and individually reviewed by the SME. This study of inconsistent classifications only considers these documents. (Note that only 1,981 documents are recorded in Inview software as reviewed, but the SME read many documents without categorization and used bulk coding instead, and thus this count is artificially low. The SME believes the upward adjustment to 2,500 is approximately correct). This means the SME categorized or coded 144,214 documents by using Inview software’s mass categorization features, which allows for categorization without actually reviewing each individual document. This is common for duplicative documents or document types.

In the first review of the 661 documents classified as relevant only 333 were specified for training. The 661 training documents include all documents identified as relevant in the machine selected document sets. The 328 documents classified as relevant by the SME but not specified for training were all derived from the 142,602 human judgmental selected documents. These documents were not specified for use as training documents because the SME thought it might skew or confuse the machine learning to include them. Documents were sometimes excluded because of unusual traits and characteristics of the relevant document, or to avoid excessive weighting of particular document types that might bias the training. For example, where other duplicates or near duplicates had already been used several times for training. (As mentioned, this was one of my first predictive coding projects, and I am not sure this strategy of mass withholding of documents from training to mitigate against bias was correct. If I had a do-over I would probably train on more documents and trust the software more to sort it out.) Some documents specified for training by the SME were not in fact used for training, but were instead only used by the Inview software as part of the initial control set for testing purposes. Documents in a control set for testing purposes are not also used for machine training. Only 1 of the 333 relevant documents here specified for training by the SME in the first review was so removed from training and instead used in the control set.

In the first review of the 146,053 documents classified as irrelevant only 2,586 were specified for training. The 2,586 training documents include all documents identified as irrelevant in the machine selected document sets. The 143,467 documents classified as irrelevant by the SME but not specified for training were all derived from the 142,602 human judgmental selected documents or the random samples. These documents were not specified for use as training documents because the SME thought it might skew or confuse the machine learning to include them. Documents were sometimes excluded because of unusual traits and characteristics of the irrelevant document, or to avoid excessive weighting of particular document types that might bias the training. For example, where other duplicates or near duplicates had already been used several times for training. 1,063 of the 2,586 documents specified for training by the SME were not, in fact, used for training by the Inview software. They were instead used by the Inview software as part of the initial control set for testing purposes. Therefore after removal of the control set of 1,063 irrelevant documents used for testing, only 1,523 irrelevant documents were used for machine training.

In the second review, which used the Monomodal method, 48,959 documents were coded as follows:

  • 10,000 machine selected, not random, with exactly 200 documents in each of the 50 rounds, plus
  • 2,366 random selected by two 1,183 random samples, one at the beginning and another at the end of the project, plus
  • 36,593 human judgmental selected.

The coding classifications were 579 documents relevant and 48,380 irrelevant.

This 579 total includes 13 documents considered Highly Relevant or Hot.

Again, it should be noted that the remaining 650,123 documents (699,082-48,959) were considered by the SME to be irrelevant due to low predictive coding ranking and other reasons. They were treated as irrelevant even though not classified by the SME through bulk coding.

Of the 48,959 total documents categorized only approximately 12,000 were actually read and individually reviewed by the SME. This study of inconsistent classifications only considers these documents.  (Again note this is a best estimate as explained above. Inview records 11,601 as physically reviewed.) This means the SME categorized or coded 36,959 documents by using Inview software’s mass categorization features.

The first Multimodal review identified 18 highly relevant or Hot Documents. The second Monomodal review found only 13 of these 18 Hot documents. No Hot documents were found in the Monomodal review that had not been found in the Multimodal review. Five Hot documents were found in the Multimodal Review that were not also found in the Monomodal review. All were individually reviewed by the SME in both projects.

The Monomodal review thus found only 72% of the Hot documents found by the earlier Multimodal review. Put another way, the Multimodal method did 38% better in finding the total Hot documents than Monomodal.

In the second review of the 579 documents classified as relevant only 577 were specified for training. The 2 documents categorized as relevant and not specified for training were from the 36,593 human judgmental selected documents for the same reasons mentioned in the first review. Further, 1 relevant document specified for training was not in fact used to train the system, but was instead used by the Inview software as part of the control set. Therefore only 576 relevant documents were used for machine training.

In the second review of the 48,380 documents classified as irrelevant only 10,948 were used for training. All 10,000 documents identified as irrelevant in the machine selected document sets were used for training. The 37,432 documents classified as irrelevant by the SME and not used for training were all derived from the 36,593 human judgmental selected documents and the random samples. These documents were not used for training for the reasons previously described, but primarily to avoid confusing cumulative training that might bias the training. In addition, of the 10,948 irrelevant documents specified for training, 1,063 were diverted by the Inview software for use in the control set, and thus used for testing and not machine training. Therefore only 9,885 irrelevant documents were used for machine training.

A comparison of the relevant documents found by each method showed the following:

  • The 661 relevant found by Multimodal included 376 documents not found in Monomodal, which means 57% were unique. The 661 relevant included 18 Hot documents, 5 of which were not found by Multimodal, which means 28% were unique.
  • The 579 relevant found by Monomodal included 294 documents not found in Multimodal, which means 51% were unique. The 579 relevant included 7 Hot documents, none of which were not found by Multimodal, which means 0% were unique.
  • There were a total of 955 relevant documents found by using both the Multimodal and Monomodal method.
  • There were 285 relevant documents found by both the Multimodal and Monomodal methods, which is 30% of the total 955 found.

The comparisons between the two reviews of relevant document classifications are shown in the Venn diagram below.

two_methods_compare copy

The 285 relevant documents found in both reviews represent an Overlap or Jaccard index of 29.8% (285/(376+579). Ellen M. Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt  697, 700 (2000) (“Overlap is defined as the size of the intersection of the relevant document sets divided by the size of the union of the relevant document sets.”); Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011), pgs 10-11. 

32 Different Documents Reviewed and Coded Relevant by Multimodal and Irrelevant by Monomodal

A study was made for this report of the content of the 376 documents that were only marked as relevant in the Multimodal review performed in March 2012, and not marked as relevant in the later January 2013 Monomodal (Borg) review. The Inview software shows that the SME had in fact individually reviewed 118 of these 376 documents in the second Monomodal review and determined them to be irrelevant.

A study of the content of these 118 documents shows that 86 of the 118 documents were duplicates, or near duplicates, leaving a total of 32 unique documents with inconsistent SME review classifications. When the SME found or was presented these same 32 documents in the earlier March 2012 Multimodal review he had marked them as relevant.

This is evidence of inconsistent reviews by the SME showing concept drift. All of these documents were grey-area types where the SME changed his view of relevance to be more constrictive. The SME had narrowed his concept of relevance.

A study of these 32 documents shows that there were no obvious errors made in the coding. It is therefore reasonable to attribute all of the inconsistent classifications to concept shift on these documents, not pure human error, such as where the SME intended to mark a document relevant, but accidentally clicked on the irrelevant coding button instead. (This kind of error did happen in the course of the review but quality control efforts easily detected these errors.)

31 Different Documents Reviewed and Coded Relevant by Monomodal and Irrelevant by Multimodal

A study was also made for this report of the content of the 294 documents that were only marked as relevant in the Monomodal (Borg) review performed in January 2013, and not marked as relevant in the earlier March 2012 Multimodal review. The Inview software shows that the SME had in fact individually reviewed 38 of these 294 documents in the second Monomodal review and determined them to be irrelevant.

A study of the content of these 38 documents shows that 7 of the 38 documents were duplicates, or near duplicates, leaving a total of 31 unique documents with inconsistent SME review classifications. When the SME found or was presented with these same 31 documents in the later January 2013 Monomodal review he had marked them as relevant.

This is again evidence of inconsistent reviews by the SME showing concept drift. All of these documents were grey area types where the SME changed his view of relevance, but this time to be more inclusive. The SME had expanded his concept of relevance.

A study of these 31 documents shows that there were no obvious errors made in the coding.  It is therefore reasonable to attribute all of the inconsistent classifications to concept shift on these documents, not pure human error.

211 Different Documents Reviewed and Coded Relevant by Both Multimodal and Monomodal 

A study was also made for this report of the content of the 285 relevant documents found by both the Multimodal and Monomodal methods. In both projects all documents coded as relevant by the SME had been individually reviewed by him before final classification. A study of the content of these 285 documents shows that 74 of them were duplicates, or near duplicates, leaving a total of 211 unique documents with consistent SME review classifications. When the SME found or was presented with these same 211 documents in both projects he had marked them as relevant.

274 Different Documents Reviewed and Coded Relevant by Both Methods

To summarize the prior unique total relevant document counts, after removal of all duplicates or near duplicate there were a total of 274 different documents coded relevant by one or both methods. This compares to the earlier 955 total relevant document count before deduplication.

11 Different Documents Reviewed and Coded as Hot By Both Multimodal and Monomodal

A study was also made of the content of the 18 documents coded as Hot. In both projects all documents coded as Hot by the SME had been individually reviewed by him before final classification. A study of the content of these of these 18 documents shows that 7 of them were duplicates, or near duplicates, leaving a total of 11 unique documents. There was only 1 duplicate in the 5 Hot documents that the Multimodal review located and the Monomodal review did not. There were 6 more duplicates found in the 13 other Hot documents discovered in both reviews. Therefore, after removing a total of 7 duplicate documents there were a total of 11 unique Hot documents. (These 11 unique Hot documents are also included within the total 274 unique Relevant documents count.) Monomodal found 7 and missed 4.  Multimodal found all 11. Monomodal review thus missed 36% of the Hot documents. Put another way, the Multimodal methods did 57% better in finding the unique hot documents than Monomodal.

This differential between the different unique Hot documents discovered is both reviews is shown in this Venn diagram. The Jaccard Index for Hot document classification was 64% (7/7+4).


Documents Categorized as Irrelevant by Both Multimodal and Monomodal 

The Multimodal method review categorized 146,053 documents as irrelevant. Of that total, 1,517 were categorized after review of each document, and 144,536 were bulk coded without the SME reviewing each individual document.

The Monomodal method review categorized 48,380 documents as irrelevant. Of that total, 11,083 were categorized after review of each document, and 37,297 were bulk coded without the SME reviewing each individual document.

The Agreement in coding the same documents irrelevant in both reviews was 31,109.

Of the 31,109 total documents categorized as irrelevant in both projects only approximately 3,000 were actually read and individually reviewed by the SME in both projects. This study of inconsistent classifications only considers these documents. (Note that only 2,500 documents are recorded in Inview software as reviewed, but the SME read many documents without categorization and used bulk coding instead, and thus this count is artificially low. The SME believes a 500 document upward adjustment to 3,000 is approximately correct.) This means the SME categorized or coded 28,109 of the 31,109 overlapping irrelevant documents by using Inview software’s mass categorization features.

Concept Drift Analysis

First, it is interesting to see that the change in concept drift from the first project to the second was approximately equal in both directions. Although the total counts were different due to duplicate documents, the SME changed his opinion in the second review from irrelevant to relevant on 31 different documents, and from relevant to irrelevant on 32 different documents.

The overall metrics of inconsistent coding of 274 unique relevant documents are as follows:

  • 211 different documents were coded relevant consistently in both reviews;
  • An additional 63 different documents were coded inconsistently, of which,
    • 49% (31) were first coded irrelevant in Multimodal and then coded relevant in Monomodal (false positives).
    • 51% (32) were first coded relevant in Multimodal and then coded irrelevant in Monomodal (false negatives).

An inconsistency of coding of 63 out of 274 relevant documents represents an inconsistency rate of 23%. Put another way, the coding on documents determined to be relevant was consistent 77% of the time. Again, this later calculation is known as the Jaccard measure. See Voorhees’ Variations, supra, and Grossman & Cormack, Technology Assisted Review, supra. Also See William Webber, How accurate can manual review be? Again, the Jaccard index is formally defined as the size of the intersection, here 211, divided by the size of the union of the sample sets, here 274 (211+32+31). Therefor the Jaccard index for the individual review of relevant documents in the two projects is 77% (211/274). This is shown by the Venn diagram below.


Several prior studies have been made of reviews for relevant documents that employed the Jaccard measure. The best known is the study of Ellen Voorhees that analyzed agreement among professional analysts (SMEs) in the course of a TREC study. It was found that two SMEs (retired intelligence officers) agreed on responsiveness on only 45% of the documents. When three SMEs were considered they agreed on only about 30% of the documents. Voorhees, Variations, supraAlso see: Grossman & Cormack, Technology Assisted Review, supra. It appears from the Voorhees report that the SMEs in this study were examining different documents that did not include duplicates. For that reason the Jaccard measure of different documents in the instant study of 77% would be the appropriate comparison, not the measure of 30% when duplicate documents were included.

A more recent study of a legal project using contract lawyers had Jaccard measures of 16% between the first review and follow-up reviews based on samples of the first. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review, Journal of the American Society for Information Science and Technology, 61(1):70–80. The Jaccard index numbers were extrapolated by Grossman and Cormack in Technology Assisted Review, supra at pgs. 13-14. Also see Grossman Cormack Glossary, Ver. 1.3 (2012) that defines the Jaccard index and goes on to state that expert reviewers commonly achieve Jaccard Index scores of about 50%, and scores exceeding 60% are very rare.

Analysis of Agreement in Coding Irrelevant Documents

The author is aware that comparisons of coding of irrelevant documents are not typically considered important in information retrieval studies for a variety of reasons, including the different prevalence rates in review projects. For that reason studies typically only include the Jaccard measure for comparison of relevant classifications only. Still, in view of the legal debate concerning the disclosure of irrelevant documents, this paper includes a brief examination of the total Agreement rates, including irrelevancy determinations. Further, Agreement rates are interesting and appropriate here since both studies consider a review of the exact same Enron dataset of 699,082 documents, and thus the same prevalence, and they are not relying on random samples, but on two full reviews.

The high Agreement rates on irrelevant classifications in the two reviews are of special significance in the author’s opinion because of the current debate in the legal community concerning procedures for predictive coding review. Several courts have already adopted the position that all relevant and all irrelevant documents used in training should be disclosed to a requesting party, even though the legal rules of procedures only require disclosure of relevant documents. Da Silva Moore et al. v. Publicus Groupe SA, 2012 WL 607412 (S.D.N.Y. Feb. 24, 2012) (Peck., M.J.), aff’d, 2012 WL 1446534 (S.D.N.Y. April 26, 2012) (Carter, J.); Global Aerospace Inc., et al. v. Landow Aviation, L.P., et al., 2012 WL 1431215 (Va. Cir. Cit. April 23, 2012); In re Actos (Pioglitazone) Products, MDL No. 6-11-md-2299 (W.D. La. July 27, 2012). Many attorneys and litigants take the contrary position that irrelevant documents should never be disclosed, even in the context of active machine learning. See Solomon, R., Are Corporations Ready To Be Transparent And Share Irrelevant Documents With Opposing Counsel To Obtain Substantial Cost Savings Through The Use of Predictive Coding, Metropolitan Corporate Counsel 20:11 (Nov. 2012).

Although the author has been flexible on this issue in some cases, before these results were studied the author had been advocating a do-not-disclose irrelevant documents position. Losey, R., Keywords and Search Methods Should Be Disclosed, But Not Irrelevant Documents (May 26, 2013). The author now contends that the Agreement and Jaccard index data shown in this study support a compromise position where limited disclosure may sometimes be appropriate, but only of borderline documents where irrelevancy is uncertain or likely subject to debate.

In the author’s opinion the inclusion of analysis of irrelevant coding by the SME in these two reviews allows for a more complete analysis and understanding of the types of documents and document classifications that cause inconsistent reviews. Again, to do this fairly the universe of classifications has been limited to those where the SME actually reviewed the documents, and also duplicate document counts have been eliminated. This seems to be the best measure to provide a clear indication of the types of documents that are inconsistently coded.

The inclusion of all review determinations in a consistency analysis, not just review decisions where a document is classified as relevant, provides critical information to understand the reasonability of disclosure positions in litigation. This is discussed in the conclusions below. This also seems appropriate when analyzing active machine learning where the training on irrelevance is just as important as the training on relevance.

In both projects the SME coded 31,109 identical unique documents as irrelevant. Of the 31,109 total overlapping documents coded, the SME actually read and reviewed approximately 3,000 of these documents and bulk coded the rest (28,109).

Thus in both projects the SME read and individually reviewed 3,274 unique documents: 3,000 documents were marked irrelevant and 274 marked relevant. This is shown in the Venn diagram below. Of the 3,274 identical documents reviewed there were only 63 inconsistencies. This represents an overall inconsistency error rate of 01.9%. Thus the Agreement rate for review of both relevant and irrelevant documents is 98.1% (3274/3,337).

Inconsistency_compare Conclusions Regarding Inconsistent Reviews

These results suggest that when only one human reviewer is involved who is an SME, and highly motivated, that the overall consistency rates in review are much higher than when multiple non-SME reviewers are involved with questionable motivation (contract reviewers) (77% v 16%), or multiple SMEs of unknown motivation and knowledge (retired intelligence officers in Voorhees study), (77% v. 45% with two SMEs, and 30% with three SMEs). These comparisons are shown visually in this graph.


These results also suggest that with one SME reviewer the classification of irrelevant documents is nearly uniform (98%+ Agreement), and that the inconsistencies primarily lie in relevant categorizations (77% Jaccard) of borderline relevant documents. (A caveat should be made that this observation is based on unfiltered data, and not a keyword collection or data otherwise distorted with artificially high prevalence rates.)

The 77% Jaccard measure is consistent with the test reported by Grossman and Cormack of an SME (Topic Authority in TREC language) reviewing her own prior adjudications of ten documents and disagreeing with herself on three of the ten classifications, and classifying another two as borderline. Grossman & Cormack, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012) at pgs. 17-20.

The overall Agreement rate of 98% of all relevancy determinations, including irrelevant classifications where almost all classifications are easy and obvious, strongly suggests that the very low Jaccard index rates measured in previous studies of 16% to 45% were more likely caused by human error, not document relevance ambiguity or genuine disagreement on the scope of relevance. A secondary explanation for the low scores is lack of significant subject matter expertise such that the reviewers were not capable of recognizing a clearly relevant document. Even if you only consider the determinations of relevancy, and exclude determinations of irrelevancy, the 77% Jaccard index is still significantly greater than the prior 16% to 45% consistency rates.

The findings in this study thus generally support the conclusions of Cormack and Grossman that most inconsistencies in document classifications are due to human error, not the presence of borderline documents or the inherent ambiguity of all relevancy determinations. Id. Of the 3,274 different documents the SME read in both projects in the instant study only 63 were seen to be borderline grey area types, which is less than 2%. There are certainly more grey area relevant documents than that in the 3,274 documents reviewed (excluding the duplication and near duplication issue), but they did not come to the author’s attention in this post-hoc analysis because the SME was consistent in review of these other borderline documents. Still, the findings in this study support the conclusions of Grossman and Cormack that only approximately 5% of documents in a typical unfiltered predictive coding review project are of a borderline grey area type.

The findings and conclusions support the use of SMEs with in-depth knowledge of the legal subject, and the use of as few SMEs to do the review as possible. The study also strongly suggests that the greatest consistency in document review arises from the use of one SME only.

These findings and conclusions also reinforce the need for strong quality control measures in large reviews where multiple reviewers must be used, especially when the reviewers are relatively low-paid, non-SMEs.

The inconsistencies (opposite of Jaccard index) shown in this study of determinations of relevance, and excluding the classifications of irrelevant, were relatively small – 23%, as compared to 55%, 70% and 84% in prior studies. Moreover, as mentioned, they were all derived from grey area or borderline type documents, where relevancy was a matter of interpretation. In the author’s experience documents such as this tend to have low probative value. If they were significant to litigation discovery, then they usually would not be of a grey area, subjective type. They would instead be obviously relevance. I say usually because the author has seen rare exceptions, typically in situations where one borderline document leads to other documents with strong probative value. Still, this is unusual. In most situations the omission of borderline ambiguous documents, and others like them, would have little or no impact on the case.

These observations, especially the high consistency of irrelevance classifications (98%+), support the strict limitation of disclosure of irrelevant documents as part of a cooperative litigation discovery process. Instead, only documents that a reviewer knows are of a grey area type or likely to be subject to debate should be disclosed. (The SME in this study was personally aware of the ambiguous type grey area documents when originally classifying these documents. They were obvious because it was difficult to decide if they were within the border of relevance, or not. The ambiguity would trigger an internal debate where a close question decision would ultimately be made.)

Even when limiting disclosure of irrelevant documents to those that are known to be borderline, disclosure of the actual documents themselves may frequently not be necessary. A summary of the documents with explanation of the rationale as to the ultimate determination of irrelevance should often suffice. The disclosure of a description of the borderline documents will at least begin a relevancy dialogue with the requesting party. Only if the abstract debate fails to reach agreement would disclosure of the actual documents be required. Even then it could be done in camera to a neutral third-party, such as a judge or special master. Alternatively, disclosure could be made with additional confidentiality restrictions pending a ruling by the court.

I am interested in what conclusions others may draw from these metrics regarding concept drift from one review project to the next, inconsistencies of single human reviewers, and other issues here discussed. The author welcomes public and private comments. Private comments may be made by email to Ralph.Losey@gmail.com and public remarks in the comment section below. Marketing type comments will be deleted.

Secrets of Search – Part II

December 18, 2011

This is Part Two of the blog that I started last week on the Secrets of Search, which was in turn a sequel to two blogs before that: Spilling the Beans on a Dirty Little Secret of Most Trial Lawyers and Tell Me Why?  In Secrets of Search – Part One we left off with a review of some of the analysis on fuzziness of recall measurements included in the August 2011 research report of information scientist, William Webber: Re-examining the Effectiveness of Manual Review. We begin part two with the meat of his report and another esoteric search secret. This will finally set the stage for the deepest secret of all and the seventh insight into trial lawyer resistance to e-discovery.

Summarizing Part One of this Blog Post
and the First Two Secrets of Search

I can quickly summarize the first two secrets with popular slang: keyword search sucks, and so does manual review (although not quite as bad), and because most manual review sucks, most so-called objective measurements of precision and recall are unreliable. Sorry to go all negative on you, but only by outing these not-so-little search secrets can we establish a solid foundation for our efforts with the discovery of electronic evidence. The truth must be told, even if it sucks.

I also explained that keyword search would not be so bad if it were not done blindly like a game of Go Fish, where it achieves really pathetic recall percentages in the 4% to 20% range (the TREC batch tasks). It still has a place with smarter software and improved, cooperation based Where’s Waldo type methods and quality controls. In that same vein I explained that manual review can probably also be made good enough for accurate scientific measurements. But, in order to do so, the manual reviews would have to replicate the state-of-the-art methods we have developed in private practice, and that is expensive. I concluded that we should come up with the money for better scientific research so we could afford to do that. We could then develop and test a new gold standard for objective search measurements. Scientific research could then test, accurately measure, and guide the latest hybrid processes the profession is developing for computer assisted review.

Another conclusion you could also fairly draw is that since the law already accepts linear manual review and keyword search as reasonable methods to respond to discovery requests, the law has set a very low standard and so we do not need better science. All you need to do to establish that an alternative method is legally reasonable is to show that it does as well as the previously accepted keyword and manual methods. That kind of comparison sets a low hurdle, one that even our existing fuzzy research proves we have already met. This means we already have a green light under the law, or logically we should have, to proceed with computer assisted review. Judge Peck’s article on predictive coding stated an obvious logical conclusion based upon the evidence.

You could, and I think should, also conclude that any expectation that computer assisted reviews have to be near perfect to be acceptable is misplaced. The claim that some vendor’s make as to near perfection by their search methods is counter to existing scientific research. It is wrong, mere marketing puff, because the manual based measurements of recall and precision are too fuzzy to measure that closely. If any computer assisted or other type of review comes up with 44%, it might in fact be perfect by an actual objective standard, and visa versa. Allegedly objective measurements of high recall rates in search is, for the time being at least, an illusion. It is a dangerous delusion too because this misinformation could be used against producing parties to try to drive up the costs of production for ulterior motives. Let’s start getting real about objective recall claims.

In any event, most computer assisted search is already better than average keyword or manual search, so it should be accepted as reasonable under the law without confidence inflation. We don’t need perfection in the law, we don’t need to keep reviewing and re-reviewing to try to reach some magic, way-too-high measure of recall. Although we should always try to get more and more of the truth, we should always try to improve, we should also remember that there is only so much truth that any of us can afford when faced with big data sets and limited financial resources.

As I have said time and again when discussing e-discovery efforts in general, including preservation related efforts, the law demands reasonable efforts not perfection. Now science buttresses this position in document productions by showing that we have never had perfection in search of large numbers of documents, not with manual, and certainly not with keyword, and, here is the kicker, it is not possible to objectively measure it anyway!

At least not yet. Not until we start taking our ignorance of the processes of search and discovery as a disease. Then maybe we will start allocating our charitable and scientific efforts accordingly, so we can have better measurements. Then with reliable and more accurate measurements, with solid gold objective standards, we can create more clearly defined best practices, ones that are not surrounded with marketing fluff. More on this later, but first let’s move onto another secret that comes out of Webber’s research. I’m afraid it will complicate matters even further, but life is often like that. We live in a very complex and imperfect world.

The Third Search Secret (Known Only to a Very Few): e-Discovery Watson May Still Not Be Able to Beat Our Champions

Webber’s report reveals that there is more to the man versus machine question than we first thought. His drill down analysis of the 2009 TREC interactive tasks shows that the computer assisted reviews were not the hands down victors over human reviewers as we first thought, at least not victors over many of the well-trained, exceptional reviewer men and women. Putting aside the whole fuzziness issue, Webber’s research suggests that the TREC and EDI tests so far have been the equivalent of putting Watson up against the average Jeopardy contestants, you know, the poor losers you see each week who, like me, usually fail to guess anything right.

The real test of IBM’s Watson, the real proof, didn’t come until Watson went up against the champions, the true professionals at the game. We have not seen that yet in TREC or the EDI studies. But the current organizers know this, and they are trying to level the playing field with multi-pass reviews and, as Webber notes, trying to answer the question we lawyers really want to know, the one that has not been answered yet, namely which Watson, which method can an attorney most reliably employ to create a production consistent with their conception of relevance.

Webber in his research and report digs deep into the TREC 2009 results and looked at the precision and recall rates of individual first pass reviewers. Re-examining the Effectiveness of Manual Review. He found that while Grossman and Cormack were accurate to say that overall two of the top machines did better than man, the details showed that:

Only for Topic 203 does the best automated system clearly outperform the best manual reviewer. As before, the professional manual review team for Topic 207 stands out. Several reviewers outperform the best automated system, and even the weaker individual reviewers have both precision and recall above 0.5.

This means the best team of professional reviewers who participated in Topic 207 actually beat the best machines! They did this in spite of the mentioned inequities in training, supervision, and appeal. Did you know that secret? I’m told that topic 203 was an easy one having to do with junk filters, but still, easy or not, the human team won.

There is still more to this secret. When you drill down even further you find that certain individual reviewers on each team topic actually beat the best machines on each topic in some way, even if their entire human team did not. That’s right, the top machines were defeated by a few champion humans in most every event. Humans won even though they were disadvantaged by not having an even playing field. I guaranty that this is a secret you have never heard before (unless you went to China) because Webber just discovered it from his painstaking analysis of the 2009 TREC results. Chin up contract reviewers, the reports of your death have been greatly exaggerated. Watson has not beat you yet, in fact, Watson still needs you to set up the gold standard to determine who wins.

Webber’s research shows that a competition between the best Watsons and best reviewers is still a very close race where humans often win. Please note this analysis assumes no time limits or cost limits for the human review, which are, of course, false assumptions in legal practice. This is why pure manual review is still, or should be, as dead as a doornail. The future is a team approach where humans use machines in a nonlinear fashion, not visa versa. More on this later.

Webber’s findings are the result of something that is not a secret to anyone who has ever been involved in a large search project, that all reviewers are not created equal. Some are far better than others. There are many good psychological, intelligence, and project management and methodology reasons for this, especially the management and methodology issues. See eg the must read guest blog by contract review attorney Larry Chapin, Contract Coders: e-Discovery’s “Wasting Asset”?

The facts supporting Webber’s findings on individual reviewer excellence are shown in Figure 2 of his paper on the variability in review team reliability. Re-examining the Effectiveness of Manual Review. The small red crosses in each figure (except flawed task 205) show the computer’s best efforts. Note how many individual reviewers (a bin is 500 documents that were reviewed by one specific reviewer) were able to beat the computer’s best efforts in either precision, or recall, or both. They are shown as either to the right or above the red cross. If above this means they were more precise. If to the right, they had better recall.

William Webber summarizes these findings in his blog recently by saying:

The best reviewers have a reliability at or above that of the technology-assisted system, with recall at 0.7 and precision at 0.9, while other reviewers have recall and precision scores as low as 0.1. This suggests that using more reliable reviewers, or (more to the point) a better review process, would lead to substantially more consistent and better quality review. In particular, the assessment process at TREC provided only for assessors to receive written instructions from the topic authority, not for the TA to actively manage the assessment process, by (for instance) performing an early check on assessments and correcting misconceptions of relevance or excluding unreliable assessors. Now, such supervision of review teams by overseeing attorneys may (regrettably) not always occur in real productions, but it should surely represent best practice.

Webber, W., How Accurate Can Manual Review Be? IREvalEtAl (12/15/11). Better review process and project management are key, which is the next part of the secret.

How to Be Better Than Borg

Webber’s research shows that some of the human reviewers in TREC stood out as better than Borg. They beat the machines. Does this really surprise anyone in the review industry? Sure, human review may be (should be) dead as a way to review all documents in large-scale reviews, but it is alive and well as the most reliable method for final check of computer suggested coding, a final check for classifications like privilege before production.

This is a picture of humans and machines working together as a team, as friends, but not as Borg implants where machines dictate, nor as human slaves where smart machines are not allowed. I know that George Socha, whom I quoted in Tell Me Why?, much like one of my fictional heroes, Jean Luc Picard, was glad to escape the Borg enslavement. So too would most contract lawyers who are stuck in dead-end review jobs with cruel employers. By this way, his embarrassing, unprofessional, contract lawyers as slaves mentality was shown dramatically by some of the reader comments to Contract Coders: e-Discovery’s “Wasting Asset”? They report incredible incidents of abuse by some law firms. Some of the private complaints I have heard from document reviewers about abuse and mismanagement are even worse than these public comments. The primary rule of any relationship must always be mutual respect. That applies to contract lawyers, and, if they are a part of your team, even to artificial intelligence agents like Watson, Siri, and their predictive coding cousins. Get to know and understand your entire team and to appreciate their respective strengths and weaknesses.

Webber’s study shows that the quality of the individual human reviewers on a team is paramount. He makes several specific recommendations in section 3.4 of his report for improving review team quality, including:

Dual assessment, for instance, can help catch random errors of inattention, while second review by an authoritative reviewer such as the supervising attorney can correct misconceptions of relevance during the review process, and adjust for assessor errors once it is complete [Webber et al., 2010]. …

[S]ignificant divergence from the median appears to be a partial, though not infallible, indicator of reviewer unreliability. A simple approach to improving review team quality is to exclude those reviewers whose proportion relevant are significantly different from the median, and re-apportion their work to the more reliable reviewers. …

Fully excluding reviewers based solely on the proportion of documents they find relevant is a crude technique. Nevertheless, the results of this section suggest that this proportion is a useful, if only partial, indicator of reliability, one which could be combined with additional evidence to alert review managers when their review process is diverging from a controlled state. It may be that review teams with better processes, such as the team from Topic 207, already use such techniques. Therefore, they need to be considered when a benchmark for manual review quality is being established, against which automatic techniques can be compared.

Webber’s conclusion summarizes his findings and bears close scrutiny, so I quote it here in full:

5. CONCLUSIONS. The original review from which Roitblat et al. draw their data cost $14 million, and took four months of 100-hour weeks to complete. The cost, effort, and delay underline the need for automated review techniques, provided they can be shown to be reliable. Given the strong disagreement between manual reviews, even some loss in review accuracy might be acceptable for the efficiency gained. If, though, automated methods can conclusively be demonstrated to be not just cheaper, but more reliable, than manual review, then the choice requires no hesitation. Moreover, such an achievement for automated text-processing technology would mark an epoch not just in the legal domain, but in the wider world.

Two recent studies have examined this question, and advanced evidence that automated retrieval is at least as consistent as manual review [Roitblat et al., 2010], and in fact seems to be more reliable [Grossman and Cormack, 2011]. These results are suggestive, but (we argue) not conclusive as they stand. For the latter study in particular (leaving questions of potential bias in the appeals process aside), it is questionable whether the assessment processes employed in the track truly are representative of a good quality manual review process.

We have provided evidence of the greatly varying quality of reviewers within each review team, indicating a lack of process control (unsurprising since for four of the seven topics the reviewers were not a genuine team). The best manual reviewers were found to be as good as the best automated systems, even with the asymmetry in the evaluation setup. The one, professional team that does manage greater internal consistency in their assessors is also the one team that, as group, outperforms the best automated method. We have also pointed out a simple, statistically based method for improving process control, by observing the proportion of documents found relevant by each assessor, and counseling or excluding those who appear to be outliers.

Above all, it seems that previous studies (and this one, too) have not directly addressed the crucial question, which is not how much different review methods agreed or disagree with each other (as in the study by Roitblat et al. [2010]), nor even how close automated or manual review methods turn out to have come to the topic authority’s gold standard (as in the study by Grossman and Cormack [2011]). Rather, it is this: which method can a supervising attorney, actively involved in the process of production, most reliably employ to achieve their overriding goal, to create a production consistent with their conception of relevance. There is good, though (we argue) so far inconclusive, evidence that an automated method of production can be as reliable a means to this end as a (much more expensive) full manual review. Quantifying the tradeoff between manual effort and automation, and validating protocols for verifying the correctness of either approach in practice, are particularly relevant in the multi-stage, hybrid work-flows of contemporary legal review and production. Given the importance of the question, we believe that it merits the effort of a more conclusive empirical answer.

The evidence shows that it is at least very difficult, perhaps even impossible (I await for more science to form a definite opinion), for us humans to maintain the concentration necessary to review tens of thousands of documents, day in and day out, for weeks. Sure we can do it for a few hours, and for 500 or so documents, but for 8-10 hours a day with tens or hundreds of thousands of documents for weeks on end? I doubt it. We need help. We need suggestive coding. We need a team that includes smart computers.

Know Your Team’s Strengths and Weaknesses

The challenge to human reviewers becomes ridiculously hard when you ask them to not only make relevancy calls, but, at the same time, to also make privilege calls, and confidentiality calls, and, here is the worst, multiple case issues categorization calls, a/k/a, issue tagging. Experience shows that the human mind cannot really handle more than five or six case issues at a time, at least when reviewing all day. But I keep hearing tales of lawyers asking reviewers to make ten to twenty case issue calls for weeks on end. If you think it is hard to get consistent relevancy calls, just think of the problem of putting relevant docs into ten to twenty buckets. Might as well throw darts. That is a scientific experiment I’d like to see, one testing the efficacy of case issue tags. How many categorizations can humans really handle before it becomes a complete waste of time?

I call on e-discovery lawyers everywhere to better understand their team members and stop asking them to do the impossible. Issue tagging must be kept simple and straightforward for the human members of your team to deal with it. The ten to twenty case-issue tags is a complete waste of time, perhaps with the exception of seed-set training, as thereafter Watson has no such limitations. But in so far as the final, out-the-door review goes, do not encumber your humans with mission impossible tasks. Know your team members, their strengths and weaknesses. Know what the humans do best, like catch obvious bloopers beyond the kin of present day AI agents, and do not expect them to be as tireless as machines.

The review process improvements mentioned by Webber, and other safeguards touted by most professional review companies who truly understand and care about the strengths and weaknesses of their team, will certainly mitigate against the problems inherent in all human review. In my mind the most important of these are experience, training, mutual respect, good working conditions, motivation, and quality controls, including quick terminations or reassignments when called for. More innovative methods are, I believe, just around the corner, such as game theory applications discussed by Lawrence Chapin in Contract Coders: e-Discovery’s “Wasting Asset”? But the bottom line will always be that computers are much better at complex repetitive drudgery tasks such as reviewing tens of thousands, or millions, of documents. Thankfully our minds are not designed for this, whereas computers are.

Reviewers Need Subject Matter Expertise and Money Motivation

Based on my experience as a reviewer and supervisor, the human challenges to make review determinations over large scales of data are magnified when the human reviewers are not themselves subject matter experts, and magnified even further when the reviewers have no experience in the process. This was not only true of all of the student volunteer reviewers at TREC, but is also sometimes true in real world practice as well. That is just invited error. Training is part of the solution to that.

It is also my supposition that in our culture the errors are magnified again when there is no, or inadequate, compensation provided. All TREC reviewers were unpaid volunteers except for the professional review team members. They were paid by the companies they work for, although those companies were not paid, and the rate of pay to the individuals is unknown. Still, can you be surprised that the top reviewers, the ones who beat the machines, were all paid, and only a few of the student teams came close? In our culture money is a powerful motivator. That is another reason to have better funded experiments that come closer to real world conditions. The test subjects in our experiments should be paid.

The same principle applies in the real world too. Contract review companies should stop competing on price alone and we consumers should stop being fooled by that. Quality is job number one, or should be. Do you really think the company with the lowest price is providing the best service? Do you think their attorney reviewers don’t resent this kind of low pay, sometimes in the $15-$20 per hour range. Most of these lawyers have six-figure student loans to pay off. They deserve a fair wage and, I hypothesize, will perform better if they are paid better.

To test my money-motivation theory I’d love to see an experiment where one review team is paid $25 an hour, and another is paid $75. Be real and let them know which team they are on. Then ask both to review the same documents involving weeks of grueling, boring work. Add in the typical vagaries of relevance, and equal supervision and training, and then see which team does better. Maybe add another variation where there is a stick added to the carrot and you can be fired for too many mistakes. Anyone willing to fund such a study? A contract review company perhaps? (Doubtful!) Better yet, perhaps there is a tech company out there willing to do so, one that competes with cheap human review teams? They should be motivated by money to finance such research (why would most contract review companies want this investigated?). The research would, of course, have to be done by bona fide third-party scientists in a peer review setting. We don’t want the profit motive messing with the truth and objective science.

Secret of Sampling

There is one more fundamental thing you need to understand about the TREC tests, indeed all scientific tests, one which I suppose you could also call a secret since so few people seem to know it, and that is, no one, I repeat, no person, ever sat down and looked at all of the 685,592 documents under consideration in 2010 TREC Legal Track interactive tasks. No one has ever looked at all of the documents in any TREC task. No person, much less a team of subject matter experts with three-pass reviews as I discussed in Part One, has determined the individual relevancy, or not, of all of these documents by which to judge the results of the software assisted reviews. All that happened (and I don’t mean that as a negative connotation), is that a random sample of the 685,592 documents were reviewed by a variety of people.

I have no trouble with sampling and do not think it really matters that only a random sample of the 685,592 corpus was reviewed. Sampling and math are the most powerful tools in every information scientist’s pocket. It seems like magic (much like the hash algorithms), but random sampling has been proven time and again to be reliable. For instance, a sample of 2,345 documents is needed to know the contents of 100,000, with a 95% confidence level and a +/-2 % confidence interval. Yet for a collection of 1,000,000 with the same confidence levels, a sample of only  2,395 is required (just 50 more to sample 900,00 more documents). If you add another zero and seek to know about 10,000,000 documents, you need only sample 2,400.

To play with the metrics yourself I suggest you see the calculator at http://www.surveysystem.com/sscalc.htm. For a good explanation of sampling see: Application of Simple Random Sampling (SRS) in eDiscovery, Manuscript By Doug Stewart, submitted to the Organizing Committee of the Fourth DESI Workshop on Setting Standards for Electronically Stored Information in Discovery Proceedings on April 20, 2011. Sampling is important. As I have been saying for over two years now, all e-discovery software should include a sampling button as a basic feature. (Many vendors have taken my advice, and I keep asking some of them to whom I made specific demands, to now call the new feature the Ralph Button, but they just laugh. Oh well:)

If the Human Review is Unreliable, Then so is the Gold Standard

The problem with average human review and the comparative measurements of computer assisted alternatives is not with the sampling techniques used to measure. The problem is that if the sample set created by average Joe or Jane reviewer is flawed, then so is the projection. Sampling has the same weakness as AI agent software, including predictive coding seed sets. If the seeds selected are bad, then the trees they grow will be bad too. They won’t look at all like what you wanted and the errors will magnify as the trees grow. It is the same old problem of garbage in, garbage out. I addressed this in Part One on this article, in the section, The Second Search Secret (Known Only to a Few): The Gold Standard to Measure Review is Really Made Out of Lead, but it bears repetition. It is a critical point that has been swept under the carpet until now.

Like it or not, aside from a few top reviewers working with relatively small sets, like the champs in TREC, most human review of relevancy in large-scale reviews is basically garbage, unless it is very carefully managed and constantly safeguarded by statistical sampling and other procedures. Also, if there is no clear definition of relevance, or if relevance is a constantly moving target, or both as is often the case, then the reviewers work will be poor (inconsistent), no matter what methods you use. Note this clear understanding of relevance is often missing in real world reviews for a variety of reasons, including the requesting party’s refusal to clarify under mistaken notions of work product protection, vigorous advocacy, and the like.

Even in TREC, where they claim to have clear relevancy definitions and the review sets were not that large, I’m told by Webber that:

TREC assessors disagree with themselves between 15% to 19% of the times when shown the same document twice (due to undetected duplication in the corpus).

That’s right, the same reviewers looking at the same document at different times disagreed with themselves between 15% to 19% of the time. For authority Webber refers to: Scholer et al., Quantifying Test Collection Quality Based on the Consistency of Relevance Judgements. As you start adding multiple reviewers to a project the disagreement rates naturally get much higher. That is in accord with most everyone’s experience and the scientific tests. If people cannot agree with themselves on questions of relevance, how can you expect them to agree with others? Despite a few champs, human relevancy review is generally very fuzzy.

Some Things Can Still Be Seen Through the Fuzzy Lenses

The exception to the fuzzy measurements problem, which I noted in Part One, is that the measures are not too vague for purposes of comparison, at least that is what the scientists tell me. Also, and this is very important, when you add the utility measures of time and money to review evaluation, which in the real world of litigation we must do, but has not yet been done in scientific testing, and do not just rely on the abstract measures of precision and recall, then computer assisted review must always win, at least in large-scale projects. We never have the time and money to manually review hundreds of thousands, or millions, of documents, just because they are in the custody of a person of interest. I don’t care what kind of cheap, poor quality labor you use. As Jason Baron likes to point out, at a fast review speed of 100 files per hr, and a cost of $50 per hour for a reviewer, it would still take $500 Million and 10 Million hours to review the 1 Billion emails in the White House.

When you consider the utility measures of time and cost, it is obvious that pure manual review is dead. Even our weak, fuzzy comparative testing lens shows that shows manual and computer review precision and recall are about equal, and maybe the computer is even leading (hard to tell with these fuzzy lenses on). But when you add the time and costs measures, the race is not even close. Computers are far faster and should also be  much cheaper. The need for computer assisted review to cull down the corpus, and then assist in the coding, is painfully obvious. The EDI study of a $14 Million review project by all too human contract coders with an overlap rate of only 28% proved that. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010.

Going for the Gold

The old gold standard of average human reviewers, working in dungeons <smile>, unassisted by smart technology, and not properly managed, has been exposed as a fraud. What else do you call a 28% overlap rate? We must now develop a new gold standard, a new best practice for big data review. And we must do so with the help and guidance of science and testing. The exact contours of the new gold are now under development in dozens of law firms, private companies, and universities around the world. Although we do not know all of the details, we know it will involve:

  1. Bottom Line Driven Proportional Review where the projected costs of review are estimated at the beginning of a project (more on this in a future blog);
  2. High quality tech assisted review, with predictive coding type software, and multiple expert review of key seed-set training documents using both subject matter experts (attorneys) and AI experts (technologists);
  3. Direct supervision and feedback by the responsible lawyer(s) (merits counsel) signing under 26(g);
  4. Extensive quality control methods, including training and more training, sampling, positive feedback loops, clever batching, and sometimes, quick reassignment or firing of reviewers who are not working well on the project;
  5. Experienced, well motivated human reviewers who know and like the AI agents (software tools) they work with;
  6. New tools and psychological techniques (e.g. game theory, story telling) to facilitate prolonged concentration (beyond just coffee, $, and fear) to keep attorney reviewers engaged and motivated to perform the complex legal judgment tasks required to correctly review thousands of usually boring documents for days on end (voyeurism will only take you so far);
  7. Highly skilled project managers who know and understand their team, both human and computer, and the new tools and techniques under development to help coach the team;
  8. Strategic cooperation between opposing counsel with adequate disclosures to build trust and mutually acceptable relevancy standards; and,
  9. Final, last-chance review of a production set before going out the door by spot checking, judgmental sampling (i.e. search for those attorney domains one more time), and random sampling.

I have probably missed a few key factors. This is a group effort and I cannot talk to everyone, nor read all of the literature. If you think I have missed something key here, please let me know. Of course we also need understanding clients who demand competence, and judges willing to get involved when needed to rein in intransigent non-cooperators and to enforce fair proportionality. Also, you should always go for confidentiality and clawback agreements and orders.

Technology Assisted Review

When I say technology assisted review in the best practices list above, which is now a popular phrase, I mean the same thing as computer assisted review. I mean a review method where computerized processes are used to cull down the corpus, and then again to assist in the coding. In the first step technology is  used to cull out final selections of documents from a larger corpus for humans to review before final production. The probable irrelevant documents are culled-out and not subject to any further human reviews, except perhaps for quality control random sampling. Keyword search is one very primitive example of that computer assisted culling. Concept search is another more recent, advanced example. There are many others. Think for instance of Axcellerate’s 40 automatically populated filters, which they collectively refer to as their Predictive Analytics step that I described in Part One of Secrets of Search.

These days the software is so smart that technology assisted review can not only intelligently cull out likely irrelevant documents, it can also make predictions for how the remaining relevant documents should be categorized. That is the second step where all of the remaining documents are reviewed by software to predict key classifications like privileged, confidential, hot, and maybe even a few case specific issues. The software predicts how a human will likely code a documents and batches documents out in groups accordingly. This predictive coding, combined with efficient document batching (putting into sets of documents for human review), makes the human review work easier and more efficient. For instance, one reviewer, or small review team, might be assigned all of the probable privileged documents, another the probable confidential for redaction, a third the probable hot documents, and the remaining documents divided into teams by case issue tags, or maybe by date, or custodian, all depending on the specifics of the case. It is an art, but one that can and should be measured and guided by science.

I contrast this kind of technology assisted review with pure Borg type computer controlled review, where there is complete computer delegation, where the computer does all, with little or no human involvement, except for the first seed set generation of relevancy patterns. Here we trust the AI agent and produce all documents determined to be relevant and not-privileged. No human does a double-check of the computer’s coding before the documents go out the door. In my opinion, we are still far away from such total delegation, although I don’t rule it out someday. (Resistance is futile.) Do you agree?

Is anyone out there relying on 100% computer review with no human eye quality controls? Conversely, as to the opposite, is there anyone out there who still uses pure (100%) human review? Who has humans (lawyers or paralegals) review all documents in a custodian collection (assuming, as you should, that there are thousands or tens of thousands of documents in the collection)? Is there anyone who does not rely on some little brother of Watson to review and cull out at least some of the corpus first?

More Research Please

The fuzzy standard of most human review is an inconvenient truth known to all information scientists. As we have seen, it has been known to TREC researchers since at least 2000 with the study by Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness, 36:5 Information Processing & Management 697, 701 (2000).  Yet I for one have not heard much discussion about it. This flaw cuts to the core of information science, because without accurate, objective measurements, there can be no science. For that reason scientists have come up with many techniques to try to overcome the inherent fuzziness of relevancy determinations, in and outside of legal search. I concede they are making progress, and TREC legal track is, for instance, getting better every year, but, like Voorhees and Webber, I insist there is still a long way to go.

Maybe the best software programs (whatever they are) are far better than our best reviewers under ideal conditions (that’s what I think), maybe not. But the truth is, we don’t really know what our real precision and recall rates are now, we don’t really know how much of the truth we are finding. The measures are, after all, so vague, so human dependent. What are we to make of our situation in legal review where the Roitblat et al study shows an overlap rate of only 28%? Here is Webber’s more precise information science language explanation that he made in reviewing my blog article in his blog:

The most interesting part of Ralph’s post, and the most provocative, both for practitioners and for researchers, arises from his reflections on the low levels of assessor agreement, at TREC and elsewhere, surveyed in the background section of my SIRE paper. Overlap (measured as the Jaccard coefficient; that is, size of intersection divided by size of union) between relevant sets of assessors is typically found to be around 0.5, and in some (notably, legal) cases can be as low as 0.28. If one assessor were taken as the gold standard, and the effectiveness of the other evaluated against it, then these overlaps would set an upper limit on F1 score (harmonic mean of precision and recall) of 0.66 and 0.44, respectively. Ralph then provocatively asks, if this is the ground truth on which we are basing our measures of effectiveness, whether in research or in quality assurance and validation of actual productions, then how meaningful are the figures we report? At the most, we need to normalize reported effectiveness scores to account for natural disagreement between human assessors (something which can hardly be done without task-specific experimentation, since it varies so greatly between tasks). But if our upper bound F1 is 0.66, then what are we to make of rules-of-thumb such as “75% recall is the threshold for an acceptable production”?

As Webber well knows, this means that such 75% or higher rules-of-thumb for acceptable recall are just wishful thinking. It means they should be disregarded because they are counter to the actual evidence of measurement deficiencies. The evidence instead shows that the maximum possible mean precision and recall rate measured objectively is only 44%. Demands in litigation for objective search recall rates higher than 44% fly in the face of the EDI study. It is an unreasonable request on its face, never mind the legal precedent for accepting keyword search or manual review. I understand that the research also shows that technology assisted reviews are at least as good as manual, but that begs the real question as to how good either of them are!

I personally find it hard to believe that with today’s technology assisted reviews we are not in fact doing much better than 44% or 65% recall, but then I think back to the lawyers in the 1980s in the Blair Moran study: We are confident our search terms uncovered 75% of the relevant evidence. Well, who knows, maybe they did, but the measurements were wrong. Who knows how well any of us are doing in big data reviews? The fuzziness of the measures is an inconvenient truth that must be faced. The 44% max objective rate creates a lack of confidence interval that must be corrected. We have to significantly improve the gold standard, we have to upgrade the quality of reviews used for measurements.

This is one reason I call for more research, and better funded research. We need to know how much of the truth we are finding, we need a recall rate we can count on to do justice. Large corporations should especially step up to the plate and fund pure scientific research, not just product development. I trust you that it works, but, as President Regan said, I still want you to verify. I still want you to show me exactly how well it works, and I want you to do it with objective, peer-reviewed science, and to use a gold standard that I can trust.

Trust But Verify

As it now stands, the confidence rates and error margins are too low for me to entirely trust Watson, much less his little brothers. The computer was, after all, trained by humans, and they can be unreliable. Garbage in, garbage out. I will only trust a computer trained by several humans, checking against each other, and all of them experts, well paid experts at that. Even then, I’d like to have a final expert review of the documents finally selected for production before they actually go out the door. After all, the determinations and samples are based on all too human judgments. If the stakes are high, and they usually are in litigation, especially where privileges and confidential information are involved, there needs to be a final check before documents are produced. That is the true gold standard in my world. Do you agree? Please leave a comment below.

Apology and Holiday Greetings from Ralph

Now I must apologize to my readers. I promised a two-part blog on Secrets of Search where the deepest secret would be revealed in Part Two, along with the seventh insight into why most lawyers in the world do not want to do e-discovery. But admit it, this Part Two is already too long isn’t it (over 7,100 words)? How long can we mere mortals maintain our attention on this stuff? You already have a lot to think about here. So, it looks like I lied before. It now seems to me better to wait and finish this article in a Part III, rather than ask you to read on and on.

So stay tuned friends, I promise this soap opera will finally come to a conclusion next time, when we are all much fresher and finally ready to hear the truth, the whole truth, and nothing but the truth about the secrets of search. (And yes, I really have four monitors at my desk, actually I have five when you include my personal MacBook Pro, which is by far my favorite computer.) Oh yeah, and the next blog may be late too. We’ll see how busy Santa keeps me. Happy Holidays!

%d bloggers like this: