This is the conclusion of the report on the Enron document review experiment that I began in my last blog. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. The conclusion is an analysis of the relative effectiveness of the two reviews. Prepare for surprises. Artificial Intelligence has come a long way.
The Monomodal method, which I nicknamed Borg review for its machine dominance, did better than anticipated. Still, it came up short in the key component, as the graphic suggests, of finding Hot documents. Yes. There is still a place for keyword and other types of search. But it is growing smaller every year.
Description of the Two Types of Predictive Coding Review Methods Used
When evaluating the success of the Monomodal all predictive-coding-approach in the second review, please remember, that this is not pure Borg. I would not spend 52 hours of my life doing that kind of review. I doubt any SME or search expert would do so. Instead, I did my version of the Borg review, which is quite different from that endorsed by several vendors. I call my version the Enlightened Hybrid Borg Monomodal review. Losey, R., Three-Cylinder Multimodal Approach To Predictive Coding. I used all three-cylinders described in this article: one for random, a second for machine analysis, and a third cylinder powered by human input. The only difference from full Multimodal review is that the third engine of human input was limited to predictive coding based ranked searches.
This means that in the version of Monomodal review tested the random selection of documents played only a minor role in training (thus an Enlightened approach). It also means that the individual SME reviewer was allowed to supplement the machine selected documents with his own searches, which I did, so long as the searches were predictive coding based (thus the Hybrid approach, Man and Machine). For example, with the Hybrid approach to Monomodal the reviewer can select documents for review for possible training based on their ranked positions. The reviewer does not have to rely entirely on the computer algorithms to select all of the documents for review.
The primary difference between my two reviews was that the first Multimodal method used several search methods to find documents for machine training, including especially keyword and similarity searches, whereas the second did not. Only machine learning type searches were used in the Monomodal search. Otherwise I used essentially the same approach as I would in any litigation, and budgeted my time and expense to 52 hours for each project.
Both Reviews Were Bottom Line Driven
Both the Monomodal and Multimodal reviews were tempered by a Bottom Line Driven approach. This means the goal of the predictive coding culling reviews was a reasonable effort where an adequate number of relevant documents were found. It was not a unrealistic, over-expensive effort. It did not include a vain pursuit of more of the same type documents. These documents would never find their way into evidence anyway, and would never lead to new evidence. They would only make the recall statistics look good. The law does not require that. (Look out for vendors and experts who promote the vain approach of high recall just to line their own pockets.) The law requires reasonable efforts proportional to the value of the case and the value of the evidence. It does not require perfection. In most cases it is a waste of money to try.
In both reviews I stopped the iterative machine training when few new documents were located in the last couple of rounds. I stopped when the documents predicted as relevant were primarily just more of the same or otherwise not important. It was somewhat fortuitous that this point was reached after about the same amount of effort, even though I had only gone through 5 rounds of training in Multimodal, as compared to 50 rounds in Monomodal. I was about at the same point of new-evidence-exhaustion in both reviews and these final stats reflect the close outcomes.
There is no question in my mind that more relevant documents could have been found in both reviews if I had done more rounds of training. But I doubt that new, unique types of relevant documents would have been uncovered, especially in the first Multimodal review. In fact, I tested this theory after the first Multimodal review was completed and did a sixth round of training not included in these metrics. I called it my post hoc analysis and it is described at pages 74-84 of the Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. I found 32 technically relevant documents in the sixth round, as expected, and, again as expected, none were of any significance.
In both reviews the decision to stop was tested, and passed, based on my version of the elusion test of the null-set (all documents classified as irrelevant and thus not to be produced). My elusion test has a strict accept-on-zero-error policy for Hot documents. This test does not prove that all Hot documents have been found. It just creates a testing condition such that if any Hot documents are found in the sample, then the test failed and more training is required. In the random sample quality assurance tests for both reviews no Hot documents were found, and no new relevant documents of any significance were found, so the tests were passed. (Note that the test passed in the second Monomodal review, even though, as will be shown, the second review did not locate four unique Hot documents found in the first review.) In both elusion tests the false negatives found in the random sample were all just unimportant more of the same type documents that I did not care about anyway.
Neither of my Enron reviews were perfect, and the recall and F1 tests reflect that, but they were both certainly reasonable and should survive any legal challenge. If I had gone on with further rounds of training and review, the recall would have improved, but to little or no effect. The case itself would not have been advanced, which is the whole point of e-discovery, not the establishment of artificial metrics. With the basic rule of proportionality in mind the additional effort of more rounds of review would not have been worth it. Put another way, it would have been unreasonable to have insisted on greater recall or F1 scores in these projects.
It is never a good idea to have a preconceived notion of a minimum recall or F1 measure. It all depends on the case itself, and the documents. You may know about the case and scope of relevance (although frequently that matures as the project progresses), but you usually do not about the documents. That is the whole point of the review.
It is also important to recognize that both of these predictive coding reviews, Multi and Monomodal, did better than any manual review. Moreover, they were both far, far, less expensive than traditional reviews. These last considerations will be considered in an upcoming blog and will not be addressed here. Instead I will focus on objective measures of prevalence, recall, precision, and total document retrieval comparisons. Yes, that means more math, but not much.
Summary of Prevalence and Comparative Recall Calculations
A total of three simple random samples were taken of the entire 699,082 dataset as described with greater particularity in the search narratives. Predictive Coding Narrative (2012); Borg Challenge Report (2013). A random sample of 1,507 documents was made in the first review wherein 2 relevant documents were found. This showed a prevalence rate of 0.13%. Two more random samples were taken in the second review of 1,183 documents in each sample. The total random sample in the second review was thus 2,366 documents with 5 relevant found. This showed a prevalence rate of 0.21%. Thus a total of 3,873 random sampled documents were reviewed and a total of 7 relevant documents found.
Since three different samples were taken some overlap in sampled documents was possible. Nevertheless, since these three samples were each made without replacement we can combine them for purposes of the simple binomial confidence intervals estimated here.
By combining all three samples with a total of 3,873 documents reviewed, and 7 relevant documents found, you have a prevalence of 0.18%. The spot projection of 0.18% over the entire 699,082 dataset is 1,264. Using a Binomial calculation to determine the confidence interval, and using a confidence level of 95%, the error rage is from 0.07% to 0.37%. This represents a range of from between 489 to 2,587 projected relevant documents in the entire dataset.
From the perspective of the reviewer the low projected range represents the best-case-scenario for calculating recall. Here we know the 489 relevant documents is not correct because both reviews found more relevant documents than that. The Multimodal found 661 and the Monomodal found 579. Taking a conservative view for recall calculation purposes, and assuming that the 63 documents considered relevant in one review, and not in another, were in fact all relevant for purposes, this means we have a minimum floor of 955 relevant document. Thus under the best-case-scenario, the 955 found represents all of the relevant documents in the corpus, not the 489 or 661 counts.
From the perspective of the reviewer the high projected range in the above binomial calculations – 2,587 – represents the worst-case-scenario for calculating recall. It has the same probability as being correct as the 489 low range projection had. It is a possibility, albeit slim, and certainly less likely than the 955 minimum floor we were able to set using the binomial calculation tempered by actual experience
Under the most-likely-scenario, the spot projections, there are 1,264 relevant documents. This is shown in the bell curve below. Note that since the random sample calculations are all based on a 95% probability level, there was a 2.5% chance that fewer than 489 or greater than 2,587 relevant documents would be found (the left and right edges of the curve). Also note that the spot projection of 1,264 has the highest probability (9.5%) of being the correct estimate. Moreover, the closer to 1,264 you come on the bell curve the higher the probability of likely accuracy. Therefore, it is more likely that there are 1,500 relevant documents than 1,700, and more likely that there are 1,100 documents than 1,000.
The recall calculations under all three scenarios are as follows:
- Under the most-likely-scenario using the spot projection of 1,264:
- Monomodal (Borg) retrieval of 579 = 46% recall.
- Multimodal retrieval of 661 = 52% recall (that’s 13% better than Monomodal (6/46)).
- Projected relevant documents not found by best effort, Multimodal = 603.
- Under the worst-case-scenario using the maximum count projection of 2,587:
- Monomodal (Borg) retrieval of 579 = 22% recall.
- Multimodal retrieval of 661 = 26% recall (that’s 18% better than Monomodal (4/22)).
- Projected relevant documents not found by best effort, Multimodal = 1,926.
- Best Case scenario = 955 relevant.
- Monomodal (Borg) retrieval of 579 = 61% recall.
- Multimodal retrieval of 661 = 69% recall (that’s 13% better than Monomodal (8/61)).
- Projected relevant documents not found by best effort, Multimodal = 334.
In summary, the prevalence projections from the three random samples suggest that the Multimodal method recalled from between 26% to 69% of the total number of relevant documents, with the most likely result being 52% recall. The prevalence projections suggest that the Monomodal method recalled from between 22% to 61% of the total number of relevant documents, with the most likely result being a 46% recall. The metrics thus suggest that Multimodal attained a recall level from between 13% to 18% better than attained by the Monomodal method.
Precision and F1 Comparisons
The first Multimodal review classified 661 documents as relevant. The second review re-examined 403 of those 661 documents. The second review agreed with the relevant classification of 285 documents and disagreed with 118. Assuming that the second review was correct, and the first review incorrect, the precision rate was 71% (285/403).
When the content of these documents are examined, and the duplicate and near duplicate documents are removed from the analysis as previously explained, the Multimodal review classified 369 different unique documents as relevant. The second review re-examined 243 of those 369 documents. The second review agreed with the relevant classification of 211 documents and disagreed with 32. Assuming that the second review was correct, and the first review incorrect, the precision rate was 87% (211/243).
Conversely, if you assume the conflicting second review calls were incorrect, and the SME got it right on all of them the first time, the precision rate for the first review would be 100%. That is because all of the documents identified by the first review as relevant to the information request would in fact stand confirmed as relevant. As discussed previously, all of the disputed calls concerned ambiguous or borderline grey area documents. The classification of these documents is inherently arbitrary, to some extent, and they are easily subject to concept shift. The author takes no view as to the absolute correctness of the conflicting classifications.
The second Monomodal review classified 579 documents as relevant. The second review re-examined 323 of those 579 documents and agreed with the relevant classification of 285 documents and disagreed with 38. Assuming that the first review was correct, and the second review incorrect, the agreement rate on relevant classifications was 88% (285/323).
When the content of these documents are examined, and the duplicate and near duplicate documents are removed from the analysis as previously explained, the Monomodal review classified 427 different unique documents as relevant. The first review had examined 242 of those 427 documents. The first review agreed with the relevant classification of 211 documents and disagreed with 31. Assuming that the first review was correct, and the second review incorrect, the precision rate was again 87% (211/242).
Assuming the conflicting first review calls were incorrect, and the SME got it right on all of them the second time, then again the precision rate for the second review would be 100%. That is because all of the documents identified by the first review as relevant to the information request would in fact stand confirmed as relevant.
In view of the inherent ambiguity of all of the documents with conflicting coding the measurement of precision in these two projects is of questionable value. Nevertheless, assuming that the inconsistencies in coding were always correct, when you do not account for duplicate and near duplicate documents the second Monomodal review was 24% more consistent with the first Multimodal review. However when the duplicates and near duplicate documents are removed for a more accurate assessment, the precision rates of both reviews were almost identical at 87%.
The F1 measurement is the harmonic mean of the precision and recall rates. The formula for calculating the harmonic mean is not too difficult: 2/(1/P + 1/R) where P is precision and R is recall. Thus using the more accurate 87% precision rate for both, the harmonic mean ranges for the projects are:
- 40% to 77% for Multimodal
- 35% to 71% for Monomodal
The F1 measures for most-likely-scenario spot projections for both are:
- 65% for Multimodal
- 61% for Monomodal
In summary since the precision rates of the two methods were identical at a respectable 87%, the comparisons between the recall rates and F1 rates are nearly identical. The Multimodal F1 of 40% for the worst-case-scenario was 14% better than the Monomodal F1 of 35%. The Multimodal F1 of 65% for the best-case-scenario was 7% better than the Monomodal F1 of 61%. The most likely spot projection differential between 61% and 65% again shows Multimodal with a 7% improvement over Monomodal.
Comparisons of Total Counts of Relevant Documents
The first review using the Multimodal method found 661 relevant documents. The second review using the Monomodal method found 579 relevant documents. This means that Multimodal found 82 more relevant documents than Monomodal. That is a 14% improvement. This is shown by the roughly proportional circles below.
Analysis of the content of these relevant documents showed that:
- The set of 661 relevant documents found by the first Multimodal review contained 292 duplicate or near duplicate documents, leaving only 369 different unique documents. There were 74 duplicates or near duplicates in the 285 documents coded relevant by both Multimodal and Monomodal, and 218 duplicates in the 376 documents that were only coded relevant in the Multimodal review. (As the most extreme example, the 376 documents contained one email with the subject line Enron Announces Plans to Merge with Dynegy dated November 9, 2001, that had 54 copies.)
- The set of 579 relevant documents found by second Monomodal review contained 152 duplicate or near duplicate documents, leaving only 427 different unique documents. There were 74 duplicates or near duplicates in the 285 documents coded relevant by both Multimodal and Monomodal, and 78 duplicates in the 294 documents that were only coded relevant in the Monomodal review. (As the most extreme example, the 294 documents contained one email with the subject line NOTICE TO: All Current Enron Employees who Participate in the Enron Corp. Savings Plan dated January 3, 2002, that had 39 copies.)
- Therefore when you exclude the duplicate or near duplicate documents the Monomodal method found 427 different documents and the Multimodal method found 369. This means the Monomodal method found 58 more unique relevant documents than Multimodal, an improvement of 16%. This is shown by the roughly proportional circles below.
On the question of effectiveness of retrieval of relevant documents under the two methods it looks like a draw. The Multimodal method found 14% more relevant documents, and likely attained a recall level from between 13% to 18% better than attained by the Monomodal method. But after removal of duplicates and near duplicates, the Monomodal method found 16% more unique relevant documents.
This result is quite surprising to the author who had expected the Multimodal method to be far superior. The author suspects the unexpectedly good results in the second review over the first, at least from the perspective of unique relevant documents found, may derive, at least in part, from the SME’s much greater familiarity and expertise with predictive coding techniques and Inview software by the time of the second review. Also, as mentioned, some slight improvements were made to the Inview software itself just before the second review, although it was not a major upgrade. The possible recognition of some documents in the second review from the first could also have had some slight impact.
Hot Relevant Document Differential
The first review using the Multimodal method found 18 Hot documents. The second review using the Monomodal method included only 13 Hot documents. This means that Multimodal found 5 more relevant documents than Monomodal. That is a 38% improvement. This is shown by the roughly proportional circles below.
Analysis of the content of these Hot documents showed that:
- The set of 18 Hot documents found by first Multimodal review contained 7 duplicate or near duplicate documents, leaving only 11 different unique documents.
- The set of 13 Hot documents found by second Monomodal review contained 6 duplicate or near duplicate documents, leaving only 7 different unique documents. Also, as mentioned, all 13 of the Hot documents found by Monomodal were also found by Multimodal, whereas Multimodal found 5 Hot documents that Monomodal did not.
- Therefore when you exclude the duplicate or near duplicate documents the Multimodal method found 11 different documents and the Monomodal method found 7. This means the Multimodal method found 4 more unique Hot documents than Monomodal, an improvement of 57%. This is shown by the roughly proportional circles below.
On the question of effectiveness of retrieval of Hot documents the Multimodal method did 57% better than Monomodal. Thus, unlike the comparison of effectiveness of retrieval of relevant documents, which was a close draw, the Multimodal method was far more effective in this category. In the author’s view the ability to find Hot documents is much more important than the ability to find merely relevant document. That is because in litigation such Hot documents have far greater probative value as evidence than merely relevant documents. They can literally make or break a case.
In other writings the author has coined the phrase Relevant is Irrelevant to summarize the argument that Hot documents are far more significant in litigation than merely relevant documents. The author contends that the focus of legal search should always be on retrieval of Hot documents, not relevant documents. Losey, R. Secrets of Search – Part III (2011) (the 4th secret). This is based in part on the well-known rule of 7 +/- 2 that is often relied upon by trial lawyers and psychologists alike as a limit to memory and persuasion. Id. (the 5th and final secret of search).
To summarize this study suggests that the hybrid multimodal search method, one that uses a variety of search methods to train the predictive coding classifier, is significantly more effective (57%) at finding highly relevant documents than the hybrid monomodal method. When comparing the effectiveness of retrieval of merely relevant documents the two methods did, however, perform about the same. Still, the edge in performance must again go to Multimodal because of the 7% to 14% better projected F1 measures.