Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents

June 17, 2013

Enron_2This is the conclusion of the report on the Enron document review experiment that I began in my last blog. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. The conclusion is an analysis of the relative effectiveness of the two reviews. Prepare for surprises. Artificial Intelligence has come a long way.

The Monomodal method, which I nicknamed Borg review for its machine dominance, did better than anticipated. Still, it came up short in the key component, as the graphic suggests, of finding Hot documents. Yes. There is still a place for keyword and other types of search. But it is growing smaller every year.

Description of the Two Types of Predictive Coding Review Methods Used

When evaluating the success of the Monomodal all predictive-coding-approach in the second review, please remember, that this is not pure Borg. I would not spend 52 hours of my life doing that kind of review. I doubt any SME or search expert would do so. Instead, I did my version of the Borg review, which is quite different from that endorsed by several vendors. I call my version the Enlightened Hybrid Borg Monomodal review. Losey, R., Three-Cylinder Multimodal Approach To Predictive Coding. I used all three-cylinders described in this article: one for random, a second for machine analysis, and a third cylinder powered by human input. The only difference from full Multimodal review is that the third engine of human input was limited to predictive coding based ranked searches.

This means that in the version of Monomodal review tested the random selection of documents played only a minor role in training (thus an Enlightened approach). It also means that the individual SME reviewer was allowed to supplement the machine selected documents with his own searches, which I did, so long as the searches were predictive coding based (thus the Hybrid approach, Man and Machine). For example, with the Hybrid approach to Monomodal the reviewer can select documents for review for possible training based on their ranked positions. The reviewer does not have to rely entirely on the computer algorithms to select all of the documents for review.

The primary difference between my two reviews was that the first Multimodal method used several search methods to find documents for machine training, including especially keyword and similarity searches, whereas the second did not. Only machine learning type searches were used in the Monomodal search. Otherwise I used essentially the same approach as I would in any litigation, and budgeted my time and expense to 52 hours for each project.

Both Reviews Were Bottom Line Driven

Both the Monomodal and Multimodal reviews were tempered by a Bottom Line Driven approach. This means the goal of the predictive coding culling reviews was a reasonable effort where an adequate number of relevant documents were found. It was not a unrealistic, over-expensive effort. It did not include a vain pursuit of more of the same type documents. These documents would never find their way into evidence anyway, and would never lead to new evidence. They would only make the recall statistics look good. The law does not require that. (Look out for vendors and experts who promote the vain approach of high recall just to line their own pockets.) The law requires reasonable efforts proportional to the value of the case and the value of the evidence. It does not require perfection. In most cases it is a waste of money to try.

Bottom_Line_Proportional

In both reviews I stopped the iterative machine training when few new documents were located in the last couple of rounds. I stopped when the documents predicted as relevant were primarily just more of the same or otherwise not important. It was somewhat fortuitous that this point was reached after about the same amount of effort, even though I had only gone through 5 rounds of training in Multimodal, as compared to 50 rounds in Monomodal. I was about at the same point of new-evidence-exhaustion in both reviews and these final stats reflect the close outcomes.

There is no question in my mind that more relevant documents could have been found in both reviews if I had done more rounds of training. But I doubt that new, unique types of relevant documents would have been uncovered, especially in the first Multimodal review. In fact, I tested this theory after the first Multimodal review was completed and did a sixth round of training not included in these metrics. I called it my post hoc analysis and it is described at pages 74-84 of the Predictive Coding Narrative: Searching for Relevance in the Ashes of EnronI found 32 technically relevant documents in the sixth round, as expected, and, again as expected, none were of any significance.

In both reviews the decision to stop was tested, and passed, based on my version of the elusion test of the null-set (all documents classified as irrelevant and thus not to be produced). My elusion test has a strict accept-on-zero-error policy for Hot documents. This test does not prove that all Hot documents have been found. It just creates a testing condition such that if any Hot documents are found in the sample, then the test failed and more training is required. In the random sample quality assurance tests for both reviews no Hot documents were found, and no new relevant documents of any significance were found, so the tests were passed. (Note that the test passed in the second Monomodal review, even though, as will be shown, the second review did not locate four unique Hot documents found in the first review.) In both elusion tests the false negatives found in the random sample were all just unimportant more of the same type documents that I did not care about anyway.

Neither of my Enron reviews were perfect, and the recall and F1 tests reflect that, but they were both certainly reasonable and should survive any legal challenge. If I had gone on with further rounds of training and review, the recall would have improved, but to little or no effect. The case itself would not have been advanced, which is the whole point of e-discovery, not the establishment of artificial metrics. With the basic rule of proportionality in mind the additional effort of more rounds of review would not have been worth it. Put another way, it would have been unreasonable to have insisted on greater recall or F1 scores in these projects.

It is never a good idea to have a preconceived notion of a minimum recall or F1 measure. It all depends on the case itself, and the documents. You may know about the case and scope of relevance (although frequently that matures as the project progresses), but you usually do not about the documents. That is the whole point of the review.

It is also important to recognize that both of these predictive coding reviews, Multi and Monomodal, did better than any manual review. Moreover, they were both far, far, less expensive than traditional reviews. These last considerations will be considered in an upcoming blog and will not be addressed here. Instead I will focus on objective measures of prevalence, recall, precision, and total document retrieval comparisons. Yes, that means more math, but not much.

Summary of Prevalence and Comparative Recall Calculations

A total of three simple random samples were taken of the entire 699,082 dataset as described with greater particularity in the search narratives. Predictive Coding Narrative (2012); Borg Challenge Report (2013). A random sample of 1,507 documents was made in the first review wherein 2 relevant documents were found. This showed a prevalence rate of 0.13%.  Two more random samples were taken in the second review of 1,183 documents in each sample. The total random sample in the second review was thus 2,366 documents with 5 relevant found. This showed a prevalence rate of 0.21%. Thus a total of 3,873 random sampled documents were reviewed and a total of 7 relevant documents found.

Since three different samples were taken some overlap in sampled documents was possible. Nevertheless, since these three samples were each made without replacement we can combine them for purposes of the simple binomial confidence intervals estimated here.

By combining all three samples with a total of 3,873 documents reviewed, and 7 relevant documents found, you have a prevalence of 0.18%. The spot projection of 0.18% over the entire 699,082 dataset is 1,264. Using a Binomial calculation to determine the confidence interval, and using a confidence level of 95%, the error rage is from 0.07% to 0.37%. This represents a range of from between 489 to 2,587 projected relevant documents in the entire dataset.

From the perspective of the reviewer the low projected range represents the best-case-scenario for calculating recall. Here we know the 489 relevant documents is not correct because both reviews found more relevant documents than that. The Multimodal found 661 and the Monomodal found 579. Taking a conservative view for recall calculation purposes, and assuming that the 63 documents considered relevant in one review, and not in another, were in fact all relevant for purposes, this means we have a minimum floor of 955 relevant document. Thus under the best-case-scenario, the 955 found represents all of the relevant documents in the corpus, not the 489 or 661 counts.

From the perspective of the reviewer the high projected range in the above binomial calculations – 2,587 – represents the worst-case-scenario for calculating recall. It has the same probability as being correct as the 489 low range projection had. It is a possibility, albeit slim, and certainly less likely than the 955 minimum floor we were able to set using the binomial calculation tempered by actual experience

Under the most-likely-scenario, the spot projections, there are 1,264 relevant documents. This is shown in the bell curve below. Note that since the random sample calculations are all based on a 95% probability level, there was a 2.5% chance that fewer than 489 or greater than 2,587 relevant documents would be found (the left and right edges of the curve). Also note that the spot projection of 1,264 has the highest probability (9.5%) of being the correct estimate. Moreover, the closer to 1,264 you come on the bell curve the higher the probability of likely accuracy. Therefore, it is more likely that there are 1,500 relevant documents than 1,700, and more likely that there are 1,100 documents than 1,000.

Prevalence_ENRON

The recall calculations under all three scenarios are as follows:

  • Under the most-likely-scenario using the spot projection of 1,264:
    • Monomodal (Borg) retrieval of 579 = 46% recall.
    • Multimodal retrieval of 661 = 52% recall (that’s 13% better than Monomodal (6/46)).
    • Projected relevant documents not found by best effort, Multimodal = 603.
Enron_Prevalence_Graph

Most-Likely-Scenario

  • Under the worst-case-scenario using the maximum count projection of 2,587:
    • Monomodal (Borg) retrieval of 579 = 22% recall.
    • Multimodal retrieval of 661 = 26% recall (that’s 18% better than Monomodal (4/22)).
    • Projected relevant documents not found by best effort, Multimodal = 1,926.
  • Best Case scenario = 955 relevant.
    • Monomodal (Borg) retrieval of 579 = 61% recall.
    • Multimodal retrieval of 661 = 69% recall (that’s 13% better than Monomodal (8/61)).
    • Projected relevant documents not found by best effort, Multimodal = 334.

In summary, the prevalence projections from the three random samples suggest that the Multimodal method recalled from between 26% to 69% of the total number of relevant documents, with the most likely result being 52% recall. The prevalence projections suggest that the Monomodal method recalled from between 22% to 61% of the total number of relevant documents, with the most likely result being a 46% recall. The metrics thus suggest that Multimodal attained a recall level from between 13% to 18% better than attained by the Monomodal method. 

Precision and F1 Comparisons 

The first Multimodal review classified 661 documents as relevant. The second review re-examined 403 of those 661 documents. The second review agreed with the relevant classification of 285 documents and disagreed with 118. Assuming that the second review was correct, and the first review incorrect, the precision rate was 71% (285/403).

When the content of these documents are examined, and the duplicate and near duplicate documents are removed from the analysis as previously explained, the Multimodal review classified 369 different unique documents as relevant. The second review re-examined 243 of those 369 documents. The second review agreed with the relevant classification of 211 documents and disagreed with 32. Assuming that the second review was correct, and the first review incorrect, the precision rate was 87% (211/243).

Conversely, if you assume the conflicting second review calls were incorrect, and the SME got it right on all of them the first time, the precision rate for the first review would be 100%. That is because all of the documents identified by the first review as relevant to the information request would in fact stand confirmed as relevant. As discussed previously, all of the disputed calls concerned ambiguous or borderline grey area documents. The classification of these documents is inherently arbitrary, to some extent, and they are easily subject to concept shift. The author takes no view as to the absolute correctness of the conflicting classifications.

The second Monomodal review classified 579 documents as relevant. The second review re-examined 323 of those 579 documents and agreed with the relevant classification of 285 documents and disagreed with 38. Assuming that the first review was correct, and the second review incorrect, the agreement rate on relevant classifications was 88% (285/323).

When the content of these documents are examined, and the duplicate and near duplicate documents are removed from the analysis as previously explained, the Monomodal review classified 427 different unique documents as relevant. The first review had examined 242 of those 427 documents. The first review agreed with the relevant classification of 211 documents and disagreed with 31. Assuming that the first review was correct, and the second review incorrect, the precision rate was again 87% (211/242).

Assuming the conflicting first review calls were incorrect, and the SME got it right on all of them the second time, then again the precision rate for the second review would be 100%. That is because all of the documents identified by the first review as relevant to the information request would in fact stand confirmed as relevant.

In view of the inherent ambiguity of all of the documents with conflicting coding the measurement of precision in these two projects is of questionable value. Nevertheless, assuming that the inconsistencies in coding were always correct, when you do not account for duplicate and near duplicate documents the second Monomodal review was 24% more consistent with the first Multimodal review. However when the duplicates and near duplicate documents are removed for a more accurate assessment, the precision rates of both reviews were almost identical at 87%.

The F1 measurement is the harmonic mean of the precision and recall rates.  The formula for calculating the harmonic mean is not too difficult: 2/(1/P + 1/R) where P is precision and R is recall. Thus using the more accurate 87% precision rate for both, the harmonic mean ranges for the projects are:

  • 40% to 77% for Multimodal
  • 35% to 71% for Monomodal

The F1 measures for most-likely-scenario spot projections for both are:

  • 65% for Multimodal
  • 61%  for Monomodal

In summary since the precision rates of the two methods were identical at a respectable 87%, the comparisons between the recall rates and F1 rates are nearly identical. The Multimodal F1 of 40% for the worst-case-scenario was 14% better than the Monomodal F1 of 35%. The Multimodal F1 of 65% for the best-case-scenario was 7% better than the Monomodal F1 of 61%. The most likely spot projection differential between 61% and 65% again shows Multimodal with a 7% improvement over Monomodal. 

Comparisons of Total Counts of Relevant Documents

The first review using the Multimodal method found 661 relevant documents. The second review using the Monomodal method found 579 relevant documents. This means that Multimodal found 82 more relevant documents than Monomodal. That is a 14% improvement. This is shown by the roughly proportional circles below.

Relevant_Circles_Compare

Analysis of the content of these relevant documents showed that:

  • The set of 661 relevant documents found by the first Multimodal review contained 292 duplicate or near duplicate documents, leaving only 369 different unique documents. There were 74 duplicates or near duplicates in the 285 documents coded relevant by both Multimodal and Monomodal, and 218 duplicates in the 376 documents that were only coded relevant in the Multimodal review. (As the most extreme example, the 376 documents contained one email with the subject line Enron Announces Plans to Merge with Dynegy dated November 9, 2001, that had 54 copies.)
  • The set of 579 relevant documents found by second Monomodal review contained 152 duplicate or near duplicate documents, leaving only 427 different unique documents. There were 74 duplicates or near duplicates in the 285 documents coded relevant by both Multimodal and Monomodal, and 78 duplicates in the 294 documents that were only coded relevant in the Monomodal review. (As the most extreme example, the 294 documents contained one email with the subject line NOTICE TO: All Current Enron Employees who Participate in the Enron Corp. Savings Plan dated January 3, 2002, that had 39 copies.)
  • Therefore when you exclude the duplicate or near duplicate documents the Monomodal method found 427 different documents and the Multimodal method found 369. This means the Monomodal method found 58 more unique relevant documents than Multimodal, an improvement of 16%. This is shown by the roughly proportional circles below.

two_circles_Unique_relevantOn the question of effectiveness of retrieval of relevant documents under the two methods it looks like a draw. The Multimodal method found 14% more relevant documents, and likely attained a recall level from between 13% to 18% better than attained by the Monomodal method. But after removal of duplicates and near duplicates, the Monomodal method found 16% more unique relevant documents.

This result is quite surprising to the author who had expected the Multimodal method to be far superior. The author suspects the unexpectedly good results in the second review over the first, at least from the perspective of unique relevant documents found, may derive, at least in part, from the SME’s much greater familiarity and expertise with predictive coding techniques and Inview software by the time of the second review. Also, as mentioned, some slight improvements were made to the Inview software itself just before the second review, although it was not a major upgrade. The possible recognition of some documents in the second review from the first could also have had some slight impact.

Hot Relevant Document Differential

The first review using the Multimodal method found 18 Hot documents. The second review using the Monomodal method included only 13 Hot documents. This means that Multimodal found 5 more relevant documents than Monomodal. That is a 38% improvement. This is shown by the roughly proportional circles below.

hot_circles

Analysis of the content of these Hot documents showed that:

  • The set of 18 Hot documents found by first Multimodal review contained 7 duplicate or near duplicate documents, leaving only 11 different unique documents.
  • The set of 13 Hot documents found by second Monomodal review contained 6 duplicate or near duplicate documents, leaving only 7 different unique documents. Also, as mentioned, all 13 of the Hot documents found by Monomodal were also found by Multimodal, whereas Multimodal found 5 Hot documents that Monomodal did not.
  • Therefore when you exclude the duplicate or near duplicate documents the Multimodal method found 11 different documents and the Monomodal method found 7. This means the Multimodal method found 4 more unique Hot documents than Monomodal, an improvement of 57%. This is shown by the roughly proportional circles below.

hot_Circles_unique

Conclusion

Enron_2On the question of effectiveness of retrieval of Hot documents the Multimodal method did 57% better than Monomodal. Thus, unlike the comparison of effectiveness of retrieval of relevant documents, which was a close draw, the Multimodal method was far more effective in this category. In the author’s view the ability to find Hot documents is much more important than the ability to find merely relevant document. That is because in litigation such Hot documents have far greater probative value as evidence than merely relevant documents. They can literally make or break a case.

In other writings the author has coined the phrase Relevant is Irrelevant to summarize the argument that Hot documents are far more significant in litigation than merely relevant documents. The author contends that the focus of legal search should always be on retrieval of Hot documents, not relevant documents. Losey, R. Secrets of Search – Part III (2011) (the 4th secret). This is based in part on the well-known rule of 7 +/- 2 that is often relied upon by trial lawyers and psychologists alike as a limit to memory and persuasion. Id. (the 5th and final secret of search).

To summarize this study suggests that the hybrid multimodal search method, one that uses a variety of search methods to train the predictive coding classifier, is significantly more effective (57%) at finding highly relevant documents than the hybrid monomodal method. When comparing the effectiveness of retrieval of merely relevant documents the two methods did, however, perform about the same. Still, the edge in performance must again go to Multimodal because of the 7% to 14% better projected F1 measures.


A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents

June 11, 2013

Enron_Losey_StudyThis is my first report comparing two different searches of 699,062 Enron documents that I performed in 2012 and 2013. I understand this may be the first study of the outcomes of two searches of a large dataset by a single reviewer. It is certainly the first such study concerning legal search. The inconsistencies between the two reviews is, I am told, of scientific interest. The fact that I used two different predictive coding methods in my experiment is also of some interest. This blog is a draft of what I hope will become a formal technical article. My private thanks to those scientists and other experts who have already provided criticisms, suggestions and encouragement of this study.

This draft report sets forth the metrics of both reviews in detail and provides a preliminary analysis of the consistencies and inconsistencies of the document classifications. I conclude with my opinion of the legal implications of these findings on the current debate over disclosure of irrelevant documents used in machine training. In a future blog I will provide a preliminary analysis of the comparative effectiveness of the two methods used in the reviews.

I welcome peer reviews, criticisms, and suggestions from scientists and academics with an interest in this study. I also welcome dialogue with attorneys concerning the legal implications of these new findings. Private comments may be made by email to Ralph.Losey@gmail.com and public comments in the comment section at the end of this article.

Objective Report of the Two Reviews

The 699,082 Enron dataset reviewed is the EDRM derived version of emails and attachments. It was processed and hosted by Kroll Ontrack on two different accounts. Both reviews used Kroll Ontrack’s Inview software, although the second review used a slightly upgraded version. Both reviews had the same goal to find all documents related to involuntary employee termination, not voluntary. A simple classification scheme was used where all documents were either coded as irrelevant, relevant, or relevant and hot (highly relevant).

The review work was performed by a single subject matter expert (SME) on employee termination, namely the author, a partner in the Jackson Lewis law firm, which specializes in employment litigation. The author is in charge of the firm’s electronic discovery and has thirty-three years of experience with legal document reviews.

The first review was done in May and June 2012 over eight days. The second was done in January and February 2013 over approximately twelve days. Both reviews were done solo by the same SME without outside help or assistance. In both reviews the SME expended a total of approximately 52 hours on each project, for a total of 104 hours. That was 52 hours of review and analysis time, but did not include time to write-up the search reports or wait on computer processing.

The original purpose of the first review was to improve the author’s familiarity with the predictive coding features of Inview and provide a narrative for instructional purposes of his use of the bottom line driven hybrid multimodal approach to review that he endorses. The author prepared a detailed narrative describing this first review project published on his e-Discovery Team blog. Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (2012).

The purpose of the second review was to perform an experiment to evaluate the impact of using a different methodology to do the same review. In the second review the author used a bottom line driven hybrid monomodal approach. A series of videos and blogs describing the review have also been published on the author’s blog. Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (2013). The video reports include satirical segments based on the Startrek Borg villains to try to convey the boring, stifling qualities of the Monomodal review method.

The review method used in the first review is called Multimodal in this report, and the second method is called Monomodal.  A nickname is also sometimes used for the second approach, where it is called the Borg method, more specifically the Hybrid Enlightened Borg approach. Losey, R., Three-Cylinder Multimodal Approach To Predictive Coding. The author does not endorse the Monomodal method, but wanted to know how effective it was compared to the Multimodal method. The author has discussed these two methods of search and review at length in many articles. See CAR page of e-Discovery Team blog for a complete listing.

Since these contrasting review methods are described in detail elsewhere only a simple summary is provided now. The two methods both use predictive coding analysis and document ranking, and both use human (SME) judgment to select documents for training in an active machine learning process. The primary difference is that the Monomodal method only uses the predictive coding search and review techniques, whereas the Multimodal used predictive coding methods, plus a variety of other search methods, including especially keyword search and similarity search. The Multimodal method used multiple modes of search to find training documents for active machine learning.

Attempt to Emulate Two Separate Reviews

An attempt was made to keep each of the reviews as separate and independent as possible. The goal was to avoid the SME’s memory of coding a document one way in the first review to influence his coding of the same document in the second review. For that reason, and others, the SME never consulted the classifications made in the first review as part of the second. In fact, the Kroll Ontrack review platform for the first review project was never opened after the first project completed until just recently to make this comparative analysis. Further, the SME intentionally did not review notes of the first project to try to refresh his memory for the second. To the contrary, the SME tried as far as possible to forget his first reviews and approach the second project as a tabula rasa. That is one reason there was a seven-month delay between the two reviews.

In general the SME self-reports a good but not exceptional memory for document recollection. Moreover, in the seven-month interim between the two reviews (May-June 2012 to January-February 2013), the SME had done many other review projects. He had literally read tens of thousands of other documents during that time period, none of which were part of this Enron database.

For those reasons the SME self-reports that his attempt to start afresh was largely successful. He did not recognize most of the documents he saw for the second time in the second review, but he did recognize a few, and recalled his prior classifications for some of them. It was not possible for him to completely forget all of the classifications he had made in the first review during the course of the second review. The ones he recognized tended to be the more memorable documents (such as the irrelevant photos of naked women that he stumbled upon in both reviews, and the Ken Lay emails). He did recall those documents and his previous classifications of those documents. But this involved a very small number of documents. The SME estimates that he recognized less than 100 unique documents (not including duplicates and near duplicates of the same documents, of which there are many in the 699,082 EDRM Enron dataset).

Also, the SME recognized between 10-20 grey area type documents where the relevancy determinations were difficult and to a certain extent arbitrary. He knew that he had seen them before, but could not recall how he had previously coded these documents. As mentioned, the SME made no effort to do so. His analysis and internal debate on these and all other documents reviewed concerned whether they were relevant, or not. The classifications were made entirely anew on all documents, especially including these ambiguous documents, rather than trying to rely on the SME’s uncertain memory of how they were previously classified.

Caveats and Speculations

In spite of these efforts to emulate two separate reviews, the recollection of the SME on some documents should be taken into consideration and the metrics on inconsistent reviews taken as a floor. If there had been a longer delay in time between the two reviews, say two years instead of seven months, it is reasonable to assume the inconsistencies would increase. The author would, however, expect any such increase to be relatively minor.

It is also important to note the SME’s impression (admittedly subjective, but based on over thirty years of experience with document review and relevancy determinations), that if he had studied his prior reviews before beginning the second review, and if he had otherwise taken some minimal efforts to refresh his memory, then he would have significantly reduced the number of inconsistencies. Further, the author believes that a shorter delay in time between the reviews (for instance, 10 days instead of 10 months) would also have lessened the inconsistency rate with no additional efforts on his part.

The imposition of quality control procedures designed for consistencies between the two reviews would, in the author’s view, have drastically reduced the inconsistency rate. Again, any such procedures were intentionally omitted here to try to emulate, as far as possible, two completely separate and independent reviews.

Summary of Metrics of the Two Reviews

In the first review, which used the Multimodal method, 146,714 documents were coded as follows:

  • 1,507 random sample generated at the beginning of the project, plus
  • 1,605 from null-set random sample at the end of the project, plus
  • 1,000 machine selected from five rounds of training (sub-total 4,112), plus
  • 142,602 human judgmental selected.

The coding classifications were 661 relevant and 146,053 irrelevant. (This count does not include the approximate 30 additional relevant documents found in the post-hoc analysis of the project.)

This 661 total includes 18 documents considered Highly Relevant or Hot.

Further, it should be noted that the remaining 552,368 documents (699,082-146,714) not classified were considered by the SME to be irrelevant due to low predictive coding ranking and other reasons. They were treated as irrelevant even though not classified by the SME through bulk coding.

Of the 146,714 total documents categorized only approximately 2,500 were actually read and individually reviewed by the SME. This study of inconsistent classifications only considers these documents. (Note that only 1,981 documents are recorded in Inview software as reviewed, but the SME read many documents without categorization and used bulk coding instead, and thus this count is artificially low. The SME believes the upward adjustment to 2,500 is approximately correct). This means the SME categorized or coded 144,214 documents by using Inview software’s mass categorization features, which allows for categorization without actually reviewing each individual document. This is common for duplicative documents or document types.

In the first review of the 661 documents classified as relevant only 333 were specified for training. The 661 training documents include all documents identified as relevant in the machine selected document sets. The 328 documents classified as relevant by the SME but not specified for training were all derived from the 142,602 human judgmental selected documents. These documents were not specified for use as training documents because the SME thought it might skew or confuse the machine learning to include them. Documents were sometimes excluded because of unusual traits and characteristics of the relevant document, or to avoid excessive weighting of particular document types that might bias the training. For example, where other duplicates or near duplicates had already been used several times for training. (As mentioned, this was one of my first predictive coding projects, and I am not sure this strategy of mass withholding of documents from training to mitigate against bias was correct. If I had a do-over I would probably train on more documents and trust the software more to sort it out.) Some documents specified for training by the SME were not in fact used for training, but were instead only used by the Inview software as part of the initial control set for testing purposes. Documents in a control set for testing purposes are not also used for machine training. Only 1 of the 333 relevant documents here specified for training by the SME in the first review was so removed from training and instead used in the control set.

In the first review of the 146,053 documents classified as irrelevant only 2,586 were specified for training. The 2,586 training documents include all documents identified as irrelevant in the machine selected document sets. The 143,467 documents classified as irrelevant by the SME but not specified for training were all derived from the 142,602 human judgmental selected documents or the random samples. These documents were not specified for use as training documents because the SME thought it might skew or confuse the machine learning to include them. Documents were sometimes excluded because of unusual traits and characteristics of the irrelevant document, or to avoid excessive weighting of particular document types that might bias the training. For example, where other duplicates or near duplicates had already been used several times for training. 1,063 of the 2,586 documents specified for training by the SME were not, in fact, used for training by the Inview software. They were instead used by the Inview software as part of the initial control set for testing purposes. Therefore after removal of the control set of 1,063 irrelevant documents used for testing, only 1,523 irrelevant documents were used for machine training.

In the second review, which used the Monomodal method, 48,959 documents were coded as follows:

  • 10,000 machine selected, not random, with exactly 200 documents in each of the 50 rounds, plus
  • 2,366 random selected by two 1,183 random samples, one at the beginning and another at the end of the project, plus
  • 36,593 human judgmental selected.

The coding classifications were 579 documents relevant and 48,380 irrelevant.

This 579 total includes 13 documents considered Highly Relevant or Hot.

Again, it should be noted that the remaining 650,123 documents (699,082-48,959) were considered by the SME to be irrelevant due to low predictive coding ranking and other reasons. They were treated as irrelevant even though not classified by the SME through bulk coding.

Of the 48,959 total documents categorized only approximately 12,000 were actually read and individually reviewed by the SME. This study of inconsistent classifications only considers these documents.  (Again note this is a best estimate as explained above. Inview records 11,601 as physically reviewed.) This means the SME categorized or coded 36,959 documents by using Inview software’s mass categorization features.

The first Multimodal review identified 18 highly relevant or Hot Documents. The second Monomodal review found only 13 of these 18 Hot documents. No Hot documents were found in the Monomodal review that had not been found in the Multimodal review. Five Hot documents were found in the Multimodal Review that were not also found in the Monomodal review. All were individually reviewed by the SME in both projects.

The Monomodal review thus found only 72% of the Hot documents found by the earlier Multimodal review. Put another way, the Multimodal method did 38% better in finding the total Hot documents than Monomodal.

In the second review of the 579 documents classified as relevant only 577 were specified for training. The 2 documents categorized as relevant and not specified for training were from the 36,593 human judgmental selected documents for the same reasons mentioned in the first review. Further, 1 relevant document specified for training was not in fact used to train the system, but was instead used by the Inview software as part of the control set. Therefore only 576 relevant documents were used for machine training.

In the second review of the 48,380 documents classified as irrelevant only 10,948 were used for training. All 10,000 documents identified as irrelevant in the machine selected document sets were used for training. The 37,432 documents classified as irrelevant by the SME and not used for training were all derived from the 36,593 human judgmental selected documents and the random samples. These documents were not used for training for the reasons previously described, but primarily to avoid confusing cumulative training that might bias the training. In addition, of the 10,948 irrelevant documents specified for training, 1,063 were diverted by the Inview software for use in the control set, and thus used for testing and not machine training. Therefore only 9,885 irrelevant documents were used for machine training.

A comparison of the relevant documents found by each method showed the following:

  • The 661 relevant found by Multimodal included 376 documents not found in Monomodal, which means 57% were unique. The 661 relevant included 18 Hot documents, 5 of which were not found by Multimodal, which means 28% were unique.
  • The 579 relevant found by Monomodal included 294 documents not found in Multimodal, which means 51% were unique. The 579 relevant included 7 Hot documents, none of which were not found by Multimodal, which means 0% were unique.
  • There were a total of 955 relevant documents found by using both the Multimodal and Monomodal method.
  • There were 285 relevant documents found by both the Multimodal and Monomodal methods, which is 30% of the total 955 found.

The comparisons between the two reviews of relevant document classifications are shown in the Venn diagram below.

two_methods_compare copy

The 285 relevant documents found in both reviews represent an Overlap or Jaccard index of 29.8% (285/(376+579). Ellen M. Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt  697, 700 (2000) (“Overlap is defined as the size of the intersection of the relevant document sets divided by the size of the union of the relevant document sets.”); Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011), pgs 10-11. 

32 Different Documents Reviewed and Coded Relevant by Multimodal and Irrelevant by Monomodal

A study was made for this report of the content of the 376 documents that were only marked as relevant in the Multimodal review performed in March 2012, and not marked as relevant in the later January 2013 Monomodal (Borg) review. The Inview software shows that the SME had in fact individually reviewed 118 of these 376 documents in the second Monomodal review and determined them to be irrelevant.

A study of the content of these 118 documents shows that 86 of the 118 documents were duplicates, or near duplicates, leaving a total of 32 unique documents with inconsistent SME review classifications. When the SME found or was presented these same 32 documents in the earlier March 2012 Multimodal review he had marked them as relevant.

This is evidence of inconsistent reviews by the SME showing concept drift. All of these documents were grey-area types where the SME changed his view of relevance to be more constrictive. The SME had narrowed his concept of relevance.

A study of these 32 documents shows that there were no obvious errors made in the coding. It is therefore reasonable to attribute all of the inconsistent classifications to concept shift on these documents, not pure human error, such as where the SME intended to mark a document relevant, but accidentally clicked on the irrelevant coding button instead. (This kind of error did happen in the course of the review but quality control efforts easily detected these errors.)

31 Different Documents Reviewed and Coded Relevant by Monomodal and Irrelevant by Multimodal

A study was also made for this report of the content of the 294 documents that were only marked as relevant in the Monomodal (Borg) review performed in January 2013, and not marked as relevant in the earlier March 2012 Multimodal review. The Inview software shows that the SME had in fact individually reviewed 38 of these 294 documents in the second Monomodal review and determined them to be irrelevant.

A study of the content of these 38 documents shows that 7 of the 38 documents were duplicates, or near duplicates, leaving a total of 31 unique documents with inconsistent SME review classifications. When the SME found or was presented with these same 31 documents in the later January 2013 Monomodal review he had marked them as relevant.

This is again evidence of inconsistent reviews by the SME showing concept drift. All of these documents were grey area types where the SME changed his view of relevance, but this time to be more inclusive. The SME had expanded his concept of relevance.

A study of these 31 documents shows that there were no obvious errors made in the coding.  It is therefore reasonable to attribute all of the inconsistent classifications to concept shift on these documents, not pure human error.

211 Different Documents Reviewed and Coded Relevant by Both Multimodal and Monomodal 

A study was also made for this report of the content of the 285 relevant documents found by both the Multimodal and Monomodal methods. In both projects all documents coded as relevant by the SME had been individually reviewed by him before final classification. A study of the content of these 285 documents shows that 74 of them were duplicates, or near duplicates, leaving a total of 211 unique documents with consistent SME review classifications. When the SME found or was presented with these same 211 documents in both projects he had marked them as relevant.

274 Different Documents Reviewed and Coded Relevant by Both Methods

To summarize the prior unique total relevant document counts, after removal of all duplicates or near duplicate there were a total of 274 different documents coded relevant by one or both methods. This compares to the earlier 955 total relevant document count before deduplication.

11 Different Documents Reviewed and Coded as Hot By Both Multimodal and Monomodal

A study was also made of the content of the 18 documents coded as Hot. In both projects all documents coded as Hot by the SME had been individually reviewed by him before final classification. A study of the content of these of these 18 documents shows that 7 of them were duplicates, or near duplicates, leaving a total of 11 unique documents. There was only 1 duplicate in the 5 Hot documents that the Multimodal review located and the Monomodal review did not. There were 6 more duplicates found in the 13 other Hot documents discovered in both reviews. Therefore, after removing a total of 7 duplicate documents there were a total of 11 unique Hot documents. (These 11 unique Hot documents are also included within the total 274 unique Relevant documents count.) Monomodal found 7 and missed 4.  Multimodal found all 11. Monomodal review thus missed 36% of the Hot documents. Put another way, the Multimodal methods did 57% better in finding the unique hot documents than Monomodal.

This differential between the different unique Hot documents discovered is both reviews is shown in this Venn diagram. The Jaccard Index for Hot document classification was 64% (7/7+4).

hot_Vinn_unique

Documents Categorized as Irrelevant by Both Multimodal and Monomodal 

The Multimodal method review categorized 146,053 documents as irrelevant. Of that total, 1,517 were categorized after review of each document, and 144,536 were bulk coded without the SME reviewing each individual document.

The Monomodal method review categorized 48,380 documents as irrelevant. Of that total, 11,083 were categorized after review of each document, and 37,297 were bulk coded without the SME reviewing each individual document.

The Agreement in coding the same documents irrelevant in both reviews was 31,109.

Of the 31,109 total documents categorized as irrelevant in both projects only approximately 3,000 were actually read and individually reviewed by the SME in both projects. This study of inconsistent classifications only considers these documents. (Note that only 2,500 documents are recorded in Inview software as reviewed, but the SME read many documents without categorization and used bulk coding instead, and thus this count is artificially low. The SME believes a 500 document upward adjustment to 3,000 is approximately correct.) This means the SME categorized or coded 28,109 of the 31,109 overlapping irrelevant documents by using Inview software’s mass categorization features.

Concept Drift Analysis

First, it is interesting to see that the change in concept drift from the first project to the second was approximately equal in both directions. Although the total counts were different due to duplicate documents, the SME changed his opinion in the second review from irrelevant to relevant on 31 different documents, and from relevant to irrelevant on 32 different documents.

The overall metrics of inconsistent coding of 274 unique relevant documents are as follows:

  • 211 different documents were coded relevant consistently in both reviews;
  • An additional 63 different documents were coded inconsistently, of which,
    • 49% (31) were first coded irrelevant in Multimodal and then coded relevant in Monomodal (false positives).
    • 51% (32) were first coded relevant in Multimodal and then coded irrelevant in Monomodal (false negatives).

An inconsistency of coding of 63 out of 274 relevant documents represents an inconsistency rate of 23%. Put another way, the coding on documents determined to be relevant was consistent 77% of the time. Again, this later calculation is known as the Jaccard measure. See Voorhees’ Variations, supra, and Grossman & Cormack, Technology Assisted Review, supra. Also See William Webber, How accurate can manual review be? Again, the Jaccard index is formally defined as the size of the intersection, here 211, divided by the size of the union of the sample sets, here 274 (211+32+31). Therefor the Jaccard index for the individual review of relevant documents in the two projects is 77% (211/274). This is shown by the Venn diagram below.

Unique_Docs_Venn

Several prior studies have been made of reviews for relevant documents that employed the Jaccard measure. The best known is the study of Ellen Voorhees that analyzed agreement among professional analysts (SMEs) in the course of a TREC study. It was found that two SMEs (retired intelligence officers) agreed on responsiveness on only 45% of the documents. When three SMEs were considered they agreed on only about 30% of the documents. Voorhees, Variations, supraAlso see: Grossman & Cormack, Technology Assisted Review, supra. It appears from the Voorhees report that the SMEs in this study were examining different documents that did not include duplicates. For that reason the Jaccard measure of different documents in the instant study of 77% would be the appropriate comparison, not the measure of 30% when duplicate documents were included.

A more recent study of a legal project using contract lawyers had Jaccard measures of 16% between the first review and follow-up reviews based on samples of the first. Roitblat, Kershaw, and Oot (2010, Journal of the American Society for Information and Technology). The Jaccard index numbers were extrapolated by Grossman and Cormack in Technology Assisted Review, supra at pgs. 13-14. Also see Grossman Cormack Glossary, Ver. 1.3 (2012) that defines the Jaccard index and goes on to state that expert reviewers commonly achieve Jaccard Index scores of about 50%, and scores exceeding 60% are very rare.

Analysis of Agreement in Coding Irrelevant Documents

The author is aware that comparisons of coding of irrelevant documents are not typically considered important in information retrieval studies for a variety of reasons, including the different prevalence rates in review projects. For that reason studies typically only include the Jaccard measure for comparison of relevant classifications only. Still, in view of the legal debate concerning the disclosure of irrelevant documents, this paper includes a brief examination of the total Agreement rates, including irrelevancy determinations. Further, Agreement rates are interesting and appropriate here since both studies consider a review of the exact same Enron dataset of 699,082 documents, and thus the same prevalence, and they are not relying on random samples, but on two full reviews.

The high Agreement rates on irrelevant classifications in the two reviews are of special significance in the author’s opinion because of the current debate in the legal community concerning procedures for predictive coding review. Several courts have already adopted the position that all relevant and all irrelevant documents used in training should be disclosed to a requesting party, even though the legal rules of procedures only require disclosure of relevant documents. Da Silva Moore et al. v. Publicus Groupe SA, 2012 WL 607412 (S.D.N.Y. Feb. 24, 2012) (Peck., M.J.), aff’d, 2012 WL 1446534 (S.D.N.Y. April 26, 2012) (Carter, J.); Global Aerospace Inc., et al. v. Landow Aviation, L.P., et al., 2012 WL 1431215 (Va. Cir. Cit. April 23, 2012); In re Actos (Pioglitazone) Products, MDL No. 6-11-md-2299 (W.D. La. July 27, 2012). Many attorneys and litigants take the contrary position that irrelevant documents should never be disclosed, even in the context of active machine learning. See Solomon, R., Are Corporations Ready To Be Transparent And Share Irrelevant Documents With Opposing Counsel To Obtain Substantial Cost Savings Through The Use of Predictive Coding, Metropolitan Corporate Counsel 20:11 (Nov. 2012).

Although the author has been flexible on this issue in some cases, before these results were studied the author had been advocating a do-not-disclose irrelevant documents position. Losey, R., Keywords and Search Methods Should Be Disclosed, But Not Irrelevant Documents (May 26, 2013). The author now contends that the Agreement and Jaccard index data shown in this study support a compromise position where limited disclosure may sometimes be appropriate, but only of borderline documents where irrelevancy is uncertain or likely subject to debate.

In the author’s opinion the inclusion of analysis of irrelevant coding by the SME in these two reviews allows for a more complete analysis and understanding of the types of documents and document classifications that cause inconsistent reviews. Again, to do this fairly the universe of classifications has been limited to those where the SME actually reviewed the documents, and also duplicate document counts have been eliminated. This seems to be the best measure to provide a clear indication of the types of documents that are inconsistently coded.

The inclusion of all review determinations in a consistency analysis, not just review decisions where a document is classified as relevant, provides critical information to understand the reasonability of disclosure positions in litigation. This is discussed in the conclusions below. This also seems appropriate when analyzing active machine learning where the training on irrelevance is just as important as the training on relevance.

In both projects the SME coded 31,109 identical unique documents as irrelevant. Of the 31,109 total overlapping documents coded, the SME actually read and reviewed approximately 3,000 of these documents and bulk coded the rest (28,109).

Thus in both projects the SME read and individually reviewed 3,274 unique documents: 3,000 documents were marked irrelevant and 274 marked relevant. This is shown in the Venn diagram below. Of the 3,274 identical documents reviewed there were only 63 inconsistencies. This represents an overall inconsistency error rate of 01.9%. Thus the Agreement rate for review of both relevant and irrelevant documents is 98.1% (3274/3,337).

Inconsistency_compare Conclusions Regarding Inconsistent Reviews

These results suggest that when only one human reviewer is involved who is an SME, and highly motivated, that the overall consistency rates in review are much higher than when multiple non-SME reviewers are involved with questionable motivation (contract reviewers) (77% v 16%), or multiple SMEs of unknown motivation and knowledge (retired intelligence officers in Voorhees study), (77% v. 45% with two SMEs, and 30% with three SMEs). These comparisons are shown visually in this graph.

Review_Consistency_Rates

These results also suggest that with one SME reviewer the classification of irrelevant documents is nearly uniform (98%+ Agreement), and that the inconsistencies primarily lie in relevant categorizations (77% Jaccard) of borderline relevant documents. (A caveat should be made that this observation is based on unfiltered data, and not a keyword collection or data otherwise distorted with artificially high prevalence rates.)

The 77% Jaccard measure is consistent with the test reported by Grossman and Cormack of an SME (Topic Authority in TREC language) reviewing her own prior adjudications of ten documents and disagreeing with herself on three of the ten classifications, and classifying another two as borderline. Grossman & Cormack, Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012) at pgs. 17-20.

The overall Agreement rate of 98% of all relevancy determinations, including irrelevant classifications where almost all classifications are easy and obvious, strongly suggests that the very low Jaccard index rates measured in previous studies of 16% to 45% were more likely caused by human error, not document relevance ambiguity or genuine disagreement on the scope of relevance. A secondary explanation for the low scores is lack of significant subject matter expertise such that the reviewers were not capable of recognizing a clearly relevant document. Even if you only consider the determinations of relevancy, and exclude determinations of irrelevancy, the 77% Jaccard index is still significantly greater than the prior 16% to 45% consistency rates.

The findings in this study thus generally support the conclusions of Cormack and Grossman that most inconsistencies in document classifications are due to human error, not the presence of borderline documents or the inherent ambiguity of all relevancy determinations. Id. Of the 3,274 different documents the SME read in both projects in the instant study only 63 were seen to be borderline grey area types, which is less than 2%. There are certainly more grey area relevant documents than that in the 3,274 documents reviewed (excluding the duplication and near duplication issue), but they did not come to the author’s attention in this post-hoc analysis because the SME was consistent in review of these other borderline documents. Still, the findings in this study support the conclusions of Grossman and Cormack that only approximately 5% of documents in a typical unfiltered predictive coding review project are of a borderline grey area type.

The findings and conclusions support the use of SMEs with in-depth knowledge of the legal subject, and the use of as few SMEs to do the review as possible. The study also strongly suggests that the greatest consistency in document review arises from the use of one SME only.

These findings and conclusions also reinforce the need for strong quality control measures in large reviews where multiple reviewers must be used, especially when the reviewers are relatively low-paid, non-SMEs.

The inconsistencies shown in this study of determinations of relevance, and excluding the classifications of irrelevant, were relatively small – 77%, as compared to 45%, 30% and 16% in prior studies. Moreover, as mentioned, they were all derived from grey area or borderline type documents, where relevancy was a matter of interpretation. In the author’s experience documents such as this tend to have low probative value. If they were significant to litigation discovery, then they usually would not be of a grey area, subjective type. They would instead be obviously relevance. I say usually because the author has seen rare exceptions, typically in situations where one borderline document leads to other documents with strong probative value. Still, this is unusual. In most situations the omission of borderline ambiguous documents, and others like them, would have little or no impact on the case.

These observations, especially the high consistency of irrelevance classifications (98%+), support the strict limitation of disclosure of irrelevant documents as part of a cooperative litigation discovery process. Instead, only documents that a reviewer knows are of a grey area type or likely to be subject to debate should be disclosed. (The SME in this study was personally aware of the ambiguous type grey area documents when originally classifying these documents. They were obvious because it was difficult to decide if they were within the border of relevance, or not. The ambiguity would trigger an internal debate where a close question decision would ultimately be made.)

Even when limiting disclosure of irrelevant documents to those that are known to be borderline, disclosure of the actual documents themselves may frequently not be necessary. A summary of the documents with explanation of the rationale as to the ultimate determination of irrelevance should often suffice. The disclosure of a description of the borderline documents will at least begin a relevancy dialogue with the requesting party. Only if the abstract debate fails to reach agreement would disclosure of the actual documents be required. Even then it could be done in camera to a neutral third-party, such as a judge or special master. Alternatively, disclosure could be made with additional confidentiality restrictions pending a ruling by the court.

I am interested in what conclusions others may draw from these metrics regarding concept drift from one review project to the next, inconsistencies of single human reviewers, and other issues here discussed. The author welcomes public and private comments. Private comments may be made by email to Ralph.Losey@gmail.com and public remarks in the comment section below. Marketing type comments will be deleted.


Guest Blog: Quick Peek at the Math Behind the Black Box of Predictive Coding

June 3, 2013

baron.UF.oct.2009Editor’s Introduction: This guest blog by Jason R. Baron and Jesse B. Freeman is a republication of the original paper they submitted for the DESI V workshop in Rome on June 14, 2013. Their paper is entitled Cooperation, Transparency, and the Rise of Support Vector Machines in E-Discovery: Issues Raised by the Need to Classify Documents as Either Responsive or Nonresponsive. I urge everyone to read the Baron and Freeman article because it is the first I have ever seen to introduce legal readers to the higher geometric concepts that underlie predictive coding. The background on predictive coding law and ethics is good too.

The stated goal of the DESI workshop for which this paper was prepared, a workshop that Jason helped organize and promote, is to focus on best practices and standards for using predictive coding, machine learning, and other advanced search and review methods in e-discovery. The e-Discovery Team is pleased to publish this  to ensure the widest possible readership. The original article in PDF form may also be found at the DESI V papers webpage. All of the submissions for this workshop are worthy of attention of serious students of legal search. DESI stands for Discovery of Electronically Stored Information, and, as you might expect, DESI V is the fifth international workshop. The first workshop, DESI Iwas in 2007 in Palo Alto. DESI II was in London in 2008. DESI III was in Barcelona in 2009. DESI IV was in Pittsburgh in 2011.

Jesse Freeman Jason is not the mathematician behind Cooperation, Transparency, and the Rise of Support Vector Machines in E-Discovery, but provides the legal content and background  on cooperation and transparency in typical, clear Baron fashion. The math, vector, and coordinate geometric explanations in the paper come from Jason’s intern, Jesse Freeman, a college student shown right. With Jason’s help, Jesse does an excellent job explaining the concepts of Support Vector Machines, which, as the article explains, provides the basic structure for supervised machine learning.

In this fairly easy half-hour read you will not only get a good summary of the law, but will get a look at the black box behind predictive coding type software. You will see that it is not really as dense and impenetrable as you might have imagined. Although on another level it may be much more far-out than you expected, as my introduction will now try to explain.

T3D_Cartesian_coordinateso understand the geometric concepts in the article all that you need to do is stretch your mind to include more than the three dimensions that make up your everyday world: the line, plane and solid. The first extra spatial dimension beyond the usual three Cartesian coordinates is the hardest to imagine. After that, it is not too hard to imagine more. In fact, there is no limit to the number of spatial dimensions you can process algebraically, although, unless you are a mystic or quantum physicist, you may well have difficulty visualizing or intuiting infinite dimensions.

Space-time_EinsteinBy the way, it is no fair to just add time and call that a fourth dimension like Einstein did. We are talking imaginary abstract spatial dimensions here, not space-time reality. It is also no fair to just add a point and call that a fourth dimension like Euclid and Pythagoras did. A point, the infinite succession of which makes up a line, is the zero dimension, which again, is not really a spatial dimension at all. It is a place holder, much like the present in the space-time continuum. A point is also a placeholder for a document on a multidimensional grid, but you will have to read the article, or already know something about SVMs, to understand that reference.

The Baron and Freeman article states that visualizing these hyper dimensions is not important, that what is important is to have an intuition that hyperplanes perform the same function as separating lines in two dimensions. That is certainly the consensus view, but I disagree with the authors on this one point. I think that the attempt to visualize hyper-dimensions is important precisely because it makes deep intuition possible. Anyway, that is how I approach it. For any others who are like me and highly visual I suggest you try staring at the animation below depicting a fourth dimension. Then try to think about adding dimensions to a coordinate system or grid. Watching this visualization, and thinking about higher spatial dimensions, may help to open your mind. It may facilitate your understanding of this article, before or after you read it. Plus it is relaxing. But please, for safety sake, don’t drive right away or operate power machinery.

4-Dimensions_animation

Once you open the door to infinite dimensional space, you can see how multiple dimension coordinates make it possible to map, sort, and rank documents using hyperplanes, vectors, and probabilities. It makes extremely complex category separation possible where you can separate relevant from irrelevant documents by positing thousands, if not millions of document piles, each existing in a different dimension (but beware of the curse of dimensionality!). As Jesse Freeman puts it in the article at section 4.3:

By projecting points representing documents into higher dimensional space, it is always theoretically possible to linearly separate relevant from irrelevant documents using a non-curved hyperplane. Then, from the set of separating hyperplanes, an SVM could choose the one that maintains the maximum distance between both clusters of data.

This article explains the basics of the mapping and spatial divisions from one category, here relevant, to another, irrelevant. It opens the black box for a quick peek at the predictive coding search engines inside.

black_box_SVM

Good luck puzzling through the Freeman portions of the article explaining Support Vector Machines (SVMs). It is a journey beyond the Cartesian world that you are used to into an imaginary world of hyper-dimensions. Higher dimensions are not just a mathematical fantasy anymore. They are used in a very practical fashion by SVMs and other technology. They are an essential part of what makes active machine learning software work, software that we now use everyday, including predictive coding legal search software, spam filters, Amazon book recommendations, Pandora music selections, Internet advertisement placements, etc. Technological applications of higher dimensions are now driving our entire techno-culture forward. We lawyers had better get some grasp of how the math and science works if we are to remain relevant. This article can help, and will, I hope, become the first of many of its kind. Jesse Freeman, please go on to law school and, even if you do not, keep on writing for us lawyer Flatlanders.

hyperplanes3d_2

_______________

Cooperation, Transparency, and the Rise of Support Vector Machines in E-Discovery: Issues Raised by the Need to Classify Documents as Either Responsive or Nonresponsive 

Jason R. Baron, Esq. University of Maryland, College Park, Maryland and Jesse B. Freeman Williams College, Williamstown, Massachusetts¹

Abstract: Exponential increases in the volume of electronically stored information are necessitating new thinking on the part of the greater legal community, including a movement away from linear or manual review, as well as away from reliance on keyword searching as the sole automated means to handle e-discovery search and document review requirements. Increasingly, lawyers are becoming more familiar with certain advanced forms of search techniques, including those utilizing machine learning. The landmark US opinion in da Silva Moore v. Publicus Groupe SA, issued in February 2012, giving a judicial imprimatur to use of “predictive coding” and other sophisticated iterative sampling techniques in satisfaction of discovery obligations, should assist in paving the way toward greater acceptance of these new methods. Almost all of these machine learning processes are based on support vector machines or related algorithms, which at first glance seem unapproachably complex. The basic intuitions behind their functionality are not nearly as daunting. After providing relevant background on traditional notions of the discovery process and the emergence of a need for more sophisticated forms of artificial intelligence to solve e-discovery challenges, this paper will explain the mathematical intuition behind support vector machines, so that lawyers can more fully grasp the implications of this new technology. In particular, this paper suggests that support vector machine technology necessarily requires lawyers paying heightened attention to notions of cooperation and transparency, in light of the collaborative, iterative interaction with coding software, and the need for sharing sets of non-responsive documents in order that use of the technology is optimized.

_____________

 1 Jason R. Baron is an Adjunct Faculty member of the University of Maryland College of Information Studies, and also serves as Director of Litigation at the National Archives and Records Administration. B.A., Wesleyan University; J.D., Boston University; Member of the District of Columbia and Massachusetts Bars. Jesse B. Freeman is a candidate for B.A. degrees in mathematics and economics, Class of ’15, at Williams College, and interned in the Office of General Counsel at the National Archives during the Summer of 2012. The authors wish to kindly thank William Webber, Doug Oard, and Ralph Losey for their comments on earlier drafts. The views expressed here are the authors alone and do not necessarily represent the views of any institution, public or private, with which they are affiliated.

___________

1. Introduction 

Since enactment of the 2006 US Federal Rules of Civil Procedure, lawyers in the United States increasingly have confronted the need to learn about a brave new world of “electronically stored information” (ESI), including the need to be aware of tools and techniques borrowed from the realm of artificial intelligence that previously were unheard of in civil discovery practice prior to trial. The 2006 Rules anticipated that the profession would undergo a sea-change in practice, by requiring increased attention to preservation of and access to electronic evidence at the outset of litigation, in the form of increased awareness of the necessity of legal preservation holds [1], and the desirability of performing more advanced and efficient searches for relevant documents – beyond anything necessitated in an era of paper documents [2, 3]. Given the need to pay attention at the beginning of litigation to such highly technical issues, lawyers are beginning to embrace the notion of being more cooperative and transparent in their legal practice to conform to e-discovery demands [4].

Nevertheless, the legal profession as a whole is by no means aware of the latest, profound changes in discovery practice brought on by the emerging use of machine learning technologies in the cause of making document review more efficient. In particular, support vector machines (SVMs) have the potential to dramatically increase both the quality and efficiency of the search and document review functions in e-discovery. Unfortunately, the mathematical formulas used to describe SVMs are both technical and intimidating. This paper has two modest aims: first, we will show that the intimidating formulas that keep many from fully understanding how SVMs work are based on the much simpler mathematical notions of distance and separation. Hopefully, readers of this paper will develop greater understanding of SVMs, in order that they consider incorporating such promising new technologies in their everyday e-discovery practice. While SVMs are not the only predictive coding technology available, this paper focuses on SVMs for two reasons. First, SVMs are a highly popular form of predictive coding. Second, all predictive coding software maps documents based on specified characteristics and looks for those characteristics in unread documents in order to make similar classifications without the need for hands-on review. We focus on SVMs because the theoretical background on predictive coding involved in the explanation de-mystify the process for all users and the specific mechanism of the SVM should be directly relevant information to many.

A second aim is to preliminarily explore how growing and eventually widespread use of SVMs holds the potential to upset traditional notions of what it means to practice civil discovery. The paper will argue that optimum use of these technologies necessitates practicing a heightened level of cooperation and transparency between or among adversaries, at least with respect to the sharing of “nonresponsive” documents during the discovery process. The authors are well aware of how provocative these issues are; however, as described in detail below, starting with the da Silva Moore v. Publicus Groupe SA litigation in a US federal court in Manhattan, and in a select number of other cases, the parties are already largely on record as having embraced just such a level of cooperation — thus making the positions taken in this paper somewhat easier to maintain, as at least not entirely speculative [5].

2. Traditional Means of “Cooperation” in US Discovery and E-Discovery Practice 

Since 1938, with the adoption of the US Federal Rules of Civil Procedure, civil discovery practice, as ideally realized, has been grounded on notions of cooperation, transparency and fairness [6, 7]. The rules traditionally have assumed that lawyers will carry out their obligations on behalf of clients without need of active court supervision; however, in the age of ESI, judicial norms with regard to how active a court should be on the front end of litigation are, in many places, rapidly changing. Regardless, lawyers’ obligations have been bounded, however, by at least one limiting condition that represents a fundamental aspect of practice, universally followed to date, namely: that due diligence involves the search for and production of any and all nonprivileged, relevant evidence requested by an opposing party. Thus, as early as 1946, the US Supreme Court held in the case of Hickman v. Taylor [8], that “[m]utual knowledge of all the relevant facts gathered by both parties is essential to proper litigation” (emphasis added). To that end, Rule 26(b)(1) states that “Parties may obtain discovery regarding any nonprivileged matter that is relevant to any party’s claim or defense,” and that “[f]or good cause, the court may order discovery of any matter relevant to the subject matter involved in the action” (emphasis added). The Rule goes on to add that “relevant information” need not be admissible at trial if discovery appears reasonably calculated to lead to the discovery of admissible evidence.

In the decades prior to the 2006 rules changes, for the most part the legal community met its obligations under the federal rules by performing reasonable searches for relevant documents in traditional folders, filing cabinets, and warehouses filled with records. The task at hand was to straightforwardly have one or more lawyers – sometimes in teams – work through a review of boxes of documents to cull out potentially relevant pieces of evidence, for a further decision on both relevance and privilege. Irrelevant or nonresponsive documents were left behind, and only in rare cases were there quality checks to determine if documents had been missed in the review. To the extent controversy existed with respect to the basic discovery protocol, it involved occasional albeit sometimes notorious cases where counsel (and their client) failed to make reasonably diligent efforts to comply with a legally proper discovery request by opposing counsel, resulting in sanctions in the most egregious cases of suppressed (i.e., known but not disclosed) evidence [9].

The past decade has seen the growing volume and complexity of evidence in the form of ESI. This in turn has led to a spotlight placed on the efficacy of keyword searching in lieu of wholesale reliance on manual or linear review, i.e., “eyes-on” review of every document by a team of attorneys [2]. In the paradigmatic case, counsel’s initiation of search protocols centered around coming up with a limited number of keywords, with or without employment of Boolean operators, has been for some time the de facto standard for meeting legal requirements to perform reasonable searches for relevant documents. The Sedona Search Commentary went on to point out at length the known limitations of keyword searching based on the inherent ambiguities in written texts, citing to the important early work of Blair & Maron [10], and challenged the legal profession to recognize that more advanced means to perform searches of ESI held out the potential to increase both “recall” (the ratio of relevant documents obtained in a given search to the overall number of relevant documents in the repository subject to search), and “precision” (the ratio of relevant to irrelevant documents obtained in a given search). Accordingly, as Practice Pointer 1, the Commentary emphasized that

In many settings involving [ESI], reliance solely on a manual search process for the purpose of finding responsive documents may be infeasible or unwarranted. In such cases, the use of automated search methods should be viewed as reasonable, valuable, and even necessary.

The Commentary went on to discuss alternative search methods, including use of techniques grounded in fuzzy search, concept search, latent semantic indexing, Bayesian belief networks, clustering and categorization techniques, and machine learning methods of various types [2]. The Commentary concluded with a call for research, to better evaluate known search methods in a legal context, and explicitly referenced the TREC Legal Track, run out of the US National Institute for Standards and Technology, as one such research effort underway [11].

In the years since the 2006 rules amendments, an explosion of case law and commentaries ensued, with increasing attention being paid to the importance of quality control, project management, and iterative sampling, to optimize completeness and accuracy in finding “relevant” documents in particular productions. (For a summary of cases and commentaries, see [12].) As part of this collective movement toward more sophisticated ways to perform quality control (QC) checks of results obtained, notions of how transparent the process should be to the “requesting” as opposed to “responding” party have come to be highlighted. Given the inherent asymmetry present in responding parties having unequal rights of access to and knowledge of their own data universe, in the pre-ESI era responding parties were comfortable in the expectation that they could perform reasonable searches of their client’s records, without any a priori requirement imposed that the interim results of a given document production would be shared with opposing parties. The 2006 rules amendments, with an emphasis on early meet and confer conferences amongst parties to work through issues of preservation and access, somewhat undermined settled expectations. Against the backdrop of near-universal acceptance of the principle that lawyers should be more cooperative in negotiations involving their scope of ESI obligations, it was natural for the judiciary’s expectations to be heightened with respect to the sophistication of would-be search protocols, including taking into account whether sufficient sampling of the “non-hit” population of documents had occurred to confidently say that all relevant documents had been found [3], [13].

3. A New Era: “Predictive Coding” Approved By Courts 

Notwithstanding the growing sophistication in the legal space in the use of advanced search methods, not until the year 2012 had any reported judicial decision affirmatively ruled on whether the use of “predictive coding,” as one form of software-assisted advanced search method, was justified. Everything has changed, however, with reported decisions out of New York [5], Virginia [13], and Louisiana [28], respectively, a further high-profile evidentiary proceeding pending in Illinois [14] — all of which have involved various federal and state courts opining on the use of “predictive coding” in litigation to find relevant documents.

The term “predictive coding,” as one of many labels describing partially automated software assisted review processes using support vector machines or related algorithms, involves (i) a set of preserved data, representing the entirety of what has been captured during a legal hold or culled down using filters for date ranges, custodians, or general subject areas; (ii) use of a random sample of seed documents, and/or a judgmental sample of documents obtained through prior coding, keyword searching, or known documents of particular high relevance to a particular discovery, coupled with a human-in-the-loop strategy of manually coding whatever seed set exists for relevance or privilege; (iii) employing machine learning software, including most notably support vector machines, to categorize similar documents; and (iv) using some kind of QC process to check for coding consistency [12].

In the much-cited case of da Silva Moore, a US federal magistrate judge held that the state-of-the-art in advanced search techniques had progressed to the point where the Court could “bless” the use of a predictive coding protocol in the litigation as submitted by one or both parties [5]. In his February 24, 2012 watershed opinion, Magistrate Judge Andrew Peck writes:

In this case, the Court determined that the use of predictive coding was appropriate considering (1) the parties’ agreement, (2) the vast amount of ESI to be reviewed (over three million documents), (3) the superiority of computer-assisted review to the available alternatives (i.e., linear manual review or keyword searches), (4) the need for cost effectiveness and proportionality . . .; (5) the transparent process proposed by [defendants].

This Court was one of the early signatories to The Sedona Conference Cooperation Proclamation, and has stated that ‘the best solution in the entire area of electronic discovery is cooperation among counsel. . . .’ *An important aspect of cooperation is transparency in the discovery process. [Defendants] transparency in its proposed ESI search protocol made it easier for the Court to approve the use of predictive coding. . . . [Defendants] confirmed that all of the documents that are reviewed as a function of the seed set, whether they are ultimately coded relevant or irrelevant, aside from privilege, will be turned over to plaintiffs. … If necessary, counsel will meet and confer to attempt to resolve any disagreements regarding the coding applied to the documents in the seed set. While not all experienced ESI counsel believe it necessary to be as transparent as [defendant] was willing to be, such transparency allows the opposing counsel (and the Court) to be more comfortable with computer-assisted review, reducing fears about the so called ‘black-box’ of the technology. This court highly recommends that counsel in future cases be willing to at least discuss, if not agree to, such transparency in the computer-assisted review process.

The magistrate judge’s opinion allowing the use of “predictive coding” was subsequently affirmed by a federal district court judge [5]. An Order to the same effect also has been rendered in a state court proceeding in Virginia, where the Court issued a protective order allowing a responding party in discovery to use predictive coding over the objections of the requesting party [14]. In still another case in Illinois, multiple days of evidentiary hearings were held with expert testimony describing the pros and cons of using predictive coding, where the requesting party had moved to compel essentially “starting over” using such method –even after over a million documents have been located by a responding party using keyword searching and other traditional means [15]. The parties settled their search methods dispute in that case before an opinion was rendered.

The extraordinarily detailed protocol in Moore, attached as an appendix to the February 24, 2012 opinion [5], contains provisions for seed sets of documents generated through a combination of random and judgmental sampling, followed by up to seven iterative rounds of “training” the system, through a commitment by counsel to share both responsive and nonresponsive documents by “issue tag” categories. The protocol further provides for sampling at the back end of the initial training period to function as a QC check on excluded or irrelevant documents, to determine how well the trained system has done in coding accurately making those exclusions. (A similarly detailed joint protocol on predictive coding subsequently has been adopted in the In re Actos case out of Louisiana [28] .)

What the Moore protocol does not purport to explain, however, is the “black box” mathematical algorithms used in predictive coding or software-assisted method, which the judge in Moore more or less took on faith. It may be useful, therefore, to have an explanation at hand on what the mathematics of predictive coding entails, and why the protocol adopted by the Court in Moore does in fact represent best practice when using this technology, especially with respect to the issue of classifying documents as responsive or not.

4. Support Vector Machines: A Look Under The Hood 

In order to develop a sense of how support vector machines (SVMs) and similar algorithms operate, one must at least consider the following questions. First, how do computers represent a lawyer’s annotations of relevance on documents in a seed set? Second, how can annotations distinguishing relevant and irrelevant documents in the seed set enable the SVM to make the same distinction in a body of unread documents? Third, what are some complications that could arise in attempting to perform classifications between relevance and nonrelevance? After an elementary tutorial in section 4, we will go on in section 5 to ask are there ways in which legal professionals should alter traditional practices to achieve the full benefits of SVM-type technologies?

4.1 Separating Relevant and Irrelevant Data Using a Computer Algorithm 

When a lawyer reviews potential documents in discovery, she is expected to have a good idea whether the document will be meaningful to the litigation or not – based on past legal experience and specific training on the issues arising in a particular case. For computers the process of determining relevance is less obvious. But, as shown by a growing number of studies, if trained by a lawyer and equipped with an SVM, a computer can estimate with remarkable accuracy whether or not a document will be relevant to a particular case, potentially saving legal professionals’ valuable time [16]. To better understand how SVMs do this, we will start from a notion of documents as points in space, analyze how a computer could separate such points with a line, determine which separating line the computer could choose, and generalize our simple model to more complex searches.

SVMs can use the word content of documents to map each document within a corpus or seed set to a point in a coordinate space [17]. SVMs can also map documents using metadata [18] and relevant features derived from probabilistic latent semantic indexing [19].

For the sake of simplicity, suppose one is painting a house blue and only cares about the keywords “blue paint” and “maintenance.” Place the frequency (representation as a percentage of total words) of the phrase “blue paint” on the X-axis and the frequency of “maintenance” on the Y-axis, such that both are increasing as one moves out from (0,0). Unless two documents are lexically equivalent up to the order of words, each document will correspond to a unique point in space. Figure 1 demonstrates this simplified model.

Fig 1. Documents mapped to points using word content

Fig 1. Documents mapped to points using word content

Now, one can understand how a lawyer would train an SVM. Out of potentially millions of articles, the SVM might give a lawyer seed sets of as few as fifty and as many as a few hundred at a time to analyze for relevance, up to some designated cumulative cap of several thousand documents to be judged overall. These documents are called the “seed set” [19]. Seed sets are often selected in one of two ways. The SVM might draw a random sample of documents from the entire body of documents. Or, the seed set could be selected from the results of a judgmental search performed within the corpus (e.g. using keywords). Using either way, or some combination of both, once a seed set is determined, the lawyer identifies or codes documents as either relevant or irrelevant. SVMs incorporate these annotations of relevance into their spatial representation of documents. In figure 2, we imitate this coding process by using clear squares to denote irrelevant documents and black diamonds to denote relevant documents.

Fig 2. Figure 1 modified to incorporate relevance

Fig 2. Figure 1 modified to incorporate relevance

Now we will develop an algorithmic notion of separation of articles based on relevance. In this case, relevant and irrelevant data are clustered together. Documents that disproportionately feature the word “maintenance” turn out to be about general home maintenance, and do not pertain to our research about maintaining the quality of a paint job. All other articles were helpful in some way. As figure 3 shows, there is more than one way to spatially divide these documents based on relevance. The divisions in figure 3 are clear because the data are nicely clustered. But, in fact, there is always more than one way to spatially divide coded documents no matter how entangled relevant and irrelevant documents are in the graphical space [20]. That process is explained later.

Fig 3. A subset of possible divisions of relevant and irrelevant documents

Fig 3. A subset of possible divisions of relevant and irrelevant documents

The non-uniqueness of the separating line presents a potential problem: which line should the computer choose? It should choose the line that preserves the maximum distance between both bodies of data. To see why, suppose it does not do so. Then, the computer line is fairly close to at least one of our clusters. For convenience, suppose it is closer to relevant documents. Now consider what happens when one uses the data on the opposite side of this line – data deemed irrelevant by the SVM. Note that under the specified mapping system, documents that are graphically proximate have similar lexical content. So, one might expect that a document that is spatially “close” to a relevant document to also be relevant. Therefore, a separating line unnecessarily close to the relevant cluster is more likely to place a potentially relevant document on the irrelevant side of the separating line. In this circumstance, the SVM might dismiss a relevant result as irrelevant, which neither counsel wants. To abate this problem, the SVM selects the line that maintains a maximum distance between both clusters of data [21]. The maximum distance criterion specifies a unique separation line. Figure 4 provides an example of a maximum margin solution.

Fig 4. The maximum margin solution is least prone to error of all possible separating lines.

Fig 4. The maximum margin solution is least prone to error of all possible separating lines.

Models are rarely as simple as the artificial example provided above. There are three major generalizations of which one should be aware.

First, if one cares about more than two search terms, each point gains more coordinates and is thus positioned in a higher dimensional space. Suddenly, drawing a line is no longer an adequate way to separate two points. For example, if one wants to separate points in three dimensions, one uses a plane. Think of an umbrella as a small plane that separates points that are raindrops from points that are a person’s skin, clothes, and hair. If the umbrella had no width, like a line or no dimension, like a point, it would not adequately separate the two sets of points in the three dimensional universe. It needs to be at least two dimensional or the person carrying it will get soaked. So, the plane is the higher dimensional analogue of the line in terms of its ability to separate data in three dimensions. Yet, most searches will deal with more than three search terms and thus the input space for those searches will be higher than three-dimensional. At this point, one loses the ability to easily visualize the space in which points representing documents lie. Moreover, as the space increases in dimension, one needs higher dimensional analogues of planes to separate points within the space. Mathematicians call these structures “hyperplanes” [20]. Visualizing hyperplanes is not important; having the intuition that hyperplanes perform the same function as separating lines in two dimensions is.

The second generalization is that sometimes the structure that separates clusters while maintaining maximum distance is a curve, rather than a straight line. In these cases, SVMs use so-called “kernel functions” to derive a curve that separates the sets of points [20]. This process will be explained infra. Figure 5 gives an example of a separating curve.

Fig 5. Using a curve to separate more entangled data

Fig 5. Using a curve to separate more entangled data

Third, in the case of both hyperplanes and separating curves, one still wants to maintain maximum distance from both clusters of data. Failure to do so has the same negative consequences in high and low dimensions: a high risk of obscuring desired results or signaling false positives[2].

[2] Placing the separating hyperplane too close to the irrelevant cluster creates a risk of falsely identifying irrelevant documents as relevant.

4.2 Using Separating Spatial Constructs to Filter Future Results 

SVMs are powerful because they can predict whether a document will be relevant even if no lawyer has performed “eyes-on” manual review of that document. This section explains how SVMs predict the relevance of unobserved documents.

An SVM can quickly map an unread document to a point in space by counting the keywords present in that document as a proportion of total words. This point will either lie on the relevant side of the line or the irrelevant side of the line. If the document falls on the relevant side of the line, the SVM will keep the document and notify the lawyer that it is relevant. If the document falls on the irrelevant side outside of the range of potential ambiguity, the SVM will discard it, reducing the lawyer’s potential workload.

In higher dimensions, the position of the point with respect to the line might not be as obvious. So, SVMs use more general distance formulas. This will give the distance between an unobserved document and the hyperplane a positive or negative parity. The parity corresponds to which side of the hyperplane the document lies on. The side of the line on which the document lies informs the SVM about whether or not the document is likely to be relevant. So, even in higher dimensions SVMs can discern the relevance of a document using distance formulas. Distance formulas can even generate signed values of distance if the dividing hyperplane is curved.

4.3 Potential Complications and Their Solutions 

Five potential complications arise in the use of SVMs to classify documents in the hyperplane: seemingly inseparable data; statistical outliers; data points that are close to or are contained in the separating hyperplane that divides relevant and irrelevant documents; the necessity of sorting documents into more than two categories; and the introduction of new documents.

Dealing with seemingly inseparable data. Sometimes, data will appear to be inseparable. These cases are best illustrated through an example. Suppose one is interested in a new tax law and that one only seeks to use the keyword “tax”. After parsing a set of seed documents, a lawyer finds that documents that contain “tax” as 0 – 3% of the total words are only tangentially related to his research and tend to be irrelevant. In contrast, documents in which “tax” represents 4 – 6% of the total words tend to be relevant. However, documents in which “tax” represents 7% or more of the total word count tend to be merely descriptive and do not provide the deep analysis the lawyer seeks. Figure 6 is a graphical representation of this apparent dilemma.

Fig 6. No single point can separate relevant data from irrelevant data well.

Fig 6. No single point can separate relevant data from irrelevant data well.

That there are two clusters of irrelevant documents on either side of the relevant documents makes it unclear where one should draw the separating line, which in this one-dimensional case would just be a point.

To solve this problem, SVMs use kernel functions. Kernel functions project data into higher dimensional spaces. Surprisingly, given a data set in which no two identical objects have opposite labels, there is always a kernel function that will allow the data to be linearly separated. In fact, this projection into higher dimensional space is equivalent to curving the separating hyperplane [20]. So, separation using a curved hyperplane is never necessary as a non-curved hyperplane can always separate the data in some dimension.[3]

Consider the previous example. Suppose we projected our one-dimensional set of data into two dimensions. If f is the frequency with which “tax” appears in every hundred words, on average, then create a two dimensional graph mapping each document to f and (f – .05). Graph the first dimension on the X-axis, the second on the Y-axis. Now, instead of a line, one has a parabola. Also, the model has become two dimensional. So, the separating geometric construct becomes a line instead of a point. Figure 7 shows that this new set of data can easily be separated with a line.

Fig 7. A solution to the problem posed by figure 6

Fig 7. A solution to the problem posed by figure 6

By projecting points representing documents into higher dimensional space, it is always theoretically possible to linearly separate relevant from irrelevant documents using a non-curved hyperplane. Then, from the set of separating hyperplanes, an SVM could choose the one that maintains the maximum distance between both clusters of data. Although there is always a function that can separate relevant from irrelevant documents, some such functions are so complex that they are computationally intractable. In fact, most SVMs are only packaged with a few kernel algorithms to create kernel functions. In the cases that these packages fail to find a perfect separation function, the SVM will use a computationally feasibly kernel that separates most of the data with the maximum margin but accepts a “soft margin” of error.

Dealing with Outliers. There might be a few relevant documents that are surrounded by irrelevant documents or vice versa. This might for two reasons. The documents might be genuinely relevant (or irrelevant) even though their proportions of keywords do not match up with other documents of their type. Or, the documents could be false positive results; not even expert lawyers can separate relevant documents from irrelevant documents with anything approaching 100% accuracy [16, 22, 23].

To solve this problem, SVMs have a “soft margin” built into their algorithmic structure. This margin dictates how many outliers are allowed to lie on the opposite side of the hyperplane and how far they have to be from the hyperplane to be considered outliers [20].

Dealing with Documents that Lie Close to the Separating Hyperplane. Although most documents can be easily classified based on a lawyer’s coding annotations of the seed set, some classifications are not obvious. In particular, documents that lie close to or on the separating hyperplane are of ambiguous relevance. They are fairly close to both the cluster of relevant data and the cluster of irrelevant data. Thus, irrelevant documents that approach this hyperplane are more likely to be relevant than irrelevant documents that are farther away. The reverse is true for relevant documents. Therefore, this set of documents is most likely to be incorrectly classified by the SVM. A relevant document might be discarded or an irrelevant document might be labeled relevant, harming precision, recall, or both. To reduce the risk of false classification, an “active learning” SVM creates another seed set for the lawyer out of the documents that were left ambiguous by the previous filtering. After each seed set classification, the SVM uses the new inputs provided by the lawyer to create a more precise separation between the two classes of data [24]. In contrast, a “batch learning” SVM creates a new seed set out of random documents that were omitted from both the previous filtering and the previous seed set [24]. The SVM ends either of these iterative processes once it determines that the error that may result from automatic classification will be sufficiently small. In other words, the system “stabilizes” to an acceptable margin of error.

Relegating the task of classifying ambiguous documents to the lawyer means that the lawyer has to sift through more documents than are present in the initial seed set. However, on net, a lawyer who uses an SVM personally classifies significantly fewer documents than one who uses traditional review. In fact, lawyers do not even have to classify all of the documents of ambiguous relevance. If lawyers find more error acceptable, they can sift through smaller seeds of these documents, allow the SVM to record patterns in their classifications, and have the SVM classify the rest of the ambiguous documents.

Adapting SVMs to Sort Documents Into More than Two Categories. Standard SVMs are binary linear classifiers; they use lines (or their n-dimensional analogues, i.e., hyperplanes) to separate data into two categories. Yet, documents might need to be sorted by more than one criterion and divided into more than two sets. For example, lawyers may be interested in whether a document contains Personally Identifiable Information (PII) in addition to whether that document is relevant. To solve this problem, the SVM would simply make two binary classifications. One would separate the relevant documents from the irrelevant ones. The other would discern which documents are likely to have PII and which probably do not contain PII. Then, each document has two labels (or “issue tags,” in the vernacular used in the Moore protocol), and the documents can be separated into four categories: PII relevant, PII irrelevant, non-PII relevant, and non-PII irrelevant. If there are n potentially important features a document can have, an SVM would do n binary classifications and use the results to create 2n categories of documents [25].

Consider the following SVM: documents are mapped according to two keywords and then classified based on: (i) whether they are relevant; and (ii) whether they contain PII. Relevant documents are shaded black; irrelevant documents are clear. These two categories are separated by a vertical line. Documents containing PII are squares; documents without PII are diamonds. These two categories are separated by a horizontal line. Figure 8 depicts this dual division.

Fig 8. A basic example of division based on multiple criteria

Fig 8. A basic example of division based on multiple criteria

Introduction of new documents. Finally, suppose new unlabeled documents are introduced. Then, cooperating counsel may agree to feed these new documents to an SVM, which has two benefits. First, after the SVM classifies these new documents, lawyers may program it to look for new “issue tags,” that are highly correlated with relevance or irrelevance. Incorporating these tags as an additional proxy for relevance can improve both the current model and future filtering efforts.

This would allow both parties to channel the accuracy and efficiency of an SVM as new facts emerge to ensure the SVM best suits their needs. Second, independent of the chance of discovering a new, relevant issue tag, electronically sorting new documents will be faster and potentially more accurate than manual review [16].

5. Optimizing The Benefits of SVMs in Search Protocols 

SVMs are useful because they hold out the potential to be more efficient and effective than other review methods. As the comprehensive RAND Study, “Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery” concludes, the answer is “not entirely clear” given the lack of present data points what the magnitude of savings there is to be achieved by using predictive coding methods as compared with other hybrid forms of automated and manual review [16] at p.66. However, as the RAND report also emphasizes, “predictive coding in large-scale discovery review has the potential to yield significant cost savings without compromising quality as compared with that provided by a human review.” [16] at p. 71.

This potential will, in our view, be more rapidly fulfilled as lawyers consider the benefits of greater cooperation and transparency, as Judge Peck and his colleagues have urged. [26] To this end, we make the following observations about process and protocols.

First, lawyers need to conceptualize the e-discovery process as involving multiple iterative feedback loops, where input from an opposing party is desirable in order to fine-tune the production of relevant documents. As first noted in [6], this process involves multiple meet and confers, in which sample sets are provided of the results of an automated search, with opportunity given for choices being made by opposing counsel on what constitutes the documents of greatest interest returned in the first, second, or subsequent sample.

Second, as set out in the Moore protocol, the SVM algorithm fairly demands that good exemplar candidate documents from both the “relevant” and “irrelevant” universes be agreed to, in order that the sophisticated machine learning techniques described above in section 4 can take place. Importantly, it turns out the computer achieves the greatest gains in learning through active learning processes such as re-seeding documents that are “closer” to the classifier hyperplane [27]. This represents a challenge, one that the parties in Moore may not have fully anticipated, when nominally agreeing to discuss the classification of documents into responsive and nonresponsive piles.

Unquestionably, the idea that a protocol would require the turning over nonprivileged, irrelevant documents, in order to optimize training of a machine learning algorithm, is fairly unprecedented outside of the Moore and In re Actos protocols. However, absent building in that specification, it is not difficult to imagine many situations where counsel for one party who may have insisted on using predictive coding (as in the case of the responding party in Global Aerospace), ends up over-training the system to fit a one-sided conception of “relevance” in the litigation. In other words, absent agreement on what is considered irrelevant, especially in hard cases, there is much greater potential for going off course. However, as Judge Peck anticipated, there will be participants in litigation that strongly object to the intentional turning over of any irrelevant documents, and/or a greater number of documents than absolutely required, regardless of circumstances. Over time, however, as more judges would be expected to adopt similar protocols urging cooperation between parties, resistance in the profession (and among clients) may lessen. A recent article in Metropolitan Corporate Counsel [29] observed:

It remains to be seen whether corporations will embrace predictive coding with the levels of transparency involved in Da Silva [Moore], Actos and [Global Aerospace]. Some corporations will clearly be motivated by the potential cost savings. They may limit the matters they are willing to be transparent to those that they know are unlikely to involve the production of sensitive documents. Others may embrace transparency because they figure that the volume of irrelevant documents to be produced during the predictive coding training process will be relatively small and thus the risk low or they figure the problem of producing irrelevant documents can be controlled with a protective order or confidentiality agreement.

Given how novel the propositions discussed in this paper are, it is perfectly understandable that many lawyers will attempt to avoid any obligation that arises to engage with the other side in negotiations that include reaching agreement on the sharing of nonrelevant documents in connection with a protocol on advanced search techniques. See [30] for a further discussion of “forced” disclosure vs. voluntary disclosure of irrelevant documents when engaging in a predictive coding process. One day, however, courts may more routinely be in a position to rule that the failure to adopt such methods and protocols is unreasonable, i.e., that a process that goes so far as to transparently reveal both relevant and nonrelevant documents in the seed and training sets represents a benchmark of some kind for what is considered an “adequate” or “reasonable” response to a party’s discovery obligations. If more lawyers take the time to understand the underlying mathematics, as well as the sophisticated joint protocols that have been proposed, they arguably will benefit from the realization that classification is a two-sided proposition, demanding appropriate attention to all documents in a given repository or data set in order that machine learning technologies can be fine-tuned or optimized appropriately.

___________

References 

1. Pension Comm.of Univ. of Montreal Pension Plan v. Banc of Am. Sec., LLC, 685 F. Supp. 2d, 456 (S.D.N.Y. 2010).

2. The Sedona Conference, The Sedona Conference Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (“Sedona Search Commentary”), Sedona Conf. J. 8:189 (2007). https://thesedonaconference.org/publications

3. The Sedona Conference, The Sedona Conference Commentary on Achieving Quality in the E-Discovery Process, Sedona Conf. J. 10:299 (2009). https://thesedonaconference.org/publications

4. The Sedona Conference, The Sedona Conference Cooperation Proclamation, Sedona Conf. J. 10:331 (2009). https://thesedonaconference.org/publications

5. Moore et al. v. Publicus Groupe SA, 2012 WL 607412 (S.D.N.Y. Feb. 24, 2012) (Peck., M.J.), aff’d, 2012 WL 1446534 (S.D.N.Y. April 26, 2012) (Carter, J.)

6. Paul, G.L., Baron, J.R., “Information Inflation: Can the Legal System Adapt?,” Richmond J. Law & Tech., 13:10 (2007) http://jolt.richmond.edu/v13i3/article10.pdf

7. Beckerman, J.S., “Confronting Civil Discovery’s Fatal Flaws,” Minnesota L. Rev., 84:505 (2000). http://www.vallexfund.com/download/Confronting_Civil_Discovery_Fatal_Flaws_2000.pdf

8. Hickman v. Taylor, 329 U.S. 496 (1946).

9. Metropolitan Opera Ass’n, Inc. v. Local 100, Hotel Employees and Restaurant Employees Internat’l Union, 212 F.R.D. 178 (S.D.N.Y. 2003).

10. Blair, D.C., Maron, M.E., “An evaluation of retrieval effectiveness for a full-text document-retireval system,” Communications of the ACM 289 (1985). http://opim-sun.wharton.upenn.edu/~sok/papers/b/blair-maron.pdf

11. Oard, D.W., Baron, J.R., Hedin, B., Lewis, D.D., Tomlinson, S., “Evaluation of information retrieval for E-discovery,” Artificial Intelligence and Law, 18:347 (Springer 2010) (citing to research from the TREC Legal Track, http://trec-legal.umiacs.umd.edu/).

12. Baron, J.R., “Law in the Age of Exabytes: Some Further Thoughts on’Information Inflation’ And Current Issues in E-Discovery Search,” Richmond J. Law & Tech., 17:9 (2011). http://jolt.richmond.edu/v17i3/article9.pdf

13. Mt. Hawley Ins. Co. v. Felman Prod., Inc., 271 F.R.D. 125 (S.D. W.Va. 2010).

14. Global Aerospace Inc., et al. v. Landow Aviation, L.P., et al., 2012 WL 1431215 (Va. Cir. Cit. April 23, 2012) (order approving use of predictive coding in discovery)

15. Kleen Prods, LLC v. Packaging Corp. of Am., Docket 1:10-cv-05711 (N.D. Ill.) (plaintiffs’ motion pending to compel use of predictive coding in discovery).

16. RAND Corporation, “Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery” (2012), http://www.rand.org/pubs/monographs/MG1208.html.

17. Joachims, T., “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” Universitat Dortmund, 1998. Web. 07 Apr. 2012. http://www.cs.iastate.edu/~jtian/cs573/Papers/Joachims-ECML-98.pdf

18. Han, H., Manavoglu,E., Zha, H., Tsioutsiouliklis, K.,Giles,C.,and Zhang, X., “Rule-based Word Clustering for Document Metadata Extraction.” Proc. of ACM Symposium on Applied Computing, Santa Fe, New Mexico. (2005). 1049-053. http://clgiles.ist.psu.edu/papers/SAC-2005-Document-Metadata-Extraction.pdf

19. Yan, H., “Support Vector Machines for Text Categorization Based on Latent Semantic Indexing.” (2001). http://www.isn.ucsd.edu/courses/774/2001/lsa.pdf

20. Noble, William. “What is a Support Vector Machine.” Nature Biotechnology 24.12 (2006): 1565-567. . http://www.broadinstitute.org/annotation/winter_course_2006/index_files/Noble%202006%20SVM%20tutorial%20Nat%20Biotech.pdf

21. Orbanz, Peter. “Support Vector Machines.” Cambridge University, Cambridge. 09 Apr. 2012. Lecture. http://mlg.eng.cam.ac.uk/porbanz/teaching/slides_ml__svm.pdf

22. Roitblat, H.L, Oot, P., Kershaw. A., “Document Categorization in Legal Electronic Discovery: Computer Classification v. Manual Review,” J. Am. Soc’y for Info. Sci. & Tech., 61:70 (2010). http://www.clearwellsystems.com/e-discovery-blog/wp-content/uploads/2010/12/man-v-comp-doc-review.pdf

23. Baron, J.R., Oard, D., Lewis, D., TREC Legal Track 2007 Overview, Proceeding of the 15th Annual Text Retrieval Conference, National Institute of Standards and Technology, http://trec-legal.umiacs.umd.edu

24. Burl, M.C.., Wang. E.,”Active Learning for Directed Exploration of Complex Systems.” Proceedings of the 26th International Conference on Machine Learning. International Conference on Machine Learning, Montreal, Canada. 2009. 4. Web. http://cubs.buffalo.edu/govind/CSE705-SeminarPapers/2.pdf

25. Nayak, P. Raghavan, P., and Mooney.R., “Information Retrieval.” Computer Science 276: Introduction to Information Retrieval. Stanford University, Palo Alto, CA. 06 Apr. 2012. Lecture. http://jolt.richmond.edu/v18i3/article8.pdf

26. Waxse, D.E.., “Cooperation–What Is It and Why Do It,” Richmond J. Law & Tech (2012), 18:3.

27. Tong, S., Koller, D., “Support Vector machine Active Learning with Applications to Text Classification.” Journal of Machine Learning Research (2001): 45-66. Print. http://www.ai.mit.edu/projects/jmlr/papers/volume2/tong01a/tong01a/pdf

28. In re Actos (Pioglitazone) Products, MDL No. 6-11-md-2299 (W.D. La. July 27, 2012)

29. Solomon, R., “Are Corporations Ready To Be Transparent And Share Irrelevant Documents With Opposing Counsel To Obtain Substantial Cost Savings Through The Use of Predictive Coding,” Metropolitan Corporate Counsel 20:11 (Nov. 2012). Print. http://www.metrocorpcounsel.com/articles/21076/are-corporations-ready-be-transparent-and-share-irrelevant-documents-opposing-counsel

30. Losey, R., “Keywords and Search Methods Should Be Disclosed, But Not Irrelevant Documents” (May 26, 2013), http://e-discoveryteam.com/2013/05/26/keywords-and-search-methods-should-be-disclosed-but-not-irrelevant-documents/.


Electronic Discovery Best Practices Update

May 30, 2013

I found time recently to make several updates to EDBP.com, the collection of electronic discovery best practices for legal services. There are several additions and revisions, but the primary ones were made to step-one, Litigation Readiness, and step-seven, C.A.R. (computer assisted review).

May be freely used without changesI received input on Litigation Readiness from a practicing attorney who prefers to remain anonymous. He correctly pointed out that the statements on this page pertaining to third-party certification of ESI destruction went too far, especially considering how most large organizations necessarily have to delete data everyday to keep functioning. The revisions make clear that an outside expert audit of ESI destructions to solidify protection under Rule 37(e) only applies to exceptional large-scale data purges.

More input on this collection of pre-suit best practices are welcome. This page is still in its early formative stage. You can contribute anonymously, as done here, or can receive credit for any significant contributions. Links to excellent articles on the subject are also appreciated. Please send me suggestions. Remember, EDBP is focused solely on attorney work, the practice of law, and not on technology per se. The non-legal e-discovery work performed by other members of a e-discovery team are not included in EDBP. This differentiates this work-flow model from its much older, big brother.

EDBP_5-9Portions of the CAR best practices pages are now fairly well articulated, with several recent additions made. But your input on computer assisted review issues is also invited, especially on the Review Quality Controls section of CAR. That still has a long way to go.

The next step of Protections has not been written-up at all. When completed it will include best practices for the legal tasks of Redaction, Privilege Logs, Confidentiality Agreements and Orders, and Clawback Agreements and Orders. By the way at a recent CLE I attended in Pittsburgh both Judge Frank Maas and Judge John Facciolla said it was borderline malpractice not to have a clawback order entered in a case with serious ESI.

I divide the CAR best practices into a primary page, that just has a short introduction, and the sub-pages where most of the content resides:

I recently added several revisions and citations in the Predictive Coding page, including a summary of a recent article by Warwick Sharp, Ten Essential Best Practices in Predictive Coding (Today’s General Counsel, May 2013). Warwick, who I don’t think I’ve ever met, is a co-founder of Equivo and VP. His suggestions were all good and warranted inclusion on EDBP. I also added to this page a discussion of the difference between a control set and a training set, something that I touched upon in my most recent robot animation, Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search.

I also added citations to two excellent white papers from KPMG. They were added to both the Predictive Coding and Review Quality Control pages, where there is anyway some overlap. The best and most recent KPMG white paper was by Manfred Gabriel entitled Quality Control For Predictive Coding In e-Discovery (2013). The predecessor paper, The Case For Statistical Sampling In e-Discovery (2012), by multiple KPMG authors, Chris Paskach, Michael Carter, and Phil Strauss, was also very good.

I know Manfred, who is now a principal at KPMG, from presenting together at Legal Tech a few years back with Jason R. Baron on advanced review techniques. Unlike many lawyers who claim expertise in CAR, including especially predictive coding, Manfred has far more than just theoretical knowledge. Like Maura Grossman and myself, Manfred is hands-on in the digital world of document review. He is not only a review expert, he is an SME in the field of anti-trust law. Manfred Gabriel actually drives the CAR, and supervises many big projects. That is the only way to really understand these complex processes. Manfred and I seem to agree on all things predictive coding (although we agree to disagree on the proper role of EDBP’s big red-square!), so I was pleased to see he has now made a written contribution to the field with his Quality Control paper.


Keywords and Search Methods Should Be Disclosed, But Not Irrelevant Documents

May 26, 2013

black_box_KEYWORDSA common question these days between most lawyers discussing e-discovery is: What Keywords Did You Use? This is often followed by I’ll show you mine if you show me yours. Often this latter statement is made out of a bona fide spirit of cooperation, typically in cases where:

  1. both sides had too much ESI to search manually;
  2. they culled using simple keyword technology either because that is the only search they knew how to do, or they did not deem more advanced predictive coding technology to be appropriate for that case; and,
  3. the attorneys knew how to cooperate to get discovery done without spending too much money.

In these cases attorneys freely exchange the final keywords they used. There is no wasted breath or valuable client dollars spilled over the question.

Only rarely would attorneys in this symmetrical position not only want to know the keywords finally chosen, but also the keywords, parametrics, Boolean logic, etc., tested and rejected along the way. If they asked for a list of all the keywords ever tested, the proper response is no, or in my case, no such list exists, and I can’t recall, but there were quite a few. They might want to ask whether you tried this, that, or the other keyword. That’s fair, and an expert searcher would probably say, yes, I tried all of those early on, and they were all rejected because (fill in the blank). Alternatively, they might say to one or more of the suggestions: No, I didn’t think of that one, but I’ll check on it later today. It’ll just take a minute to try it out, and then I’ll get back to you on that.

A Keyword Search Hiding Kimono Made of Work Product is a Kimono Made of Whole Cloth

But what about another scenario, the asymmetrical one? You know, where one side has tons of ESI (actually ESI is weightless, but this sounds good), and the other side that has virtually none, aside these days from a pesky Facebook page or two. In these cases the kind of cooperation described for symmetrical cases is often lacking. One side, typically the plaintiff, has nothing to disclose. The conversation is more like, you show me yours, but I can’t show you mine because, well, I don’t have one. So sad. All too often this conservation serves as a prelude to a real waste of client money fighting over the question.

men's kimonoThe well-endowed defense counsel is often too shy to show theirs, keywords of course. So they hide theirs in a kimono made of a fabric called work product. This legal doctrine is designed to protect an attorney’s mental impressions, conclusions, opinions, or legal theories. It also protects from discovery documents and tangible things that are prepared in anticipation of litigation. Hickman v. Taylor329 U.S. 495 (1947); Rule 26(b)(3), FRCP. Therefore, you cannot send an interrogatory asking for the other side’s strategy to win the case; or more correctly stated, you can ask, but the attorney does not have to answer. 

Many lawyers have long considered the particular methods they used to find documents responsive to a request for production to be work product. It was, after all, their own thought processes and legal techniques that created the keywords. They object to disclosing the  keywords they used. They argue such disclosure would unfairly require them to disclose their theory of the case, their mental impressions of how to find relevant information.

This seems like a stretch to many attorneys, and judges, some of whom have rejected this argument outright. They do not think that any significant attorney ideas are revealed by something as mechanical as keywords, especially the final keywords used to cull a large dataset. They think keywords relate to the underlying facts of what documents are responsive to a document request, not mental impressions. They do not see any work product in keywords.

Still, many lawyers cling to this very broad interpretation of work product, especially when in asymmetrical litigation. In those cases defense counsel may respond to the question of what keywords did you use by saying something like: You cannot see my keywords, they are mine, all mine, and mine alone. No one may see my magic words. They are secret. They are protected by privilege. I would rather die than let you peek under my kimono. Well, ok, maybe the last phrase is not uttered too often, but the others essentially are. Passion by lawyers to protect their secrets can run high. This is usually displaced passion. It should be directed to protecting client confidences instead. Often lawyers grappling with e-discovery forget and confuse an attorney-client privilege with a work product privilege.

Attorney-client and work product are two completely different kinds of privilege. The AC privilege is owned by the client, not the lawyer. The lawyer has a strong ethical duty to protect the client’s AC privilege. This is dramatically contrasted with a work product privilege that is owned by the attorney and is given far less protection under the law. There is no ethical duty whatsoever for an attorney to keep his work product secret, except for the duty of competent representation. Often competence requires an attorney not to reveal his or her mental impressions of a case to opposing counsel. They reasonably construe the scope of the privilege and determine that:

  1. disclosure would not be in their client’s best interests, and
  2. the rules do not otherwise require them to share this particular aspect of their mind-set about the case.

But having said all of that, it is important to understand that attorneys often deem it to be in their client’s best interests to share some of their mental impressions of a case. Moreover, like it or not, the rules often require an attorney to make some disclosure of their mind-set, theories, etc., or material prepared for the case. So they do it. They move the case along. They do not get bogged down with a question of open or closed kimono, which is often just an ego-trip where a lawyer is over-valuing their own mental impressions at the client’s expense. Yes, kimono-closing motion play can be a very expensive process.

All actual trial lawyers, and not mere paper-pushers as we used to say, or now, maybe better said, mere electron-pushers, know full well that good lawyers share mental impressions with opposing counsel all of the time. Indeed, is that not what legal briefs are all about?

Some disclosure of work product is required for any attorney to comply with the rules of civil procedure, including the almighty Rule One, and especially the discovery rules. Discovery is built on the premise of cooperation, and that in turn requires some rudimentary sharing of mental impressions, such as what do you think is relevant, what documents do you want us to try to find to respond to this or that category in a Request For Production?

How could you possibly comply with Rule 26(f), for instance, without some select waiver of work product? Remember in subsection (2) it mandates attorneys to discuss the “nature and basis of their claims and defenses and the possibilities for promptly settling or resolving the case” and to develop a joint discovery plan. In subsection (3) the rules require lawyers to talk to each other and “state the parties’ views and proposals” on a topics A-F. Subsection (C) of Rule 26(f) in turn requires discussion of the views and proposals concerning “any issues about disclosure or discovery of electronically stored information…“ All of these mandated discussions require disclosure of an attorney’s mental impressions, conclusions, opinions, and legal theories.

Trial lawyers have always disclosed some work product to each other to prepare for trail (of which discovery is a part), conduct trials, and settle cases. A lawyer can share his mental impressions with the other side, if he wants, and can do so without fear of opening the door of a complete waiver. Again, this happens all of the time, especially in any settlement discussions, where both sides will try to persuade the other of the strength of their case. They will explain why and how they will win and the other side will lose. The same kind of discussion is inherent in any proportionality issue.

Lawyers usually love to argue about their opinions, so why this recalcitrance about keywords? Could it be because they know or suspect that their keywords suck? Do they fear ridicule and reversal because they just dreamed up keywords without testing? Or worse, did they use bad keywords on purpose to try to hide the truth?

Ralph_Kimono_Search_Triangle

Bottom line: to represent a client’s best interests and comply with the Rules, a lawyer has to share mental impressions to a certain extent. If lawyers refuse to talk to each other, refuse to cooperate, all on some misguided notion that they have a right to remain to remain silent because of the work product doctrine, discovery will never get done. The case will go off track and may never be resolved on the merits. The hide-your-keywords under a kimono doctrine that seems to be in fashion among many e-discovery lawyers these days is misguided at best, and at worst, may be illegal.

Kimono closing lawyers, get over yourself and how valuable your mental impressions are. Tell the other side what your keywords are. Or are you hiding them because your keywords are so poor? Are you embarrassed by what you have to show? Then get a keyword search expert to help you out. Unlike predictive coding experts, there are plenty of power users and professional keyword searchers around.

Cases Supporting Disclosure of Keywords

Like it or not more and more judges are growing tired of obstructionism and expensive discovery side-shows. They are requiring lawyers to show their keywords, at least the ones used in final culling. They are compelling lawyers to open their kimonos. Consider the ruling in a recent trade-secret theft case in California that cites to the law of several other jurisdictions.

To the extent Plaintiff argues that disclosure of search terms would reveal privileged information, the Court rejects that argument. Such information is not subject to any work product protection because it goes to the underlying facts of what documents are responsive to Defendants’ document request, rather than the thought processes of Plaintiff’s counsel. See Romero v. Allstate Ins. Co., 271 F.R.D. 96, 109-10 (E.D. Pa. 2010) (finding that document production information, including search terms, did not fall under work product protection because such information related to facts) (citing Upjohn Co. v. United States, 449 U.S. 383, 395–96 (1981) (“Protection of the privilege extends only to communications and not to facts. The fact is one thing and a communication concerning that fact is entirely different.”)); see also Doe v. District of Columbia, 230 F.R.D. 47, 55-56 (D.D.C. 2005) (holding that Rule 26(b)(1) of the Federal Rules of Civil Procedure can be read to allow for discovery of document production policies and procedures and such information is not protected under the work product doctrine or attorney-client privilege). Moreover, Defendants’ substantial need for this information is apparent. See In re Enforcement of Subpoena Issued by F.D.I.C., 2011 WL 2559546, at *1 (N.D. Cal. June 28, 2011) (LaPorte, J.) (“Fact work product consists of factual material and is subject to a qualified protection that a showing of substantial need can overcome.”). There is simply no way to determine whether Plaintiff did an adequate search without production of the search terms used.

Formfactor, Inc. v. Micro-Probe, Inc., Case No. C-10-03095 PJH (JCS), 2012 WL 1575093, at *7 n.4 (N.D. Cal. May 3, 2012).

The holding in Formfactor was recently followed in the well-known Apple v Samsung case involving a third-party subpoena of Google. Apple Inc. v. Samsung Electronics Co. LtdThe court compelled Google to disclose the keywords it used to respond to the subpoena and also disclose the names of the custodians whose computer records were searched. Their arguments that a third-party under Rule 45 did not have to make such disclosure were rejected. The court instead noted that discovery cooperation, including transparency of search methods, was required of anyone in litigation, both parties and non-parties.

Keyword Search Alone is Good Enough for Most Cases

Keywords are here to stay. Sure, it is old technology, but it is still an effective means of search. It should not be abandoned entirely. Predictive coding, by which I mean search using near-infinite-dimensional vector space probability analysis of all documents searched, is far more advanced than mere keyword search. But this kind of advanced-math search is hard to do correctly, and anyway, is not needed for all search projects.

Forgive the primitive image, but you do not need to use an elephant gun to kill a mouse. Since most cases today, even in federal court, involve less than $100,000 at issue, predictive coding is not needed in most suits. Keywords search alone, without including advanced analytics, is proportionally sound for most of these small value cases. Indeed, it is proportionally sound today for any case that does not involve high volumes of ESI or otherwise have complex search challenges.

Moreover, even in the big cases involving complex search problems, you would never use predictive coding search alone. That is about as silly as relying on random chance alone to train your predictive coding robots. You would use all kinds of search, what I call the multimodal approach. That includes keyword search using modern-day parametric Boolean features.

Ralph_kimono_whole_cloth

Anytime keywords are used to screen out files for review you should be prepared to disclose those keywords. I personally do not like to use keywords as an independent filter in a predictive coding process. But sometimes it happens, such as to limit the initial documents collected and thereafter searched with predictive coding. If that happens, and if the question is asked – What keywords did you use? – you should be prepared to answer. You should not try to hide that under your kimono. More and more courts consider that work product argument to be made of whole cloth.

Disclosure in Predictive Coding Search

hypercube_predictive_coding

Assuming that a predictive coding process is done properly, and keywords are not used to select what documents get searched, then the question of what keywords you used as part of the multimodal search becomes moot. A predictive coding CAR is not driven by keywords. It is driven by infinite dimensional probability math. Keywords are not in the black box anymore, hyper-dimensions are.

The keywords used in a predictive coding project are just one of many types of search used to find the documents that fuel the predictive coding engine. They help an expert searcher find documents to train the machine in an active learning process. The training documents are what cull out by probability ranking, not particular keywords. A predictive coding CAR runs on whole-documents, usually thousands of documents, not a few keywords. Should these training documents be disclosed, or not, is the issue in predictive coding search projects. Despite what some may say, there is no one set-answer to that question that applies to all cases. Da Silva Moore does not purport to provide the only possible answer for all cases. None of the orders in the field do that. The judges involved know better.

For now it is still an open question as to how far the work product doctrine applies to predictive coding search processes. It may not apply at all. You may have to share your mental impressions, your basic search plan. You may have to disclose what you did and why. You may have to explain what predictive coding search methods you used, but not disclose your entire training set, not disclose your irrelevant documents. Even if courts hold to the contrary that your search methods are protected by work product, you may want to share the methods  anyway to save time and client money. It may be in your client’s best interests to explain what you did. You will not lose any competitive advantage by doing so.

Sell Proportionality by Explaining How Good Your Search Is

I for one enjoy sharing how I did a complex, advanced analytics, iterative, multimodal search project. I do not enjoy this kind of show and tell because I like the sound of my own voice (well, ok that may be part of it), but because it helps persuade the requesting party that they are getting the most bang possible for the buck. It supports my proportionality argument. The abilities of predictive coding are truly mind-blowing. When done right, with good software (and, the truth is, most of the software out there is not good, is not bona fide active machine learning, and is not used properly), it can accomplish miracles. At least from our three-dimensional, keyword conditioned perspective, the search results seem miraculous.

black_box_SVMIf the requesting party, or judge, are still not convinced, and insist on an explanation on how the software black box really works, then you can bring an information scientist familiar with the software you used to explain it all. They like to talk almost as much as lawyers. They will go on and on. The other side may be sorry they asked, and most judges will be sorry they allowed a Daubert hearing for discovery. To me it is fascinating to hear how near-infinite-dimensional vector space probability analysis really works. So fascinating, in fact, that I have lined up a guest blog by Jason R. Baron and Jesse B. Freeman, a wiz-kid math genius he found, that introduces multidimensional support vector machines to lawyers. It is coming soon.

Predictive coding, when done right, is the best thing that ever happened to a requesting party in a complex ESI case. So if you are proud of what you have got, open up your predictive coding kimono for the other side to see. They should be impressed by the advanced methods, by the not just reasonable, but stellar, multidimensional efforts. After all, did they use hyper-dimensional based probability algorithms in their search? Did they used an iterative multimodal approach with both statistical quality control and quality assurance methods. Your best-practices in search justify a low-cost proportional approach.

You Should Not Have to Share Irrelevant Documents Unless and Until a Showing is Made that Your Predictive Coding Search Efforts Were Unreasonable

Does this mean that you must share all of the actual documents used in the machine training? Absolutely not. I have been talking about sharing process, not documents. The attorney work product has nothing to do with sharing, or not sharing, the client’s documents. We are not talking about documents prepared in connection with litigation. We are talking about documents prepared in the ordinary course of business, of life. These documents have their own protection from disclosure, for instance, the privilege held  by the client, not the lawyer, the attorney-client privilege. A lawyer cannot waive that. Only a client can. There are many other protections that apply to ESI, such as trade-secrets, or personal privacy laws. But the most central protection is that built into the rules, where only relevant information must be disclosed. Irrelevant ESI is not discoverable.

That means that unless there is some dispute as to the adequacy of the search efforts, only the relevant documents in a training set used in predictive coding need be produced. There is no legal basis in the initial stage to require production of all documents, including irrelevant documents.

Still, the client may choose to do so in certain cases, with certain sets of documents, and with certain protections in place. But that has nothing to do with work product analysis. It typically has to do with building confidence and trust between litigation parties and avoiding expensive disputes. It has to do with concerns that a reasonable effort to find relevant ESI is being made. It has to do with mitigating risk by cooperation and participation. It has to do with avoiding motions for sanctions, motions predicated upon allegations of an unreasonable search effort.

If a requesting party is kept in the dark, and the producing party does not reveal their search kung fu, and if the requesting party later makes a good cause showing that the producing party’s  search effort was unreasonable, then you are facing possible sanctions and the dreaded redo. All a requesting party will have to do to show good cause is provide proof that the producing party missed certain key hot documents. Then in a sanctions motion the reasonability of the producing party’s search efforts becomes relevant. This would usually happen in the context of a post-production motion.

Then, and only then, would there be legal authority to require production of the irrelevant documents used in the training sets. That is because these formerly irrelevant documents would then become relevant. They would then be relevant to the issue of reasonable efforts. They would then be discoverable, assuming the producing party claims its efforts were reasonable.

I contend that this magical transformation from irrelevant to relevant requires a good cause showing. There must at least be a justiciable issue of fact that the search efforts were unreasonable before the irrelevant documents in a training set become discoverable. Based on my CLE efforts, and listening to judges from around the country, I am confident that the judges, when they hear these arguments, will agree with this logic. They will, in most cases, only require disclosure of process, and not also require disclosure of irrelevant documents.

A recent case out of New York confirms this belief. Hinterberger v. Catholic Health Systems, Inc., 1:08-cv-00380-WMS-LGF, U.S. Dist. Crt., W. District of N.Y., Order dated 5/21/13. In this complicated case with tons of ESI the defendant finally gave up on using keyword search alone to try to find which of its millions of emails were likely relevant and needed to be reviewed for possible production. After Magistrate Judge Leslie Foshio pointed out the Da Silva Moore case to the parties, the defendant decided to drive a more advanced CAR, one that included a predictive coding search engine. They hoped that would allow them to accomplish their search task within budget.

Before the predictive coding work began, however, plaintiff demanded that the exact same Da Silva Moore search protocol be followed, and they be allowed to participate in the seed set generation, including a quick peek of all unprivileged training documents. Defendant objected, as well it should. No good cause had been shown to force such disclosure of irrelevant documents. Defendant argued that plaintiff’s motion was premature, and plaintiff misread the intent of Judge Peck’s order in Da Silva Moore. Defendant asserted that plaintiffs had no right to access to Defendants’ seed-set documents at this time. Judge Foshio essentially agreed with defendant, and denied plaintiff’s motion to compel, but did so largely upon defense counsel’s representation that they would cooperate with plaintiff. Judge Foshio got it right, so did the defendant: cooperation is the key, not disclosure of irrelevant documents or particular protocols agreed to in other cases

Conclusion

Ralph_Kimono_Search3As the predictive coding landscape matures, and as counsel learn to cooperate and make disclosure of methods used, there will be no need to build trust by disclosing irrelevant documents in training sets. Judges will not have to go there. Counsel will only need to disclose the multimodal search processes used, including details of the predictive coding methods. I do this all the time, although in much greater detail than required. See eg: summary of CAR and list of over thirty articles on predictive coding theory and methods; Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (detailed description of a multimodal search of 699,082 ENRON documents); Borg Challenge  (description of same search using a semi-automated monomodal method). Worst case scenario, counsel may have to explain the black boxes of the predictive coding software they used. All they need do for that is pull a science rabbit out of their hat who will explain hyper-dimensional probability vectors, regression analysis and the like. There are experts for that too. Every good software company has at least one. I know several.

In simple keyword search cases a similar logic will prevail. Counsel will have to disclose the keywords used in final culling, but not the documents deemed irrelevant by keywords or second-pass relevance attorney review teams. The instruction books prepared for these human review teams, if any, will also be kept secret, but not the general methods used, such as the quality controls. Case specific reviewer instruction manuals are documents prepared for litigation. That is classic work product. Moreover, they typically include far more information than keyword disclosure, or search method disclosure. They often explain an attorney’s strategies and theories of a case. Here a clear line still exists to protect a lawyer’s work product. Yes, the kimono still lives, so too does the concept of relevance.


Follow

Get every new post delivered to your Inbox.

Join 2,269 other followers