Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search

April 21, 2013

reinventing the wheelGetting predictive coding software is just part of the answer to the high-cost of legal review. Much more important is how you use it, which in turn depends, at least in part, on which software you get. That is why I have been focusing on methods for using the new technologies. I have been advocating for what I call the hybrid multimodal method. I created this method on my own over many years of legal discovery. As it turns out, I was merely reinventing the wheel. These methods are already well-established in the scientific information retrieval community. (Thanks to information scientist Jeremy Pickens, an expert in collaborative search, who helped me to find the prior art.)

In this blog I will share some of the classic information science research that supports hybrid multimodal. It includes the work of  Gary Marchionini, Professor and Dean of the School of Information and Library Sciences of U.N.C. at Chapel Hill, and UCLA Professor Marcia J. Bates who has advocated for a multimodal approach to search since 1989. Study of their writings has enabled me to better understand and refine my methods. I hope you will also explore with me the literature in this field. I provide links to some of the books and literature in this area for your further study.

Advanced CARs Require Completely New Driving Methods

First I need to set the stage for this discussion by use of the eight-step diagram show below. This is one of the charts I created to teach the workflow I use in a typical computer assisted review (CAR) project. You have seen it here many times before. For a full description of the eight steps see the Electronic Discovery Best Practices page on predictive coding.

predictive coding work flow

The iterated steps four and five in this work-flow are unique to predictive coding review. They are where active learning takes place. The Grossman-Cormack Glossary defines active learning as:

An Iterative Training regimen in which the Training Set is repeatedly augmented by additional Documents chosen by the Machine Learning Algorithm, and coded by one or more Subject Matter Expert(s).

The Grossman-Cormack Glossary of Technology-Assisted Review,  2013 Fed. Cts. L. Rev. 7 (2013). at pg.

Beware of any co-called advanced review software that does not include these steps; they are not bona-fide predictive coding search engines. My preferred active learning process is threefold:

1.  The computer selects documents for review where the software classifier is uncertain of the correct classification. This helps the classifier algorithms to learn by adding diversity to the documents presented for review. This in turn helps to locate outliers of a type your initial judgmental searches in step two (and  five) of the above diagram have missed. This is machine-selected sampling, and, according to a basic text in information retrieval engineering, a process is not a bona fide active learning search without this ability. Manning, Raghavan and Schutze, Introduction to Information Retrieval, (Cambridge, 2008) at pg. 309.

2.  Some reasonable percentage of the documents presented for human review in step five are selected at random. This again helps maximize recall and premature focus on the relevant documents initially retrieved.

3.  Other relevant documents that a skilled reviewer can find using a variety of search techniques. This is called judgmental sampling. After the first round of training, a/k/a the seed set, judgmental sampling by a variety of search methods is used based on the machine selected or random selected documents presented for review. Sometimes the subject matter expert (“SME”) human reviewer may follow a new search idea unrelated to the documents presented.  Any kind of searches can be used for judgmental sampling, which is why I call it a multimodal search. This may include some linear review of selected custodians or dates, parametric Boolean keyword searches, similarity searches of all kinds, concept searches, as well as several unique predictive coding probability searches.

The initial seed set generation, step two in the chart, should also use some random samples, plus judgmental multimodal searches. Steps three and six in the chart always use pure random samples and rely on statistical analysis. For more on the three types of sampling see my blog, Three-Cylinder Multimodal Approach To Predictive Coding.

My insistence on the use of multimodal judgmental sampling in steps two and five to locate relevant documents follows the consensus view of information scientists specializing in information retrieval, but is not followed by several prominent predictive coding vendors. They instead rely entirely on machine selected documents for training, or even worse, rely entirely on random selected documents to train the software. In my writings I call these processes the Borg approach, after the infamous villans in Star Trek, the Borg, an alien race that assimilates people. (I further differentiate between three types of Borg in Three-Cylinder Multimodal Approach To Predictive Coding.) Like the Borg, these approaches unnecessarily minimize the role of individuals, the SMEs. They exclude other types of search to supplement an active learning process. I advocate the use of all types of search, not just predictive coding.

Hybrid Human Computer Information Retrieval

human-and-robots

In contradistinction to Borg approaches, where the machine controls the learning process, I advocate a hybrid approach where Man and Machine work together. In my hybrid CARs the expert reviewer remains in control of the process, and their expertise is leveraged for greater accuracy and speed. The human intelligence of the SME is a key part of the search process. In the scholarly literature of information science this hybrid approach is known as Human–computer information retrieval (HCIR).

The classic text in the area of HCIR, which I endorse, is Information Seeking in Electronic Environments (Cambridge 1995) by Gary Marchionini, Professor and Dean of the School of Information and Library Sciences of U.N.C. at Chapel Hill. Professor Marchionini speaks of three types of expertise needed for a successful information seeker:

1.  Domain Expertise. This is equivalent to what we now call SME, subject matter expertise. It refers to a domain of knowledge. In the context of law the domain would refer to particular types of lawsuits or legal investigations, such as antitrust, patent, ERISA, discrimination, trade-secrets, breach of contract, Qui Tam, etc. The knowledge of the SME on the particular search goal is extrapolated by the software algorithms to guide the search. If the SME also has System Expertise, and Information Seeking Expertise, they can drive the CAR themselves.   Otherwise, they will need a chauffeur with such expertise, one who is capable of learning enough from the SME to recognize the relevant documents.

2.  System Expertise. This refers to expertise in the technology system used for the search. A system expert in predictive coding would have a deep and detailed knowledge of the software they are using, including the ability to customize the software and use all of its features. In computer circles a person with such skills is often called a power-user. Ideally a power-user would have expertise in several different software systems.

3.  Information Seeking Expertise. This is a skill that is often overlooked in legal search. It refers to a general cognitive skill related to information seeking. It is based on both experience and innate talents. For instance, “capabilities such as superior memory and visual scanning abilities interact to support broader and more purposive examination of text.” Professor Marchionini goes on to say that: “One goal of human-computer interaction research is to apply computing power to amplify and augment these human abilities.” Some lawyers seem to have a gift for search, which they refine with experience, broaden with knowledge of different tools, and enhance with technologies. Others do not, or the gift is limited to interviews and depositions.

Id. at pgs.66-69, with the quotes from pg. 69.

All three of these skills are required for an attorney to attain expertise in legal search today, which is one reason I find this new area of legal practice so challenging. It is difficult, but not impossible like this Penrose triangle.

Penrose_triangle_Expertise

It is not enough to be an SME, or a power-user, or have a special knack for search. You have to be able to do it all. However, studies have shown that of the three skill-sets, System Expertise, which in legal search primarily means mastery of the particular software used, is the least important. Id. at 67. The SMEs are more important, those  who have mastered a domain of knowledge. In Professor Marchionini’s words:

Thus, experts in a domain have greater facility and experience related to information-seeking factors specific to the domain and are able to execute the subprocesses of information seeking with speed, confidence, and accuracy.

Id. That is one reason that the Grossman Cormack glossary builds in the role of SMEs as part of their base definition of computer assisted review:

A process for Prioritizing or Coding a Collection of electronic Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection.

Grossman-Cormack Glossary at pg. 21 defining TAR.

According to Marchionini, Information Seeking Expertise, much like Subject Matter Expertise, is also more important than specific software mastery. Id. This may seem counter-intuitive in the age of Google, where an illusion of simplicity is created by typing in words to find websites. But legal search of user-created data is a completely different type of search task than looking for information from popular websites. In the search for evidence in a litigation, or as part of a legal investigation, special expertise in information seeking is critical, including especially knowledge of multiple search techniques and methods. Again quoting Professor Marchionini:

Expert information seekers possess substantial knowledge related to the factors of information seeking, have developed distinct patterns of searching, and use a variety of strategies, tactics and moves.

Id. at 70.

In the field of law this kind of information seeking expertise includes the ability to understand and clarify what the information need is, in other words, to know what you are looking for, and articulate the need into specific search topics. This important step precedes the actual search, but should thereafter continue as an integral part of the process. As one of the basic texts on information retrieval written by Gordon Cormack, et al, explains:

Before conducting a search, a user has an information need, which underlies and drives the search process. We sometimes refer to this information need as a topic …

Buttcher, Clarke & Cormack, Information Retrieval: Implementation and Evaluation of Search Engines (MIT Press, 2010) at pg. 5.

The importance of pre-search refining of the information need is stressed in the first step of the above diagram of my methods, ESI Discovery Communications. It seems very basic, but is often under appreciated, or overlooked entirely in the litigation context. In legal discovery information needs are often vague and ill-defined, lost in overly long requests for production and adversarial hostility. In addition to concerted activity up front to define relevance, the issue of information need should be kept in mind throughout the project. Typically our understanding of relevance evolves as our understanding of what really happened in a dispute emerges and grows.

At the start of an e-discovery project we are almost never searching for specific known documents. We never know for sure what information we will discover. That is why the phrase information seeking is actually more appropriate for legal search than information retrieval. Retrieval implies that particular facts exist and are already known; we just need to look them up. Legal search is not like that at all. It is a process of seeking and discovery. Again quoting Professor Marchionini:

The term information seeking is preferred to information retrieval because it is more human oriented and open ended. Retrieval implies that the object must have been “known” at some point; most often, those people who “knew” it organized it for later “knowing” by themselves or someone else. Seeking connotes the process of acquiring knowledge; it is more problem oriented as the solution may or may not be found.

Information Seeking in Electronic Environments, supra at 5-6.

Legal search is a process of seeking information, not retrieving information. It is a process of discovery, not simple look-up of known facts. More often than not in legal search you find the unexpected, and your search evolves as it progresses. Concept shift happens. Or you find nothing at all. You discover that the requesting party has sent you hunting for Unicorns, for evidence that simply does not exist. For example, the plaintiff alleges discrimination, but a search through tens of thousands of defendant’s emails shows no signs of it.

Information scientists have been talking about the distinction between machine oriented retrieval and human oriented seeking for decades. The type of discovery search that lawyers do is referred to in the literature (without any specific mention of law or legal search) as exploratory search. See: White & Roth, Exploratory Search: Beyond the Query-Response Paradigm (Morgan & Claypool, 2009). Ryen W. White, Ph.D., a senior researcher at Microsoft Research, builds on the work of Marchionini and gives this formal definition of exploratory search:

Exploratory search can be used to describe an information-seeking problem context that is open-ended, persistent, and multi-faceted; and to describe information-seeking processes that are opportunistic, iterative, and multi-tactical. In the first sense, exploratory search is commonly used in scientific discovery, learning, and decision-making contexts. In the second sense, exploratory tactics are used in all manner of information seeking and reflect seeker preferences and experience as much as the goal.

Id. at 6. He could easily have added legal discovery to this list, but like most information scientists, seems unacquainted with the law and legal search.

White and Roth point out that exploratory search typically uses a multimodal (berrypicking) approach to information needs that begin as vague notions. A many-methods-approach helps the information need to evolve and become more distinct and meaningful over time. They contend that the information-seeking strategies need to be supported by system features and user interface designs, bringing humans more actively into the search process. Id. at 15. That is exactly what I mean by a hybrid process where lawyers are actively involved in the search process.

The fully Borg approach has it all wrong. They use a look-up approach to legal search that relies as much as possible on fully automated systems. The user interface for this type of information retrieval software is designed to keep humans out of the search, all in the name of ease of use and impartiality. The software designers of these programs, typically engineers working without adequate input from lawyers, erroneously assume that e-discovery is just a retrieval task. They erroneously assume that predictive coding always starts with well-defined information needs that do not evolve with time. Some engineers and lit-support techs may fall for this myth, but all practicing lawyers know better. They know that legal discovery is an open-ended, persistent, and multi-faceted process of seeking.

Hybrid Multimodal Computer Assisted Review

Professor Marchionini notes that information seeking experts develop their own search strategies, tactics and moves. The descriptive name for the strategies, tactics and moves that I have developed for legal search is Hybrid Multimodal Computer Assisted Review Bottom Line Driven Proportional Strategy. See eg. Bottom Line Driven Proportional Review (2013). For a recent federal opinion approving this type of hybrid multimodal search and review seeIn Re: Biomet M2a Maagnum Hip Implant Products Liability Litigation (MDL 2391), Case No. 3:12-MD-2391, (N.D. Ind., April 18, 2013); also seeIndiana District Court Approves Multimodal Computer Assisted Review.

I refer to this method as a multimodal because, although the predictive coding type of searches predominate (shown on the below diagram as Intelligent Review or IR), other modes of search are also employed. As described, I do not rely entirely on random documents, or computer selected documents. The other types of methods used in a multimodal process are shown in this search pyramid.

Pyramid Search diagram

Most information scientists I have spoken to agree that it makes sense to use multiple methods in legal search and not just rely on any single method. UCLA Professor Marcia J. Bates first advocated for using multiple search methods back in 1989, which she called berrypicking. Bates, Marcia J., The Design of Browsing and Berrypicking Techniques for the Online Search Interface, Online Review 13 (October 1989): 407-424. As Professor Bates explained in 2011 in Quora:

An important thing we learned early on is that successful searching requires what I called “berrypicking.” … Berrypicking involves 1) searching many different places/sources, 2) using different search techniques in different places, and 3) changing your search goal as you go along and learn things along the way. This may seem fairly obvious when stated this way, but, in fact, many searchers erroneously think they will find everything they want in just one place, and second, many information systems have been designed to permit only one kind of searching, and inhibit the searcher from using the more effective berrypicking technique.

This berrypicking approach, combined with HCIR exploratory search, is what I have found from practical experience works best with legal search. They are the Hybrid Multimodal aspects of my Computer Assisted Review Bottom Line Driven Method.

Conclusion

Predictive_coding_trianglesNow that we have shown that courts are very open to predictive coding, we need to move on to a different, more sophisticated discussion. We need to focus on analysis of different predictive coding search methods, the strategies, tactics and moves. We also need to understand and discuss what skill-sets and personnel are required to do it properly. Finally, we need to begin to discuss the different types of predictive coding software.

There is much more to discuss concerning the use predictive coding than whether or not to make disclosure of seed sets or irrelevant training documents. Although that, and court approval, are the only things most expert panels have talked about so far. The discussion on disclosure and work-product should continue, but let us also discuss the methods and skills, and, yes, even the competing software.

We cannot look to vendors alone for the discussion and analysis of predictive coding software and competing methods of use. Obviously they must focus on their own software. This is where independent practitioners have an important role to play in the advancement of this powerful new technology.

Join with me in this discussion by your comments below or send me ideas for proposed guest blogs. Vendors are of course welcome to join in the discussion, and they make great hosts for search SME forums. Vendors are an important part of any successful e-discovery team. You cannot do predictive coding review without their predictive coding software, and, as with any other IT product, some software is much better than others.


An Elusive Dialogue on Legal Search: Part Two – Hunger Games and Hybrid Multimodal Quality Controls

September 3, 2012

This is a continuation of last week’s blog, An Elusive Dialogue on Legal Search: Part One where the Search Quadrant is Explained. The quadrant and random sampling are not as elusive as Peeta Mellark in The Hunger Games shown right, but almost. Indeed, as most of us lawyers did not major in math or information science, these new techniques can be hard to grasp. Still, to survive in the vicious games often played these days in litigation, we need to  find a way. If we do, we can not only survive, we can win, even if we are from District 12 and the whole world is watching our every motion.

The emphasis in the second part of this essay is on quality controls and how such efforts, like search itself, must be multimodal and hybrid. We must use a variety of quality assurance methods – we must be multimodal. To use the Hunger Games analogy, we must use both bow and rope, and camouflage too. And we must employ both our skilled human legal intelligence and our computer intelligence – we must be hybrid; Man and machine, working together in perfect harmony, but with Man in charge. That is the only way to survive the Hunger Games of litigation in the 21st Century. The only way the odds will be ever in your favor.

Recall and Elusion

But enough fun with Hunger Games, Search Quadrant terminology, nothingness, and math, and back to Herb Rotiblat’s long comment on my earlier blog, Day Nine of a Predictive Coding Narrative.

Recall and Precision are the two most commonly used measures, but they are not the only ones. The right measure to use is determined by the question that you are trying to answer and by the ease of asking that question.

Recall and Elusion are both designed to answer the question of how complete we were at retrieving all of the responsive documents. Recall explicitly asks “of all of the responsive documents in the collection, what proportion (percentage) did we retrieve?” Elusion explicitly asks “What proportion (percentage) of the rejected documents were truly responsive?” As recall goes up, we find more of the responsive documents, elusion, then, necessarily goes down; there are fewer responsive documents to find in the reject pile. For a given prevalence or richness as the YY count goes up (raising Recall), the YN count has to go down (lowering Elusion). As the conversation around Ralph’s report of his efforts shows, it is often a challenge to measure recall.

This last comment was referring to prior comments made in my same Day Nine Narrative blog by two other information scientists William Webber and Gordon Cormack. I am flattered that they all seem to read my blog, and make so many comments, although I suspect they may be master game-makers of sorts like we saw in Hunger Games.

The earlier comments of Webber and Cormack pertained to point projection of yield and the lower and upper intervals derived from random samples. All things I was discussing in Day Nine. Gordon’s comments focused on the high-end of possible interval error and said you cannot know anything for sure about recall unless you assume the worst case scenario high-end of the confidence interval. This is true mathematically and scientifically, I suppose (to be honest, I do not really know if it is true or not, but I learned long ago not to argue science with a scientist, and they do not seem to be quibbling amongst themselves, yet.) But it certainly is not true legally, where reasonability and acceptable doubt (a kind of level of confidence), such as a preponderance of the evidence, are always the standard, not perfection and certainty. It is not true in manufacturing quality controls either.

But back to Herb’s comment, where he picks up on their math points and elaborates concerning the Elusion test that I used for quality control.

Measuring recall requires you to know or estimate the total number of responsive documents. In the situation that Ralph describes, responsive documents were quite rare, estimated at around 0.13% prevalence. One method that Ralph used was to relate the number of documents his process retrieved with his estimated prevalence. He would take as his estimate of Recall, the proportion of the estimated number of responsive documents in the collection as determined by an initial random sample.

Unfortunately, there is considerable variability around that prevalence estimate. I’ll return to that in a minute. He also used Elusion when he examined the frequency of responsive documents among those rejected by his process. As I argued above, Elusion and Recall are closely related, so knowing one tells us a lot about the other.

One way to use Elusion is as an accept-on-zero quality assurance test. You specify the maximum acceptable level of Elusion, as perhaps some reasonable proportion of prevalence. Then you feed that value into a simple formula to calculate the sample size you need (published in my article the Sedona Conference Journal, 2007). If none of the documents in that sample comes up responsive, then you can say with a specified level of confidence that responsive documents did not occur in the reject set at a higher rate than was specified. As Gordon noted, the absence of a responsive document does not prove the absence of responsive documents in the collection.

The Sedona Conference Journal article Herb referenced here is called Search & Information Retrieval Science. Also, please recall that my narrative states, without using the exact same language, that my accept-on-zero quality assurance test pertained to Highly Relevant documents, not relevant documents. I decided in advance that if my random sample of excluded documents included any that were Highly Relevant documents, then I would consider the test a failure and initiate another round of predictive coding. My standard for merely relevant documents was secondary and more malleable, depending on the probative value and uniqueness of any such false negatives. False negatives are what Herb calls YN, and we also now know is called D in the Search Quadrant with totals shown again below.

Back to Herb’s comment, who, by the way looks a bit like President Snow, don’t you think? Herb is now going to start talking about Recall, which as we now know is A/G, and is a measure of accuracy that I did not directly make or claim.

If you want to directly calculate the recall rate after your process, then you need to draw a large enough random sample of documents to get a statistically useful sample of responsive documents. Recall is the proportion of responsive documents that have been identified by the process. The 95% confidence range around an estimate is determined by the size of the sample set. For example, you need about 400 responsive documents to know that you have measured recall with a 95% confidence level and a 5% confidence interval. If only 1% of the documents are responsive, then you need to work pretty hard to find the required number of responsive documents. The difficulty of doing consistent review only adds to the problem. You can avoid that problem by using Elusion to indirectly estimate Recall.

The Fuzzy Lens Problem Again

The reference to the difficulty of doing consistent review refers to the well documented inconsistency of classification among human reviewers. That is what I called in Secrets of Search, Part One, as the fuzzy lens problem that makes recall such an ambiguous measure in legal search. It is ambiguous because when large data sets are involved the value for G (total relevant) is dependent upon  human reviewers. The inconsistency studies show that the gold standard of measurement by human review is actually just dull lead.

Let me explain again in shorthand, and please fell free to refer to the Secrets of Search trilogy and original studies for the full story. Roitblot’s own well-known study of a large-scale document review showed that human reviewers only agreed with each other on average of 28% of the time. Roitblat, Kershaw, and Oot, Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the American Society for Information Science and Technology, 61(1):70–80, 2010. An earlier study by one of the leading information scientists in the world, Ellen M. Voorhees, found a 40% agreement rate between human reviewers. Variations in relevance judgments and the measurement of retrieval effectiveness, 36:5 Information Processing & Management 697, 701 (2000). Voorhees concluded that with 40% agreement rates it was not possible to measure recall any higher than 65%. Information scientist William Webber calculated that with a 28% agreement rate a recall rate cannot be reliably measured above 44%. Herb Rotiblat and I dialogued about this issue before the last time in Reply to an Information Scientist’s Critique of My “Secrets of Search” Article

I prepared the graphics below to illustrate this problem of measurement and the futility of recall calculations when the measurements are made by inconsistent reviewers.

Until we can crack the inconsistent reviewer problem, we can only measure recall vaguely, as we see on the left, or at best the center, and can only make educated guesses as to the reality on the right. The existence of the error has been proven, but as Maura Grossman and Gordon Cormack point out, there is a dispute as to the cause of the error. In one analysis that they did of TREC results they concluded that the inconsistencies were caused by human error, not a difference of opinion on what was relevant or not. Inconsistent Responsiveness Determination in Document Review: Difference of Opinion or Human Error? But, regardless of the cause, the error remains.

Back to Herb’s Comment.

One way to assess what Ralph did is to compare the prevalence of responsive documents in the set before doing predictive coding with their prevalence after using predictive coding to remove as many of the responsive documents as possible. Is there a difference? An ideal process will have removed all of the responsive documents, so there will be none left to find in the reject pile.

That question of whether there is a difference leads me to my second point. When we use a sample to estimate a value, the size of the sample dictates the size of the confidence interval. We can say with 95% confidence that the true score lies within the range specified by the confidence interval, but not all values are equally likely. A casual reader might be led to believe that there is complete uncertainty about scores within the range, but values very near to the observed score are much more likely that values near the end of the confidence interval. The most likely value, in fact, is the center of that range, the value we estimated in the first place. The likelihood of scores within the confidence interval corresponds to a bell shaped curve.

This is a critical point. It means that the point projections, a/k/a, the spot projections, can be reliably used. It means  that even though you must always qualify any findings that are based upon random sampling by stating the applicable confidence interval, the possible range of error, you may still reliably use the observed score of the sample in most data sets, if a large enough sample size is used to create low confidence interval ranges. Back to Herb’s Comment.

Moreover, we have two proportions to compare, which affects how we use the confidence interval. We have the proportion of responsive documents before doing predictive coding. The confidence interval around that score depends on the sample size (1507) from which it was estimated. We have the proportion of responsive documents after predictive coding. The confidence interval around that score depends on its sample size (1065). Assuming that these are independent random samples, we can combine the confidence intervals (consult a basic statistics book for a two sample z or t test or http://facstaff.unca.edu/dohse/Online/Stat185e/Unit3/St3_7_TestTwoP_L.htm), and determine whether these two proportions are different from one another (0.133% vs. 0.095%). When we do this test, even with the improved confidence interval, we find that the two scores are not significantly different at the 95% confidence level. (try it for yourself here: http://www.mccallum-layton.co.uk/stats/ZTestTwoTailSampleValues.aspx.). In other words, the predictive coding done here did not significantly reduce the number of responsive documents remaining in the collection. The initial proportion 2/1507 was not significantly higher than 1/1065. The number of responsive documents we are dealing with in our estimates is so small, however, that a failure to find a significant difference is hardly surprising.

This paragraph appears to me to have assumed that my final quality control test was a test for Recall and uses the upper limit, the worst case scenario, as the defining measurement. Again, as I said in the narrative and replies to other comments, I was testing for Elusion, not Recall. Further, the Elusion test (D/F) here was for Highly Relevant documents, not relevant, and none were found, 0%. None were found in the first random sample at the beginning of the project, and none were found in the second random sample at the end. The yields referred to by Herb are for relevant documents, not Highly Relevant. The value of DFalse Negatives, in the elusion test was thus zero. As we have discussed, when that happens, where the numerator in a fraction is zero, the result of the division is also always zero, which, in an Elusion test, is exactly what you are looking for. You are looking for nothing and happy to find it.

The final sentence in Herb’s last paragraph is key to understanding his comment: The number of responsive documents we are dealing with in our estimates is so small, however, that a failure to find a significant difference is hardly surprising. It points to the inherent difficulty of using random sampling measurements of recall in low yield document sets where the prevalence is low. But there is still some usefulness for random sampling in these situations as the conclusion of his Comment shows.

Still, there is other information that we can glean from this result. The difference in the two proportions is approximately 28%. Predictive coding reduced by 28% the number of responsive documents unidentified in the collection. Recall, therefore, is also estimated to be 28%. Further, we can use the information we have to compute the precision of this process as approximately 22%. We can use the total number of documents in the collection, prevalence estimates, and elusion to estimate the entire 2 x 2 decision matrix.

For eDiscovery to be considered successful we do not have to guarantee that there are no unidentified responsive documents, only that we have done a reasonable job searching for them. The observed proportions do have some confidence interval around them, but they remain as our best estimate of the true percentage of responsive documents both before predictive coding and after. We can use this information and a little basic algebra to estimate Precision and Recall without the huge burden of measuring Recall directly.

These are great points made by Herb Rotiblat in the last paragraph regarding reasonability. It shows how lawyer-like he has become after working with our kind for so many years, rather than professor types like my brother in the first half of his career. Herb now well understands the difference between law and science and what this means to legal search.

Law is not a Science, and Neither Is Legal Search

To understand the numbers and need for reasonable efforts that accepts high margins of error, we must understand the futility of increasing sample sizes to try to cure the upper limit of confidence. William Webber in his Comment of August 6, 2012 at 10:28 pm said that “it is, unfortunately, very difficult to place a reassuring upper bound on a very rare event using random sampling.” (emphasis added) Dr. Webber goes on to explain that to attain even a 50% confidence interval would require a final quality control sample of 100,000 documents. Remember, there were only 699,082 documents to begin with, so that is obviously no solution at all. It is about as reassuring as the Hunger Games slogan, may the odds be ever in your favor, when we all know that all but 1 of the 24 gamers must die.

Aside from the practical cost and time issues, the fuzzy lens problem of poor human judgments also makes the quest for reassuring bounds of error a fool’s errand. The perfection is illusory. It cannot be attained, or more correctly put, if you do attain high recall in a large data set, you will never be able to prove it. Do not be fooled by the slogans and the flashy, facile analysis.

Fortunately, the law has long recognized the frailty of all human endeavors. The law necessarily has different standards for acceptable error and risks than does math and science. The less-than-divine standards also apply to manufacturing quality control where small sample sizes have long been employed for acceptable risks. There too, like in a legal search for relevance, the prevalence of defective items sampled for is typically very low.

Math and science demand perfection. But the law does not. We demand reasonability and good faith, not perfection. Some scientists may think that we are settling, but it is more like practical realism, and, is certainly far better than unreasonable and bad faith. Unlike science and math, the law is used to uncertainties. Lawyers and judges are comfortable with that. For example, we are reassured enough  to allow civil convictions when a judge or jury decides that it is more likely than not that the defendant is at fault, a 51% standard of doubt. Law and justice demand reasonable efforts, not perfection.

I know Herb Rotiblat agrees with me because this is the fundamental thesis of the fine paper he wrote with two lawyers, Patrick Oot and Anne Kershaw, entitled: Mandating Reasonableness in a Reasonable Inquiry. At pages 557-558 they sum up saying (footnote omitted):

We do not suggest limiting the court system’s ability to discover truth. We simply anticipate that judges will deploy more reasonable and efficient standards to determine whether a litigant met his Rule 26(g) reasonable inquiry obligations. Indeed, both the Victor Stanley and William A. Gross Construction decisions provide a primer for the multi-factor analysis that litigants should invoke to determine the reasonableness of a selected search and review process to meet the reasonable inquiry standard of Rule 26(f): 1. Explain how what was done was sufficient; 2. Show that it was reasonable and why; 3. Set forth the qualifications of the persons selected to design the search; 4. Carefully craft the appropriate keywords with input from the ESI’s custodians as to the words and abbreviations they use; and 5. Use quality control tests on the methodology to assure accuracy in retrieval and the elimination of false positives.

As to the fifth criteria, which we are discussing here, of quality control tests, Rotiblat, Oot and Kershaw assert in their article at page 551 that : “A litigant should sample at least 400 results of both responsive and non-responsive data.” This is the approximate sample size when using 95% confidence level and a 5% confidence interval. (Note in my sampling I used less than a 3% confidence interval with a much larger sample  size of 1,065 documents.) To support this assertion that a sample size of 400 documents is reasonable, the authors  in footnote 77 refer to an email they have on file from Maura Grossman regarding legal search of data sets in excess of 100,000 documents, which concluded with the statement:

Therefore, it seemed to me that, for the average matter with a large amount of ESI, and one which did not warrant hiring a statistician for a more careful analysis, a sample size of 400 to 600 documents should give you a reasonable view into your data collection, assuming the sample is truly randomly drawn.

Personally, I think a larger sample size than 400-600 documents is needed for quality control tests in large cases. The efficacy of this small calculated sample size using a 5% confidence interval assumes a prevalence of 50%, in other words, that half of the documents sampled are relevant. This is an obvious fiction in all legal search, just as it is in all sampling for defective manufacturing goods. That is why I sampled 1,065 documents using 3%. Still, in smaller cases, it may be very appropriate to just sample 400-600 documents using a 5% interval. It all depends, as I will elaborate further in the conclusion.

But regardless, all of these scholars of legal search make the valid point that only reasonable efforts are required in quality control sampling, not perfection. We have to accept the limited usefulness of random sampling alone as a quality assurance tool because of the margins of error inherent in sampling of the low prevalence data sets common in legal search. Fortunately, random sampling is not our only quality assurance tool. We have many other methods to show reasonable search efforts.

Going Beyond Reliance on Random Sampling Alone to a Multimodal Approach

Random sampling is not a magic cure-all that guaranties quality, or definitively establishes the reasonability of a search, but it helps. In low yield datasets, where there is a low percentage of relevant documents in the total collection, the value of random sampling for Recall is especially suspect. The comments of our scientist friends have shown that. There are inherent limitations to random sampling.

Ever increasing sample sizes are not the solution, even if that was affordable and proportionate. Confidence intervals in sampling of less than two or three percent are generally a waste of time and money. (Remember the sampling statistics rule of thumb of 2=4 that I have explained before wherein a halving of confidence interval error rate, say from 3% to 1.5%, requires a quadrupling of sample size.) Three or four percent confidence interval levels are more appropriate in most legal search projects, perhaps even the 5% interval used in the Mandating Reasonableness article by Roitblat, Oot and Kershaw. Depending on the data set itself, prevalence, other quality control measures, complexity of the case, and the amount at issue, say less than $1,000,000, the five percent based small sample size of approximately 400 documents could well be adequate and reasonable. As usual in the law, it all depends on many circumstances and variables.

The issue of inconsistent reviews between reviewers, the fuzzy lens problem, necessarily limits the effectiveness of all large-scale human reviews. The sample sizes required to make a difference are extremely large. No such reviews can be practically done without multiple reviewers and thus low agreement rates. The gold standard for review of large samples like this is made of lead, not gold. Therefore, even if cost was not a factor, large sample sizes would still be a waste of time.

Moreover, in the real word of legal review projects, there is always a strong component of vagary in relevance. Maybe that was not true in the 2009 TREC experiment as Grossman and Cormack’s study suggests, but it has been true in the thousands of messy real-world lawsuits that I have handled in the past 32 years. All trial lawyers I have spoken with on the subject agree.

Relevance can be, and usually is, a fluid and variable target depending on a host of factors, including changing legal theories, changing strategies, changing demands, new data, and court rulings. The only real gold standard in law is a judge ruling on specific documents. Even then, they can change their mind, or make mistakes. A single person, even a judge, can be inconsistent from one document to another. See Grossman & Cormack, Inconsistent Responsiveness Determination at pgs. 17-18 where a 2009 TREC Topic Authority contradicted herself 50% of the time when re-examining the same ten documents.

We must realize that random sampling is just one tool among many. We must also realize that even when random sampling is used, Recall is just one measure of accuracy among many. We must utilize the entire 2 x 2 decision matrix.

We must consider the possible applicability of all of the measurements that the search quadrant makes possible, not just recall.

  • Recall = A/G
  • Precision = A/C
  • Elusion = D/F
  • Fallout = B/H
  • Agreement = (A+E)/(D+B)
  • Prevalence = G/I
  • Miss Rate = D/G
  • False Alarm Rate = B/C

No doubt we will develop other quality control tests, for instance using Prevalence as a guide or target for relevant search as I described in my seven part Search Narrative. Just as we must use multimodal search efforts for effective search of large-scale data sets, so too must we use multiple quality control methods when evaluating the reasonability of search efforts. Random sampling is just one tool among many, and, based on the math, maybe not the best method at that, regardless of whether it is for recall, or elusion, or any other binary search quadrant measure.

Just as keyword search must be supplemented by the computer intelligence of predictive coding, so too must random based quality analysis be supplemented by skilled legal intelligence. That is what I call a Hybrid approach. The best measure of quality is to be found in the process itself, coupled with the people and software involved. A judge called upon to review reasonability of search should look at a variety of factors, such as:

  • What was done and by whom?
  • What were their qualifications?
  • What rules and disciplined procedures were followed?
  • What measures were taken to avoid inconsistent calls?
  • What training was involved?
  • What happened during the review?
  • Which search methods were used?
  • Was it multimodal?
  • Was it hybrid, using both human and artificial intelligence?
  • How long did it take?
  • What did it cost?
  • What software was used?
  • Who developed the software?
  • How long has the software been used?

Conclusion

These are just a few questions that occur to me off the top of my head. There are surely more. Last year in Part Two of Secrets of Search I suggested nine characteristics of what I hope would become an accepted best practice for legal review. I invited peer review and comments on what I may have left out, or any challenges to what I put in, but so far this list of nine remains unchallenged. We need to build on this to create standards so that quality control is not subject to so many uncertainties.

Jason R. Baron, William Webber, myself, and others keep saying this over and over, and yet the Hunger Games of standardless discovery goes on. Without these standards we may all fall prey at any time to a vicious sneak attack by another contestant in the litigation games. A contest that all too often feels like a fight to the death, rather than a cooperative pursuit of truth and justice. It has become so bad now that many lawyers snicker just to read such a phrase.

The point here is, you have to look at the entire process, and not just focus on taking random samples, especially ones that claims to measure recall in low yield collections.  By the way I submit that almost all legal search is of low yield collections, not just employment law related as some have suggested. Those who think the contrary have too broad a concept of relevance, and little or no understanding of actual trials, cumulative evidence, and the modern data koan of big data “relevant is irrelevant.” Even though random sampling is not The Answer we once thought, it should be part of the process. For instance, a random sample elusion test that finds no Highly Relevant documents should remain an important component of that process.

The no-holds-barred Hunger Games approach to litigation must end now. If we all join together, this will end in victory, not defeat. It will end with alliances and standards. Whatever district you hail from, join us in this nobel quest. Turn away from the commercial greed of winning-at-all-costs. Keep your integrity. Keep the faith. Renounce the vicious games; both hide-the-ball and extortion. The world is watching. But we are up for it. We are prepared. We are trained. The odds are ever in our favor. Salute all your colleagues who turn from the games and the leadership of greed and oppression. Salute all who join with us in the rebellion for truth of justice.

__________________

_______________

____________

__________

________

_____

___

__

_


Days Seven and Eight of a Predictive Coding Narrative: Where I have another hybrid mind-meld and discover that the computer does not know God

July 29, 2012

This is my fifth in a series of narrative descriptions of  a search of 699,082 Enron emails to find evidence on involuntary employee terminations. The preceding narratives are:

In this fifth installment I will continue my description, this time covering days seven and eight of the project. As the title indicates, progress continues and I have another hybrid mind-meld moment. I also discover that the computer does not recognize the significance of references to God in an email. This makes sense logically, but is unexpected and kind of funny when encountered in a document review.

Seventh Day of Review (7 Hours)

this seventh day I followed Joe White’s advice as described at the end of the last narrative. It was essentially a three-step process:

One: I ran another learning session for the dozen or so I’d marked since the last one to be sure I was caught up, and then made sure all of the prior Training documents were checked back in. This only took a few minutes.

Two: I ran two more focus document trainings of 100 docs each, total 200. The focus documents are generated automatically by the computer. It only took about an hour to review these 200 documents because most were obviously irrelevant to me, even if the computer was somewhat confused.

I received more of an explanation from Joe White on the focus documents, as Inview calls them. He explains that, at the current time at least (KO is coming out with a new version of the Inview software soon, and they are in a state of constant analysis and improvement), 90% of each focus group consists of grey area type documents, and 10% are pure random under IRT ranking. For documents drawn via workflow (in the demo database they are drawn from the System Trainers group in the Document Training Stage) they are selected as 90% focus and 10% random; where the 90% focus selection is drawn evenly across each category set for iC training.

The focus documents come from the areas of least certainty for the algorithm. A similar effect can be achieved by searching for a given iC category for documents between 49 – 51%, etc., as I had done before for relevance. But the automated focus document system makes it a little easier because it knows when you do not have enough documents in the 49 – 51% probability range and then increases the draw to reach your specified number, here 100,  to the next least-certain documents. This reduces the manual work in finding the grey area documents for review and training.

Three: I looked for more documents to evaluate/train the system. I had noticed that “severance” was a key word in relevant documents, and so went back and ran a search for this term for the first time. There were 3,222 hits, so, as per my standard procedure, I added this document count to name of the folder that automatically saved the search.

I found many more relevant documents that way. Some were of a new type I had not seen before (having to do with the mass lay-offs when Enron was going under), so I knew I was expanding the scope of relevancy training, as was my intent. I did the judgmental review by using various sort-type judgment searches in that folder, i.e. by ordering the documents by subject line, file type, search terms hits (the star symbols), etc., and did not review all 3,222 docs. I did not find that necessary. Instead, I honed in on the relevant docs, but also marked some irrelevant ones here that were close. Below is a screen shot of the first page of the documents sorted by putting those selected for training at the top.

I had also noticed that “lay off” “lay offs” and “laid off” were common terms found in relevant docs, and I had not searched for those particular terms before either. There were 973 documents with hits with one of these search terms. I did the same kind of judgmental search of the folder I created with these documents and found more relevant documents to train. Again, I was finding new documents and knew that I was expanding the scope of relevancy. Below is one new relevant document found in this selection; note how the search terms are highlighted for easy location.

I also took the time to mark some irrelevant documents in these new search folders, especially the documents in the last folder, and told them to train too, since they were otherwise close from a similar keywords perspective. So I thought I should go ahead and train them to try to teach the fine distinctions.

The above third step took another five hours (six hours total). I knew I had added hundreds of new docs for training in the past five hours, both relevant and irrelevant.

Fourth Round

I decided it was time to run a training session again and force the software to analyze and rank all of the documents again. This was essentially the Fourth Round (not counting the little training I did at the beginning today to make sure I was served with the right (updated) Focus documents).

After the Training session completed, I asked for a report. It showed that 2,663 total documents (19,731 pages) have now been categorized and marked for Training in this last session. There were now 1,156 Trainer (me) identified documents, plus the original 1,507 System ID’ed docs. (Previously, in Round 3, there were the same 1,507 System ID’ed docs, and only 534 Trainer ID’ed docs.)

Then I ran a report to see how many docs had been categorized by me as Relevant (whether also marked for Training or not). Note I could have done this before the training session too, and it would not make any difference in results. All the training session does is change the predictions on coding, not the actual prior human coding. This relevancy search was saved in another search folder called “All Docs Marked Relevant after 4th Round – 355 Docs.” After the third round I had only ID’ed 137 relevant documents. So progress in recall was being made.

Prevalence Quality Control Check

As explained in detail in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane, my first random sample search allowed me to determine prevalence and get an idea of the total number of relevant document likely contained in the database. The number was 928 documents. That was the spot or point projection of the total yield in the corpus. (Yield is another information science and statistics term that is useful to know. It means in this context the expected number of relevant documents in the total database. See eg. Webber, W., Approximate Recall Confidence Intervals, ACM Transactions on Information Systems, Vol. V, No. N, Article A (2012 draft) at A2.)

My yield calculation here of 928 is based on my earlier finding of 2 relevant documents in the initial 1,507 random sample. (2/1507=.00132714) (.13*699,082=928 relevant documents). So based on this I knew that I was correct to have gone ahead with the fourth round, and would next check to see how many documents the IRT now predicted would be relevant. My hope was the number would now be closer to the 928 goal of the projected yield of the 699,082 document corpus.

This last part had taken another hour, so I’ll end Day Seven with a total of 7 hours of search and review work.

Eighth Day of Review (9 Hours)

First I ran a probability search as before for all 51%+ probable relevant docs and saved them in a folder by that name. After the fourth round the IRC now predicted a total of 423 relevant documents. Remember I had already actually reviewed and categorized 355 docs as relevant, so this was only a potential max net gain of 68 docs. As it turned out, I disagreed with 8 of the predictions, so the actual net gain was only 60 docs, for a total of 415 confirmed relevant documents.

I had hoped for more after broadening the scope of documents marked relevant in the last seeding. So I was a little disappointed that my last seed set had not led to more predicted relevant. Since the “recall goal” for this project was 928 documents, I knew I still had some work to do to expand the scope. Either that or the confidence interval was at work, and there were actually fewer relevant documents in this collection than the random sample predicted as a point projection. The probability statistics showed that the actual range was between 112 documents 3,345 documents, due to the 95% confidence level and +/-3% confidence interval.

51%+ Probable Relevant Documents

Next I looked at the 51%+ probable relevant docs folder and sorted by whether the documents had been categorized on not. You do that by clicking on the symbol for categorization, a check, which is by default located in the upper left. That puts all of the categorized docs together, either on top or bottom. Then I reviewed the 68 new documents, the ones the computer predicted to be relevant that I had not previously marked relevant.

This is always the part of the review that is the most informative for me as to whether the computer is actually “getting-it” or not. You look to see what documents it gets wrong, in other words, makes a wrong prediction of probable relevance, and try to determine why. In this way you can be alert for additional documents to try to correct the error in future seeds. You learn from the computer’s mistakes where additional training is required.

I then had some moderately good news in my review. I only disagreed with eight of the 68 new predictions. One of these documents only had a 52.6% probability for relevance, another 53.6%, another 54.5%, another 54%, another 57.9%, and another other only 61%.  Another two were 79.2% and 76.7% having to do with “voluntary” severance again, a mistake I had seen before. So even when the computer and I disagreed, it was not by much.

Computer Finds New Hard-to-Detect Relevant Documents

A couple of the documents that Inview predicted to be relevant were long, many pages, so my study and analysis of them took a while. Even though these long documents at first seemed irrelevant to me, as I kept reading and analyzing them, I ultimately agreed with the computer on all of them. A careful reading of the documents showed that they did in fact include discussion related to termination and terminated employees. I was surprised to see that, but pleased, as it showed the software mojo was kicking in. The predictive coding training was allowing the computer to find documents I would likely never have caught on my own. The mind-meld was working and hybrid power was again manifest.

These hard to detect issues (for me) mainly arose from the unusual situation of the mass terminations that came at the end of Enron, especially at the time of its bankruptcy. To be honest, I had forgotten about those events. My recollection of Enron history was pretty rusty when I started this project. I had not been searching for bankruptcy related terminations before. That was entirely the computer’s contribution and it was a good one.

From this study of the 68 new docs I realized that although there were still some issues with the software making an accurate distinction between voluntary and involuntary severance, overall, I felt pretty confident that Inview was now pretty well-trained. I based that on the 60 other predictions that were spot on.

Note that I marked most of the newly confirmed relevant documents for training, but not all. I did not want to excessively weight the training with some that were redundant, or odd for one reason or another, and thus not particularly instructive.

This work was fairly time-consuming. It took three long hours on a Sunday to complete.

Fifth Round

Returning to work in the evening I started another training session, the Fifth. This would allow the new teaching (document training instructions) to take effect.

My plan was to then have the computer serve me up the 100 close calls (Focus Documents) by using the document training Checkout feature. Remember this feature selects and serves up for review the grey area docs designed to improve the IRT training, plus random samples.

But before I reviewed the next training set, I did a quick search to see how many new relevant documents (51%+) the last training (fifth round) has predicted. I found a total of 545 documents 51%+ predicted relevant. Remember I left the last session with 415 relevant docs (goal is 928). So progress was still being made. The computer had added 130 documents.

Review of Focus Documents

Before I looked at these new ones to see how many I agreed with, I stuck to my plan, and took a Checkout feed of 100 Focus documents. My guess is that most of the newly predicted 51%+ relevant docs would be in the grey area anyway, and so I’ll be reviewing some of them when I reviewed the Focus documents.

First, I noticed right away that it served up 35 irrelevant junk files that were obviously irrelevant and previously marked as such, such as PST placeholder files, and a few others like that, which clutter this ENRON dataset. Obviously, they were part of the random selection part of the Focus document selections. I told them all to train in one bulk command, hit the completed review button for them, and then focused on the remaining 65 documents. None had been reviewed before. Next I found some more obviously irrelevant docs, which were not close at all, i.e. 91% irrelevant and only 1% likely relevant. I suspect this is part of the general database random selection that makes up 10% of the Focus documents (the other 90% are close calls).

Next I did a file type sort to see if any more of the unreviewed documents in this batch of 100 were obviously irrelevant based on file type. I found 8 more such files, mass categorized them, mass trained them and quickly completed review for these 8.

Now there were 57 docs left, 9 of which were Word docs, and the rest emails. So I checked the 9 word docs next. Six of these were essentially the same document called “11 15 01 CALL.doc.” The computer gave each approximately a 32.3% probability of irrelevance and a 33.7% probability of relevance. Very close indeed. Some of the other docs had very slight prediction numbers (less than 1%). The documents proved to be very close calls. Most of them I found to be irrelevant. But in one document I found a comment about mass employee layoffs, so I decided to call it relevant to our issue of employee terminations. I trained those eight and checked them back in. I then reviewed the remaining word docs, found that they were also very close, but marked these as irrelevant and checked them in, leaving 48 docs left to review in the Training set of 100.

Next I noticed a junk kind of mass email from a sender called “Black.” I sorted by “From” found six by Black, and a quick look showed they were all irrelevant, as the computer had predicted for each. Not sure why they were picked as focus docs, but regardless, I trained them and checked them back in, now leaving 42 docs to review.

Next I sorted the remaining by “Subject” to look for some more that I might be able to quickly bulk code (mass categorize). It did not help much as there were only a couple of strings with the same subject. But I kept that subject order and sloughed through the remaining 42 docs.

I found most of the remaining docs were very close calls, all in the 30% range for both relevant and irrelevant. So they were all uncertain, i.w. a split choice, but none were actually predicted relevant, that is, none were in the over 50% likely relevant range. I found that most of them were indeed irrelevant, but not all. A few in this uncertain range were relevant. They were barely relevant, but of the new type recently marked having to do with the bankruptcy. Others that I found relevant were of a type I had seen before, yet the computer was still unsure with basically an even split of prediction in the 30% range. They were apparently different from the obviously relevant documents, but in a subtle way. I was not sure why. See Eg: control number 12509498.

It was 32.8% relevant and 30.9% irrelevant, even though I had marked an identical version of this email before as relevant in the last training. The computer was apparently suspicious of my prior call and was making sure. I know I’m anthropomorphizing a machine, but I don’t know how else to describe it.

Computer’s Focus Was Too Myopic To See God

One of the focus documents that the computer found a close call in the 30% range was email with control number 10910388. It was obviously just an inspirational message being forwarded around about God. You know the type I’m sure.

It was kind of funny to see that this email confused the computer, whereas any human could immediately recognize that this was a message about God, not employee terminations. It was obvious that the computer did not know God.

Suddenly My Prayers Are Answered

Right after the funny God mistake email, I reviewed another email with control number 6004505. It was about wanting to fire a particular employee. Although the computer was uncertain about the relevancy of this document, I knew right away that it rocked. It was just the kind of evidence I had been looking for. I marked it as Highly Relevant, the first hot document found in several sessions. Here is the email.

I took this discovery of a hot doc as a good sign. I was finding both the original documents I had been looking for and the new outliers. It looked to me like I had succeeded in training and in broadening the scope of relevancy to its proper breadth. I might not be travelling a divine road to redemption, but it was clearly leading to better recall.

Since most of these last 42 documents were all close questions (some were part of the 10% random and were obvious), the review took longer than usual. The above tasks all took over 1.5 hours (not including machine search time or time to write this memo).

Good Job Robot!

My next task was to review the 51% predicted relevant set of 545 docs. One document was particularly interesting, control number 12004849, which was predicted to be 54.7% likely relevant. I had previously marked it Irrelevant based on my close call decision that it only pertained to voluntary terminations, not involuntary terminations. It was an ERISA document, a Summary Plan Description of the Enron Metals Voluntary Separation Program.

Since the document on its face obviously pertained to voluntary separations, it was not relevant. That was my original thinking and why I at first called it Irrelevant. But my views on document characterizations on that fuzzy line between voluntary and involuntary employee terminations had changed somewhat over the course of the review project. I now had a better understanding of the underlying facts. The document necessarily defined both eligibility for this benefit, money when an employee left, and ineligibility. It specifically stated that employees of certain Enron entities were ineligible for this benefit. It stated that acceptance of an application was strictly within the company’s discretion. What happened if even an eligible employee decided not to voluntarily quit and take this money? Would they not then be terminated involuntarily? What happened if they applied for this severance, and the company said no? For all these reasons, and more, I decided that this document was in fact relevant to both voluntary and involuntary terminations. The relevance to involuntary terminations was indirect, and perhaps a bit of a stretch, but in my mind it was in the scope of a relevant document.

Bottom line, I had changed my mind and I now agreed with the computer and considered it Relevant. So I changed the coding to relevant and trained on it. Good call Inview. It had noticed an inconsistency with some of my other document codings and suggested a correction. I agreed. That was impressive. Good robot!

Looking at the New 51%+

Another one of the new documents that was in the 51%+ predicted relevant group was a document with 42 versions of itself. It was the Ken Lay email where he announced that he was not accepting his sixty-million dollar golden parachute. (Can you imagine how many law suits would have ensued if he took that money?) Here is one of the many copies of this email.

I had previously marked a version of this email as relevant in past rounds. Obviously the corpus (the 699,082 Enron emails) had more copies of that particular email that I had not found before. It was widely circulated. I confirmed the predictions of Relevance.  (Remember that this database was deduplicated only on the individual custodian basis, vertical deduplication. It was not globally deduplicated against all custodians, horizontal deduplication. I recommend full horizontal deduplication as a default protocol.)

I disagreed with many of the other predicted relevant docs, but did not consider any of them important. The documents now presenting as possibly relevant were, in my view, cumulative and not really new, not really important. All were fetched by the outer limits of relevance triggered by my previously allowing in as barely relevant the final day comments on Ken Lay’s not taking a sixty-million dollar payment, and also allowing in as relevant general talk during bankruptcy that might mention layoffs.

Also, I was allowing in as relevant new documents and emails that concerned the ERISA plan revisions that were related to general severance. The SPD of the Enron Metals Voluntary Separation Program was an example of that. These were all fairly far afield of my original concept of relevance, which had grown as I saw all of the final days emails regarding layoffs, and better understood the bankruptcy and ERISA set up, etc.

Bottom line, I did not see much training value in these newly added docs, both predicted and confirmed. The new documents were not really new. They were very close to documents already found in the prior rounds. I was thinking it might be time to bring this search to an end.

Latest Relevancy Metrics

I ran one final search to determine my total relevant coded documents. The count was 659. That was a good increase over the last measured count of 545 relevant, but still short of my initial goal of 928, the point projection of yield. That is a 71% recall (659/928) of my target, which is pretty good, especially if the remaining relevant were just cumulative or otherwise not important. Considering the 3% confidence interval, and the range inherent in the 928 yield point projection because of that, from between 112 and 3,345 documents, it could in fact already be 100% recall, although I doubted that based on the process to date. See references to point projection, intervals, and William Webber’s work on confidence intervals in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane and in Webber, W., Approximate Recall Confidence Intervals, ACM Transactions on Information Systems, Vol. V, No. N, Article A (2012 draft).

Enough Is Enough

I was pretty sure that further rounds of search would lead to the discovery of more relevant documents, but thought it very unlikely that any more significant relevant documents would be found. Although I had found one hot doc in this round, the quality of the rest of the documents found convinced me that was unlikely to occur again. I had the same reaction to the grey area documents. The quality had changed. Based on what I had been seeing in the last two rounds, the relevant documents left were, in my opinion, likely cumulative and of no real probative value to the case.

In other words, I did not see value in continuing the search and review process further, except for a final null-set quality control check. I decided to bring the search to end. Enough is enough already. Reasonable efforts are required, not perfection. Besides, I knew there was a final quality control test to be passed, and that it would likely reveal any serious mistakes on my part.

Moving On to the Perhaps-Final Quality Control Check

After declaring the search to be over, the next step in the project was to take a random sample of the documents not reviewed or categorized, to see if any significant false-negatives turned up. If none did, then I would  consider the project a success, and conclude that more rounds of search were not required. If some did turn up, then I would have to keep the project going for at least another round, maybe more, depending on exactly what false-negatives were found. That would have to wait for the next day.

But before ending this long day I ran a quick search to see the size of this null set. There were 698,423 docs not categorized as relevant and I saved them in a Null Set Folder for easy reference. Now I could exit the program.

Total time for this night’s work was 4.5 hours, not including report preparation time and wait time on the computer for the training.

To be continued . . . .           


TAR Course: 1st Class

June 17, 2017

First Class: Background and History of Predictive Coding

Welcome to the first of seventeen classes. We begin with an introduction to the philosophy of the e-Discovery Team, which we share with Facebook, Google and most other high-tech companies. Then we go into the background and history of predictive coding, including some of the patents in this area. This first class is somewhat difficult, but worry not, most of the classes are easier. Also, you do not need to understand the patents discussed here, just the general ideas behind the evolution of predictive coding methods.

We Follow the ‘Hacker Way’ Philosophy

We Follow Zuckerberg’s Hacker Way

As the history section that follows will show, many e-discovery vendors are still stuck in the past. They have not upgraded their software to include the latest active machine learning methods. They lag behind because they do not follow the dominant Silicon Valley philosophy, which the e-Discovery Team fully endorses, called the Hacker Way. To quote the explanation of this philosophy given by Facebook’s founder, Mark Zuckerber, in his Letter to Investors for the initial public offering in 2012:

The Hacker Way is an approach to building that involves continuous improvement and iteration. . .  Hackers try to build the best services over the long term by quickly releasing and learning from smaller iterations rather than trying to get everything right all at once. . . .  We have the words `Done is better than perfect’ painted on our walls to remind ourselves to always keep shipping.  . . .

Hacking is also an inherently hands-on and active discipline. Instead of debating for days whether a new idea is possible or what the best way to build something is, hackers would rather just prototype something and see what works. There’s a hacker mantra that you’ll hear a lot around Facebook offices: ‘Code wins arguments.’

Hacker culture is also extremely open and meritocratic. Hackers believe that the best idea and implementation should always win — not the person who is best at lobbying for an idea or the person who manages the most people.

Zuckerberg, Letter to Investors (1/31/12). The e-Discovery Team has long endorsed the nine basic principles of the Hacker Way set forth in the diagrams below.

See: Losey, “The Hacker Way” – What the e-Discovery Industry Can Learn From Facebook’s Management Ethic (8/18/13); The Solution to Empty-Suits in the Board Room: The “Hacker Way” of ManagementPart One and Part Two (8/22/13);

The problem with legal technology is the debates often go on for years, not days. We are against that. We just do it. We have broken many things to get to this point and fixed many more. Our processes are still not perfect, but they keep improving. In the meantime, we keep shipping, we keep making phased productions. Iteration is the essence of both machine learning and creative work. Openness is also part of our core values. Thus we share most of what we learn in the TAR Course and e-Discovery Team Training.  On the e-Discovery Team blog we announce from time to time the continuous improvements we make to these programs.

First Generation Predictive Coding, Version 1.0

We begin our instruction with history because to understand the current state of the art of any field, you need to understand what came before. This is especially true in the legal technology field. Also, if you do not know history, you are doomed to repeat the mistakes of the past.

The first generation Predictive Coding, version 1.0, entered the market in 2009. The document review software for lawyers used active machine learning. This allowed the software to predict the relevance or irrelevance of a large collection of documents by manual review of only a portion of the documents. Code some of the documents relevant and it would predict the relevance of all and rank them. The ranking was the new “special power” of document search that applied active machine learning algorithms, a type of AI. We will go over the practical details of how this works later in this course.

The methods for use of predictive coding software have always been built into the software. The first version 1.0 software required a user to begin the review with a Subject Matter Expert (SME), usually a senior-level lawyer in charge of the case, to review a random selection of several thousand documents. The random documents they reviewed included a secret set of documents not identified to the SME, or anyone else, called a control set.

The secret control set supposedly allowed you to objectively monitor your progress in Recall and Precision of the relevant documents from the total set. It also supposedly prevented lawyers from gaming the system. As you will see in this class, we think the use of control sets was a big mistake.

Version 1.0 software set up the review project into two distinct stages. The first stage was for training the software on relevance so that it could predict the relevance of all of the documents. The second was for actual review of documents that the software had predicted would be relevant. You would do your training, then stop and do all of your review. There were two distinct stages.

Second Generation Predictive Coding, Version 2.0

The next generation of version 2.0 methodology continued to use secret control sets, but combined the two-stages of review into one. It was no longer train then review, but instead, the training continued throughout the review project. This continuous training improvement was  popularized by Maura Grossman and Gordon Cormack who called this method continuous active learning, or CAL for short. They later trademarked CAL, and so here we will just call it continuous training or CT for short. Under the CT method, which again was built into the software, the training continued continuously throughout the document review. There was not one stage to train, then another to review the predicted relevant document. The training and review continued together.

The main problem with version 2.0 predictive coding is that the use of secret control set continued. Please note that Grossman and Cormack’s method of review, which they call CAL, has never used control sets.

Third and Fourth Generation Predictive Coding, Versions 3.0 and 4.0

The next method of Predictive Coding, version 3.0, again combined the two-stages into one, the CT technique, but eliminated the use of secret control sets. Random sampling itself remained. It is the third step in the eight-step  process of both versions 3.0 and 4.0 that will be explained in the TAR Course, but the secret set of random documents, the control set, was eliminated.

The Problem With Control Sets

Although the use of a control set is basic to all scientific research and statistical analysis, it does not work in legal search. The EDRM, which apparently still promotes the use of a methodology with control sets, explains that the control set:

… is a random sample of documents drawn from the entire collection of documents, usually prior to starting Assisted Review training rounds. … The control set is coded by domain experts for responsiveness and key issues. … [T]he coded control set is now considered the human-selected ground truth set and used as a benchmark for further statistical measurements we may want to calculate later in the project. As a result, there is only one active control set in Assisted Review for any given project. … [C]ontrol set documents are never provided to the analytics engine as example documents. Because of this approach, we are able to see how the analytics engine categorizes the control set documents based on its learning, and calculate how well the engine is performing at the end of a particular round. The control set, regardless of size or type, will always be evaluated at the end of every round—a pop quiz for Assisted Review. This gives the Assisted Review team a great deal of flexibility in training the engine, while still using statistics to report on the efficacy of the Assisted Review process.

Control Sets: Introducing Precision, Recall, and F1 into Relativity Assisted Review (a kCura white paper adopted by EDRM).

Grossman_DavidThe original white paper written by David Grossman, entitled Measuring and Validating the Effectiveness of Relativity Assisted Review, is cited by EDRM as support for their position on the validity and necessity of control sets. In fact, the paper does not support this proposition. The author of this Relativity White Paper, David Grossman, is a Ph.D. now serving as the associate director of the Georgetown Information Retrieval Laboratory, a faculty affiliate at Georgetown University, and an adjunct professor at IIT in Chicago. He is a leading expert in text retrieval and has no connections with Relativity except to write this one small paper. I spoke with David Grossman on October 30, 2015. He confirmed that the validity, or not, of control sets in legal search was not the subject of his investigation. His paper does not address this issue. In fact, he has no opinion of the validity of control sets in the context of legal search. Even though control sets were mentioned, it was never his intent to measure their effectiveness per se.

David Grossman was unaware of the controversies in legal search when he made that passing reference, including the effectiveness of using control sets. He was unaware of my view, and that of many others in the field of legal search, that the ground truth at the beginning of a search project was more like quick sand. Although David has never done a legal search project, he has done many other types of real-world searches. He volunteered that he has frequently had that same quicksand type of experience where the understanding of relevance evolves as the search progresses.

The main problem with the use of the control set in legal search is that the SMEs, what EDRM here refers to as the domain experts, never know the full truth of document responsiveness at the beginning of a project. This is something that evolves over time. The understanding of relevance changes over time; it changes as particular documents are examined. The control set fails and creates false results because “the human-selected ground truth set and used as a benchmark for further statistical measurements” is never correct, especially at the beginning of a large review project. Only at the end of a project are we in a position to determine a “ground truth” and “benchmark” for statistical measurements.

This problem was recognized by another information retrieval expert, William Webber, PhD. William does have experience with legal search and has been kind enough to help me through technical issues involving sampling many times. Here is how Dr. Webber puts it in his blog Confidence intervals on recall and eRecall:

Using the control set for the final estimate is also open to the objection that the control set coding decisions, having been made before the subject-matter expert (SME) was familiar with the collection and the case, may be unreliable.

Having done many reviews, where Losey has frequently served as the SME, we are much more emphatic than William. We do not couch our opinion with “may be unreliable.” To us there is no question that at least some of the SME control set decisions at the start of a review are almost certainly unreliable.

KEYS_cone.filter-copyAnother reason control sets fail in legal search is the very low prevalence typical of the ESI collections searched. We only see high prevalence when the document collection is keyword filtered. The original collections are always low, usually less that 5%, and often less than 1%. About the highest prevalence collection we have ever searched was the Oracle collection in the EDI search contest and it had obviously been heavily filtered by a variety of methods. That is not a best practice because the filtering often removes the relevant documents from the collection, making it impossible for predictive coding to ever find them. See eg, William Webber’s analysis of the Biomet case where this kind of keyword filtering was used before predictive coding began. What is the maximum recall in re Biomet?, Evaluating e-Discovery (4/24/13).

The control set approach cannot work in legal search because the size of the random sample, much less the portion of the sample allocated to the control set, is never even close to large enough to include a representative document from each type of relevant documents in the corpus, much less the outliers. So even if the benchmark were not on such shifting grounds, and it is, it would still fail because it is incomplete. The result is likely to be overtraining of the document types to those that happened to hit in the control set, which is exactly what the control set is supposed to prevent. This kind of overfitting can and does happen even without exact knowledge of the documents in the control set. That is an additional problem separate and apart from relevance shift. It is a problem solved by the multimodal search aspects of predictive coding in version of 3.0 and 4.0 taught here.

William_webberAgain William Webber has addressed this issue in his typical understated manner. He points out in Why training and review (partly) break control sets the futility of using of control sets to measure effectiveness because the sets are incomplete:

Direct measures of process effectiveness on the control set will fail to take account of the relevant and irrelevant documents already found through human assessment.

A naïve solution to this problem to exclude the already-reviewed documents from the collection; to use the control set to estimate effectiveness only on the remaining documents (the remnant); and then to combine estimated remnant effectiveness with what has been found by manual means. This approach, however, is incorrect: as documents are non-randomly removed from the collection, the control set ceases to be randomly representative of the remnant. In particular, if training (through active learning) or review is prioritized towards easily-found relevant documents, then easily-found relevant documents will become rare in the remnant; the control set will overstate effectiveness on the remnant, and hence will overstate the recall of the TAR process overall. …

In particular, practitioners should be wary about the use of control sets to certify the completeness of a production—besides the sequential testing bias inherent in repeated testing against the one control set, and the fact that control set relevance judgments are made in the relative ignorance of the beginning of the TAR process. A separate certification sample should be preferred for making final assessments of production completeness.

Control sets are a good idea in general, and the basis of most scientific research, but it simply does not work in legal search. It was built into the version 1.0 and 2.0 software by engineers and scientists who had little understanding of legal search. They apparently had, and some still have, no real grasp at all as to how relevance is refined and evolves during the course of any large document review, nor of the typical low prevalence of relevance. The normal distribution in probability statistics is just never found in legal search.

The whole theory behind the secret control set myth in legal search is that the initial relevance coding of these documents was correct, immutable and complete; that it should be used to objectively judge the rest of the coding in the project. That is not true. In point of fact, many documents determined to be relevant or irrelevant at the beginning of a project may be considered the reverse by the end. The target shifts. The understanding of relevance evolves. That is not because of a bad luck or a weak SME (a subject we will discuss later in the TAR Course), but because of the natural progression of the understanding of the probative value of various types of documents over the course of a review.

Not only that, many types of relevant documents are never even included in the control set because they did not happen to be included in the random sample. The natural rarity of relevant evidence in unfiltered document collections, aka low prevalence, makes this more likely than not.

All experienced lawyers know how relevance shifts during a case. But the scientists and engineers who designed the first generation software did not know this, and anyway, it contravened their dogma of the necessity of control sets. They could not bend their minds to the reality of indeterminate, rare legal relevance. In legal search the target is always moving and always small. Also, the data itself can often change as new documents are added to the collection. In other areas of information retrieval, the target is solid granite, simple Newtonian, and big, or at least bigger than just a few percent. Outside of legal search it may make sense to talk of an immutable ground truth. In legal search the ground truth of relevance is discovered. It emerges as part of the process, often including surprise court rulings and amended causes of action. It is in flux. The truth is rare. The truth is relative.

schrodinger_quantum_uncertainityThe parallels of legal search with quantum mechanics are interesting. The documents have to be observed before they will manifest certainly as either relevant or irrelevant. Uncertainty is inherent to information retrieval in legal search. Get used to it. That is reality on many levels, including the law.

The control set based procedures were not only over-complicated, they were inherently defective. They were based on an illusion of certainty, an illusion of a ground truth benchmark magically found at the beginning of a project before document review even began. There were supposedly SME wizards capable of such prodigious feats. I have been an SME in many, many topics of legal relevance since I started practicing law in 1980. I can assure you that SMEs are human, all too human. There is no magic wizard behind the curtain.

Moreover, the understanding of any good SME naturally evolves over time as previously unknown, unseen documents are unearthed and analyzed. Legal understanding is not static. The theory of a case is not static. All experienced trial lawyers know this. The case you start out with is never the one you end up with. You never really know if Schrodinger’s cat is alive or dead. You get used to that after a while. Certainty comes from the final rulings of the last court of appeals.

The use of magical control sets doomed many a predictive coding project to failure. Project team leaders thought they had high recall, because the secret control set said they did, yet they still missed key documents. They still had poor recall and poor precision, or at least far less than their control set analysis led them to believe. See: Webber, The bias of sequential testing in predictive coding, June 25, 2013, (“a control sample used to guide the producing party’s process cannot also be used to provide a statistically valid estimate of that process’s result.”) I still hear stores from reviewers where they find precision of less than 50% using Predictive Coding 1.o and 2.0 methods, sometimes far less. Our goal is to use predictive coding 4.0 methods to increase precision to the 80% or higher level. This allows for the reduction of cost without sacrifice of recall.

Many attorneys who worked with predictive coding software versions 1.0 or 2.0, where they did not see their projects overtly crash and burn, as when missed smoking gun documents later turn up, or where reviewers see embarrassingly low precision, were nonetheless suspicious of the results. Even if not suspicious, they were discouraged by the complexity and arcane control set process from every trying predictive coding again. As attorney and search expert J. William (Bill) Speros likes to say, they could smell the junk science in the air. They were right. I do not blame them for rejecting predictive coding 1.0 and 2.0. I did too, eventually. But unlike many, I followed the Hacker Way and created my own method, called version 3.0, and then in later 2016, version 4.0. We will explain the changes made from version 3.0 to 4.0 later in the course.

funny_wizardThe control set fiction put an unnecessarily heavy burden upon SMEs. They were supposed to review thousands of random documents at the beginning of a project, sometimes tens of thousands, and successfully classify them, not only for relevance, but sometimes also for a host of sub-issues. Some gamely tried, and went along with the pretense of omnipotence. After all, the documents in the control set were kept secret, so no one would ever know if any particular document they coded was correct or not. But most SMEs simply refused to spend days and days coding random documents. They refused to play the pretend wizard game. They correctly intuited that they had better things to do with their time, plus many clients did not want to spend over $500 per hour to have their senior trial lawyers reading random emails, most of which would be irrelevant.

I have heard many complaints from lawyers that predictive coding is too complicated and did not work for them. These complaints were justified. The control set and two-step review process were the culprits, not the active machine learning process. The control set has done great harm to the legal profession. As one of the few writers in e-discovery free from vendor influence, much less control, I am here to blow the whistle, to put an end to the vendor hype. No more secret control sets. Let us simplify and get real. Lawyers who have tried predictive coding before and given up, come back and try Predictive Coding 4.0.

Recap of the Evolution of Predictive Coding Methods

Version 1.0 type software with strong active machine learning algorithms. This early version is still being manufactured and sold by many vendors today. It has a two process of train and then review. it also uses secret control sets to guide training. This usually requires an SME to review a certain total of ranked documents as guided by the control set recall calculations.

Version 2.0 of Predictive Coding eliminated the two-step process, and made the training continuous. For that reason version 2.0 is also called continuous  training, CT. It did not, however,  reject the random sample step and its control set nonsense.

Predictive Coding 3.0 was a major change. It built on the continuous training improvements in 2.0, but also eliminated the secret control set and mandatory initial review of a random sample. This and other process improvements in Predictive Coding 3.0 significantly reduced the burden on busy SMEs, and significantly improved the recall estimates. This in turn improved the overall quality of the reviews.

Predictive Coding 4.0 is the latest method and the one taught in this course. It includes some variations in the ideal work flow, and refinements on the continuous active training to facilitate double-loop feedback. We call this Intelligently Space Training (IST) and is all part of our Hybrid Multimodal IST method. All of this will be explained in detail in this course.In Predictive Coding 3.0 and 4.0 the secret control set basis of recall calculation are replaced with a prevalence based random sample guide, and elusion based quality control samples and other QC techniques. These can now be done with contract lawyers and only minimal involvement by SME. See Zero Error Numerics. This will all be explained in the TAR Course. The final elusion type recall calculation is done at the end of the project, when final relevance has been determined. See: EI-Recall. Moreover, in the 3.0 and 4.0 process the sample documents are not secret. They are known and adjusted as the definitions of relevance change over time to better control your recall range estimates. That is a major improvement.

The method of predictive coding taught here has been purged of vendor hype and bad science and proven effective many times. The secret control set has never worked, and it is high time it be expressly abandoned. Here are the main reasons why: (1) relevance is never static, it changes over the course of the review; (2) the random selection size was typically too small for statistically meaningful calculations; (3) the random selection was typically too small in low prevalence collections (the last majority in legal search) for complete training selections; and (4) it supposedly required a senior SME’s personal attention for days of document review work, a mission impossible for most e-discovery teams.

Here is Ralph Losey talking about control sets in June 2017. He is expressing his frustration about vendors still delaying upgrades to their software to eliminate the control set hooey. Are they afraid of losing business in the eastern plains of the smoky mountains?

Every day that vendors keep phony control set procedures is another day that lawyers are mislead on recall calculations based on them; another day lawyers are frustrated by wasting their time on overly large random samples; another day everyone has a false sense of protection from the very few unethical lawyers out there, and the very many not fully competent lawyers; and another day clients pay too much for document review. The e-Discovery Team calls on all vendors to stop using control sets and phase it out of their software.

The First Patents

USPTOWhen predictive coding first entered the legal marketplace in 2009 the legal methodology used by lawyers for predictive coding was dictated by the software manufacturers, mainly the engineers who designed the software. See eg. Leading End-to-End eDiscovery Platform Combines Unique Predictive Coding Technology with Random Sampling to Revolutionize Document Review (2009 Press Release). Recommind was an early leader, which is one reason I selected them for the Da Silva Moore v. Publicis Groupe case back in 2011. On April 26, 2011, Recommind was granted a patent for predictive coding: Patent No. 7,933,859, entitled Full-Text Systems and methods for predictive coding. The search algorithms in the patent used Probabilistic Latent Semantic Analysis, an already well-established statistical analysis technique for data analysis. (Recommind obtained two more patents with the same name in 2013: Patent No. 8,489,538 on July 16, 2013; and Patent No. 8,554,716 on October 8, 2013.)

As the title of all of these patents indicate, the methods of use of the text analytics technology in the software were key to the patent claims. As is typical for patents, many different method variables were described to try to obtain as wide a protection as possible. The core method was shown in Figure Four of the 2011 patent.

Recommind_Patent4This essentially describes the method that I now refer to as Predictive Coding Version 1.0. It is the work flow I had in mind when I first designed procedures for the Da Silva Moore case. In spite of the Recommind patent, this basic method was followed by all vendors who added predictive coding features to their software in 2011, 2012 and thereafter. It is still going on today. Many of the other vendors also received patents for their predictive coding technology and methods, or applications are pending. See eg. Equivio, patent applied for on June 15, 2011 and granted on September 10, 2013, patent number 8,533,194; Kroll Ontrack, application 20120278266, April 28, 2011.

To my knowledge there has been no litigation between vendors. My guess is they all fear invalidation on the basis of lack of innovation and prior art.

The engineers, statisticians and scientists who designed the first predictive coding software are the people who dictated to lawyers how the software should be used in document review. None of the vendors seemed to have consulted practicing lawyers in creating these version 1.0 methods. I know I was not involved.

Ralph Losey

Losey in 2011 when first arguing against the methods of version 1.0

I also remember getting into many arguments with these technical experts from several companies back in 2011. That was when the predictive coding 1.0 methods hardwired into their software were first explained to me. I objected right away to the secret control set. I wanted total control of my search and review projects. I resented the secrecy aspects. There were enough black boxes in the new technology already. I was also very dubious of the statistical projections. In my arguments with them, sometimes heated, I found that they had little real grasp of how legal search was actually conducted or the practice of law. My arguments were of no avail. And to be honest, I had a lot to learn. I was not confident of my positions, nor knowledgeable enough of statistics. All I knew for sure is that I resented their trying to control my well-established, pre-predictive coding search methods. Who were they to dictate how I should practice law, what procedures I should follow? These scientists did not understand legal relevance, nor how it changes over time during the course of any large-scale review. They did not understand the whole notion of the probative value of evidence and the function of e-discovery as trial preparation. They did not understand weighted relevance, and the 7+/2 rule of judge and jury persuasion. I gave up trying, and just had the software modified to suit my needs. They would at least agree to do that to placate me.

Part of the reason I gave up trying back in 2011 is that I ran into a familiar prejudice from this expert group. It was a prejudice against lawyers common to most academics and engineers. As a high-tech lawyer since 1980 I have faced this prejudice from non-lawyer techies my whole career. They assume we were all just a bunch of weasels, not to be trusted, and with little or no knowledge of technology and search. They have no idea at all about legal ethics or professionalism, nor of our experience with the search for evidence. They fail to understand the central role of lawyers in e-discovery, and how our whole legal system, not just discovery, is based on the honesty and integrity of lawyers. We need good software from them, not methods to use the software, but they knew better. It was frustrating, believe me. So I gave up on the control set arguments and moved on. Until today.

In the arrogance of the first designers of predictive coding, an arrogance born of advanced degrees in entirely different fields, these information scientists and engineers presumed they knew enough to tell all lawyers how to use predictive coding software. They were blind to their own ignorance. The serious flaws inherent in Predictive Coding Version 1.0 are the result.

Predictive Coding Version 2.0 Adopts Continuous Active Training

The first major advance in predictive coding methodology was to eliminate the dual task phases present in Predictive Coding 1.0. The first phase of the two-fold version 1.0 procedure was to use active learning to train the classifier. This would take several rounds of training and eventually the software would seem to understand what you were looking for. Your concept of relevance would be learned by the machine. Then the second phase would begin. In phase two you actually reviewed the documents that met the ranking criteria. In other words, you would use predictive coding in phase one to cull out the probable irrelevant documents, and then you would be done with predictive coding. (In some applications you might continue to use predictive coding for reviewer batch assignment purposes only, but not for training.) The next phase two was all about review to confirm the prediction of classification, usually relevance. In phase two you would just review, and not also train.

In my two ENRON experiments in 2012 I did not follow this two-step procedure. I just kept on training until I could not find any more relevant documents. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One); Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two); Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (in PDF form and the blog introducing this 82-page narrative, with second blog regarding an update); Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods).

I did not think much about it at the time, but by continuing to train I used a, to me, perfectly reasonable departure from the version 1.0 method. I was using what is now promoted as the new and improved Predictive Coding 2.0. In this 2.0 version you combine training and review. The training is continuous. The first round of document training might be called the seed set, if you wish, but it is nothing particularly special. All rounds of training are important and the training should continue as the review proceeds, unless there are some logistical reasons not to. After all, training and review are both part of the same review software, or should be. It just makes good common sense to do that, if your software allows you to. If you review a document, then you might as well at least have the option to include it in the training. There is no logical reason for a cut-off point in the review process where training stops. I really just came up with that notion in Da Silva for simplicity sake.

In predictive coding 2.0 you do Continuous Training, or CT for short. It just makes much more sense to keep training as long as you can, if your software allows you to do that.

There are now several vendors that promote the capacity of continuous training and have it built into their review software, including Kroll.  Apparently many vendors still use the old dual task, stop training approach of version 1.0. And, most vendors still use, or at least give lip service to, the previously sacrosanct random secret control set features of version 1.0 and 2.0.

John Tredennick

John Tredennick

The well-known Denver law technology sage, John Tredennick, CEO of Catalyst, often writes about predictive coding methods. Here is just one of the many good explanations John has made about continuous training (he calls it CAL), this one from his article with the catchy name “A TAR is Born: Continuous Active Learning Brings Increased Savings While Solving Real-World Review Problems” (note these diagrams are his, not mine, and he here calls predictive coding TAR):

How Does CAL Work?

CAL turns out to be much easier to understand and implement than the more complicated protocols associated with traditional TAR reviews.

Catalyst_TAR1.0

A TAR 1.0 review is typically built around the following steps:

1. A subject matter expert (SME), often a senior lawyer, reviews and tags a sample of randomly selected documents to use as a “control set” for training.
2. The SME then begins a training process using Simple Passive Learning or Simple Active Learning. In either case, the SME reviews documents and tags them relevant or non-relevant.
3. The TAR engine uses these judgments to build a classification/ranking algorithm that will find other relevant documents. It tests the algorithm against the control set to gauge its accuracy.
4. Depending on the testing results, the SME may be asked to do more training to help improve the classification/ranking algorithm.
5. This training and testing process continues until the classifier is “stable.” That means its search algorithm is no longer getting better at identifying relevant documents in the control set.

Even though training is iterative, the process is finite. Once the TAR engine has learned what it can about the control set, that’s it. You turn it loose to rank the larger document population (which can take hours to complete) and then divide the documents into categories to review or not. There is no opportunity to feed reviewer judgments back to the TAR engine to make it smarter.

TAR 2.0: Continuous Active Learning

In contrast, the CAL protocol merges training with review in a continuous process. Start by finding as many good documents as you can through keyword search, interviews, or any other means at your disposal. Then let your TAR 2.0 engine rank the documents and get the review team going.

Catalyst_Tar.2.0

As the review progresses, judgments from the review team are submitted back to the TAR 2.0 engine as seeds for further training. Each time reviewers ask for a new batch of documents, they are presented based on the latest ranking. To the extent the ranking has improved through the additional review judgments, reviewers receive better documents than they otherwise would have.

Blue_Lexie_robot_blackJohn has explained to us that his software has never had a control set, and it allows you to control the timing of continuous training, so in this sense his Catalyst software is already fully Predictive Coding 3.0 and 4.0 compliant. Even if your software has control set features, you can probably still disable them. That is what I do with the Kroll software that I typically use (see eg MrEDR.com). I am talking about a method of use here, not a specific algorithm, nor patentable invention. So unless the software you uses forces you do a two-step process, or makes you use a control set, you can use these version 3.0 and 4.0 methods with it. Still, some modifications of the software would be advantageous to streamline and simplify the whole process that is inherent in Predictive Coding 3.0 and 4.0. For this reason I call on all software vendors to eliminate the secret control set now and the dual step process.

Version 3.0 Patents Reject the Use of Control and Seed Sets

Recommind_Patent_control_setThe main problem for us with the 1.0 work-flow methodology for Predictive Coding was not the two-fold nature of train then review, which is what 2.0 addressed, but its dependence on creation of a secret control set and seed set at the beginning of a project. That is the box labeled 430 in Figure Four to the Recommind patent. It is shown in Tredennick’s Version 1.0 diagram on the left as control set and seed set. The need for a random secret control set and seed set became an article of faith based on black letter statistics rules. Lawyers just accepted it without question as part of version 1.0 predictive coding. It is also one reason that the two-fold method of train then review, instead of CAL 2.0, is taking so long for some vendors to abandon.

Based on my experience and experiments with predictive coding methods since 2011, the random control set and seed set are both unnecessary. The secret control set is especially suspect. It does not work in real-world legal review projects, or worse, provides statistical mis-information as to recall. As mentioned, that is primarily because in the real world of legal practice relevance is a continually evolving concept. It is never the same at the beginning of a project, when the control set is created, as at the end. The engineers who designed version 1.0 simply did not understand that. They were not lawyers and did not appreciate the flexibility of the relevance. They did not know about concept drift. They did not understand the inherent vagaries and changing nature of the search target in a large document review project. They also did not understand how human SMEs were, how they often disagree with themselves on the classification of the same document even without concept drift. As mentioned, they were also blinded by their own arrogance, tinged with antipathy against lawyers.

They did understand statistics. I am not saying their math was wrong. But they did not understand evidence, did not understand relevance, did not understand relevance drift (or, as I prefer to call it, relevance evolution), and did not understand efficient legal practice. Many I have talked to did not have any real understanding of how lawyers worked at all, much less document review. Most were just scientists or statisticians. They meant well, but they did harm nonetheless. These scientists did not have any legal training. If they were any lawyers on the version 1.0 software development team, they were not heard, or had never really practiced law. (As a customer, I know I was brushed off.) Things have gotten much better in this regard since 2008 and 2009, but still, many vendors have not gotten the message. They still manufacture version 1.0 type predictive coding software.

Jeremy Pickens, Ph.D., Catalyst’s in-house information scientist, seems to agree with my assessment of control sets. See Pickens, An Exploratory Analysis of Control Sets for Measuring E-Discovery Progress, DESI VI 2015, where he reports on an his investigation of the effectiveness of control sets to measure recall and precision. Jeremy used the Grossman and Cormack TAR Evaluation Toolkit for his data and gold standards. Here is his conclusion:

A popular approach in measuring e-discovery progress involves the creation of a control set, holding out randomly selected documents from training and using the quality of the classification on that set as an indication of progress on or quality of the whole. In this paper we do an exploratory data analysis of this approach and visually examine the strength of this correlation. We found that the maximum-F1 control set approach does not necessarily always correlate well with overall task progress, calling into question the use of such approaches. Larger control sets performed better, but the human judgment effort to create these sets have a significant impact on the total cost of the process as a whole.

A secret control set is not a part of the Predictive Coding 4.0 method. As will be explained in this course, we still have random selection reviews for prevalence and quality control purposes – Steps Three and Seven – but the documents are not secret and they are typically used for training (although they do not have to be). Moreover, after version 3.0 we eliminated any kind of special first round of training seed set, random based or otherwise. The first time the machine training begins is simply the first round. Sometimes it is big, sometimes it is not. It all depends on our technical and legal analysis of the data presented or circumstances of the project. It also all depends on our legal analysis and the disputed issues of fact in the law suit or other legal investigation. That is the kind of thing that lawyers do everyday. No magic required, not even high intelligence; only background and experience as a practicing lawyer are required.

The seed set is dead. So too is the control set. Other statistical methods must be used to calculate recall ranges and other numeric parameters beyond the ineffective control set method. Other methods beyond just statistics must be used to evaluate the quality and success of a review project. See eg. EI-Recall and Zero Error Numerics that includes statistics, but is not limited to it).

Grossman and Cormack Patents

We do not claim any patents or other intellectual property rights to Predictive Coding 4.0, aside from copyrights to Losey’s writings, and certain trade secrets that we use, but have not published or disclosed outside of our circle of trust. But our friends Gordon Cormack and Maura Grossman, who are both now professors, do claim patent rights to their methods. The methods are apparently embodied in software somewhere, even though the software is not sold. In fact, we have never seen it, nor, as far as I know, has anyone else, except perhaps their students. Their patents are all entitled Full-Text Systems and methods for classifying electronic information using advanced active learning technique: December 31, 2013, 8,620,842, Cormack; April 29, 2014, 8,713,023, Grossman and Cormack; and, September 16, 2014, 8,838,606, Grossman and Cormack.

The Grossman and Cormack patents and patent applications are interesting for a number of reasons.  For instance, they all contain the following paragraph in the Background section explaining why their invention is needed. As you can see it criticizes all of the existing version 1.0 software on the market at the time of their applications (2013) (emphasis added):

Generally, these e-discovery tools require significant setup and maintenance by their respective vendors, as well as large infrastructure and interconnection across many different computer systems in different locations. Additionally, they have a relatively high learning curve with complex interfaces, and rely on multi-phased approaches to active learning. The operational complexity of these tools inhibits their acceptance in legal matters, as it is difficult to demonstrate that they have been applied correctly, and that the decisions of how to create the seed set and when to halt training have been appropriate. These issues have prompted adversaries and courts to demand onerous levels of validation, including the disclosure of otherwise non-relevant seed documents and the manual review of large control sets and post-hoc document samples. Moreover, despite their complexity, many such tools either fail to achieve acceptable levels of performance (i.e., with respect to precision and recall) or fail to deliver the performance levels that their vendors claim to achieve, particularly when the set of potentially relevant documents to be found constitutes a small fraction of a large collection.

They then indicate that their invention overcomes these problems and is thus a significant improvement over prior art. In Figure Eleven of their patent (shown below) they describe one such improvement, “an exemplary method 1100 for eliminating the use of seed sets in an active learning system in accordance with certain embodiments.”

Grossman_Cormack_Patent11

These are basically the same kind of complaints that I have made here against Predictive Coding 1.0 and 2.0. I understand the criticisms regarding complex interfaces, that rely on multi-phased approaches to active learning. If the software forces use of control set and seed set nonsense, then it is an overly complex interface. (It is not overly complex if it allows other types of search, such as keyword, similarity or concept, for this degree of complexity is necessary for a multimodal approach.) I also understand their criticism of the multi-phased approaches to active learning, which was fixed in 2.0 by the use of continuous training, instead of train and then review.

The Grossman & Cormack criticism about low prevalence document collections, which is the rule, not the exception in legal search, is also correct. It is another reason the control set approach cannot work in legal search. The number of relevant documents to be found constitutes a small fraction of a large collection and so the control set random sample is very unlikely to be representative, much less complete. That is an additional problem separate and apart from relevance shift.

About the only complaint the Grossman & Cormack patent makes that I do not understand is the gripe about large infrastructure and interconnection across many different computer systems in different locations. For Kroll software at least, and also Catalyst, that is the vendor’s problem, not the attorneys. All the user does is sign on to a secure cloud server.

Notice that there is no seed set or control set in the Grossman & Cormack patent diagram as you see in the old Recommind patent. Much of the rest of the patent, in so far as I am able to understand the arcane patent language used, consists of applications of continuous training techniques that have been tested and explained in their writings, including many additional variables and techniques not mentioned in their articles. See eg. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014. Their patent includes the continuous training methods, of course, but also eliminates the use of seed sets. I assumed this also means the elimination of use control sets, a fact that Professor Cormack has later confirmed. Their CAL methods do not use secret control sets. Their software patents are thus like our own 3.0 and 4.o innovations, although they do not use IST.

Go on to Class Two.

Or pause to do this suggested “homework” assignment for further study and analysis.

SUPPLEMENTAL READING: Review all of the patents cited, especially the Grossman and Cormack patents (almost identical language was used in both, as you will see, so you only need to look one). Just read the sections in the patents that are understandable and skip the arcane jargon. Also, suggest you read all of the articles cited in this course. That is a standing homework assignment in all classes. Again, some of it may be too technical. Just skip through or skim those sections. Also see if you can access Losey’s LTN editorial, Vendor CEOs: Stop Being Empty Suits & Embrace the Hacker Way. We suggest you also check out HackerLaw.org.

EXERCISES: What does TTR stand for in door number one of the above graphic? Take a guess. I did not use the acronym in this class, but if you have understood this material, you should be able to guess what it means. In later classes we will add more challenging exercises at the end of the class, but this first class is hard enough, so we  will let it go with that.

Students are invited to leave a public comment below. Insights that might help other students are especially welcome. Let’s collaborate!

_

e-Discovery Team LLC COPYRIGHT 2017

ALL RIGHTS RESERVED

_

 


%d bloggers like this: