This is a continuation of my earlier blog with the same title: Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training – Part One.
Latest Grossman Cormack Study
The information scientist behind this study is Gordon V. Cormack, Professor, University of Waterloo. He has a long history as a search expert outside of legal search, including special expertise in spam searches. The lawyer who worked with Gordon on this study is Maura R. Grossman, Of Counsel, Wachtell, Lipton, Rosen & Katz. In addition to her J.D., she has a PhD in psychology, and has been a tireless advocate for effective legal search for many years. Their work is well known to everyone in the field.
The primary purpose of their latest study was not to test the effectiveness of training based on random samples. That was a secondary issue. The primary focus of the study was to test the relative effectiveness of a three different training approaches to active machine learning, Continuous Active Learning, Simple Active Learning, Simple Passive Learning.
Our primary experiments evaluated the specific formulations of CAL, SAL, and SPL described in Section 2; secondary experiments explored the effect of using keyword-selected versus randomly selected documents for the seed and training sets.
Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, at pg. 4.
They picked three different approaches, CAL, SAL and SPL, which I will explain next. First, however, I want to make clear that the three protocols they tested are not the only approaches used in machine learning. There are many more. My own approach uses all three mentioned, plus others, many others. That is why I call it multimodal. Cormack and Grossman used three fairly simplistic methods in this experiment because they were easy to reproduce in an experimental setting. Moreover, all three protocols tested had some connection with real world legal practice.
The Cormack Grossman protocol test was a scientific experiment, not a demonstration of the latest and greatest machine-learning protocols. Despite what some have said, Cormack and Grossman are not vendors. Unlike me, they are not even advocates of any particular type of machine learning protocol, nor of any particular type of vendor software. They are scientists and educators. They do not sell software. They do not sell any particular secret sauce legal search methods, although they no doubt have such methods. They do not sell them or even promote them.
I am in essentially the same position as Maura Grossman in some respects. I am not a vendor either, but when I sell my time as a lawyer to lead predictive coding projects, I am, in effect, selling my expertise in particular machine training protocols. I have my own methods, and I often create special methods to suit a particular project. Although I have written about fifty articles on the subject, I have only disclosed the basic outlines of the protocols I use, not all of the details. Plus, I continue to learn and invent new methods. When you retain me as a lawyer, I use these methods to quickly and effectively find the information needed.
Maura Grossman is also an attorney. In that capacity she no doubt privately advocates for some methods over others, and like I do, sells her secret sauce, her methods, as part of her work for her law firm’s clients. But that is not what she is doing here in her work with Gordon. They are testing a few of the most basic training protocols used for scientific purposes. They do not use any of these simplistic methods in their practice any more than I do. If they are selling anything here, it is the truth, the facts from experiments. Even that they give away and invite peer review. No, they are not vendors, and this is true science, not marketing, not legal advocacy.
Three Machine Learning Protocols Tested
Cormack and Grossman set up an ingenious experiment to test the effectiveness of three machine learning protocols. It is ingenious for several reasons, not the least of which is that they created what they call an “evaluation toolkit” to perform the experiment. They have even made this same toolkit, this same software, freely available for use by any other qualified researchers. They invite other scientists to run the experiment for themselves. They invite open testing of their experiment. They invite vendors to do so too, but so far there have been no takers.
That is the true essence of the scientific approach. Find truth from empirical evidence, not dogma. (That is, by the way, also the approach of all enlightened legal systems, where justice is based on facts, on evidence, and not social opinions or religious dictates.) Cormack and Grossman have opened up their experiment to full public view and re-testing in an extraordinary way. It is my strong desire, and I am sure theirs as well, that other scientists will take them up on this offer and run their tests to get a clearer and bigger view of the facts, and thus a clearer and better view of the truth of the relative efficacy of these three training methods. One information scientist has already tested their study using his own approach, William Webber. His findings generally confirmed and expanded upon those of Cormack and Grossman as to the relative ineffectiveness of random-only training, especially on low prevalence data sets (which is what lawyers typically work with in legal search). Webber, Random vs active selection of training examples in e-discovery (Evaluating e-Discovery blog, 7/14/14).
I am not a big fan of acronyms, and so I personally do not like how Cormack and Grossman use three acronyms throughout their study to label the three machine training protocols. But to understand their report you need to learn and remember these acronyms — CAL, SAL, SPL — just do not look for me to ever use them again in my blog, except to discuss this experiment. I will instead invent what I hope are clever summary words that will be much easier for me to remember.
The rest of this Part Two of the blog will be devoted to explaining, in my own words, the CAL, SAL, and SPL machine training protocols examined in this experiment.
CAL – Continuous Active Learning.
This protocol uses one method for the first round of training, and another for all subsequent rounds. In the first round, documents are selected using human judgmental sampling by use of keyword search; in the second and subsequent rounds, a learning algorithm that classifies documents and ranks them according to probability is used. In the first round, 1000 of the keyword search results are selected at random. In the second and subsequent rounds, the top 1,000 ranked documents are selected, reviewed, and added to the training set. These are the documents in which the machine has the highest confidence, or the highest degree of certainty as to its prediction that the document is relevant. The rankings change as the training evolves. This process continues until adequate recall is achieved. I call this training protocol a modified keyword and high probability machine selected method.
I call it modified because keyword selected documents are only used in the first round of training, and thereafter, the keyword search method is dropped. All subsequent training relies exclusively on high probability machine selected documents for review.
Cormack and Grossman call the first round of machine training documents the seed set. They do so out of tradition and common usage in the legal search community, but frankly, I see no reason to continue to use this dated language. I prefer to call the first round of machine training just that, the first round. Otherwise many people are confused by thinking that the first round of training is somehow special. In fact, machine training iterations are just like baseball innings. The first inning may be important psychologically, but it is just another inning, like all of the others, and the game is never over until at least the ninth. The score before that is irrelevant.
We do not know from this study the impact of using keyword search in every round. They did not test that exactly, but, in their secondary experiments Cormack and Grossman did explore the effect of adding keyword-selected training documents to the SPL (pure random) methods. They found that adding keywords to the first round of training generally improves performance in SPL, and that keywords for all training sets usually, but not always, improves SPL. The also found that replacing keyword search with random selections harms the performance of the CAL and SAL methods.
In my work I use every possible type of search in every round of training, the first round, second, third, whatever. The only type of search I do not use in the first round, the so called seed set, are machine selected uncertainty documents. That is because the classification probability ranking has not begun yet. That particular kind of search method can only begin after the first round of training (although if a presuit method was used, smart probability charged documents could be used in the first round too). I pick and choose which methods to use in a particular training round depending on the case, the data, and what I see and know from the study as the review progresses. It often depends on what I learned in prior rounds. Maybe I will use 100 machine selected documents in the second round, maybe more, maybe none. Maybe I will focus on the highest ranking as done here with CAL, or maybe I will focus on the middle rankings, the so-called uncertainty zone used in the next protocol tested here, SAL. I may even check out some of the lowest rankings. I may even switch the type of classifications ranked, say move from relevant to irrelevant, or more likely to highly relevant (hot). Maybe I will use some concept search, some similarity search, or more keywords. It all depends on the circumstances and new documents that I may have seen. It all flows, it changes, and is never the same from one project to the next, or even from one round to the next, although the basic parameters remain the same.
From what Maura and other search experts have told me, they are all doing something similar to this. But they do not share the exact details, the secret sauce, with me, anymore than I do with them. Since we use different software tools, that would anyway be very hard to do. The many types of different things that you can do with legal search in general, and machine training in particular, depend in part on the software tools that you use. The latest EDR software tool of Kroll Ontrack that I use provides me with tremendous flexibility. But still, even with Kroll Ontrack’s software, I usually have my friends at Kroll customize the predictive coding interface somewhat for me. Also, I almost always follow my own methods, not the default software settings, especially when it comes to the use of random document selections.
SAL – Simple Active Learning.
This protocol again uses one method for the first round of training, and another for all subsequent rounds. In SAL, documents for the initial round of training are again selected, the same as in CAL, by random selection of keyword hit documents. Documents for the following rounds are again selected by a learning algorithm, only this time the highest ranked documents are not the ones used for training. Instead, only the middle ranked documents are reviewed, the ones in which the machine has the highest degree of uncertainty. In my writings I refer to this later type of uncertainty based protocol as machine selected documents. The machine selected documents are typically in the 40% to 60% probability range. Under this system, after the initial round of training, only documents are reviewed that the machine has the highest degree of uncertainty as to its prediction of coding. Once training is complete, the top ranked documents are reviewed for production, but not to further train the system, until adequate recall is achieved. So my summary of this training protocol is that this is a modified keyword and uncertainty machine selected method.
Again I say modified because random keyword samples are only used to select documents in the first round of training. In all subsequent rounds, keyword search selected documents are not used, instead, documents are selected solely on the basis of automated uncertainty calculations, what I call machine selection.
Note that in my fifty-hour Borg protocol test I used something close to this SAL protocol. Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods). I used a totally random sample of 1,183 documents for the first round. In every other round thereafter I used a combination of random and uncertainty machine selected. I called this the Enlightened Borg Approach. Three-Cylinder Multimodal Approach To Predictive Coding. In this experiment I used this protocol in 50 rounds of training, plus a final quality assurance round with another random sample. In the 49 rounds after the first I reviewed 200 documents in each round; 20% of the documents, 40 documents, were random selected, and 80%, 160 documents, were uncertainty machine selected.
I compared my results with a prior review of the same 699,082 documents for the same issue (employee terminations, excluding voluntary departures). I spent about the same amount of time in the earlier review, fifty hours. But in the earlier review I used a fully multimodal approach, where I used random, machine selected, and human judgment selected documents, primarily keyword. That is what I call the full three-cylinder search engine approach. Three-Cylinder Multimodal Approach To Predictive Coding. The first document review exepriment is described in Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron (PDF), plus the blog introducing this 82-page narrative, with second blog regarding an update.
The random and machine selected protocol, which I called monomodal, did surprisingly well in my comparison experiment, but was still surpassed by full multimodal, especially in the all important search for hot documents. Full multimodal did 57% better with that classification. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents, Part One and Part Two.
The SAL protocol tested by Cormack and Grossman uses human input in the form of keyword selection for the first round. Then, in all following rounds of SAL, documents were selected for review based on the machine’s uncertainty ranking. The uncertainty rankings change as the training evolves. The exact range of probability uncertainty used in this experiment would depend on the number of documents falling within the 40% – 60% zone because the number of training documents used was fixed at 1,000 for each round. As Cormack and Grossman put it, machine selected documents are those “about which the learning algorithm is least certain.“
I advocate for a three cylinder search engine approach because this is part of my multimodal method. I use every known search method as appropriate to the circumstances to try to find the target documents. I use random methods, and human judgmental methods, including, but certainly not limited to keyword search, and I use machine selected methods, methods that rely on a predictive coding ranking system. The machine selected type methods that I use include highest rankings methods, and middle uncertainty rankings, as considered by Cormack and Grossman in this experiment. But in my work I may examine a variety of different ranges of probability ranking. The movement of these rankings can also provide valuable insights. This may be the most detailed description that I have ever provided as to the finer points of my predictive coding training methods.
Again, please remember that the few methods selected for study by Cormack and Grossman in this experiment are just a few out of hundreds of different possible search methods. Indeed the combinations possible would reach into the tens of thousands, or higher, depending on duration and complexity of the review. Very few, if any, legal search experts would only use a couple of training protocols, much less just one.
SPL – Simple Passive Learning.
The SPL protocol uses simple random selection of documents throughout the project, in the initial round and all following rounds. SPL uses no other search methods to find training documents. It does not use machine selected documents, nor human selected documents. Once training is complete, the top ranked documents are reviewed for production, but not to further train the system, until adequate recall is achieved. My summary of this training protocol is that this is a pure random selection method.
I my prior writings on machine training I have called this the Lucky Borg Approach. Three-Cylinder Multimodal Approach To Predictive Coding (“Some types of predictive coding software rely entirely on random chance to select documents for machine training. They are, so to speak, a one-cylinder predictive coding search engine. They run on chance alone.”)
Cormack and Grossman used this monomodal pure random approach to search the same document sets as the other two protocols. They again used 100 rounds, 1,000 documents per round, and simulated actual human reviewer results by reference to prior TREC studies of the same documents. Aside from the machine training protocols, the details of how the experimental simulated reviews were conducted were identical for all three processes. That is how they attained comparator information and measured comparative efficacy between the three methods tested.
To be continued . . . in Part Three of this blog I will elaborate on the results of the Cormack Grossman experiment. In Part Four I will conclude with my final opinions and analysis, and friendly recommendations for any vendors still using random-only training protocols. Hint – it has something to do with the famous Borg motto.