This is the conclusion of my four part blog: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One and Part Two and Part Three.
Cormack and Grossman’s Conclusions
Gordon Cormack and Maura Grossman have obviously put a tremendous amount of time and effort into this study. In their well written conclusion they explain why they did it, as well as provide a good summary of their findings
Because SPL can be ineffective and inefficient, particularly with the low-prevalence collections that are common in ediscovery, disappointment with such tools may lead lawyers to be reluctant to embrace the use of all TAR. Moreover, a number of myths and misconceptions about TAR appear to be closely associated with SPL; notably, that seed and training sets must be randomly selected to avoid “biasing” the learning algorithm.
This study lends no support to the proposition that seed or training sets must be random; to the contrary, keyword seeding, uncertainty sampling, and, in particular, relevance feedback – all non-random methods – improve significantly (P < 0:01) upon random sampling.
While active-learning protocols employing uncertainty sampling are clearly more effective than passive-learning protocols, they tend to focus the reviewer’s attention on marginal rather than legally significant documents. In addition, uncertainty sampling shares a fundamental weakness with passive learning: the need to define and detect when stabilization has occurred, so as to know when to stop training. In the legal context, this decision is fraught with risk, as premature stabilization could result in insufficient recall and undermine an attorney’s certification of having conducted a reasonable search under (U.S.) Federal Rule of Civil Procedure 26(g)(1)(B).
This study highlights an alternative approach – continuous active learning with relevance feedback – that demonstrates superior performance, while avoiding certain problems associated with uncertainty sampling and passive learning. CAL also offers the reviewer the opportunity to quickly identify legally significant documents that can guide litigation strategy, and can readily adapt when new documents are added to the collection, or new issues or interpretations of relevance arise.
Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, at pg. 9.
The insights and conclusions of Cormack and Grossman are perfectly in accord with my own experience and practice with predictive coding search efforts, both with messy real world projects, and the four controlled scientific tests I have done over the last several years (only two of which have to date been reported, and the fourth is still in progress). I agree that a relevancy approach that emphasizes high ranked documents for training is one of the most powerful search tools we now have. So too is uncertainty training (mid ranked) when used judiciously, as well as keywords, and a number of other methods. All the many tools we have to find both relevant and irrelevant documents for training should be used, depending on the circumstances, including even some random searches.
In my view, we should never use just one method to select documents for machine training, and ignore the rest, even when it is a good method like Cormack and Grossman have shown CAL to be. When the one method selected is the worst of all possible methods, as random search has now been shown to be, then the monomodal approach is a recipe for ineffective, over-priced review.
Why All the Foolishness with Random Search?
As shown in Part One of this article, it is only common sense to use what you know to find training documents, and not rely on a so-called easy way of rolling dice. A random chance approach is essentially a fool’s method of search. The search for evidence to do justice is too important to leave to chance. Cormack and Grossman did the legal profession a favor by taking the time to prove the obvious in their study. They showed that even very simplistic mutlimodal search protocols, CAL and SAL, do better at machine training than monomodal random only.
Information scientists already knew this rather obvious truism, that multimodal is better, that the roulette wheel is not an effective search tool, that random chance just slows things down and is ineffective as a machine training tool. Yet Cormack and Grossman took the time to prove the obvious because the legal profession is being led astray. Many are actually using chance as if it that were a valid search method, although perhaps not in the way they describe. As Cormack and Grossman explained in their report:
While it is perhaps no surprise to the information retrieval community that active learning generally outperforms random training , this result has not previously been demonstrated for the TAR Problem, and is neither well known nor well accepted within the legal community.
Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014 at pg. 8.
As this quoted comment suggests, everyone in the information science search community knew this already, that the random only approach to search is inartful. So do most lawyers, especially the ones with years of hands-on experience in search for relevant ESI. So why in the world is random search only still promoted by some software companies and their customers? Is it really to address the so called problem of “not knowing what you don’t know.” That is the alleged inherent bias of using knowledge to program the AI. The total-random approach is also supposed to prevent overt, intentional bias, where lawyers might try to mis-train the AI searcher algorithm on purpose. These may be the stated reasons by vendors, but there are other reasons. There must be, because these excuses do not hold water. This was addressed in Part One of this article.
This bias-avoidance claim must just be an excuse because there are many better ways to counter myopic effects of search driven too narrowly. There are many methods and software enhancements that can be used to avoid overlooking important, not yet discovered types of relevant documents. For instance, allow machine selection of uncertain documents, as was done here with the SAL protocol. You could also include some random document selection into the mix, and not just make the whole thing random. It is not all or nothing, not logically at least, but perhaps it is as a practical matter for some software.
My preferred solution to the problem of “not knowing what you don’t know” is to use a combination of all those methods, buttressed by a human searcher that is aware of the limits of knowledge. In mean, really! The whole premise behind using random as the only way to avoid a self-looping trap of “not knowing what you don’t know” assumes that the lawyer searcher is a naive boob or dishonest scoundrel. It assumes lawyers are unaware that they don’t know what they don’t know. Please, we know that perfectly well. All experienced searchers know that. This insight is not just the exclusive knowledge of engineers and scientists. Very few attorneys are that arrogant and self absorbed, or that naive and simplistic in their approach to search.
No, this whole you must use random only search to avoid prejudice is just a smoke screen to hide real reason a vendor sells software that only works that way. The real reason is that poor software design decisions were made in a rush to get predictive coding software to market. Software was designed to only use random search because it was easy and quick to build software like that. It allowed for quick implementation of machine training. Such simplistic types of AI software may work better than poorly designed keyword searches, but it is still far inferior to more complex machine training system, as Cormack and Grossman have now proven. It is inferior to a multimodal approach.
The software vendors with random only training need to move on. They need to invest in their software to adopt a multimodal approach. In fact, it appears that many have already done so, or are in the process. Yes, such software enhancements take time and money to implement. But we need software search tools for adults. Stop all of the talk about easy buttons. Lawyers are not simpletons. We embrace hard work. We are masters of complexity. Give us choices. Empower the software so that more than one method can be used. Do not force us to use only random selection.
We need software tools that respect the ability of attorneys to perform effective searches for evidence. This is our sand box. That is what we attorneys do, we search for evidence. The software companies are just here to give us tools, not to tell us how to search. Let us stop the arguments and move on to discuss more sophisticated search methods and tools that empower complex methods.
Attorneys want software with the capacity to integrate all search functions, including random, into a mulitmodal search process. We do not want software with only one type of machine training ability, be it CAL, SAL or SPL. We do not want software that can only do one thing, and then have the vendor build a false ideology around their one capacity that says their method is the best and only way. These are legal issues, not software issues.
Attorneys do not just want one search tool, we want a whole tool chest. The marketplace will sort out whose tools are best, so will science. For vendors to remain competitive they need to sell the biggest tool chest possible, and make sure the tools are well built and perform as advertised. Do not just sell us a screwdriver and tell us we do not need a hammer and pliers too.
Leave the legal arguments as to reasonability and rules to lawyers. Just give us the tools and we lawyers will find the evidence we need. We are experts at evidence detection. It is in our blood. It is part of our proud heritage, our tradition.
Finding evidence is what lawyers do. The law has been doing this for millennia. Think back to story of the judicial decision of King Solomon. He decided to award the child to the woman he saw cry in response to his sham decision to cut the baby in half. He based his decision on the facts, not ideology. He found the truth in clever ways built around facts, around evidence.
Lawyers always search to find evidence so that justice can be done. The facts matter. It has always been an essential part of what we do. Lawyers always adapt with the times. We always demand and use the best tools available to do our job. Just think of Abraham Lincoln who readily used telegraphs, the great new high-tech invention of his day. When you want to know the truth of what happened in an event that took place in the recent past, you hire a lawyer, not an engineer nor scientist. That is what we are trained to do. We separate the truth from the lies. With great tools we can and will do an even better job.
Many multimodal based software vendors already understand all of this. They build software that empowers attorneys to leverage their knowledge and skills. That is why we use their tools. Empowerment of attorneys with the latest AI tools empowers our entire system of justice. That is why the latest Cormack Grossman study is so important. That is why I am so passionate about this. Join with us in this. Demand diversity and many capacities in your search software, not just one.
Vendor Wake Up Call and Plea for Change
My basic message to all manufacturers of predictive coding software who use only one type of machine training protocol is to change your ways. I mean no animosity at all. Many of you have great software already, it is just the monomondal method built into your predictive coding features that I challenge. This is a plea for change, for diversity. Sell us a whole tool chest, not just a single, super-simple tool.
Yes, upgrading software takes time and money. But all software companies need to do that anyway to continue to supply tools to lawyers in the Twenty-First Century. Take this message as both a wake up call and a respectful plea for change.
Dear software designers: please stop trying to make the legal profession look only under the random lamp. Treat your attorney customers like mature professionals who are capable of complex analysis and skills. Do not just assume that we do not know how to perform sophisticated searches. I am not the only attorney with multimodal search skills. I am just the only one with a blog who is passionate about it. There are many out there with very sophisticated skills and knowledge. They may not be as old (I prefer to say experienced) and loud mouthed (I prefer to say outspoken) as I am, but they are just as skilled. They are just as talented. More importantly, their numbers are growing rapidly. It is a generation thing too, you know. Your next generation of lawyer customers are just as comfortable with computers and big data as I am, maybe more so. Do you really doubt that Adam Losey and his generation will not surpass our accomplishments with legal search. I don’t.
Dear software designers: please upgrade your software and get with the multi-feature program. Then you will have many new customers, and they will be empowered customers. Do not have the money to do that? Show your CEO this article. Lawyers are not stupid. They are catching on, and they are catching on fast. Moreover, these scientific experiments and reports will keep on too. The truth will come out. Do you want to be survive the inevitable vendor closures and consolidation? Then you need to invest in more sophisticated, fully featured software. Your competitors are.
Dear software designers: please abandon the single feature approach, then you will be welcome in the legal search sandbox. I know that the limited functionality software that some of you have created is really very good. It already has many other search capacities. It just needs to be better integrated with predictive coding. Apparently some single feature software already produces decent results, even with the handicap of random-only. Continue to enhance and build upon your software. Invest in the improvements needed to allow for full multimodal, active, judgmental search.
A random only search method for predictive coding training documents is ineffective. The same applies to any other training method if it is applied to the exclusion of all others. Any experienced searcher knows this. Software that relies solely on a random only method should be enhanced and modified to allow attorneys to search where they know. All types of training techniques should be built into AI based software, not just random. Random may be easy, but is it foolish to only search under the lamp post. It is foolish to turn a blind eye to what you know. Attorneys, insist on having your own flashlight that empowers you to look wherever you want. Shine your light wherever you think appropriate. Use your knowledge. Equip yourself with a full tool chest that allows you to do that.
To a certain extent I am not surprised by these results and I will explain why in a moment. You may want to rethink your pleas to vendors, however, at least to the type of change you are suggesting.
The predictive coding technology offers two benefits for litigators. The first is the efficiency and effectiveness of machine coding for document review. The second is defensibility. Let me address the second first.
The defensibility of predictive coding is grounded in its use of statistical theory. When processes, including random selection of training set documents, are properly employed predictive coding enables users to make certain assertions about the larger population with quantifiable confidence. When random selection and laws of probability are forsaken for judgmental selection there is no quantifiable confidence. Instead, all bets are off and one is left with only unscientific judgment.
I would be concerned that anyone forsaking random selection for judgmental selection would be pulling a “Nixon” if the outcome was challenged by an adversary. I think that Nixon said, “I gave them the knife and they twisted it with relish.”
To a certain extent, though, I am not surprised by these results. Simple random sampling has many challenges with highly variable populations. To overcome these limitations larger sample sizes are often needed under simple random sampling versus other techniques like cluster sampling or stratified sampling.
I could see where if one were to use cluster sampling or stratified sampling plans that one could use keyword search or other techniques to develop the clusters or strata prior to the actual selection of documents from each strata using random sampling. In that fashion I think that one could enjoy the greater economies and accuracies indicated by the Carmack and Grossman results while not forsaking the benefits of statistical theory. Perhaps that is what you should advocate with your vendor suggestions—a more complex toolbox would not just involve simple random sampling but provide cluster or stratified sampling plans as well.
Think of it this way by considering this example. Every quantum claim has a three parts—liability, causation and quantum. It would be better to use predictive coding to answer those different issues if a stratified sampling plan was used than if a single, simple random sample covered them all.
Similarly, think of a construction dispute having numerous change orders. A single simple random sample is likely to omit documents from one or many of the change orders. A user cold very well be blind in some area. More specifically, if the training set drawn using simple random sampling had omitted documents of a particular change order then the training set would not have the vector data necessary to score those documents with a high relevancy score. Instead they could have a low relevancy score and could even be omitted despite that they were highly relevant. A cluster or stratified sample could help to resolve this deficiency and if random sampling principles were used when drawing the training documents the results could be stated with quantifiable confidence.
Perhaps in a future post you can put these concepts to an expert in survey sampling (someone with an advanced mathematics degree and specializing or at least highly proficient in the narrower application of statistical theory to survey sampling ). I would be interested in learning their thoughts.
There is no relationship between the method of selecting training documents for a learning algorithm and the method of selecting a sample for validation purposes.
There is no question that, for validation purposes, you can only make statistical claims of accuracy (e.g. an estimate of recall) if you use a statistically valid sample.
However, for training a learning algorithm, a statistical sample offers no assurance whatsoever that the learning algorithm will achieve any particular level of accuracy. This misconception is common, as manifested by the use of training set sizes of 2,399 or similar numbers alleged to guarantee in advance that that a particular level of accuracy and confidence in the result will be achieved.
Confidence in the result is achieved by using training and learning methods that have been empirically shown to work well. Confidence in the end result of the TAR process can be enhanced after the fact by calculating a statistical estimate of the accuracy of the result. Only the latter requires random selection.
Thanks for your comment. I understand that training is not the same as validation but I am not yet ready to concede to the folly of random selection for the training set.
The machine is being trained to accurately disposition all members of the document population. How can it do that if the training set is not representative of the population? The machine is using the smaller training set to develop a model of the larger population’s vector space. If the training set is not representative of the population will that increase the number of unclassified documents that will need subsequent resolution or could it cause the machine to improperly calculate the relevancy weighting of the documents that it does classify?
You are assuming a classification method different from the CAL (continuous active learning) system we evaluated. I know that one popular system that is advertised as “TAR” leaves some documents unclassified, but it employs SPL (simple passive learning), not CAL. Our study does not find SPL to be effective, whether trained with random or keyword-selected documents, and in fact cautions against relying exclusively keyword-selected documents for SPL.
The distinction, and a more detailed response to your point, is elaborated further in our Federal Courts Law Review article,
“Comments on ‘The implications of Rule 26(g) on the use of technology-assisted review'” http://www.fclr.org/fclr/articles/pdf/comments-implications-rule26g-tar-62314.pdf
[…] Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One, Part Two, Part Three, and Part Four. […]
I, too have a Nasrudin story:
Nasrudin was known far and wide around his village as a very wise man when it came to donkeys. There were not many questions about donkeys for which he did not have an answer.
One day Nasrudin met a Dervish on the road and the Dervish asked him, “Tell me Mullah, how is it that you know so much about donkeys?” Nasrudin replied: “I have been blessed by God with deep understanding of donkeys, I have many years of experience, and I had an excellent teacher who imparted much of the knowledge that I have today. There is very little about donkeys that I do not know.” If Nasrudin lived in America, he might have been known as a donkey whisperer.
Villagers and even the Sultan came to ask Nasrudin questions about donkeys. And he always had an answer.
One day, Nasrudin’s own donkey was missing. At first, Nasrudin thought that his donkey must surely have been stolen, but later it was found out that it had simply wandered away from Nasrudin’s barn, but still they could not find it. Repeatedly, Nasrudin said, “I know practically everything about donkeys. We will find the donkey in such and such a place.” Unfortunately Nasrudin’s predictions did not bear out. No matter how certain Nasrudin was that the donkey would be in a particular place, it was not to be found. The donkey was in none of those places.
Finally, Narudin decided to try another way. He picked up a handful of straw from donkey’s stall, took it outside He let the straw fall out of his hand and went to look for the donkey in the direction indicated by the straw. After that, it did not take long for Nasrudin to find his donkey and all of the people in the village were impressed at how much he knew about donkeys and how to find them.
The moral of this story: It isn’t what we don’t know that gives us trouble, it’s what we know that ain’t so.
Ralph, your example of looking where the light is bright is exactly backward. Random sampling ensures that you spread your search throughout the space rather than focusing on the documents that are easiest to find. Looking in the wrong place can also be costly and without assessing that possibility, you have no way to knowing that it is wrong.
Finally, in my experience, most attorneys are not particularly distinguished at using search to find responsive documents. You are an exception. Rather than be insulted by the random sampling, they are typically grateful for the value that it brings. If everyone were as good at it as you appear to be, we might not need predictive coding at all. But we do.
There are so many wonderful Nasrudin stories, many involving donkeys. Here is one from Nasrudin Wikiquote:
Here’s another one from Nasrudin Stories:
Here is another one quoted word for word from, of all places, the National Catholic Reporter:
Here’s another one:
If you would like to read more Nasrudin stories, Idries Shah has collected many of them in three books: The pleasantries of the Incredible Mulla Nasrudin, The Exploits of the Incomparable Mulla Nasrudin, and The Subtleties of the Inimitable Mulla Nasrudin. Shah has others as well.