There is a well-known joke found in most cultures of the world about a fool looking for something. This anecdote has been told for thousands of years because it illustrates a basic trait of human psychology, now commonly called after the joke itself, the Streetlight Effect. This is a type of observational bias where people only look for whatever they are searching by looking where it is easiest. This human frailty, when pointed out in the right way, can be funny. One of the oldest known forms of pedagogic humor illustrating the Streetlight effect comes from the famous stories of Nasrudin, aka, Nasreddin, an archetypal wise fool from 13th Century Sufi traditions. Here is one version of this joke attributed to Nasreddin:
One late evening Nasreddin found himself walking home. It was only a very short way and upon arrival he can be seen to be upset about something. Alas, just then a young man comes along and sees the Mullah’s distress.
“Mullah, pray tell me: what is wrong?”
“Ah, my friend, I seem to have lost my keys. Would you help me search them? I know I had them when I left the tea house.”
So, he helps Nasreddin with the search for the keys. For quite a while the man is searching here and there but no keys are to be found. He looks over to Nasreddin and finds him searching only a small area around a street lamp.
“Mullah, why are you only searching there?”
“Why would I search where there is no light?”
Using Only Random Selection to Find Predictive Coding Training Documents Is Easy, But Foolish
The easiest way to train documents for predictive coding is simply to use random samples. It may be easy, but, as far as I am concerned, it is also defies common sense. In fact, like the Nasrudin story, it is so stupid as to be funny. You know you dropped your keys near your front door, but you do not look there because it is dark, it is hard to search there. You take the easy way out. You search by the street lamp.
The morals here are many. The easy way is not necessarily the right way. This is true in search, as it is in many other things. The search for truth is often hard and difficult. You need to follow your own knowledge, what you know, and what you do not. What do you know about where you lost your keys? Think about that and use your analysis to guide your search. You must avoid the easy way, the lazy way. You must not be tempted to only look under the lamp post. To do so is to ignore your own knowledge. It is foolish to the extreme. It is laughable, as this 1942 Mutt and Jeff comic strip shows:
Random search for predictive coding training documents is laughable too. It may be easy to simply pick training documents at random, but it is ineffective. It ignores an attorney’s knowledge of the case and the documents. It is equivalent to just rolling dice to decide where to look for something, instead of using your own judgment, your own skills and insights. It purports to replace the legal expertise of an attorney with a roll of dice. It would have you ignore an attorney’s knowledge of relevance and evidence, their skills, expertise, and long experience with search.
If you know you left your keys near the front door, why let random chance tell you where to search? You should instead let your knowledge guide your search. It defies common sense to ignore what you know. Yet, this is exactly what some methods of predictive coding tell you to do. These random only methods are tied to particular software vendors; the ones whose software is designed to run only on random training.
These vendors tell you to rely entirely on random selection of documents to use in training. They do so because that requires no thought, as if lawyers were not capable of thought, as if lawyers have not long been the masters of discovery of legal evidence. It is insulting to the intelligence of any lawyer, and yet several software vendors actually prescribe this as the only way to do predictive coding search. This has already been criticized as predictive coding junk science by search expert and attorney Bill Speros, who used the same classic street light analogy. Predictive Coding’s Erroneous Zones Are Emerging Junk Science (Pulling a random sample of documents to train the initial seed set … is erroneous because it looks for relevance in all the wrong places. It turns a blind eye to what is staring you in the eye.) Still, the practice continues.
The continuing success of a few vendors still using this approach is, I suspect, one reason that the new study by Gordon Cormack and Maura R. Grossman, is designed to answer the question:
Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning?
Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014 (quote from the Abstract).
Although the answer seems common sensical, in a deep archetypal way, and obvious; sometimes common sense and history can be wrong. The only way to know for sure is by scientific experiment. That is exactly what Cormack and Grossman have done.
Since several influential vendors say yes to the question raised in the study, and tell their customers that they should only look under the lamp post, and use one-light-only random search software, Grossman and Cormack had to give this seemingly funny assertion serious attention. They put the joke to the test. To no one’s surprise, except a few vendors, the experiments they performed showed that it was more effective to select training documents using non-random methods and active learning (a process that I call multimodal search). I will discuss their ingenious experiments and report in some detail in Part-Two of this blog.
Some Vendors Add Insult to Injury to Try to Justify their Random-Only Approach
To add insult to injury, some vendors try to justify their method by arguing that random selection avoids the prejudice of lawyer bias. It keeps the whole search process open. They seem to think lawyers know nothing. That they dropped their keys and have absolutely no idea where. If the lawyers think they know, they are just biased and should be ignored. They are not to be trusted.
This is not only insulting, but ignores the obvious reality that lawyers are always making the final call on relevance, not computers, not software engineers. Lawyers say what is relevant and what is not, even with random selection.
Some engineers who design random-only selected training software for predictive coding justify the limitation on the basis of assumed lawyer dishonesty. They think that if lawyers are allowed to pick samples for training, and not just have them selected for them at random, that lawyers may rig the system and hide the truth by intentionally poor selections. This is the way a lot of computer experts think when it comes to law and lawyers. I know this from over thirty years of experience.
If a lawyer is really so dishonest that they will deliberately mis-train a predictive coding system to try to hide the truth, then that lawyer can easily find other, more effective ways to hide the ball than that. Hiding evidence is unethical. It is dishonest. It is not what we are paid to do. Argue what the facts mean? Yes, most definitely. Change the facts. No. Despite what you may think is true about law and lawyers, this is not the kind of thing that 98% of lawyers do. It will not be tolerated by courts. Such lawyer misconduct could not only lead to loss of a case, but also loss of a license to practice law. Can you say that about engineering?
My message to software vendors is simple, leave it to us, to attorneys and the Bar, to police legal search. Do not attempt to do so by software design. That is way beyond your purview. It is also foolish because the people you are insulting with this kind of mistrust are your customers!
I have talked to some of the engineers who believe in random reliance as a way to protect their code from lawyer manipulation. I know perfectly well that this is what some (not all) of them are trying to do. Frankly, the arrogant engineers who think like that do not know what they are talking about. It is just typical engineer lawyer bias, plain and simple. Get over it and stop trying to sell us tools designed for dishonest children. We need full functionality. The latest Grossman Cormack study proves this.
Protect Us from Bias by Better Code, Not Random Selection
Some software designers with whom I have debated this topic will, at this point, try to placate me with statements about unintentional bias. They will point out that even though a lawyer may be acting in good faith, they may still have an unconscious, subjective bias. They will argue that without even knowing it, without realizing it, a lawyer may pick documents that only favor their clients. Oh please. The broad application of this so called insight into subjectivity to justify randomness is insulting to the intelligence of all lawyers. We understand better than most professions the inherent limitations of reason. Scientific Proof of Law’s Overreliance On Reason: The “Reasonable Man” is Dead, Long Live the Whole Man, Part Two. Also see The Psychology of Law and Discovery. We are really not that dimwitted as to be unable to do legal search without our finger on the scale, and, this is important, neither is the best predictive coding software.
Precautions can be taken against inherent, subjective bias. The solution is not to throw the baby out with the bath water, which is exactly what random-only search amounts to. The solution to bias is better search algorithms, plus quality controls. Code can be make to work so that it is not so sensitive and dependent on lawyer selected documents. It can tolerate and correct errors. It can reach out and broaden initial search parameters. It is not constrained by the lawyer selected documents.
Dear software designers: do not try to fix lawyers. We do not need the help of engineers for that. We will fix ourselves, thank you! Fix your code instead. Get real with your methods. Overcome your anti-lawyer bias and read the science.
Compete With Better Code, Not False Doctrine
Many software companies have already fixed their code. They have succeeded in addressing the inherent limitations in all active machine learning, driven as it must be by inconsistent humans. In their software the lawyer trainers are not the only ones selecting documents for training. The computer selects documents too. Smart computer selection is far different, and far better, than stupid random selection.
I know that the software I use, Kroll Ontrack’s EDR (eDiscovery Review), is frequently correcting my errors, broadening my initial conception of relevance. It is helping me to find new documents that are relevant, documents that I would never had thought of or found on my own. The computer selects as many documents as I decide are appropriate to enhance the training. Random has only a small place at the beginning to calculate prevalence. Concept searches, similarity searches, keyword, even linear, are far, far better than random alone. When they are all put together in a multimodal predictive coding package, the results can be extremely good.
The notion that you should just turn search over to chance means you should search everywhere any anywhere. That is the essence of random. It means you have no idea of where the relevant documents might be located, and what they might say. That is again completely contrary to what happens in legal discovery. No lawyer is that dim witted. There is always at least some knowledge as to the type or kind of documents that might be relevant. There is always some knowledge as to who is most likely to have them, and when, and what they might say, what names would be used, what metadata, etc.
A Joke at the Expense of Our System of Justice is Not Funny
I would be laughing at all of this random-only search propaganda like a Nasreddin joke, but for the fact that many lawyers do not get the joke. They are buying software and methods that rely exclusively on random search for training documents. Many are falling for the streetlight effect gimmicks and marketing. It is not funny because we are talking about truth and justice here, not just a fool’s house keys. I care about these pursuits and best practices for predictive coding. The future of legal search is harmed by this naive foolishness. That is why I have reacted before to vendor propaganda promoting random search. That is why I spent over fifty hours doing a predictive coding experiment based in part on random search, an approach I call the Random Borg approach. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have also written several articles on this subject to try to debunk this method, and yet this method lives on. See eg The Many Types of Legal Search Software in the CAR Market Today; Three-Cylinder Multimodal Approach To Predictive Coding.
So too have others, see eg. Speros, W., Predictive Coding’s Erroneous Zones Are Emerging Junk Science (e-Discovery Team Blog (Guest Entry), 28th April 2013). As Bill Speros puts it:
Some attorneys employ random samples to populate seed sets apparently because they:
- Don’t know how to form the seed set in a better way, or
- Want to delegate responsibility to the computer “which said ‘so’,” or
- Are emboldened by a statistical rationale premised on the claim that no one knows anything so random is a good a place to start as anywhere.
In spite of the many criticisms, on my blog at least, the random seed set approach continues, and even seems to be increasing in popularity.
Fortunately, Gordon Cormack and Maura R. Grossman have now entered this arena. They have done scientific research on the random only training method. Not surprisingly, they concluded, as Speros and I did, that random selection of training documents is not nearly as effective as multimodal, judgmental selection. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia, ACM 978-1-4503-2257-7/14/07.
To be continued . . . . where I will review the new Grossman Cormack Study and conclude with my recommendations to vendors who still use random only training. I will offer a kind of olive branch to the Borg where I respectfully invite them to join the federation of search, a search universe where all capacities are used, not just random. They have a good start with their existing predictive coding software. All they need do is break with the false doctrine and add new search capacities.