There is a well-known joke found in most cultures of the world about a fool looking for something. This anecdote has been told for thousands of years because it illustrates a basic trait of human psychology, now commonly called after the joke itself, the Streetlight Effect. This is a type of observational bias where people only look for whatever they are searching by looking where it is easiest. This human frailty, when pointed out in the right way, can be funny. One of the oldest known forms of pedagogic humor illustrating the Streetlight effect comes from the famous stories of Nasrudin, aka, Nasreddin, an archetypal wise fool from 13th Century Sufi traditions. Here is one version of this joke attributed to Nasreddin:
One late evening Nasreddin found himself walking home. It was only a very short way and upon arrival he can be seen to be upset about something. Alas, just then a young man comes along and sees the Mullah’s distress.
“Mullah, pray tell me: what is wrong?”
“Ah, my friend, I seem to have lost my keys. Would you help me search them? I know I had them when I left the tea house.”
So, he helps Nasreddin with the search for the keys. For quite a while the man is searching here and there but no keys are to be found. He looks over to Nasreddin and finds him searching only a small area around a street lamp.
“Mullah, why are you only searching there?”
“Why would I search where there is no light?”
Using Only Random Selection to Find Predictive Coding Training Documents Is Easy, But Foolish
The easiest way to train documents for predictive coding is simply to use random samples. It may be easy, but, as far as I am concerned, it is also defies common sense. In fact, like the Nasrudin story, it is so stupid as to be funny. You know you dropped your keys near your front door, but you do not look there because it is dark, it is hard to search there. You take the easy way out. You search by the street lamp.
The morals here are many. The easy way is not necessarily the right way. This is true in search, as it is in many other things. The search for truth is often hard and difficult. You need to follow your own knowledge, what you know, and what you do not. What do you know about where you lost your keys? Think about that and use your analysis to guide your search. You must avoid the easy way, the lazy way. You must not be tempted to only look under the lamp post. To do so is to ignore your own knowledge. It is foolish to the extreme. It is laughable, as this 1942 Mutt and Jeff comic strip shows:
Random search for predictive coding training documents is laughable too. It may be easy to simply pick training documents at random, but it is ineffective. It ignores an attorney’s knowledge of the case and the documents. It is equivalent to just rolling dice to decide where to look for something, instead of using your own judgment, your own skills and insights. It purports to replace the legal expertise of an attorney with a roll of dice. It would have you ignore an attorney’s knowledge of relevance and evidence, their skills, expertise, and long experience with search.
If you know you left your keys near the front door, why let random chance tell you where to search? You should instead let your knowledge guide your search. It defies common sense to ignore what you know. Yet, this is exactly what some methods of predictive coding tell you to do. These random only methods are tied to particular software vendors; the ones whose software is designed to run only on random training.
These vendors tell you to rely entirely on random selection of documents to use in training. They do so because that requires no thought, as if lawyers were not capable of thought, as if lawyers have not long been the masters of discovery of legal evidence. It is insulting to the intelligence of any lawyer, and yet several software vendors actually prescribe this as the only way to do predictive coding search. This has already been criticized as predictive coding junk science by search expert and attorney Bill Speros, who used the same classic street light analogy. Predictive Coding’s Erroneous Zones Are Emerging Junk Science (Pulling a random sample of documents to train the initial seed set … is erroneous because it looks for relevance in all the wrong places. It turns a blind eye to what is staring you in the eye.) Still, the practice continues.
The continuing success of a few vendors still using this approach is, I suspect, one reason that the new study by Gordon Cormack and Maura R. Grossman, is designed to answer the question:
Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning?
Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014 (quote from the Abstract).
Although the answer seems common sensical, in a deep archetypal way, and obvious; sometimes common sense and history can be wrong. The only way to know for sure is by scientific experiment. That is exactly what Cormack and Grossman have done.
Since several influential vendors say yes to the question raised in the study, and tell their customers that they should only look under the lamp post, and use one-light-only random search software, Grossman and Cormack had to give this seemingly funny assertion serious attention. They put the joke to the test. To no one’s surprise, except a few vendors, the experiments they performed showed that it was more effective to select training documents using non-random methods and active learning (a process that I call multimodal search). I will discuss their ingenious experiments and report in some detail in Part-Two of this blog.
Some Vendors Add Insult to Injury to Try to Justify their Random-Only Approach
To add insult to injury, some vendors try to justify their method by arguing that random selection avoids the prejudice of lawyer bias. It keeps the whole search process open. They seem to think lawyers know nothing. That they dropped their keys and have absolutely no idea where. If the lawyers think they know, they are just biased and should be ignored. They are not to be trusted.
This is not only insulting, but ignores the obvious reality that lawyers are always making the final call on relevance, not computers, not software engineers. Lawyers say what is relevant and what is not, even with random selection.
Some engineers who design random-only selected training software for predictive coding justify the limitation on the basis of assumed lawyer dishonesty. They think that if lawyers are allowed to pick samples for training, and not just have them selected for them at random, that lawyers may rig the system and hide the truth by intentionally poor selections. This is the way a lot of computer experts think when it comes to law and lawyers. I know this from over thirty years of experience.
If a lawyer is really so dishonest that they will deliberately mis-train a predictive coding system to try to hide the truth, then that lawyer can easily find other, more effective ways to hide the ball than that. Hiding evidence is unethical. It is dishonest. It is not what we are paid to do. Argue what the facts mean? Yes, most definitely. Change the facts. No. Despite what you may think is true about law and lawyers, this is not the kind of thing that 98% of lawyers do. It will not be tolerated by courts. Such lawyer misconduct could not only lead to loss of a case, but also loss of a license to practice law. Can you say that about engineering?
My message to software vendors is simple, leave it to us, to attorneys and the Bar, to police legal search. Do not attempt to do so by software design. That is way beyond your purview. It is also foolish because the people you are insulting with this kind of mistrust are your customers!
I have talked to some of the engineers who believe in random reliance as a way to protect their code from lawyer manipulation. I know perfectly well that this is what some (not all) of them are trying to do. Frankly, the arrogant engineers who think like that do not know what they are talking about. It is just typical engineer lawyer bias, plain and simple. Get over it and stop trying to sell us tools designed for dishonest children. We need full functionality. The latest Grossman Cormack study proves this.
Protect Us from Bias by Better Code, Not Random Selection
Some software designers with whom I have debated this topic will, at this point, try to placate me with statements about unintentional bias. They will point out that even though a lawyer may be acting in good faith, they may still have an unconscious, subjective bias. They will argue that without even knowing it, without realizing it, a lawyer may pick documents that only favor their clients. Oh please. The broad application of this so called insight into subjectivity to justify randomness is insulting to the intelligence of all lawyers. We understand better than most professions the inherent limitations of reason. Scientific Proof of Law’s Overreliance On Reason: The “Reasonable Man” is Dead, Long Live the Whole Man, Part Two. Also see The Psychology of Law and Discovery. We are really not that dimwitted as to be unable to do legal search without our finger on the scale, and, this is important, neither is the best predictive coding software.
Precautions can be taken against inherent, subjective bias. The solution is not to throw the baby out with the bath water, which is exactly what random-only search amounts to. The solution to bias is better search algorithms, plus quality controls. Code can be make to work so that it is not so sensitive and dependent on lawyer selected documents. It can tolerate and correct errors. It can reach out and broaden initial search parameters. It is not constrained by the lawyer selected documents.
Dear software designers: do not try to fix lawyers. We do not need the help of engineers for that. We will fix ourselves, thank you! Fix your code instead. Get real with your methods. Overcome your anti-lawyer bias and read the science.
Compete With Better Code, Not False Doctrine
Many software companies have already fixed their code. They have succeeded in addressing the inherent limitations in all active machine learning, driven as it must be by inconsistent humans. In their software the lawyer trainers are not the only ones selecting documents for training. The computer selects documents too. Smart computer selection is far different, and far better, than stupid random selection.
I know that the software I use, Kroll Ontrack’s EDR (eDiscovery Review), is frequently correcting my errors, broadening my initial conception of relevance. It is helping me to find new documents that are relevant, documents that I would never had thought of or found on my own. The computer selects as many documents as I decide are appropriate to enhance the training. Random has only a small place at the beginning to calculate prevalence. Concept searches, similarity searches, keyword, even linear, are far, far better than random alone. When they are all put together in a multimodal predictive coding package, the results can be extremely good.
The notion that you should just turn search over to chance means you should search everywhere any anywhere. That is the essence of random. It means you have no idea of where the relevant documents might be located, and what they might say. That is again completely contrary to what happens in legal discovery. No lawyer is that dim witted. There is always at least some knowledge as to the type or kind of documents that might be relevant. There is always some knowledge as to who is most likely to have them, and when, and what they might say, what names would be used, what metadata, etc.
A Joke at the Expense of Our System of Justice is Not Funny
I would be laughing at all of this random-only search propaganda like a Nasreddin joke, but for the fact that many lawyers do not get the joke. They are buying software and methods that rely exclusively on random search for training documents. Many are falling for the streetlight effect gimmicks and marketing. It is not funny because we are talking about truth and justice here, not just a fool’s house keys. I care about these pursuits and best practices for predictive coding. The future of legal search is harmed by this naive foolishness. That is why I have reacted before to vendor propaganda promoting random search. That is why I spent over fifty hours doing a predictive coding experiment based in part on random search, an approach I call the Random Borg approach. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have also written several articles on this subject to try to debunk this method, and yet this method lives on. See eg The Many Types of Legal Search Software in the CAR Market Today; Three-Cylinder Multimodal Approach To Predictive Coding.
So too have others, see eg. Speros, W., Predictive Coding’s Erroneous Zones Are Emerging Junk Science (e-Discovery Team Blog (Guest Entry), 28th April 2013). As Bill Speros puts it:
Some attorneys employ random samples to populate seed sets apparently because they:
- Don’t know how to form the seed set in a better way, or
- Want to delegate responsibility to the computer “which said ‘so’,” or
- Are emboldened by a statistical rationale premised on the claim that no one knows anything so random is a good a place to start as anywhere.
In spite of the many criticisms, on my blog at least, the random seed set approach continues, and even seems to be increasing in popularity.
Fortunately, Gordon Cormack and Maura R. Grossman have now entered this arena. They have done scientific research on the random only training method. Not surprisingly, they concluded, as Speros and I did, that random selection of training documents is not nearly as effective as multimodal, judgmental selection. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia, ACM 978-1-4503-2257-7/14/07.
To be continued . . . . where I will review the new Grossman Cormack Study and conclude with my recommendations to vendors who still use random only training. I will offer a kind of olive branch to the Borg where I respectfully invite them to join the federation of search, a search universe where all capacities are used, not just random. They have a good start with their existing predictive coding software. All they need do is break with the false doctrine and add new search capacities.
The Street Lamp called Text
Interesting article and analysis. What most lawyers and e-discovery tools do is search, sample, and analyze using only the street lamp called text-restricted tools. They don’t “see” non-textual documents so they can’t search them, and if they sampled them they’d have a hard time finding similar ones if they did find responsive non-textual documents, so they may not even sample them.
There are measures to determine if this shortcoming is essentially “no harm, no foul” in a given case or application or is in fact a serious issue. The MTV (Maximum Text Vision”) ratio measures the extent to which a text-restricted tool can “see” all the documents, and there are ways to sample to measure it and assess the impact in any given case or application. See BeyondRecognition’s Document U Blog entries for May 29 and June 17. “Measuring Text Bias/Tunnel Vision in Content Search and ECM Systems,” and “Sampling Resolves Conjecture on Significance of Non-Textual Documents.”
In specific litigation there may not be many significant non-textual documents, but for general information governance purposes having text-restricted systems will generally be far more problematic as in some industries non-textual documents are prevalent and important.
Good point. Thanks for the comment.
the experiments they performed showed that it was more effective to select training documents using non-random methods and active learning (a process that I call multimodal search)
Let me urge a little caution here about your terminology. Active learning, at least as defined in the machine learning community, is not the same thing as multimodal search.
Active learning is the notion that the machine itself can choose documents that it wants to have labeled, and then present those documents to the human reviewer for labeling. There are many ways of doing active learning, many statistics and heuristics for guiding how the unlabeled document is actually chosen. But at the end of the day, that’s all active learning is: machine selection of potentially informative examples.
Multimodal search, on the other hand, is the idea of attacking the problem from as many different angles as possible, not limiting yourself to just one approach. Unimodal doesn’t just mean only random. You can indeed be unimodally random (“borg”). But it is also possible to be unimodally judgmental, as in only using keywords to find all the responsive docs and not using any machine learning at all. That’s also unimodal. Or you could even be unimodal on active learning.. i.e. only do active learning and nothing else.
Multimodality, then, is doing all of that. A mixture of judgmental, active, supervised (machine learning), and perhaps even a tiny bit of random, if desired. Natch?
So, just trying to make sure that terminology stays sorted. I completely agree with you about the need for something “active”. I also completely agree about the need for multimodality. So it’s not that I’m disagreeing on the essence. I just want to make sure that we all understand that active isn’t the same thing as multimodal.
Yup, that’s what I’m saying and that’s what I mean when I write active learning and multimodal. Thanks for clarifying.
The bigger question I have for you is this: In the Cormack-Grossman paper, they advocate continuous learning. All training is review, all review is training. In other words, the judgments made by your contract reviewers don’t just confirm the SME-trained machine output.. they actually are allowed to alter/affect the classifications or rankings of as-yet unseen documents, in a dynamically-updating, ever changing way.
You’ve railed quite strongly against that in the past. You’ve said that you prefer a simple learning approach. The main difference between simple learning and continuous learning is that in simple learning, you train (presumably with your expert) for a finite number of iterations, then use the machine to label the remainder of the collection, and only use humans from that point forward to confirm what the machine is giving you. You don’t actually allow those judgments to dynamically alter as-yet unseen docs.
But the Cormack-Grossman paper shows not only that CAL (continuous active learning) is more effective than SPL (simple passive, aka random/borg) learning, but also more effective than SAL (simple active learning). Simple active learning is also multimodal, is it not? You can also bring judgmental seeds to bear in SAL. SAL is your preferred method of working, am I correct in saying?
And yet according to this research, CAL is better not only than SPL, but also than SAL.
My methods are not quite as you presume and would not fit into the SAL classification, nor the CAL either. It is yet another thing, another way, a more complex way. But I can understand why you might think that. The details of my methods are too personal idiosyncratic for pedantic blogging purposes. There is more to it that you would surmise in the few experiments that I have written up in the blog. Indeed, most of my best work is secret and tied up via NDAs, AC Priv and the like. You know that. The distinctions between your classification system, SAL and CAL, sometimes apply, sometimes don’t. All depends on a multitude of circumstances. I may try to share this in another blog, but it is hard to articulate, plus some things are best kept private, aside from general structures of course.
Well, as I also would not exactly fit what we do into CAL vs SAL. No classification structure is every a hundred percent perfect.
But in terms of broad, general structure, the distinction between simple and continuous *is* pretty accurate, isn’t it? I mean, you do talk openly on your blog about only having the expert judge a few thousand documents, then having the machine “finalize” its judgments in some way (whether it does so by binary classification vs ranking doesn’t matter.. the point is that either the class or the ranking is finalized or fixed, not to be altered by the contract reviewers), followed by either sampling and production, or else contract reviewer confirmation of the positive set (or high ranked set, if you’re ranking rather than classifying).
That *is* the way that either you work, or you would prefer things to work, is it not? These are indeed the discussions we’ve been having.
Another way of saying this is that the Cormack-Grossman CAL vs SAL distinction actually comprises two distinct, orthogonal dimensions: (1) the active vs passive dimension, and (2) the simple vs continuous dimension.
And I very much believe you when you say that the way that you work in dimension #1 doesn’t really fit the active vs passive. Again, we don’t work exactly that way either, and at the risk of putting words into Maura or Gordon’s mouths, I would think that they would also say that they don’t necessarily work strictly that way, either.. but that in order to do scientific comparisons you have to hold certain things constant.
But when it comes to dimension #2, you very much *do* work that way (simple, not continuous, that is) don’t you? You explicitly judged a relatively small number of documents (e.g. out of a five million doc collection, you probably explicitly read and code no more than 1% of that), and then you wave broad brush stroke over the rest, trusting the machine intelligence and whatever it has done to “fix” the rest of the docs you’re looking at. So you *are* doing simple learning, are you not?
So without going into all your exact details, my question remains as to what you think about the Grossman-Cormack results showing that the continuous approach beats both of the simple approaches?
I mean, unless you’re doing some sort of interactive clustering and then manually going through and knocking off things as non-responsive, based on a few descriptive keywords of the cluster. But that’s not really judging hundreds of thousands of documents, as in actually judging them. It is a semi-supervised approach, rather than a fully-supervised approach.
But even there, you only semi-supervise for a finite period of time (simple learning) rather than continuously (continuous learning). Non?
I think the key here is to rely on all the information you have — from the client, the attorney, a subject matter expert, the terms that show up in the database index and maybe some random selection as well. I think people buy into the random selection thing because it has a sort of scientific sound and then how it is pitched by sales folks. It seems objective.
I have always asked them to define random and they often go to the it is like political polling. At that point I become even more skeptical because it is not like polling but more like testing where an item is judged to be relevant enough. Another thing I like to know is can you create a structured sample that will give me documents from each custodian and of the various types including non-text documents this is often met with blank stares.
Our collections are targeted and therefore biased towards relevance and I have not spoken to a sales person who understands how this might have an effect on the idea of randomness.
Last thing. This notion that attorneys do not want to see documents that will hurt their client strikes me a wrong. Attorneys want to be able to make the best case given the facts and evidence which they can only do by seeing the good, the bad and the ugly. Doing this for 30 years I have yet to work with attorney who did not want do see as many relevant documents as could be located no matter the content.
I think what folks tend to forget is that random sampling is designed to give you an estimate of how many responsive docs there are in a collection. It is not designed to be able to tell you *where* all those responsive docs actually are (which is another way of saying that the docs in the random sample aren’t necessarily representative of all the subtopics in your collection.)
For the sake of argument, imagine a collection of 100 docs, each of which is in a separate language. 100 documents, 100 languages. Do a random sample of 10 of those documents and judge each for relevance, and you might find that, say, 30% of the collection is estimated to be responsive.
And let’s suppose furthermore that the sample is correct.. 30% of the collection (30 total docs in this example) is indeed responsive.
Now the problem is that the 3 responsive docs that you’ve hit in your sample are in a completely different language than the other 27 responsive docs in the remainder of the collection. Therefore, those 3 responsive docs are useless for training, because they share no terms in common with any of your training docs.
So does randomness guarantee representativeness? Absolutely not. Representativeness relates to topic, whereas random samples relate to estimating relative frequencies. Relative frequencies aren’t the same as topical coverage, a point which tends to be forgotten.
For a discussion of the points raised in these comments, please see my article with Maura in Federal Courts Law Review:
[…] Losey has also written on the issue recently, arguing that relying on random samples rather than judgmental samples “ignores an attorney’s […]
[…] Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Trainin… Part Two of this Article has been delayed. Ralph has burst free of the e-discovery bubble for a week for a quick, un-French-like, Summer vacation. […]
[…] Losey has also written on the issue recently, arguing that relying on random samples rather than judgmental samples “ignores an attorney’s […]
[…] is a continuation of my earlier blog with the same title: Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Trainin… Part […]
[…] is a continuation of Ralph Losey’s earlier blog with the same title: Latest Grossman and Cormack Study Proves Efficacy of Multimodal Search for Predictive Coding Trainin… Part […]
[…] Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training – Part One and Part […]
[…] Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training – Parts One, Two, Three and Four. The part that intrigued me was the continuous aspect to the training, as […]
[…] Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One and Part Two and Part […]
[…] as to why their approach is best and/or how their software compares to the results of study; i.e. Ralph Losey, John Tredennick and Herbert L. Roitblat. All interesting reads with salient […]
[…] that random training was the least efficient method for selecting training seeds. (See, e.g., Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training, by Ralph Losey.) Detractors challenged their results, arguing that using random seeds for training […]
[…] Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One and Part Two and Part Three. After remaining silent for some time in the face of constant vendor […]