Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part Three

July 27, 2014

This is part three of what has now become a four part blog: Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training – Part One and Part Two.

Professor Gordon Cormack

Professor Gordon Cormack

Yes, my article on this experiment and report by Professor Gordon Cormack and attorney Maura Grossman is rapidly becoming as long as the report itself, and, believe it or not, I am not even going into all of the aspects in this very deep, multifaceted study. I urge you to read the report. It is a difficult read for most, but worth the effort. Serious students will read it several times. I know I have. This is an important scientific work presenting unique experiments that tested common legal search methods.

The Cormack Grossman paper was peer reviewed by other scientists and presented at the major event for information retrieval scientists, called the annual ACM SIGIR conference. 12_acm-logo-medACM is the Association for Computing Machinery, the world’s largest educational and scientific computing society. SIGIR is the Special Interest Group On Information Retrieval section of ACM. Hundreds of scientists and academics served on organizing committees for the 2014 SIGIR conference in Australia. They came from universities and large corporate research labs from all over the world, including Google, Yahoo, and IBM. Here is a list with links to all of the papers presented.

All attorneys who do legal search should at least have a rudimentary understanding of the findings of Cormack and Grossman on the predictive coding training methods analyzed in this report. That is why I am making this sustained effort to provide my take on it, and make their work a little more accessible. Maura and Gordon have, by the way, generously given of their time to try to insure that my explanations are accurate. Still, any mistakes made on that account are solely my own.

Findings of Cormack Grossman Study

rouletteHere is how Cormack and Grossman summarize their findings:

The results presented here do not support the commonly advanced position that seed sets, or entire training sets, must be randomly selected [19, 28] [contra 11]. Our primary implementation of SPL, in which all training documents were randomly selected, yielded dramatically inferior results to our primary implementations of CAL and SAL, in which none of the training documents were randomly selected.

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, at pgs. 7-8.


Now for the details of the results comparing the previously described methods of CAL, SAL and SPL. First, let us examine the comparison between the CAL and SPL machine training methods. To refresh your memory, CAL is simplistic type of multimodal training method wherein two methods are used. Keyword search results are used in the first round of training. In all following rounds, high probability ranked search results are used. SPL is a pure random method, a monomodal method. With SPL all documents are selected by random sampling for training in all rounds.


Cormack and Grossman found that the “CAL protocol achieves higher recall than SPL, for less effort, for all of the representative training-set sizes.” Id. at pg. 4. This means you can find more relevant documents using CAL than a random method, and you can do so faster and thus with less expense.

To drill down even deeper into their findings it is necessary to look at the graphs in the report that show how the search progressed through all one-hundred rounds of training and review for various document collections. This is shown for CAL v. SPL in Figure 1 of the report. Id. at pg. 5. The line with circle dots at the top of each graph plots the retrieval rate of CAL, the clear winner on each of the eight search tasks tested. The other three lines show the random approach, SPL, using three different training-set sizes.Cormack_Grossman_Fig1  Cormack and Grossman summarize the CAL v. SPL findings as follows:

After the first 1,000 documents (i.e., the seed set), the CAL curve shows a high slope that is sustained until the majority of relevant documents have been identified. At about 70% recall, the slope begins to fall off noticeably, and effectively plateaus between 80% and 100% recall. The SPL curve exhibits a low slope for the training phase, followed by a high slope, falloff, and then a plateau for the review phase. In general, the slope immediately following training is comparable to that of CAL, but the falloff and plateau occur at substantially lower recall levels. While the initial slope of the curve for the SPL review phase is similar for all training-set sizes, the falloff and plateau occur at higher recall levels for larger training sets. This advantage of larger training sets is offset by the greater effort required to review the training set: In general, the curves for different training sets cross, indicating that a larger training set is advantageous when high recall is desired.


The Cormack Grossman experiment also compared the CAL and SAL methods. Recall the SAL method is another simple multimodal method where only two methods are used to select training documents. Keywords are again used in the first round only, just like the CAL protocol. Thereafter, in all subsequent rounds of training machine selected documents are used based on the machine’s uncertainty of classification. That means the search is focused on the midrange ranked documents about which the machine is most uncertain.


Cormack and Grossman found that “the CAL protocol generally achieves higher recall than SAL,” but the results were closer and more complex. Id. At one point in the training SAL became as good as CAL, it achieved a specific recall value with the nearly the same efforts as CAL from that point forward. The authors found that was due to the fact that many high probability documents began to be used by the machine as uncertainty selected documents. This happened after all of the mid-scoring documents had been used up. In other words, at some point the distinction between the two methods was decreased, and more high probability documents were used in SAL, in almost the same way they were used in CAL. That allowed SAL to catch up with CAL and, in effect, become almost as good.

This catch up point is different in each project. As Cormack and Grossman explain:

Once stabilization occurs, the review set will include few documents with intermediate scores, because they will have previously been selected for training. Instead, the review set will include primarily high-scoring and low-scoring documents. The high-scoring documents account for the high slope before the inflection point; the low-scoring documents account for the low slope after the inflection point; the absence of documents with intermediate scores accounts for the sharp transition. The net effect is that SAL achieves effort as low as CAL only for a specific recall value, which is easy to see in hindsight, but difficult to predict at the time of stabilization.

This inflection point and other comparisons can be easily seen in Figure 2 of the report (shown below). Id. at pg. 6. Again the line with circle dots at the top of each graph, the one that always starts off fastest, plots the retrieval rate of CAL. Again, it does better than in each of the eight search tasks tested. The other three lines show the uncertainty approach, SAL, using three different training-set sizes. CAL does better than SAL in all eight of the matters, but the differences are not nearly as great as the comparison between CAL and SPL.

Cormack_Grossman_Fig2 Cormack and Grossman summarize the CAL v. SAL findings as follows:

Figure 2 shows that the CAL protocol generally achieves higher recall than SAL. However, the SAL gain curves, unlike the SPL gain curves, often touch the CAL curves at one specific inflection point. The strong inflection of the SAL curve at this point is explained by the nature of uncertainty sampling: Once stabilization occurs, the review set … (see quote above for the rest of this sentence.)

This experiment compared one type of simple multimodal machine training method with another. It found that with the data sets tested, and other standard procedures set forth in the experiment, the method which used high ranking documents for training, what William Webber calls the Relevance method, performed somewhat better than the method that used mid-ranked documents, what Webber calls the Uncertainty method.

This does not mean that the uncertainty method should be excluded from a full multimodal approach in real world applications. It just means that here, in this one experiment, albeit a very complex and multifaceted experiment, the relevance method outperformed the uncertainty method.

I have found that in the real world of very complex (messy even) legal searches, it is good to use both high and mid-ranked documents for training, what Cormack and Grossman call CAL and SAL, and what Webber calls Relevance, and Uncertainty training. It all depends on the circumstances, including the all important cost component. In the real world you use every method you can think of to help you to find what you are looking for, not just one or two, but dozens.

Grossman and Cormack know this very well too, which I know from private conservations with them on this, and also from the conclusion to their report:

There is no reason to presume that the CAL results described here represent the best that can be achieved. Any number of feature engineering methods, learning algorithms, training protocols, and search strategies might yield substantive improvements in the future. The effect of review order and other human factors on training accuracy, and thus overall review effectiveness, may also be substantial.

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, at pg. 9.

Ralph Losey with some of his many computer tools

My practical takeaway from the Cormack Grossman experiment is that focusing on high ranking documents is a powerful search method. It should be given significant weight in any multimodal approach, especially when the goal is to quickly find as many relevant documents as possible. The “continuous” training aspects of the CAL approach are also intriguing, that is you keep doing machine training throughout the review project and batch reviews accordingly. This could become a project management issue, but, if you can pull it off within proportionality and requesting party constraints, it just makes common sense to do so. You might as well get as much help from the machine as possible and keep getting its probability predictions for as long as you are still doing reviews and can make last minute batch assignments accordingly.

I have done several reviews in such a continuous training manner without really thinking about the fact the machine input was continuous, including my first Enron experiment.Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. But this causes me to rethink the flow chart shown below that I usually use to explain the predictive coding process. The work flow shown is not a CAL approach, but rather a SAL type of approach where there is a distinct stop in training after step five, and the review work in step seven is based on the last rankings established in step five. The continuous work flow will be more difficult to show in a diagram, and to implement, but it does make good common sense if you are in a position to pull it off.


The findings in this experiment as to the strengths of using Relevancy training confirm what I have seen in most of my search projects. I usually start with the high end documents to quickly help me to teach the machine what I am looking for. I find that this is a good way to start training. Again, it just makes common sense to do so. It is somewhat like teaching a human, or a dog for that matter. You teach the machine relevance classification by telling it when it is right (positive reinforcement), and when it is wrong. This kind of feedback is critical in all learning. In most projects this kind of feedback on predictions of highly probable relevance is the fastest way to get to the most important documents. For those reasons I agree with Cormack and Grossman’s conclusion that CAL is a superior method to quickly find the most relevant documents:

CAL also offers the reviewer the opportunity to quickly identify legally significant documents that can guide litigation strategy, and can readily adapt when new documents are added to the collection, or new issues or interpretations of relevance arise.

Id. But then again, I would never rely on just Relevancy CAL type searches alone. It gets results fast, but also tends to lead to a somewhat myopic focus on the high end where you may miss new, different types of relevant documents. For that reason, I also use SAL types of searches to include the mid range documents from the Uncertainty method. That is an important method to help the machine to better understand what documents I am looking for. As Cormack and Grossman put it:

The underlying objective of CAL is to find and review as many of the responsive documents as possible, as quickly as possible. The underlying objective of SAL, on the other hand, is to induce the best classier possible, considering the level of training effort. Generally, the classier is applied to the collection to produce a review set, which is then subject to manual review.

Id. at 8.

Similarity and other concept type search methods are also a good way to quickly find as many responsive documents as possible. So too are keyword searches, and not just in the first round, but for any round. Further, this experiment, which is already very complex (to me at least), does not include the important real world component of highly relevant versus merely relevant documents. I never just train on relevancy alone, but always include a hunt for the hot documents. I want to try to train the machine to understand the difference between the two classifications. Cormack and Grossman do not disagree. As they put it, “any number of feature engineering methods, learning algorithms, training protocols, and search strategies” could improve upon a CAL only approach.

There are also ways to improve the classifier in addition to focus on mid range probability documents, although I have found that uncertainty method is the best way to improve relevance classifications. But, it also helps to be sure your training on the low end is also right, meaning review of some of the high probability irrelevant documents. Both relevant and irrelevant training are helpful. Personally, I also like to include some random aspects, especially at first, to be sure I did not miss any outlier type documents, and be sure I have a good feel for the irrelevant documents of these custodians too. Yes, chance has to place too, so long as it does not take over and become the whole show.

Supplemental Findings on Random Search

diceIn addition to comparing CAL with SAL and SPL, Cormack and Grossman experimented with what would happen to the effectiveness of both the CAL and SAL protocols if more random elements were added to the methods. They experimented with a number of different variables, including substituting random selection, instead of keyword, for the initial round of training (seed set).

As you would expect, the general results were to decrease the effectiveness of every search method wherein random was substituted, either for keyword, high ranking relevance, or mid ranking relevance (uncertainty). The negative impact was strongest in datasets where prevalence was low, which is typical in litigation. Cormack and Grossman tested eight datasets where the prevalence of responsive documents varied from 0.25% to 3.92%, which, as they put it: “is typical for the legal matters with which we have been involved.” The size of the sets tested ranged 293,000 documents to just over 1.1 million. The random based search of lowest prevalence dataset tested, matter 203, the one with a 0.25% prevalence rate, was, in their words, a spectacular failure. Conversely, the negative impact was lessened with higher prevalence datasets. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, at pg. 7.

Cormack and Grossman responded to the popular misconception that predictive coding does not work in such low prevalence datasets.

Others assert that these are examples of “low-prevalence” or “low-richness” collections, for which TAR is unsuitable [19]. We suggest that such assertions may presuppose an SPL protocol [11], which is not as effective on low-prevalence datasets. It may be that SPL methods can achieve better results on higher-prevalence collections (i.e., 10% or more responsive documents).

Id. at 9.

In fact, information scientists have been working with low prevalence datasets for decades, which is one reason Professor Cormack had a ready collection of pre-coded documents by which to measure recall, a so-called gold standard of assessments from prior studies. Cormack and Grossman explain that the lack of pre-tested datasets with high prevalence is the reason they did not use such collections for testing. They also speculate that if such high prevalence datasets are tested, then the random only (SPL) method would do much better than it did in the low prevalence datasets they used in their experiment.

However, no such collections were included in this study because, for the few matters with which we have been involved where the prevalence exceeded 10%, the necessary training and gold-standard assessments were not available. We conjecture that the comparative advantage of CAL over SPL would be decreased, but not eliminated, for high-prevalence collections.


They are probably right, if the datasets have a higher prevalence, then the chances are that random samples will find more relevant documents for training. But that still does not make the blind draw a better way to find things than looking with your eyes wide open. Plus, the typical way to attain high yield datasets is by keyword filtering out large segments of the raw data before beginning a predictive coding search. When you keyword filter like that before beginning machine training the chances are you will leave behind a significant portion, if not most of the relevant documents. Keyword filtering often has low recall, or when broad enough to include most of the relevant documents, it is very imprecise. Then you are back to the same low prevalence situation.

Better to limit filtering before machine training to obvious irrelevant, or ESI not appropriate for training, such as non-text documents like photos, music and voice mail. Use other methods to search for those types of ESI. But do not use keyword filtering on text documents simply to create an artificially high prevalence just because the random based software you use will only work that way. That is the tail wagging the dog.

For more analysis and criticism on using keywords to create artificially high prevalence, a practice Cormack and Grossman call Collection Enrichment, see another excellent article they wrote: Comments on “The Implications of Rule 26(g) on the Use of Technology-Assisted Review”7 Federal Courts Law Review 286 (2014) at pgs. 293-295, 300-301. This article also contains good explanations of the instant study with CAL, SAL and SPL. See especially Table 1 at pg. 297.

The negative impact of random elements on machine training protocols is a no duh to experienced searchers. See eg. the excellent series of articles by John Tredennick, including his review on the Cormack Grossman study: Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review.

It never helps to turn to lady luck, to random chance, to improve search. Once you start relying on dice to decide what to do, you are just spinning your wheels.

Supplemental Findings on Keywords and Random Search

go fishCormack and Grossman also tested what would happen if keywords were used instead of random selections, even when the keywords were not tested first against the actual data. This poor practice of using unverified keywords is what I call the Go Fish approach to keyword search. Child’s Game of “Go Fish” is a Poor Model for e- Discovery Search(October 2009). Under this naive approach attorneys simply guess what keywords might be contained on relevant documents without testing how accurate their guesses are. It is a very simplistic approach to keyword search, yet, nevertheless, is still widely employed in the legal profession. This approach has been criticized by many, including Judge Andrew Peck in his excellent Gross Construction opinion, the so called wake-up call for NY attorneys on search. William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Co., 256 F.R.D. 134 (S.D.N.Y. 2009).

Cormack and Grossman also tested what would happen if such naive keyword selections were used instead of the high or mid probability methods (CAL and SAL) for machine training. The naive keywords used in these supplemental comparison tests did fairly well. This is consistent with my multimodal approach, where all kinds of search methods are used in all rounds of training.

The success of naive keyword selection for machine training is discussed by Cormack and Grossman as an unexpected finding (italics and parens added):

Perhaps more surprising is the fact that a simple keyword search, composed without prior knowledge of the collection, almost always yields a more effective seed set than random selection, whether for CAL, SAL, or SPL. Even when keyword search is used to select all training documents, the result is generally superior to that achieved when random selection is used. That said, even if (random) passive learning is enhanced using a keyword-selected seed or training set, it (passive learning) is still dramatically inferior to active learning. It is possible, in theory, that a party could devise keywords that would render passive learning competitive with active learning, but until a formal protocol for constructing such a search can be established, it is impossible to subject the approach to a controlled scientific evaluation. Pending the establishment and scientific validation of such a protocol, reliance on keywords and (random) passive learning remains a questionable practice. On the other hand, the results reported here indicate that it is quite easy for either party (or for the parties together) to construct a keyword search that yields an effective seed set for active learning.

Id. at 8.

Cormack and Grossman summarize their findings on the impact of keywords in the first round of training (seed set) on CAL, SAL and SPL:

In summary, the use of a seed set selected using a simple keyword search, composed prior to the review, contributes to the effectiveness of all of the TAR protocols investigated in this study.

Keywords still have an important place in any multimodal, active, predictive coding protocol. This is, however, completely different from using keywords, especially untested naive keywords, to filter out the raw data in a misguided attempt to create high prevalence collections, all so that the random method (passive) might have some chance of success.

To be continued . . . in Part Four I will conclude with final opinions and analysis and my friendly recommendations for any vendors still using random-only training protocols. 

Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One

July 6, 2014

Nasreddin_(17th-century_miniature)There is a well-known joke found in most cultures of the world about a fool looking for something. This anecdote has been told for thousands of years because it illustrates a basic trait of human psychology, now commonly called after the joke itself, the  Streetlight Effect. This is a type of observational bias where people only look for whatever they are searching by looking where it is easiest. This human frailty, when pointed out in the right way, can be funny. One of the oldest known forms of pedagogic humor illustrating the Streetlight effect comes from the famous stories of Nasrudin, aka, Nasreddin, an archetypal wise fool from 13th Century Sufi traditions. Here is one version of this joke attributed to Nasreddin:

One late evening Nasreddin found himself walking home. It was only a very short way and upon arrival he can be seen to be upset about something. Alas, just then a young man comes along and sees the Mullah’s distress.

“Mullah, pray tell me: what is wrong?”

“Ah, my friend, I seem to have lost my keys. Would you help me search them? I know I had them when I left the tea house.”

So, he helps Nasreddin with the search for the keys. For quite a while the man is searching here and there but no keys are to be found. He looks over to Nasreddin and finds him searching only a small area around a street lamp.

“Mullah, why are you only searching there?”

“Why would I search where there is no light?”

Using Only Random Selection to Find Predictive Coding Training Documents Is Easy, But Foolish

easy-buttonThe easiest way to train documents for predictive coding is simply to use random samples. It may be easy, but, as far as I am concerned, it is also defies common sense. In fact, like the Nasrudin story, it is so stupid as to be funny. You know you dropped your keys near your front door, but you do not look there because it is dark, it is hard to search there. You take the easy way out. You search by the street lamp.

The morals here are many. The easy way is not necessarily the right way. This is true in search, as it is in many other things. The search for truth is often hard and difficult. You need to follow your own knowledge, what you know, and what you do not. What do you know about where you lost your keys? Think about that and use your analysis to guide your search. You must avoid the easy way, the lazy way. You must not be tempted to only look under the lamp post. To do so is to ignore your own knowledge. It is foolish to the extreme. It is laughable, as this 1942 Mutt and Jeff comic strip shows:


Random search for predictive coding training documents is laughable too. It may be easy to simply pick training documents at random, but it is ineffective. It ignores an attorney’s knowledge of the case and the documents. It is equivalent to just rolling dice to decide where to look for something, instead of using your own judgment, your own skills and insights. It purports to replace the legal expertise of an attorney with a roll of dice. It would have you ignore an attorney’s knowledge of relevance and evidence, their skills, expertise, and long experience with search.

diceIf you know you left your keys near the front door, why let random chance tell you where to search? You should instead let your knowledge guide your search. It defies common sense to ignore what you know. Yet, this is exactly what some methods of predictive coding tell you to do. These random only methods are tied to particular software vendors; the ones whose software is designed to run only on random training.

These vendors tell you to rely entirely on random selection of documents to use in training. They do so because that requires no thought, as if lawyers were not capable of thought, as if lawyers have not long been the masters of discovery of legal evidence. It is insulting to the intelligence of any lawyer, and yet several software vendors actually prescribe this as the only way to do predictive coding search. This has already been criticized as predictive coding junk science by search expert and attorney Bill Speros, who used the same classic street light analogy. Predictive Coding’s Erroneous Zones Are Emerging Junk Science  (Pulling a random sample of documents to train the initial seed set … is erroneous because it looks for relevance in all the wrong places. It turns a blind eye to what is staring you in the eye.) Still, the practice continues.

The continuing success of a few vendors still using this approach is, I suspect, one reason that the new study by Gordon Cormack and Maura R. Grossman, is designed to answer the question:

Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning? 

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014 (quote from the Abstract).

Although the answer seems common sensical, in a deep archetypal way, and obvious; sometimes common sense and history can be wrong. The only way to know for sure is by scientific experiment. That is exactly what Cormack and Grossman have done.

Since several influential vendors say yes to the question raised in the study, and tell their customers that they should only look under the lamp post, and use one-light-only random search software, Grossman and Cormack had to give this seemingly funny assertion serious attention. They put the joke to the test. To no one’s surprise, except a few vendors, the experiments they performed showed that it was more effective to select training documents using non-random methods and active learning (a process that I call multimodal search). I will discuss their ingenious experiments and report in some detail in Part-Two of this blog.

Some Vendors Add Insult to Injury to Try to Justify their Random-Only Approach

caveman lawyerTo add insult to injury, some vendors try to justify their method by arguing that random selection avoids the prejudice of lawyer bias. It keeps the whole search process open. They seem to think lawyers know nothing. That they dropped their keys and have absolutely no idea where. If the lawyers think they know, they are just biased and should be ignored. They are not to be trusted.

This is not only insulting, but ignores the obvious reality that lawyers are always making the final call on relevance, not computers, not software engineers. Lawyers say what is relevant and what is not, even with random selection.

Some engineers who design random-only selected training software for predictive coding justify the limitation on the basis of assumed lawyer dishonesty. They think that if lawyers are allowed to pick samples for training, and not just have them selected for them at random, that lawyers may rig the system and hide the truth by intentionally poor selections. This is the way a lot of computer experts think when it comes to law and lawyers. I know this from over thirty years of experience.

Star_wars_emperorIf a lawyer is really so dishonest that they will deliberately mis-train a predictive coding system to try to hide the truth, then that lawyer can easily find other, more effective ways to hide the ball than that. Hiding evidence is unethical. It is dishonest. It is not what we are paid to do. Argue what the facts mean? Yes, most definitely. Change the facts. No. Despite what you may think is true about law and lawyers, this is not the kind of thing that 98% of lawyers do. It will not be tolerated by courts. Such lawyer misconduct could not only lead to loss of a case, but also loss of a license to practice law. Can you say that about engineering?

My message to software vendors is simple, leave it to us, to attorneys and the Bar, to police legal search. Do not attempt to do so by software design. That is way beyond your purview. It is also foolish because the people you are insulting with this kind of mistrust are your customers!

I have talked to some of the engineers who believe in random reliance as a way to protect their code from lawyer manipulation. I know perfectly well that this is what some (not all) of them are trying to do. Frankly, the arrogant engineers who think like that do not know what they are talking about. It is just typical engineer lawyer bias, plain and simple. Get over it and stop trying to sell us tools designed for dishonest children. We need full functionality. The latest Grossman Cormack study proves this.

Protect Us from Bias by Better Code, Not Random Selection

Some software designers with whom I have debated this topic will, at this point, try to placate me with statements about unintentional bias. They will point out that even though a lawyer may be acting in good faith, they may still have an unconscious, subjective bias. They will argue that without even knowing it, without realizing it, a lawyer may pick documents that only favor their clients. Oh please. The broad application of this so called insight into subjectivity to justify randomness is insulting to the intelligence of all lawyers. We understand better than most professions the inherent limitations of reason. Scientific Proof of Law’s Overreliance On Reason: The “Reasonable Man” is Dead, Long Live the Whole Man, Part Two. Also see The Psychology of Law and DiscoveryWe are really not that dimwitted as to be unable to do legal search without our finger on the scale, and, this is important, neither is the best predictive coding software.

Precautions can be taken against inherent, subjective bias. The solution is not to throw the baby out with the bath water, which is exactly what random-only search amounts to. The solution to bias is better search algorithms, plus quality controls. Code can be make to work so that it is not so sensitive and dependent on lawyer selected documents. It can tolerate and correct errors. It can reach out and broaden initial search parameters. It is not constrained by the lawyer selected documents.

Dear software designers: do not try to fix lawyers. We do not need the help of engineers for that. We will fix ourselves, thank you! Fix your code instead. Get real with your methods. Overcome your anti-lawyer bias and read the science.

Compete With Better Code, Not False Doctrine

Many software companies have already fixed their code. They have succeeded in addressing the inherent limitations in all active machine learning, driven as it must be by inconsistent humans. In their software the lawyer trainers are not the only ones selecting documents for training. The computer selects documents too. Smart computer selection is far different, and far better, than stupid random selection.

I know that the software I use, Kroll Ontrack’s EDR (eDiscovery Review), is frequently correcting my errors, broadening my initial conception of relevance. It is helping me to find new documents that are relevant, documents that I would never had thought of or found on my own. The computer selects as many documents as I decide are appropriate to enhance the training. Random has only a small place at the beginning to calculate prevalence. Concept searches, similarity searches, keyword, even linear, are far, far better than random alone. When they are all put together in a multimodal predictive coding package, the results can be extremely good.

The notion that you should just turn search over to chance means you should search everywhere any anywhere. That is the essence of random. It means you have no idea of where the relevant documents might be located, and what they might say. That is again completely contrary to what happens in legal discovery. No lawyer is that dim witted. There is always at least some knowledge as to the type or kind of documents that might be relevant. There is always some knowledge as to who is most likely to have them, and when, and what they might say, what names would be used, what metadata, etc.

A Joke at the Expense of Our System of Justice is Not Funny

Google_Nasreddin_Hodja_FestivalI would be laughing at all of this random-only search propaganda like a Nasreddin joke, but for the fact that many lawyers do not get the joke. They are buying software and methods that rely exclusively on random search for training documents. Many are falling for the streetlight effect gimmicks and marketing. It is not funny because we are talking about truth and justice here, not just a fool’s house keys. I care about these pursuits and best practices for predictive coding. The future of legal search is harmed by this naive foolishness. That is why I have reacted before to vendor propaganda promoting random search. That is why I spent over fifty hours doing a predictive coding experiment based in part on random search, an approach I call the Random Borg approach. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents(Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have also written several articles on this subject to try to debunk this method, and yet this method lives on. See eg The Many Types of Legal Search Software in the CAR Market Today; Three-Cylinder Multimodal Approach To Predictive Coding.

Bill SperosSo too have others, see eg. Speros, W., Predictive Coding’s Erroneous Zones Are Emerging Junk Science (e-Discovery Team Blog (Guest Entry), 28th April 2013). As Bill Speros puts it:

Some attorneys employ random samples to populate seed sets apparently because they:

    • Don’t know how to form the seed set in a better way, or
    • Want to delegate responsibility to the computer “which said ‘so’,” or
    • Are emboldened by a statistical rationale premised on the claim that no one knows anything so random is a good a place to start as anywhere.

In spite of the many criticisms, on my blog at least, the random seed set approach continues, and even seems to be increasing in popularity.

Fortunately, Gordon Cormack and Maura R. Grossman have now entered this arena. They have done scientific research on the random only training method. Not surprisingly, they concluded, as Speros and I did, that random selection of training documents is not nearly as effective as multimodal, judgmental selection. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia, ACM 978-1-4503-2257-7/14/07.

To be continued . . . . where I will review the new Grossman Cormack Study and conclude with my recommendations to vendors who still use random only training. I will offer a kind of olive branch to the Borg where I respectfully invite them to join the federation of search, a search universe where all capacities are used, not just random. They have a good start with their existing predictive coding software. All they need do is break with the false doctrine and add new search capacities.


Get every new post delivered to your Inbox.

Join 3,333 other followers