Should Lawyers Be Big Data Cops?

September 1, 2014

Police_Cartoon_haltMany police departments are using big data analytics to predict where crime is likely to take place and prevent it. Should lawyers do the same to predict and stop illegal, non-criminal activities? This is not the job of police, but should it be the job of lawyers? We already have the technology to do this, but should we? Should lawyers be big data cops? Does anyone even want that?

Crime Prevention by Data Analytics is Already in Use by Many Police Departments

precrimeThe NY Times reported on this back in 2011 when it was relatively new: Sending the Police Before There’s a Crime. The Times reported how the Santa Cruz California police were using data analysis to predict where burglaries and other crimes might take place and to deploy police officers accordingly:

The arrests were routine. Two women were taken into custody after they were discovered peering into cars in a downtown parking garage in Santa Cruz, Calif. One woman was found to have outstanding warrants; the other was carrying illegal drugs.

But the presence of the police officers in the garage that Friday afternoon in July was anything but ordinary: They were directed to the parking structure by a computer program that had predicted that car burglaries were especially likely there that day.

The Times reported that several cities were already using data analysis to try to systematically anticipate when and where crimes will occur, including the Chicago Police Department. Chicago created a predictive analytics unit back in 2010.

This trend is growing and precrime detection technologies are now used by many police departments around the world, including the Department of Homeland Security, not to mention the NSA analytics of metadata. See eg The Minority Report: Using Predictive Analytics to prevent the crime from happening in the first place! (IBM); In Hot Pursuit of Numbers to Ward Off Crime (NY Times); Police embracing tech that predicts crimes (CNN); U.S. Cities Relying on Precog Software to Predict Murder (Wired). The analytics are already pretty good at predicting places and times where cars will be stolen, houses robbed and people mugged.

Abig_brotherlthough these programs help improve efficient crime fighting, they are not without serious privacy and due process critics. Imagine the potential abuses if an evil Big Brother government was not only watching you, but could arrest you based on computer predictions of what you might do. Although no one is arresting people yet for what they might do as in the Minority Report, they are subjecting people to significantly increased scrutiny, even home visits. See eg. Professor Elizabeth Joh, Policing by Numbers: Big Data and the Fourth Amendment; Professor Brandon Garrett, Big Data and Due ProcessThe minority report: Chicago’s new police computer predicts crimes, but is it racist? (The Verge, 2014); Eric Holder Warns About America’s Disturbing Attempts at Precrime. Do we really want to give computers, and the people who operate them, that much power? Does the Constitution as now written even allow that?

Should Lawyers Detect and Stop Law Suits Before They Happen?

Should lawyers follow our police departments and use data analytics to predict and stop illegal, but non-criminal activities? The police will not do it. It is beyond their jurisdiction. Their job is to fight crime, not torts, not breach of contract, nor the tens of thousand of other civil wrongs that people and corporations sue each other about every day. Should lawyers do it? Is that the next step for the plaintiff’s bar? Is that the next step for corporate defense lawyers? For corporate compliance lawyers?  For the Civil Division of the Department of Justice? How serious is the potential loss in privacy and other rights to go that route? What other risks do we take in using our new found predictive coding skills in this way?

There are millions of civil wrongs committed each year that are beyond the purview of the criminal justice system. Many of them cause disputes, and many of these disputes in turn lead to state and federal litigation. Evidence of these illegal activities is present in the both public and private data. Should lawyers mine this data to look for civil wrongs? Should the civil justice system include prevention? Should lawyers not only bring and defend law suits, but also prevent them?

Officer_baronThis is not the future we are talking about here. The necessary software and search skills already exist to do this. Lawyers with big data skills can already detect and prevent breach of contract, torts, and statutory violations, if they have access to the data. It is already possible for skilled lawyers to detect and stop these illegal activities before damages are caused, before disputes arise, before law suits are filed. Lawyers with artificial intelligence enhanced evidence search skills can already do this.

I have written about this several times before and even coined a word for this legal service. I call it “PreSuit.” It is a play off the term PreCrime from the Minority Report movie. I have built a website that provides an overview on how these services can be performed. Some lawyers have even begun rendering such services. But should they? Some lawyers, myself included, know how to use existing predictive coding software to mine data and make predictions as to where illegal activities are likely to take place. We know how to use this predictive technology to intervene to prevent such illegal activity. But should we?


Just because new technology empowers us to do new things, does not mean we should. Perhaps we should refrain from becoming big data cops? We do not need the extra work. Now one is clamoring for this new service. Should we build a new bomb just because we can?

Do we really want to empower an elite group of technology enhanced lawyers in this way? After all, society has gotten along just fine for centuries using traditional civil dispute resolution procedures. We have gotten along just fine by using a court system that imposes after-the-fact damages and injunctions to provide redress for civil wrongs. Should we really turn the civil justice system on its head by detecting the wrongs in advance and avoiding them?

Is it really in the best interest of society for lawyers to be big data cops? Or anyone else for that matter? Is it in the best interests of corporate world to have this kind of private police action? Is it in the best interest of lawyers? The public? What are the privacy and due process ramifications?

Some Preliminary Thoughts

Ralph LoseyI do not have any answers on this yet. It is too early in my own analysis to say for sure. These kind of complex constitutional issues require a lot of thought and discussions. All sides should be heard. I would like to hear what others have to say about this before I start reaching any conclusions. I look forward to hearing your public and private comments. I do, however, have a few preliminary thoughts and predictions to start the discussion. Some are serious, some are just designed to be thought-provoking. You figure out which are which. If you quote me, please remember to include this disclaimer. None of these thoughts are yet firm convictions, nor certain predictions. I may change my mind on all of this as my understanding improves. As a better Ralph than I once said: “A foolish consistency is the hobgoblin of little minds.”

First of all, there is no current demand for this service by the people who need it the most, large corporations. They may never want this, even though such opposition is irrational. It would, after all, reduce litigation costs and make their company more profitable. I am not sure why, and do not think it is as simple as some would say, that they just want to hide their illegal activities. Let me tell you an experience from my 34 years as a litigator that may shed some light on this. This is an experience that I know is common with many litigators. It has to do with the relationship between lawyers and management in most large companies.

Occasionally during a case I would become aware of a business practice in my client corporation that should obviously be changed. Typically it was a business practice that created or at least contributed to the law suit I just defended. The practice was not blatantly illegal, but was a grey-area. The case had shown that it was stupid and should be changed, if for no other reason than to prevent another case like that from happening. Since I had just seen the train wreck in slow motion, and knew full well how much it had cost the company, mostly in my fees, I thought I would help the company to prevent it from happening again. I would make a recommendation as to what should be changed and why. Sometimes I would explain in detail how the change would have prevented the litigation I just finished. I would explain how a change in the business practice would save the company money.

bored_yawn_obamaI have done this several times as a litigator at other firms before going to my current firm where I only do e-discovery. Do you know what kind of reaction I got? Nothing. No response at all, except perhaps a bored, polite thanks. I doubt my lessons learned memos were even read. I was, after all, just an unknown, young partner in a Floriduh law firm. I was not pointing out an illegal practice, nor one that had to be changed to avoid illegal activities. I was just pointing out a very ill-advised practice. I have had occasions to point out illegal activities too, in fact this is a more frequent occurrence, and there the response is much different. I was not ignored. I was told this would be changed. Sometimes I was asked to assist in that change. But when it came to recommendations to change something not outright illegal, suggestions to improve business practices, the response was totally different. Crickets. Just crickets. And big yawns. When will lawyers learn their place?

A couple of times I talked to in-house counsel about this, and tried to enlist their support to get the legal, but stupid, business practice changed. They would usually agree with me, full-heartedly, on the stupid part, after all they had seen the train wreck too. But they were cynical. They would explain that no one in upper management would listen to them. I am speaking about large corporations, ones with big bureaucracies. It may be better in small companies. In large companies in-house would express frustration. They knew the law department had far less juice than most others in the company. (Only the poor records department, or compliance department, if there is one, typically gets less respect than legal.) Many other parts of a company actually generate revenue, or at least provide cool toys that management wants, such as IT. All Legal does is spend money and aggravate everyone. The department that usually has the most juice in a company is sales, and they are the ones with most of the questionable practices. They are focused on money-making, not abstractions like legal compliance and dispute avoidance. Bottom line, in my experience upper management is not interested in hearing the opinions of lawyers, especially outside counsel, on what they should do differently.

Based on this experience I do not think the idea of lawyers as analytic cops to prevent illegal activities will get much traction with upper management. They do not want a lawyer in the room. It would stifle their creativity, their independent management acumen. They see all lawyers as nay sayers, deal breakers. Listen to lawyers and you’ll get paralysis by analysis. No, I do not see any welcome sign appearing for lawyers as big data cops, even if you present chart after chart as to how much data, time and frustration you will save the company in litigation avoidance. Of course, I never was much of a salesman. I’m just a lawyer who follows the hacker way of management (an iterative, pragmatic, action-based approach, which is the polar opposite of paralysis by analysis). So maybe some vendor salesmen out there will be able to sell the PreSuit concept, but not lawyers, at least not me.


I have tried all year. I have talked about this idea at several events. I have written about it, and created the PreSuit website with details. Do you know how many companies have responded? How many have expressed at least some interest in the possibility of reducing litigation costs by data analytics? Build it and they will come, they say. Not in my experience. I’ve built it and no one has come. There has been no response at all. Weeds are starting to grow on this field of dreams. Oh well. I’m a golfer. I’m used to disappointment.

This is probably just as well because reduction of litigation is not really in the best interests of the legal profession. After all, most law firms make most of their money in litigation. Lawyers should refuse to be big data cops and should let the CEOs carry on in ignorant bliss. Let them continue to function with eyes closed and spawn expensive litigation for corporate counsel to defend and for plaintiff’s counsel to get rich on. The litigation system works fine for the lawyers, and for the courts and judges too. Why muck up a big money generating machine by avoiding the disputes that the keep whole thing running? Especially when no one wants that.

Great-Depression_LitigatorsAll of the established powers want to leave things just the way they are. Can you imagine the devastating economic impact a fifty percent reduction in litigation would cause on the legal system? On lawyers everywhere? Both plaintiff’s and defendant’s bars? Hundreds of thousands of lawyers and support staff  would be out of work. No. This will be ignored, and if not ignored, attacked as radical, new, unproven, and perhaps most effective of all, as dangerous to privacy rights and due process. The privacy anti-big-brother groups will, for once, join forces with corporate America. Protect the workers they will say. Unions everywhere will oppose PreSuit. Labor and management will finally have an issue they can agree upon. Only a few high-tech lawyers will oppose them, and they are way outnumbered, especially in the legal profession.

No, I predict this will never be adopted voluntarily, nor will it ever be required by legislation. The politicians of today do not lead, they follow. The only thing I see now that will cause people to want to avoid litigation, to use data analytics to detect and prevent disputes, is the collapse, or near-collapse, of our current system of civil litigation. Lawyers as big data cops will only come out of desperation. This might happen sooner than you think.

There is another way of course. True leadership could come from the new ranks of corporate America. They could see the enlightened self-interest of PreSuit litigation avoidance. They could understand the value of data analytics and value of compliance. This may not come from our current generation old-school leaders, they barely know what data analytics is anyway. But maybe it will come from the next wave of leaders. There is always hope that the necessary changes will be made out of intelligence, not crises. If history is any guide, this is unlikely, but not impossible.

privacy-vs-googleOn the other hand, maybe this is benevolent neglect. Maybe the refusal to adopt these new technologies is for the best. Maybe the power to predict civil wrongs would be abused by a small technical elite of e-discovery lawyer cops. Maybe it would go to their head, and before you know it, their heavy hands would descend to rob all employees of their last fragments of privacy. Maybe innovation would be stifled by the fear that new creative actions might be seen as a precursor to illegal activities. This chilling effect could cause everyone to just play it safe.

The next generation of Steve Jobs would never arise in conditions such as this. They would instead come from the last remaining countries that still maintained a heavy litigation load. They would arise in cultures that still allow the workforce to do as it damn well pleases, and just let the courts sort it all out later. Legal smegal, just get the job done. Maybe expensive chaos is the best incubator we have for creative genius? Maybe it is best to keep lawyers out of the boardroom? Much less give them a badge and let them police anything. It is better to keep data analytics in Sales where it belongs. Let us know what our customers are doing and thinking, but keep a blind eye to ourself. That way we can do what we want.


I always end my blogs with a conclusion. But not this time. I have no conclusions yet. This could go either way. This game is too close to call. We are still in the early innings yet. Who knows? A few star CEOs may come out of the cornfields yet. Then we could find out fast whether PreSuit is a good thing. A few test cases should flush out the facts, good and bad.

Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part Four

August 3, 2014

This is the conclusion of my four part blog: Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part One and Part Two and Part Three.

Cormack and Grossman’s Conclusions

Maura-and-Gordon_Aug2014Gordon Cormack and Maura Grossman have obviously put a tremendous amount of time and effort into this study. In their well written conclusion they explain why they did it, as well as provide a good summary of their findings

Because SPL can be ineffective and inefficient, particularly with the low-prevalence collections that are common in ediscovery, disappointment with such tools may lead lawyers to be reluctant to embrace the use of all TAR. Moreover, a number of myths and misconceptions about TAR appear to be closely associated with SPL; notably, that seed and training sets must be randomly selected to avoid “biasing” the learning algorithm.

This study lends no support to the proposition that seed or training sets must be random; to the contrary, keyword seeding, uncertainty sampling, and, in particular, relevance feedback – all non-random methods – improve significantly (P < 0:01) upon random sampling.

While active-learning protocols employing uncertainty sampling are clearly more effective than passive-learning protocols, they tend to focus the reviewer’s attention on marginal rather than legally significant documents. In addition, uncertainty sampling shares a fundamental weakness with passive learning: the need to define and detect when stabilization has occurred, so as to know when to stop training. In the legal context, this decision is fraught with risk, as premature stabilization could result in insufficient recall and undermine an attorney’s certification of having conducted a reasonable search under (U.S.) Federal Rule of Civil Procedure 26(g)(1)(B).

This study highlights an alternative approach – continuous active learning with relevance feedback – that demonstrates superior performance, while avoiding certain problems associated with uncertainty sampling and passive learning. CAL also offers the reviewer the opportunity to quickly identify legally significant documents that can guide litigation strategy, and can readily adapt when new documents are added to the collection, or new issues or interpretations of relevance arise.

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, at pg. 9.

The insights and conclusions of Cormack and Grossman are perfectly in accord with my own experience and practice with predictive coding search efforts, both with messy real world projects, and the four controlled scientific tests I have done over the last several years (only two of which have to date been reported, and the fourth is still in progress). I agree that a relevancy approach that emphasizes high ranked documents for training is one of the most powerful search tools we now have. So too is uncertainty training (mid ranked) when used judiciously, as well as keywords, and a number of other methods. All the many tools we have to find both relevant and irrelevant documents for training should be used, depending on the circumstances, including even some random searches.

In my view, we should never use just one method to select documents for machine training, and ignore the rest, even when it is a good method like Cormack and Grossman have shown CAL to be. When the one method selected is the worst of all possible methods, as random search has now been shown to be, then the monomodal approach is a recipe for ineffective, over-priced review.

Why All the Foolishness with Random Search?

random samplingAs shown in Part One of this article, it is only common sense to use what you know to find training documents, and not rely on a so-called easy way of rolling dice. A random chance approach is essentially a fool’s method of search. The search for evidence to do justice is too important to leave to chance. Cormack and Grossman did the legal profession a favor by taking the time to prove the obvious in their study. They showed that even very simplistic mutlimodal search protocols, CAL and SAL, do better at machine training than monomodal random only.

scientist on simpsonInformation scientists already knew this rather obvious truism, that multimodal is better, that the roulette wheel is not an effective search tool, that random chance just slows things down and is ineffective as a machine training tool. Yet Cormack and Grossman took the time to prove the obvious because the legal profession is being led astray. Many are actually using chance as if it that were a valid search method, although perhaps not in the way they describe. As Cormack and Grossman explained in their report:

While it is perhaps no surprise to the information retrieval community that active learning generally outperforms random training [22], this result has not previously been demonstrated for the TAR Problem, and is neither well known nor well accepted within the legal community.

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014 at pg. 8.

As this quoted comment suggests, everyone in the information science search community knew this already, that the random only approach to search is inartful. So do most lawyers, especially the ones with years of hands-on experience in search for relevant ESI. So why in the world is random search only still promoted by some software companies and their customers? Is it really to address the so called problem of “not knowing what you don’t know.” That is the alleged inherent bias of using knowledge to program the AI. The total-random approach is also supposed to prevent overt, intentional bias, where lawyers might try to mis-train the AI searcher algorithm on purpose. These may be the stated reasons by vendors, but there are other reasons. There must be, because these excuses do not hold water. This was addressed in Part One of this article.

This bias-avoidance claim must just be an excuse because there are many better ways to counter myopic effects of search driven too narrowly. There are many methods and software enhancements that can be used to avoid overlooking important, not yet discovered types of relevant documents. For instance, allow machine selection of uncertain documents, as was done here with the SAL protocol. You could also include some random document selection into the mix, and not just make the whole thing random. It is not all or nothing, not logically at least, but perhaps it is as a practical matter for some software.

My preferred solution to the problem of “not knowing what you don’t know” is to use a combination of all those methods, buttressed by a human searcher that is aware of the limits of knowledge. In mean, really! The whole premise behind using random as the only way to avoid a self-looping trap of “not knowing what you don’t know” assumes that the lawyer searcher is a naive boob or dishonest scoundrel. It assumes lawyers are unaware that they don’t know what they don’t know. Please, we know that perfectly well. All experienced searchers know that. This insight is not just the exclusive knowledge of engineers and scientists. Very few attorneys are that arrogant and self absorbed, or that naive and simplistic in their approach to search.

No, this whole you must use random only search to avoid prejudice is just a smoke screen to hide real reason a vendor sells software that only works that way. The real reason is that poor software design decisions were made in a rush to get predictive coding software to market. Software was designed to only use random search because it was easy and quick to build software like that. It allowed for quick implementation of machine training. Such simplistic types of AI software may work better than poorly designed keyword searches, but it is still far inferior to more complex machine training system, as Cormack and Grossman have now proven. It is inferior to a multimodal approach.

The software vendors with random only training need to move on. They need to invest in their software to adopt a multimodal approach. In fact, it appears that many have already done so, or are in the process. Yes, such software enhancements take time and money to implement. But we need software search tools for adults. Stop all of the talk about easy buttons. Lawyers are not simpletons. We embrace hard work. We are masters of complexity. Give us choices. Empower the software so that more than one method can be used. Do not force us to use only random selection.

We need software tools that respect the ability of attorneys to perform effective searches for evidence. This is our sand box. That is what we attorneys do, we search for evidence. The software companies are just here to give us tools, not to tell us how to search. Let us stop the arguments and move on to discuss more sophisticated search methods and tools that empower complex methods.

Attorneys want software with the capacity to integrate all search functions, including random, into a mulitmodal search process. We do not want software with only one type of machine training ability, be it CAL, SAL or SPL. We do not want software that can only do one thing, and then have the vendor build a false ideology around their one capacity that says their method is the best and only way. These are legal issues, not software issues.

Attorneys do not just want one search tool, we want a whole tool chest. The marketplace will sort out whose tools are best, so will science. For vendors to remain competitive they need to sell the biggest tool chest possible, and make sure the tools are well built and perform as advertised. Do not just sell us a screwdriver and tell us we do not need a hammer and pliers too.

Leave the legal arguments as to reasonability and rules to lawyers. Just give us the tools and we lawyers will find the evidence we need. We are experts at evidence detection. It is in our blood. It is part of our proud heritage, our tradition.

King_Solomon_JudgeFinding evidence is what lawyers do. The law has been doing this for millennia. Think back to story of the judicial decision of King Solomon. He decided to award the child to the woman he saw cry in response to his sham decision to cut the baby in half. He based his decision on the facts, not ideology. He found the truth in clever ways built around facts, around evidence.

Lawyers always search to find evidence so that justice can be done. The facts matter. It has always been an essential part of what we do. Lawyers always adapt with the times. We always demand and use the best tools available to do our job. Just think of Abraham Lincoln who readily used telegraphs, the great new high-tech invention of his day. When you want to know the truth of what happened in an event that took place in the recent past, you hire a lawyer, not an engineer nor scientist. That is what we are trained to do. We separate the truth from the lies. With great tools we can and will do an even better job.

Many multimodal based software vendors already understand all of this. They build software that empowers attorneys to leverage their knowledge and skills. That is why we use their tools. Empowerment of attorneys with the latest AI tools empowers our entire system of justice. That is why the latest Cormack Grossman study is so important. That is why I am so passionate about this. Join with us in this. Demand diversity and many capacities in your search software, not just one.

Vendor Wake Up Call and Plea for Change

Ralph_x-mas_2013My basic message to all manufacturers of predictive coding software who use only one type of machine training protocol is to change your ways. I mean no animosity at all. Many of you have great software already, it is just the monomondal method built into your predictive coding features that I challenge. This is a plea for change, for diversity. Sell us a whole tool chest, not just a single, super-simple tool.

Yes, upgrading software takes time and money. But all software companies need to do that anyway to continue to supply tools to lawyers in the Twenty-First Century. Take this message as both a wake up call and a respectful plea for change.

Dear software designers: please stop trying to make the legal profession look only under the random lamp. Treat your attorney customers like mature professionals who are capable of complex analysis and skills. Do not just assume that we do not know how to perform sophisticated searches. I am not the only attorney with multimodal search skills. I am just the only one with a blog who is passionate about it. There are many out there with very sophisticated skills and knowledge. They may not be as old (I prefer to say experienced) and loud mouthed (I prefer to say outspoken) as I am, but they are just as skilled. They are just as talented. More importantly, their numbers are growing rapidly. It is a generation thing too, you know. Your next generation of lawyer customers are just as comfortable with computers and big data as I am, maybe more so. Do you really doubt that Adam Losey and his generation will not surpass our accomplishments with legal search. I don’t.

Dear software designers: please upgrade your software and get with the multi-feature program. Then you will have many new customers, and they will be empowered customers. Do not have the money to do that? Show your CEO this article. Lawyers are not stupid. They are catching on, and they are catching on fast. Moreover, these scientific experiments and reports will keep on too. The truth will come out. Do you want to be survive the inevitable vendor closures and consolidation? Then you need to invest in more sophisticated, fully featured software. Your competitors are.

Dear software designers: please abandon the single feature approach, then you will be welcome in the legal search sandbox. I know that the limited functionality software that some of you have created is really very good. It already has many other search capacities. It just needs to be better integrated with predictive coding. Apparently some single feature software already produces decent results, even with the handicap of random-only. Continue to enhance and build upon your software. Invest in the improvements needed to allow for full multimodal, active, judgmental search.


Flashlights_taticalrandom only search method for predictive coding training documents is ineffective. The same applies to any other training method if it is applied to the exclusion of all others. Any experienced searcher knows this. Software that relies solely on a random only method should be enhanced and modified to allow attorneys to search where they know. All types of training techniques should be built into AI based software, not just random. Random may be easy, but is it foolish to only search under the lamp post. It is foolish to turn a blind eye to what you know. Attorneys, insist on having your own flashlight that empowers you to look wherever you want. Shine your light wherever you think appropriate. Use your knowledge. Equip yourself with a full tool chest that allows you to do that.

Latest Grossman and Cormack Study Proves Folly of Using Random Search For Machine Training – Part Three

July 27, 2014

This is part three of what has now become a four part blog: Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training – Part One and Part Two.

Professor Gordon Cormack

Professor Gordon Cormack

Yes, my article on this experiment and report by Professor Gordon Cormack and attorney Maura Grossman is rapidly becoming as long as the report itself, and, believe it or not, I am not even going into all of the aspects in this very deep, multifaceted study. I urge you to read the report. It is a difficult read for most, but worth the effort. Serious students will read it several times. I know I have. This is an important scientific work presenting unique experiments that tested common legal search methods.

The Cormack Grossman paper was peer reviewed by other scientists and presented at the major event for information retrieval scientists, called the annual ACM SIGIR conference. 12_acm-logo-medACM is the Association for Computing Machinery, the world’s largest educational and scientific computing society. SIGIR is the Special Interest Group On Information Retrieval section of ACM. Hundreds of scientists and academics served on organizing committees for the 2014 SIGIR conference in Australia. They came from universities and large corporate research labs from all over the world, including Google, Yahoo, and IBM. Here is a list with links to all of the papers presented.

All attorneys who do legal search should at least have a rudimentary understanding of the findings of Cormack and Grossman on the predictive coding training methods analyzed in this report. That is why I am making this sustained effort to provide my take on it, and make their work a little more accessible. Maura and Gordon have, by the way, generously given of their time to try to insure that my explanations are accurate. Still, any mistakes made on that account are solely my own.

Findings of Cormack Grossman Study

rouletteHere is how Cormack and Grossman summarize their findings:

The results presented here do not support the commonly advanced position that seed sets, or entire training sets, must be randomly selected [19, 28] [contra 11]. Our primary implementation of SPL, in which all training documents were randomly selected, yielded dramatically inferior results to our primary implementations of CAL and SAL, in which none of the training documents were randomly selected.

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, at pgs. 7-8.


Now for the details of the results comparing the previously described methods of CAL, SAL and SPL. First, let us examine the comparison between the CAL and SPL machine training methods. To refresh your memory, CAL is simplistic type of multimodal training method wherein two methods are used. Keyword search results are used in the first round of training. In all following rounds, high probability ranked search results are used. SPL is a pure random method, a monomodal method. With SPL all documents are selected by random sampling for training in all rounds.


Cormack and Grossman found that the “CAL protocol achieves higher recall than SPL, for less effort, for all of the representative training-set sizes.” Id. at pg. 4. This means you can find more relevant documents using CAL than a random method, and you can do so faster and thus with less expense.

To drill down even deeper into their findings it is necessary to look at the graphs in the report that show how the search progressed through all one-hundred rounds of training and review for various document collections. This is shown for CAL v. SPL in Figure 1 of the report. Id. at pg. 5. The line with circle dots at the top of each graph plots the retrieval rate of CAL, the clear winner on each of the eight search tasks tested. The other three lines show the random approach, SPL, using three different training-set sizes.Cormack_Grossman_Fig1  Cormack and Grossman summarize the CAL v. SPL findings as follows:

After the first 1,000 documents (i.e., the seed set), the CAL curve shows a high slope that is sustained until the majority of relevant documents have been identified. At about 70% recall, the slope begins to fall off noticeably, and effectively plateaus between 80% and 100% recall. The SPL curve exhibits a low slope for the training phase, followed by a high slope, falloff, and then a plateau for the review phase. In general, the slope immediately following training is comparable to that of CAL, but the falloff and plateau occur at substantially lower recall levels. While the initial slope of the curve for the SPL review phase is similar for all training-set sizes, the falloff and plateau occur at higher recall levels for larger training sets. This advantage of larger training sets is offset by the greater effort required to review the training set: In general, the curves for different training sets cross, indicating that a larger training set is advantageous when high recall is desired.


The Cormack Grossman experiment also compared the CAL and SAL methods. Recall the SAL method is another simple multimodal method where only two methods are used to select training documents. Keywords are again used in the first round only, just like the CAL protocol. Thereafter, in all subsequent rounds of training machine selected documents are used based on the machine’s uncertainty of classification. That means the search is focused on the midrange ranked documents about which the machine is most uncertain.


Cormack and Grossman found that “the CAL protocol generally achieves higher recall than SAL,” but the results were closer and more complex. Id. At one point in the training SAL became as good as CAL, it achieved a specific recall value with the nearly the same efforts as CAL from that point forward. The authors found that was due to the fact that many high probability documents began to be used by the machine as uncertainty selected documents. This happened after all of the mid-scoring documents had been used up. In other words, at some point the distinction between the two methods was decreased, and more high probability documents were used in SAL, in almost the same way they were used in CAL. That allowed SAL to catch up with CAL and, in effect, become almost as good.

This catch up point is different in each project. As Cormack and Grossman explain:

Once stabilization occurs, the review set will include few documents with intermediate scores, because they will have previously been selected for training. Instead, the review set will include primarily high-scoring and low-scoring documents. The high-scoring documents account for the high slope before the inflection point; the low-scoring documents account for the low slope after the inflection point; the absence of documents with intermediate scores accounts for the sharp transition. The net effect is that SAL achieves effort as low as CAL only for a specific recall value, which is easy to see in hindsight, but difficult to predict at the time of stabilization.

This inflection point and other comparisons can be easily seen in Figure 2 of the report (shown below). Id. at pg. 6. Again the line with circle dots at the top of each graph, the one that always starts off fastest, plots the retrieval rate of CAL. Again, it does better than in each of the eight search tasks tested. The other three lines show the uncertainty approach, SAL, using three different training-set sizes. CAL does better than SAL in all eight of the matters, but the differences are not nearly as great as the comparison between CAL and SPL.

Cormack_Grossman_Fig2 Cormack and Grossman summarize the CAL v. SAL findings as follows:

Figure 2 shows that the CAL protocol generally achieves higher recall than SAL. However, the SAL gain curves, unlike the SPL gain curves, often touch the CAL curves at one specific inflection point. The strong inflection of the SAL curve at this point is explained by the nature of uncertainty sampling: Once stabilization occurs, the review set … (see quote above for the rest of this sentence.)

This experiment compared one type of simple multimodal machine training method with another. It found that with the data sets tested, and other standard procedures set forth in the experiment, the method which used high ranking documents for training, what William Webber calls the Relevance method, performed somewhat better than the method that used mid-ranked documents, what Webber calls the Uncertainty method.

This does not mean that the uncertainty method should be excluded from a full multimodal approach in real world applications. It just means that here, in this one experiment, albeit a very complex and multifaceted experiment, the relevance method outperformed the uncertainty method.

I have found that in the real world of very complex (messy even) legal searches, it is good to use both high and mid-ranked documents for training, what Cormack and Grossman call CAL and SAL, and what Webber calls Relevance, and Uncertainty training. It all depends on the circumstances, including the all important cost component. In the real world you use every method you can think of to help you to find what you are looking for, not just one or two, but dozens.

Grossman and Cormack know this very well too, which I know from private conservations with them on this, and also from the conclusion to their report:

There is no reason to presume that the CAL results described here represent the best that can be achieved. Any number of feature engineering methods, learning algorithms, training protocols, and search strategies might yield substantive improvements in the future. The effect of review order and other human factors on training accuracy, and thus overall review effectiveness, may also be substantial.

Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, at pg. 9.

Ralph Losey with some of his many computer tools

My practical takeaway from the Cormack Grossman experiment is that focusing on high ranking documents is a powerful search method. It should be given significant weight in any multimodal approach, especially when the goal is to quickly find as many relevant documents as possible. The “continuous” training aspects of the CAL approach are also intriguing, that is you keep doing machine training throughout the review project and batch reviews accordingly. This could become a project management issue, but, if you can pull it off within proportionality and requesting party constraints, it just makes common sense to do so. You might as well get as much help from the machine as possible and keep getting its probability predictions for as long as you are still doing reviews and can make last minute batch assignments accordingly.

I have done several reviews in such a continuous training manner without really thinking about the fact the machine input was continuous, including my first Enron experiment. Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. But this causes me to rethink the flow chart shown below that I usually use to explain the predictive coding process. The work flow shown is not a CAL approach, but rather a SAL type of approach where there is a distinct stop in training after step five, and the review work in step seven is based on the last rankings established in step five.


The continuous work flow is slightly more difficult to show in a diagram, and to implement, but it does make good common sense if you are in a position to pull it off. Below is the revised workflow to update the language and show how the training continues throughout the review.


Machine training is still done in steps four and five, but then continues in steps four, five and seven. There are other ways it could be implemented of course, but this is the CAL approach I would use in a review project where such complex batching and continuous training otherwise makes sense. Of course, it is not necessary in any project were the review in steps four and five effectively finds all of the relevant documents required. This is what happened in my Enron experiment. Predictive Coding Narrative: Searching for Relevance in the Ashes of EnronThere was no need to do a proportional final review, step seven, because all the relevant documents had already been reviewed as part of the machine training review in steps four and five. In the Enron experiment I skipped step seven and when right from step six to step eight, production. I have been able to do this is other projects as well.

Strengths of a Relevancy Weighted Type of CAL

The findings in this experiment as to the strengths of using Relevancy training confirm what I have seen in most of my search projects. I usually start with the high end documents to quickly help me to teach the machine what I am looking for. I find that this is a good way to start training. Again, it just makes common sense to do so. It is somewhat like teaching a human, or a dog for that matter. You teach the machine relevance classification by telling it when it is right (positive reinforcement), and when it is wrong. This kind of feedback is critical in all learning. In most projects this kind of feedback on predictions of highly probable relevance is the fastest way to get to the most important documents. For those reasons I agree with Cormack and Grossman’s conclusion that CAL is a superior method to quickly find the most relevant documents:

CAL also offers the reviewer the opportunity to quickly identify legally significant documents that can guide litigation strategy, and can readily adapt when new documents are added to the collection, or new issues or interpretations of relevance arise.

Id. But then again, I would never rely on just Relevancy CAL type searches alone. It gets results fast, but also tends to lead to a somewhat myopic focus on the high end where you may miss new, different types of relevant documents. For that reason, I also use SAL types of searches to include the mid range documents from the Uncertainty method. That is an important method to help the machine to better understand what documents I am looking for. As Cormack and Grossman put it:

The underlying objective of CAL is to find and review as many of the responsive documents as possible, as quickly as possible. The underlying objective of SAL, on the other hand, is to induce the best classier possible, considering the level of training effort. Generally, the classier is applied to the collection to produce a review set, which is then subject to manual review.

Id. at 8.

Similarity and other concept type search methods are also a good way to quickly find as many responsive documents as possible. So too are keyword searches, and not just in the first round, but for any round. Further, this experiment, which is already very complex (to me at least), does not include the important real world component of highly relevant versus merely relevant documents. I never just train on relevancy alone, but always include a hunt for the hot documents. I want to try to train the machine to understand the difference between the two classifications. Cormack and Grossman do not disagree. As they put it, “any number of feature engineering methods, learning algorithms, training protocols, and search strategies” could improve upon a CAL only approach.

There are also ways to improve the classifier in addition to focus on mid range probability documents, although I have found that uncertainty method is the best way to improve relevance classifications. But, it also helps to be sure your training on the low end is also right, meaning review of some of the high probability irrelevant documents. Both relevant and irrelevant training are helpful. Personally, I also like to include some random aspects, especially at first, to be sure I did not miss any outlier type documents, and be sure I have a good feel for the irrelevant documents of these custodians too. Yes, chance has to place too, so long as it does not take over and become the whole show.

Supplemental Findings on Random Search

diceIn addition to comparing CAL with SAL and SPL, Cormack and Grossman experimented with what would happen to the effectiveness of both the CAL and SAL protocols if more random elements were added to the methods. They experimented with a number of different variables, including substituting random selection, instead of keyword, for the initial round of training (seed set).

As you would expect, the general results were to decrease the effectiveness of every search method wherein random was substituted, either for keyword, high ranking relevance, or mid ranking relevance (uncertainty). The negative impact was strongest in datasets where prevalence was low, which is typical in litigation. Cormack and Grossman tested eight datasets where the prevalence of responsive documents varied from 0.25% to 3.92%, which, as they put it: “is typical for the legal matters with which we have been involved.” The size of the sets tested ranged 293,000 documents to just over 1.1 million. The random based search of lowest prevalence dataset tested, matter 203, the one with a 0.25% prevalence rate, was, in their words, a spectacular failure. Conversely, the negative impact was lessened with higher prevalence datasets. Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic DiscoverySIGIR’14, July 6–11, 2014, at pg. 7.

Cormack and Grossman responded to the popular misconception that predictive coding does not work in such low prevalence datasets.

Others assert that these are examples of “low-prevalence” or “low-richness” collections, for which TAR is unsuitable [19]. We suggest that such assertions may presuppose an SPL protocol [11], which is not as effective on low-prevalence datasets. It may be that SPL methods can achieve better results on higher-prevalence collections (i.e., 10% or more responsive documents).

Id. at 9.

In fact, information scientists have been working with low prevalence datasets for decades, which is one reason Professor Cormack had a ready collection of pre-coded documents by which to measure recall, a so-called gold standard of assessments from prior studies. Cormack and Grossman explain that the lack of pre-tested datasets with high prevalence is the reason they did not use such collections for testing. They also speculate that if such high prevalence datasets are tested, then the random only (SPL) method would do much better than it did in the low prevalence datasets they used in their experiment.

However, no such collections were included in this study because, for the few matters with which we have been involved where the prevalence exceeded 10%, the necessary training and gold-standard assessments were not available. We conjecture that the comparative advantage of CAL over SPL would be decreased, but not eliminated, for high-prevalence collections.


They are probably right, if the datasets have a higher prevalence, then the chances are that random samples will find more relevant documents for training. But that still does not make the blind draw a better way to find things than looking with your eyes wide open. Plus, the typical way to attain high yield datasets is by keyword filtering out large segments of the raw data before beginning a predictive coding search. When you keyword filter like that before beginning machine training the chances are you will leave behind a significant portion, if not most of the relevant documents. Keyword filtering often has low recall, or when broad enough to include most of the relevant documents, it is very imprecise. Then you are back to the same low prevalence situation.

Better to limit filtering before machine training to obvious irrelevant, or ESI not appropriate for training, such as non-text documents like photos, music and voice mail. Use other methods to search for those types of ESI. But do not use keyword filtering on text documents simply to create an artificially high prevalence just because the random based software you use will only work that way. That is the tail wagging the dog.

For more analysis and criticism on using keywords to create artificially high prevalence, a practice Cormack and Grossman call Collection Enrichment, see another excellent article they wrote: Comments on “The Implications of Rule 26(g) on the Use of Technology-Assisted Review”7 Federal Courts Law Review 286 (2014) at pgs. 293-295, 300-301. This article also contains good explanations of the instant study with CAL, SAL and SPL. See especially Table 1 at pg. 297.

The negative impact of random elements on machine training protocols is a no duh to experienced searchers. See eg. the excellent series of articles by John Tredennick, including his review on the Cormack Grossman study: Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review.

It never helps to turn to lady luck, to random chance, to improve search. Once you start relying on dice to decide what to do, you are just spinning your wheels.

Supplemental Findings on Keywords and Random Search

go fishCormack and Grossman also tested what would happen if keywords were used instead of random selections, even when the keywords were not tested first against the actual data. This poor practice of using unverified keywords is what I call the Go Fish approach to keyword search. Child’s Game of “Go Fish” is a Poor Model for e- Discovery Search(October 2009). Under this naive approach attorneys simply guess what keywords might be contained on relevant documents without testing how accurate their guesses are. It is a very simplistic approach to keyword search, yet, nevertheless, is still widely employed in the legal profession. This approach has been criticized by many, including Judge Andrew Peck in his excellent Gross Construction opinion, the so called wake-up call for NY attorneys on search. William A. Gross Construction Associates, Inc. v. American Manufacturers Mutual Insurance Co., 256 F.R.D. 134 (S.D.N.Y. 2009).

Cormack and Grossman also tested what would happen if such naive keyword selections were used instead of the high or mid probability methods (CAL and SAL) for machine training. The naive keywords used in these supplemental comparison tests did fairly well. This is consistent with my multimodal approach, where all kinds of search methods are used in all rounds of training.

The success of naive keyword selection for machine training is discussed by Cormack and Grossman as an unexpected finding (italics and parens added):

Perhaps more surprising is the fact that a simple keyword search, composed without prior knowledge of the collection, almost always yields a more effective seed set than random selection, whether for CAL, SAL, or SPL. Even when keyword search is used to select all training documents, the result is generally superior to that achieved when random selection is used. That said, even if (random) passive learning is enhanced using a keyword-selected seed or training set, it (passive learning) is still dramatically inferior to active learning. It is possible, in theory, that a party could devise keywords that would render passive learning competitive with active learning, but until a formal protocol for constructing such a search can be established, it is impossible to subject the approach to a controlled scientific evaluation. Pending the establishment and scientific validation of such a protocol, reliance on keywords and (random) passive learning remains a questionable practice. On the other hand, the results reported here indicate that it is quite easy for either party (or for the parties together) to construct a keyword search that yields an effective seed set for active learning.

Id. at 8.

Cormack and Grossman summarize their findings on the impact of keywords in the first round of training (seed set) on CAL, SAL and SPL:

In summary, the use of a seed set selected using a simple keyword search, composed prior to the review, contributes to the effectiveness of all of the TAR protocols investigated in this study.

Keywords still have an important place in any multimodal, active, predictive coding protocol. This is, however, completely different from using keywords, especially untested naive keywords, to filter out the raw data in a misguided attempt to create high prevalence collections, all so that the random method (passive) might have some chance of success.

To be continued . . . in Part Four I will conclude with final opinions and analysis and my friendly recommendations for any vendors still using random-only training protocols. 


Get every new post delivered to your Inbox.

Join 3,416 other followers