Predictive Coding’s Erroneous Zones Are Emerging Junk Science

Bill SperosGUEST BLOG by J. William (Bill) Speros. Editor’s Preface. Attorney Bill Speros here answers my call for critical papers on predictive coding, a call I made just  last week in Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search. Although Bill Speros keeps a lower profile than  other experts in the field, most insiders know him as one of the country’s leading, independent consultants on electronic evidence, with over 24-years of experience. Most recently, just weeks after the Madoff Ponzi scheme collapsed, Bill was brought in as the technical guru for the Madoff investigation. The Trustee of the bankruptcy proceedings, Bernie L. Madoff Investment Securities, hired Bill to kick off the e-discovery aspects of the investigation. Speros spent 4,000 hours as the interim Director of Litigation Support and E-Discovery.

When it comes to legal search Bill knows what he is talking about from direct personal experience with thousands of complex ESI search and review  projects. This essay by William Speros brings much-needed critical light on some of the poor methods of search employed by many software vendors and attorneys in the field of predictive coding.


Predictive Coding’s Erroneous Zones Are Emerging Junk Science

by: J. William Speros

Truth & LiesTo a carpenter with a hammer, everything is a nail.

To a bite mark analyst, fire cause analyst, or fingerprint analyst, their conclusions are  conclusive.

It is convenient to think that our intellectual profession wouldn’t suffer fools gladly or sustain lies for long. It is convenient, but wrong.

PBS’ Frontline’s Forensic Tools: What’s Reliable and What’s Not-So-Scientific dispelled the infallibility, and in some instances, the validity, of analytical techniques long relied upon by our legal profession. Even if those techniques were not botched or biased, their validity ranges from bought-and-paid-for infomercials to, at best, an approximation.

How did our intellectual and intelligent legal profession grant so much authority to the junk science that Frontline debunked? I suspect that back then attorneys and judges (and experts and vendors) did with those junk sciences just what we are doing now with respect to predictive coding: allowing claims, however unjustified and erroneous, to form the basis of our practices, to influence our precedent and to accrue authority.

This article discusses four erroneous claims about predictive coding that oftentimes the trade press announces in breathless terms and legal arguments describe in breathtakingly inaccurate terms:

  1. Using a full-text search to identify prospectively responsive documents and then employing predictive coding to eliminate those that are not responsive. This is erroneous because it over-relies and under-delivers. It arbitrarily places documents out-of-sight and, therefore, out-of-mind.
  2. Pulling a random sample of documents to train the initial seed set. This is erroneous because it looks for relevance in all the wrong places. It turns a blind eye to what is staring you in the eye.
  3. Identifying “magic numbers” of necessary predictive coding assessment “iterations” and of the number of responsive documents within a randomly accumulated population. This is erroneous because you may not be able to get to there from here. You don’t know what isn’t yet known.
  4. Asserting that “Predictive Coding software is the gold standard for document retrieval in complex matters.” This is erroneous because it is thinking as though predictive coding is a box.


1. Using a full-text search to identify prospectively responsive documents even if followed by using predictive coding to eliminate those that are not responsive

We see an erroneous claim that predictive coding “is employed” even though the population of documents subjected to it is constrained.

A. Over-Rely and Under-Deliver

Full-text searching is a technique and a technology, but not an ideology. Its rate of success depends  on the mechanism and also the content against which it is applied: Does this search “Losey*” retrieve “Losey” and also “LoseyRalph?” In technical terms full-text searches’ rate of success may be tested and verified.

Stopping with technical measures of success, however, stops short; it focuses on process rather than on results. While the technical process enables, the researchers’ ideology constrains.

In one current matter, for example, the producing party employed a full-text search to constrain the production to focus on particular text that the researchers felt was published within responsive documents. But according to the requesting party the search spoke only about a limited number of relevant concepts. And even with respect to those subsets, the requesting party says, the search terms were incomplete.

All of this invites statistical analysis of full-text searches’ recall and response rates. But those discussions are not only obtuse, they are irrelevantly abstract: whether the results were successful is measured against a good faith standard based on the intelligence, experience, tasks, and testing applied in the current matter.

And in the current matter, the producing party seems to have excluded not only the requesting parties from the search term identification process, but excluded search terms the producing party employed in prior, similar matters. If true, some would say that presented arrogance.

More to the point here, the producing party’s constraining the population of potentially responsive documents via a single search, using search terms conceived by attorneys without their considering natural language, jargon, terms-of-art or other real-world realities, is a demonstration of their presumed clairvoyance.

Whether as a measure of presumed arrogance or presumed clairvoyance, using single-pass full-text review reflects an ideology that ought to be abandoned.

Or re-abandoned.

More than a decade ago, for example, a client asked for help in reviewing a collection of 950,000+ documents: find all that describe “a natural person who has not reached the age of majority.” Naturally, attorneys insisted that we search for the word “minor.”

What we found, of course, was no surprise. Of the 40% of the documents which contained the word “minor” or “minors” virtually none used the word “minor” to mean “someone under the age of majority” aka “a kid.”

Speros_bar-chartInstead, we “linguistic bird-dogs” became aware of indicia of documents about “minors” that were not necessarily synonyms of the word “minor” but instead were associated with “minor’s” activities, relationships, and status including, for example:


Parenthetically, we performed this analysis prior to the predictive coding systems’ development. In the meantime, however, we tested the same collection of documents using an early predictive coding system. Interestingly, we found that it failed to find those linguistic associations and, therefore, failed to identify “kid” related documents.

To be clear, those search terms didn’t spontaneously come to mind. And they didn’t simply prove meaningful. We derived those terms because we were motivated by the ideological humility: we could not anticipate — no, anticipate contemplates preexisting experience and insight — we could not guess what search terms worked.

Even back then thoughtful researchers knew that single-cycle, fire-and-forget searches could not succeed. Yet increasingly such searches are being employed by attorneys, who perhaps hope that predictive coding will protect them. Sometimes, though, it is too late to be saved.

B. Out of sight is out of the mind

Once documents are excluded from the review collection, the documents are out of mind. In this context, out-of-mind means that they may never be produced. After all, once other documents have been searched and read and studied and reviewed for privilege and then produced, it takes researcher discipline — nearly to the point of heroics — to go back upstream to recover and reconsider previously discarded documents.

In a current matter, that was not the producing party’s intent. The producing party merely offered to compile a random search from the previously discarded collection. As will be discussed below, that itself is “looking for relevance in all the wrong places” and “turning a blind eye to what is staring them in the eye.”

Worse, excluding responsive documents from the predictive coding space stunts its intelligence. That happens because words present in the responsive but excluded documents are not available to develop and grow predictive coding engine’s insights. By improperly withholding responsive documents from predictive coding engine’s analysis, attorneys are not only denying the likelihood that responsive documents will ever be produced but dumbing-down the predictive coding intelligence and, thereby, driving-down its value.

This is as much about starving the predictive coding engine of relevant documents as it is about failing to provide adequate informational nutrients.

2. Pulling a random sample of documents to train the initial seed set

Some attorneys and vendors recommend teaching the predictive coding system what is relevant by assessing a set of documents randomly accumulated. That is erroneous for at least two reasons: it is looking for relevance in all the wrong places, and turns a blind eye to what is staring you in the eye.

A. Looking for Relevance in All the Wrong Places.

Magnifying glassPredictive coding finds target documents that are “like” particular exemplar documents. As to how the prospective documents are selected, here seems to be an unfortunate emerging standard:

  1. “Counsel said he selected…” or
  2. “This Predictive Coding workflow begins with the identification of a ‘seed set’ or initial group of relevant documents that is developed…” or
  3. “The system presents a series of randomly chosen documents for the reviewer to indicate which documents are responsive from which is built the seed set.”

Do you catch the problems?

The first two methods are vague as to how the documents were gathered:  “Selected” or “identified:” by thoughtful consideration, meaningful search or random selection?

If an otherwise precise disclosure employs passive tense and vague verbs, be careful.

The third example is more forthcoming, but more alarming. Here is what it means: the predictive coding system pulls from the document collection a random sample of documents to serve as the predictive coding project’s initial and key seed set.

Yes, we have heard it said, that an appropriate selected sample of several thousand documents from a much larger population may provide, with some statistical certainty, accurate insights into the larger population.

After all, they continue, pulling a random set of documents for the seed set is akin to pulling a random sample in a presidential election: “If the election were held today would you vote for Mr. Romney or Mr. Obama?” seems an apt analogy to, “If you had to decide right now, is this document irrelevant or relevant?”

But researchers don’t start with that question.

Here is the presidential poll question applied to document search: “In how many ways do people decide for whom to vote and what words do they use to express that process?”

Thoughtful researchers don’t try to answer that sort of question by talking to a large number of people who aren’t inclined to think about the issue.  Nor do they hope to learn about relevant documents by examining irrelevant ones.

Yet, regrettably, some attorneys are forming their predictive coding seed sets from randomly pulled documents.

If our profession continues to develop seed sets based on random searches, here is the natural implication: dilution. This approach encourages producing parties who wish to hide the truth to accumulate as many documents as possible to reduce the chance that the random pull will select responsive documents for prospective inclusion into the seed set.

The underlying business problem is that relevant documents are hidden among the clutter and the noise. That problem is not ameliorated by forming seed sets via randomly compiled sets that say little of interest, but yet are so easily manipulated. Worse, this practice is erroneous because it seems to authorize producing parties’ attorneys to turn a blind eye to what they know or should know.

B. Turn a blind eye to what is staring you in the eye. 

Some attorneys employ random samples to populate seed sets apparently, because they:

  • Don’t know how to form the seed set in a better way or
  • Want to delegate responsibility to the computer “which said ‘so’,” or
  • Are emboldened by a statistical rationale premised on the claim that no one knows anything so random is a good a place to start as anywhere.

But that random-start strategy—“pure Borg” Ralph Losey calls it—is particularly problematic because it denies that attorneys know what they are paid to know: where to look and what to find.

Street lightIt is a well-known joke: at night a guy is looking for his keys not where he dropped them but beneath a street light where it is illuminated.

The random-start/pure-Borg reality is less funny and much worse: researchers are looking beneath the street light, finding little of value and then concluding that there is little of value elsewhere, either.

Certainly it is possible that the random-start/pure-Borg approach retrieves things of value. If so, the diligent researcher seeks out other such things throughout the entire document universe. For example, if the randomly generated starting document set finds “kickbacks” and “bribes” among the review set the predictive coding system — subject to the linguistic realities and technical constraints — may enable the researcher to find other documents containing those concepts.

Has the researcher found everything of value?

Many researchers’ may perceive that they “looked everywhere and found everything.”

Escher Hands

That misperception emerges from confirmation bias. The researcher found valuable documents, no doubt. The researcher looked everywhere, no doubt. But the researchers starting position was a fatally weak one. The random-generated set didn’t offer examples of all relevant concepts — “frequent flier miles” are a form of kickback, too — but only some of them. Chances are that the random sample generator (which pulls documents, not concepts) is more likely to provide the most common concepts and less likely to provide the less common ones.

Does that pose a problem?

Statisticians dismiss the problem: “Nothing is perfect and this doesn’t need to be…” “The process confirmed that there is a 95% probability that no other relevant documents exist…”

This analysis, however statistically defensible:

  • Compares statistically significant but legally unimportant numbers: The definition of “relevant” is constrained to those concepts presented in the sample and derived by the predictive coding process thereafter. That is interesting. But what is more important is the number of responsive concepts identified relative to the total number of relevant concepts that exist.
  • Uses a circular definition of responsiveness: The statistical approach improperly constrains the focus. Its constrains analysis by considering only that which the researcher found within the random set of documents and within document sets the predictive coding system “recognizes to be like” them. In so doing, the statistical approach improperly assumes that the:
    • Predictive coding systems’ recognition is perfect. By contrast, as a technical matter, it is unreasonable to premise any analysis with the assumption that predictive coding can find all different words by which any concept may be expressed.
    • Researcher’s understanding is limited to the point of ignorance. By contrast, as a legal matter, it is expected that attorneys be knowledgeable about the disputed facts and how concepts about the matter may be expressed
  • Confuses prevalence with probativeness and persuasiveness: While common relevant concepts may be probative and persuasive, oftentimes they are simply redundant —what Ralph Losey calls “irrelevantly relevant.” Normally, or at least frequently, uncommon relevant document are the most persuasive. “Smoking guns” tend to be scarce.

The random-start/pure-Borg strategy assumes that attorneys’ suffer from ignorance or amnesia and encourages them to stumble forward from a random spot confirming they found that for what they were looking.

Again, attorneys cannot be expected and should not presume to guess what particular language may be employed by parties to express concepts. Nevertheless, attorneys are paid to know what concepts are important to particular matters.

Those concepts include, for example:

  • What: Disputed facts and how ideas are expressed in it including terms-of-art, jargon.
  • When: Time lines and life-cycle states.
  • Who: Custodians and cast of characters (business groups, roles, key custodians, etc.).
  • Where: Technical environment, file types, etc.

Those are the concepts the expression of which can serve to capably identify and assess prospectively responsive documents. They — not randomly identify and conspicuously irrelevant documents — are the stuff of which competent seed sets are made.

3. There are “magic numbers” of predictive coding assessment “iterations” and of the minimum number responsive documents within a randomly accumulated population

Magic NumbersLike seeking “any port in the storm,” attorneys who fear the ambiguous document production related duties seek precision from higher authorities. Hoping to satisfy good faith document production duties attorneys want to know:

  • What is the magic number of iterations necessary?
  • When assessing random samples to confirm that prior document assessments are complete and accurate, what is the minimum number of relevant documents that must be considered?

While it is natural for attorneys to seek that clarification, it is erroneous to specify it as an absolute value.

A. May not be able to get there from here. 

Requiring a minimum number of iterations or, conversely, boasting about performing a particular number of iterations, is as erroneous as specifying:

  • How many times must the carpenter strike the nail?
  • How steps are necessary to get me from here to where I’ll be safe? 
  • How many edits until this document is done?  (Yes, I can imagine your answer: more; many more.)

Whether the question relates to strikes, steps, edits or iterations the level of effort it takes to finish project depends upon many factors including something they all have in common: a starting point. And in complicated projects like evidence management projects, starting points are not likely known and may not be knowable because the status at any time is proprietary, vague, tribal, secret… Anything but conclusive. If the starting status isn’t conclusively known then how can the steps — in this context, iterations — that lead to a fair end point be conclusively prescribed or even ordered?

While that seems to disappoint attorneys who bear the duty to make a good faith effort to meet abstract duties, they must learn to regret that there is no “magic number” of iterations.

B. Don’t know what isn’t yet known. There are various fascinating claims that a “magic number” of responsive documents must be found within a random set to serve as the basis to affirm the predictive coding approach. It is enticing to seek to find protection in that number. And statisticians claim it is absolutely statistical.

But however statistically interesting it may be, it isn’t operable because:

  • The test itself fails when within the document population there are fewer relevant documents than the “magic number” would require.
  • The value of the test fails if the sampling never pulls in (lower probability but) critically important documents.

Consider this question from a different context: “What is the magic number of places in my house should I look to find all my footwear? Auto-magically someone randomly searches and brings me back a collection of stuff among which includes the (they say) statistically significant “magic number” of 384 socks.

Does that mean I don’t need to look in my dryer (where, particularly in busy times, I keep most of my clothes)? Or in my closet (because shoes are footwear, too)? Or in my garage (because muck boots and roller blades are footwear, too)?

It is natural for researchers to want a “magic number” against which to gauge their progress. They, too, must learn to regret that there is no “magic number” of responsive documents in samples to prove that the document review is complete.

4. “Predictive Coding software is the gold standard for document retrieval in complex matters”

Think as though it’s a box.  Regularly we read reports that court rulings “affirmed,” “approved,” or “ordered” predictive coding.

Nevertheless, we recognize that predictive coding serves as a collective term of art describing various techniques and technologies which:

  • Share some commonly understood characteristics but no precise attributes;
  • Involves some general methodologies but no clear rules; and,
  • Are associated with general aspirations but no comprehensively defined operations.

Consequently, “ordering” predictive coding is akin to noting:

  • “House construction contract requires carpenters to use hammer.”
  • “Recipe calls for use of spoon.”
  • “Surgeons’ minimum care standards include using a scalpel.”

Obviously, using a hammer, spoon, or scalpel doesn’t necessarily make a good house, a good meal or a good operation any more than using predictive coding necessarily makes a defensible process. Consequently, reports and the rulings they summarize are unhelpfully vague or improperly asserted.

Now, similar to the time that other analytical techniques were being promoted to our legal profession that Frontline debunked, we accept the techniques and assertions of defensible process bolstered by claims of recall. We accept those assertions and claims because they appear to focus, to minimize, and to protect our work. Yet to the extent that those claims are based on erroneous practices, and are imbedded in erroneous precedent, it distracts, enlarges, and imperils our work. As discussed in the Frontline episode, it has taken decades to confront junk science — debunking it is an ongoing process — and the harm suffered by it is measured with imperiled justice and wasted lives.


I appreciate Ralph’s inviting me to offer these ideas about how we can better understand what predictive coding is. And what it isn’t.

shell gameAfter all, those of us who trust the scientific and adversarial process recognize that erroneous claims don’t naturally defeat truth. They suppress truth, distract from truth and sometimes persist so long that we forget to inquire into the truth. Oftentimes, weak interests seek to dispel erroneous claims which are promoted by strong commercial interests. With respect to predictive coding my sense is that we are neither deluded nor deceptive — well, not too much anyway — but we just have not yet thought it through.

We need to think through the implications of how:

  • Clients’ zero-sum game pushes attorneys into roles outside their trained area of competence by asking them to serve as information system analysts.
  • Courts’ discovery management procedures exacerbate disputes or let them fester.
  • Rules’ imposing nearly clairvoyant preservation and nearly unbounded scope enables requesting parties to extort through discovery.
  • Vendors’ promising extraordinary (and as discussed above oftentimes impossible) capabilities but delivering overly broad document sweeps, indiscriminate processing, and lost-leader pricing that prohibits full use of technical tools.
  • Attorneys’ trusting but not verifying claims about predictive coding that would help them understand that predictive coding does not stand alone but is a tool in the shed or, as Ralph Losey has previously asserted, a component on top of the “search pyramid.”

Now is the time for our industry to confront erroneous predictive coding practices that will otherwise encumber our profession with junk science.

9 Responses to Predictive Coding’s Erroneous Zones Are Emerging Junk Science

  1. Mr. Speros,

    While I agree with your four major claims, I disagree almost entirely with the arguments you present in support of your claims. In this comment I will address one particular strawman argument:

    “Statisticians dismiss the problem: ‘Nothing is perfect and this doesn’t need to be…’ ‘The process confirmed that there is a 95% probability that no other relevant documents exist…’”

    The first claim is outside the domain of statistics, and is more aptly attributed to the civil rules and the courts. The second claim would not be advanced by any competent statistician; it is simply false.

    A statistician is able to estimate the number of “other relevant documents” that exist within some well-specified population of documents. The “confidence level” (typically 95%) is a measure of the reliability of the estimation technique, not the probability that “no other relevant documents exist.” That probability, for all intents and purposes, is 0%.

  2. The Frontline article notes that DNA evidence has been shown to be reliable, and has been used to refute other forms of unreliable contradictory evidence.

    The TREC Legal Track, held at the National Institute of Standards and Technology, has shown that certain “predictive coding” approaches work well — much better than the alternatives — under reasonably realistic laboratory conditions. None of the techniques that have been shown to work well at TREC use exclusively random training. None of the techniques that have been shown to work well at TREC use exclusively hand-picked training sets. None of the techniques that have been shown to work well at TREC use keyword culling prior to the application of “predictive coding.”

    Further scientific studies will reinforce what TREC and other studies already indicate quite clearly: That “active learning” — the technology behind the predictive coding methods that have been shown to work well at TREC — is fundamentally superior for electronic discovery, as DNA evidence is superior for forensics. It is not “just another tool to throw at the problem.”

  3. Bill Speros says:

    Thank you, Mr. Cormack, for your thoughtful comments with which I agree almost entirely.

    Certainly we agree–but in retrospect I should have made it explicit–that:

    (1) Predictive coding can work better than pre-existing techniques and technologies and
    (2) People like you and others at TREC and elsewhere are performing scientific studies to help develop thoughtful and thorough predictive coding related standards.

    In fact, the only point of conflict between our respective positions is your use of a single word: “strawman.”

    Sadly, I did not concoct those four predictive coding related claims in order to dispel them. I tried to dispel them because I noticed those four erroneous claims in recent or active cases I studied or in which I rendered consulting services.

    All this reinforces the point: unless we confront erroneous yet emerging predictive coding practices (via TREC-like and other scientific studies) then predictive coding’s potential will be encumbered.

    This reminds me of the first draft’s first (edited-out) sentence: “Years ago I heard conservative commentator George Will say, ‘Lies travel around the world while truth is still tying its shoes.'”

    Now in common, real-world litigation matters erroneous practices are employed and erroneous precedents are formed while scientific studies are still developing thoughtful standards.

  4. […] process employed is by well-known attorney and consultant William Speros whose guest column “Predictive Coding’s Erroneous Zones Are Emerging Junk Science”  appeared on Ralph Loseys eDiscovery Team […]

  5. By way of clarification, I would say that TREC didn’t show that “predictive coding” as a whole performs better than the alternatives so much as it showed that, when used in a sound process and guided by individuals with scientific information retrieval expertise, predictive coding is capable of performing well. TREC also showed that a rule-based system, again when used in a sound process and guided by similarly qualified individuals, is equally capable of performing well. As Bill points out, it is not so much the nature of the tool as the method and expertise that guides the tool that makes a meaningful difference. Any approach, when practiced by non-experts, is very unlikely to yield a high quality result.

    • I think that TREC has shown that certain active learning methods, and also a certain rule-based method can work well when properly deployed. Sorry if I didn’t make that clear in my earlier comment.

      To win at Formula 1 racing, you need a car, a driver, a strategy, and a pit crew. If any of these is sub-par, you won’t even qualify. I would not say that one is more important than the others; they are all essential.

      Many teams appeal to the TREC results in their marketing, although their cars, drivers, strategies, and pit crews bear little resemblance to those that worked well at TREC.

  6. […] and eDiscovery consultant Bill Speros recently wrote about search strategies on Ralph Losey’s blog. As a self-described “linguistic bird-dog,” he is concerned that […]

  7. […] Speros, W., 2013. Predictive Coding’s Erroneous Zones Are Emerging Junk Science. E-discovery Team Blog (Guest Entry), 28th April 2013. Available online at: […]

  8. […] by search expert and attorney Bill Speros, who used the same classic street light analogy. Predictive Coding’s Erroneous Zones Are Emerging Junk Science  (Pulling a random sample of documents to train the initial seed set … is erroneous because […]

%d bloggers like this: