Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search

May 19, 2013

Byte and SwitchByte and Switch, my future-law robots, here star in another video animation, this time on random sampling. They explain how sampling is used in machine-learning-based evidence review. In this first segment of a two-part video taking place sometime in the near-future we watch Switch help Byte to get ready to give expert testimony in a Daubert hearing. The presiding Judge, David J. Waxse, in the future routinely insists on that sort of thing. See: Waxse & Yoakum-Kriz, Experts on Computer Assisted Review: Why Federal Rule of Evidence 702 Should Apply To Their Use, 52 WBJLJ 207, (Spring 2013).

Byte, who is an expert by virtue of his knowledge-base, programming, and search experience, makes the perfect witness. Verified programming establishes that he is incapable of lies or evasion. Not only that, he has total recall of everything that happened in every search project he has been involved with. Still, Switch needs to help Byte to get ready to testify. Byte, like the scientists and programmers who created him, needs to learn how to talk simple enough for non-expert humans to comprehend. This animation shows Byte practicing for his testimony.

BYTEIn this video Byte (shown right) explains how and why random samples are taken at the start of a project, before the active learning training begins. Byte also explains that random sampling is also used again, in a limited fashion, during the training. (The Borg-type predictive coding software that relies entirely on random chance has in this near-future scenario been discredited and abandoned long ago.) In part-two Byte and Switch will go on to explain final quality assurance sampling at or near the end of a robot-enhanced search project.

As usual, pause to let the streaming video get ahead, especially if your connection is slow, and increase the video screen to full size for best effect.

Special thanks to William Webber, Information Scientist, for his background information and help. William has endured hours of my Switch-like questioning on random sampling in active machine learning search projects. His explanations of sampling have been invaluable, including such esoteric topics as Gaussian and Binomial calculations, Simple Random and Stratified Random sampling (William’s speciality), quality control sampling for testing, as opposed to training, prevalence, concept shift, and recall testing. All credit goes to William for what I get right in this future-scenario of random sampling. Any mistakes in the explanation, or errors in predictions, are entirely my own.

For the earlier adventures of Byte and Switch, see:


Robot Games: the Gamification of Legal Review

May 14, 2013

Screen Shot 2013-05-08 at 3.54.29 PMThe last blog, Robots With A Story To Tell, illustrated the use of narrative to improve legal search and review. This blog goes a bit further into the not-too-distant future and illustrates the gamification of legal review. Gamification per Wikipedia is the use of game thinking and game mechanics in a non-game context in order to engage users and solve problems. As a life-long computer gamer myself, I appreciate the power of well-designed games. They can engage a player in a timeless flow-state of enhanced concentration. This can go on for hours, days, weeks. Players get better as the game goes on. The power of their minds overcomes their physical fatigue. Why not add this quality control enhancement to legal review? It is sorely needed.

I know that Jon Canty, Partner and Co-Founder of Contact Discovery Services, agrees with me on this topic. We have talked about it a fair amount. Perhaps other review companies and software companies are also interested in this idea. Maybe someone is in a position to take action? If so, please contact me as I have several ideas on how to do it. A few of the more obvious gamification applications are shown in this animation with our robot friends, BYTE and SWITCH. (Thanks to Kip Comack of CACI International, the winner of the name-the-robots contest, for coming up with these clever names. Your book is in the mail.) As usual, this story is told from the perspective of the robots. For best effect open the video to full screen and pause to let the streaming video get ahead.

Gamification can make it easier for legal reviewers to attain an enhanced state of repetitive concentration, a timeless state of flow. That makes them better reviewers, better machine trainers. If you are a serious software developer looking to improve your predictive coding kung fu 功夫武術, let’s talk about possible collaboration. Lawyers working with robots to find the truth, with the help of story and gamified software. These are the next big things in future law.


Confessions of a Trekkie

May 10, 2013

Ralph the Star TrekkerMy name is Ralph and I’m a Star Trek addict. Yes, a true nerd. I have loved Star Trek since I was a kid in the 60s watching the tv show with my parents. We all loved the show, even if it was sometimes challengingly liberal for my Republican parents. I have seen every Star Trek show ever made, multiple times. I cannot get enough of it. I’ve even bought several Star Trek video games, just so I could have the personal thrill of firing Phasers (on stun of course), and a full volley of Photon torpedoes (not on stun). Make it so. Fight the Borg. Save the Universe.

JudgeOtisDWrightI share all of this with you so you will understand why my new favorite judge is Otis D. Wright, II. Otis is a U.S. District Judge in California who appears to be a Star Trek addict too. Look at the great opinion he wrote on May 6, 2013 , Ingenuity 13 LLC v. John Doe. It arises out of a discovery violation in a strange copyright case involving copyright trolls (Ferengi might be the better word for them). spock_khanJudge Wright’s opinion begins with this famous quote:

The needs of the many outweigh the needs of the few.”

— Spock, Star Trek II: The Wrath of Khan (1982)

Thinking of the scene in Wrath of Khan where Spock utters these fateful words nearly brings a tear to my eye, no doubt it did for the plaintiffs here too. It was a warning shot that they were about to be phaser blasted, or as lawyers say these days, bench slapped. The first paragraph of the opinion gives a great summary of the plaintiffs, Ferengi all, and includes a reference to my favorite Star Trek villains, The Borg:

Plaintiffs have outmaneuvered the legal system. They’ve discovered the nexus of antiquated copyright laws, paralyzing social stigma, and unaffordable defense costs. And they exploit this anomaly by accusing individuals of illegally downloading a single pornographic video. Then they offer to settle—for a sum calculated to be just below the cost of a bare-bones defense. For these individuals, resistance is futile; most reluctantly pay rather than have their names associated with illegally downloading porn. So now, copyright laws originally designed to compensate starving artists allow, starving attorneys in this electronic-media era to plunder the citizenry.

Is that a terrific beginning to an opinion, or what? But wait, there’s more. Judge Wright goes on to say:

Plaintiffs do have a right to assert their intellectual-property rights, so long as they do it right. But Plaintiffs’ filing of cases using the same boilerplate complaint against dozens of defendants raised the Court’s alert. It was when the Court realized Plaintiffs engaged their cloak of shell companies and fraud that the Court went to battlestations.

Battlestations, battlestations! Can you not hear the classic Star Trek alarms in your head? The judge then goes on with another first by using a Google Earth photo to expose a plaintiff’s lawyer’s lie. He actually includes this color photo in the opinion. John_Doe_Google_Earth Plaintiffs had stated that a defendant lived in a large mansion with a big gate out front, whereas the Google Earth photo showed it to be a typical small suburban track home. No gate, no mansion. This was just one example of Judge Wright’s exposure of a pattern of lies by plaintiff’s counsel. It led to his dismissal of the case, award of fees to defendants, and, declaring that these particular plaintiff’s counsel suffer from a form of moral turpitude unbecoming of an officer of the court, referring them all to state and federal bar associations for ethics investigations. He even included this color picture of these lawyers and their complex web of Ferengi-like corporate shells. Ingenuity_123_lawyers But wait, there are still more torpedoes left in the Captain’s, I mean, Judge’s arsenal. Judge Wright concludes his sanctions with an awesome flurry of weapons fire reminiscent of Kirk himself:

Third, though Plaintiffs boldly probe the outskirts of law, the only enterprise they resemble is RICO. The federal agency eleven decks up is familiar with their prime directive and will gladly refit them for their next voyage. The Court will refer this matter to the United States Attorney for the Central District of California. The [Court] will also refer this matter to the Criminal Investigation Division of the Internal Revenue Service and will notify all judges before whom these attorneys have pending cases. For the sake of completeness, the Court requests Pietz to assist by filing a report, within 14 days, containing contact information for: (1) every bar (state and federal) where these attorneys are admitted to practice; and (2) every judge before whom these attorneys have pending cases.

Judge Otis Wright, you are a true Trekkie and my new hero. Thanks for a great order. I cannot wait to cite it against certain Klingon-like opposing counsel I know.

NASA_doomedgas_eso_Judge_Wright


Robots With A Story To Tell

May 8, 2013

c3po_r2d2Robot Stories: How storytelling narratives will be part of machine learning in the not-too distant future as told from the perspective of the robots. This is the second in a series of instructional cartoons on predictive coding; what it is now, and what it could be. The first was Bad Robot! A Story of Ethics and Predictive Coding in the Not-Too-Distant Future. The cute robots have now been named by readers in Vote For Your Favorite Robot Names where the winning names were: BYTE and SWITCH. These are much funnier names than the old Star Wars storytellers, C3PO and R2D2, or the senior partners at the law firm of Robot, Robot & HwangApollo Cluster and Daria XR-1029.

For background on the storytelling approach to document review in general, not just predictive coding based review, see the prior guest blog by Bill Hamilton and Larry Chapin: Storytelling: The Shared Quest For Excellence in Document Review. This is one of several methods that can and should be used to enhance the quality and consistency of document review. I have heard that a few document review teams are already using some narrative techniques. This animation considers more advanced applications in the context of machine learning. For best effect open the video to full screen and pause to allow the streaming video to download.

Perhaps some review companies and software companies are interested in these ideas and are ready to put them into action in a machine learning context? If so, please contact me as I have several more ideas on how to do that.


Vote For Your Favorite Robot Names

May 7, 2013

RobotsI received many great suggestions for names to the two robots who star in my new predictive coding animation series. The new cartoons kicked off this Sunday with Bad Robot! Thanks to all who participated. All of the suggestions were very clever, with some more esoteric than others. After great effort I was able to narrow them down to five names. But now I need your help to decide who wins the robot naming contest. Please vote for your favorite robot names. You will decide the winner. The poll is only open for 24 hours, so vote now. Only one vote per human or robot please.

Tim_HwangBy the way, did you hear about the sort-of law firm that opened up in California in 2010, Robot, Robot & Hwang. The only human in the firm is the junior partner, Tim Hwang, shown right.  His senior partners are Apollo Cluster, who specializes in mergers and acquisitions, and Daria XR-1029, who specializes in intellectual property issues. I’m thinking Tim is more prankster than real lawyer, but I like his style none-the-less. Get yourself a law degree Tim, and I may have a job for you with my two robots, names yet to be determined.


Bad Robot!

May 5, 2013

Bad_RobotA STORY OF ETHICS AND PREDICTIVE CODING IN THE NOT-TOO-DISTANT FUTURE. Yes, by popular demand of younger readers, my e-discovery short animations are back. (I can almost hear the groaning of the literati readers!) This is the first in a series of quickie-fun videos to teach predictive coding and related topics. This first one also includes ethics. These lessons will all be told from the point of view of the Robots. And funky comic retro Robots at that! They embody the machine learning algorithms in the not-too-distant future of interactive document review. Literati, please put aside any prejudices you may have against videos and cartoons and give this new style of teaching a try. It may even become a new secret guilty pleasure. I know I had a blast creating them. For best effect open the video to full screen.

Come up with a name for the two robots, and win a prize. I’m thinking Click and Clack, but that’s not too original, I know. Please send me an email or leave your suggestion in the comment box below. Winner gets a free copy of any one of my e-discovery books; your choice.


Predictive Coding’s Erroneous Zones Are Emerging Junk Science

April 28, 2013

Bill SperosGUEST BLOG by J. William (Bill) Speros. Editor’s Preface. Attorney Bill Speros here answers my call for critical papers on predictive coding, a call I made just  last week in Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search. Although Bill Speros keeps a lower profile than  other experts in the field, most insiders know him as one of the country’s leading, independent consultants on electronic evidence, with over 24-years of experience. Most recently, just weeks after the Madoff Ponzi scheme collapsed, Bill was brought in as the technical guru for the Madoff investigation. The Trustee of the bankruptcy proceedings, Bernie L. Madoff Investment Securities, hired Bill to kick off the e-discovery aspects of the investigation. Speros spent 4,000 hours as the interim Director of Litigation Support and E-Discovery.

When it comes to legal search Bill knows what he is talking about from direct personal experience with thousands of complex ESI search and review  projects. This essay by William Speros brings much-needed critical light on some of the poor methods of search employed by many software vendors and attorneys in the field of predictive coding.

______________

Predictive Coding’s Erroneous Zones Are Emerging Junk Science

by: J. William Speros

Truth & LiesTo a carpenter with a hammer, everything is a nail.

To a bite mark analyst, fire cause analyst, or fingerprint analyst, their conclusions are  conclusive.

It is convenient to think that our intellectual profession wouldn’t suffer fools gladly or sustain lies for long. It is convenient, but wrong.

PBS’ Frontline’s Forensic Tools: What’s Reliable and What’s Not-So-Scientific dispelled the infallibility, and in some instances, the validity, of analytical techniques long relied upon by our legal profession. Even if those techniques were not botched or biased, their validity ranges from bought-and-paid-for infomercials to, at best, an approximation.

How did our intellectual and intelligent legal profession grant so much authority to the junk science that Frontline debunked? I suspect that back then attorneys and judges (and experts and vendors) did with those junk sciences just what we are doing now with respect to predictive coding: allowing claims, however unjustified and erroneous, to form the basis of our practices, to influence our precedent and to accrue authority.

This article discusses four erroneous claims about predictive coding that oftentimes the trade press announces in breathless terms and legal arguments describe in breathtakingly inaccurate terms:

  1. Using a full-text search to identify prospectively responsive documents and then employing predictive coding to eliminate those that are not responsive. This is erroneous because it over-relies and under-delivers. It arbitrarily places documents out-of-sight and, therefore, out-of-mind.
  2. Pulling a random sample of documents to train the initial seed set. This is erroneous because it looks for relevance in all the wrong places. It turns a blind eye to what is staring you in the eye.
  3. Identifying “magic numbers” of necessary predictive coding assessment “iterations” and of the number of responsive documents within a randomly accumulated population. This is erroneous because you may not be able to get to there from here. You don’t know what isn’t yet known.
  4. Asserting that “Predictive Coding software is the gold standard for document retrieval in complex matters.” This is erroneous because it is thinking as though predictive coding is a box.

__________

1. Using a full-text search to identify prospectively responsive documents even if followed by using predictive coding to eliminate those that are not responsive

We see an erroneous claim that predictive coding “is employed” even though the population of documents subjected to it is constrained.

A. Over-Rely and Under-Deliver

Full-text searching is a technique and a technology, but not an ideology. Its rate of success depends  on the mechanism and also the content against which it is applied: Does this search “Losey*” retrieve “Losey” and also “LoseyRalph?” In technical terms full-text searches’ rate of success may be tested and verified.

Stopping with technical measures of success, however, stops short; it focuses on process rather than on results. While the technical process enables, the researchers’ ideology constrains.

In one current matter, for example, the producing party employed a full-text search to constrain the production to focus on particular text that the researchers felt was published within responsive documents. But according to the requesting party the search spoke only about a limited number of relevant concepts. And even with respect to those subsets, the requesting party says, the search terms were incomplete.

All of this invites statistical analysis of full-text searches’ recall and response rates. But those discussions are not only obtuse, they are irrelevantly abstract: whether the results were successful is measured against a good faith standard based on the intelligence, experience, tasks, and testing applied in the current matter.

And in the current matter, the producing party seems to have excluded not only the requesting parties from the search term identification process, but excluded search terms the producing party employed in prior, similar matters. If true, some would say that presented arrogance.

More to the point here, the producing party’s constraining the population of potentially responsive documents via a single search, using search terms conceived by attorneys without their considering natural language, jargon, terms-of-art or other real-world realities, is a demonstration of their presumed clairvoyance.

Whether as a measure of presumed arrogance or presumed clairvoyance, using single-pass full-text review reflects an ideology that ought to be abandoned.

Or re-abandoned.

More than a decade ago, for example, a client asked for help in reviewing a collection of 950,000+ documents: find all that describe “a natural person who has not reached the age of majority.” Naturally, attorneys insisted that we search for the word “minor.”

What we found, of course, was no surprise. Of the 40% of the documents which contained the word “minor” or “minors” virtually none used the word “minor” to mean “someone under the age of majority” aka “a kid.”

Speros_bar-chartInstead, we “linguistic bird-dogs” became aware of indicia of documents about “minors” that were not necessarily synonyms of the word “minor” but instead were associated with “minor’s” activities, relationships, and status including, for example:

AMNIO
BABIES
BABY
BASEBALL
BALL
BICYCLE
CHILD
CHILDREN
DAD
DAUGHTER
FATHER
FETUS
FOOTBALL
GUARDIAN
MOM
MOMMY
MOTHER
PEDIATRICIAN
PLAY
PROBATE
SCHOOL
STUDENT
TODDLER
UTER

Parenthetically, we performed this analysis prior to the predictive coding systems’ development. In the meantime, however, we tested the same collection of documents using an early predictive coding system. Interestingly, we found that it failed to find those linguistic associations and, therefore, failed to identify “kid” related documents.

To be clear, those search terms didn’t spontaneously come to mind. And they didn’t simply prove meaningful. We derived those terms because we were motivated by the ideological humility: we could not anticipate — no, anticipate contemplates preexisting experience and insight — we could not guess what search terms worked.

Even back then thoughtful researchers knew that single-cycle, fire-and-forget searches could not succeed. Yet increasingly such searches are being employed by attorneys, who perhaps hope that predictive coding will protect them. Sometimes, though, it is too late to be saved.

B. Out of sight is out of the mind

Once documents are excluded from the review collection, the documents are out of mind. In this context, out-of-mind means that they may never be produced. After all, once other documents have been searched and read and studied and reviewed for privilege and then produced, it takes researcher discipline — nearly to the point of heroics — to go back upstream to recover and reconsider previously discarded documents.

In a current matter, that was not the producing party’s intent. The producing party merely offered to compile a random search from the previously discarded collection. As will be discussed below, that itself is “looking for relevance in all the wrong places” and “turning a blind eye to what is staring them in the eye.”

Worse, excluding responsive documents from the predictive coding space stunts its intelligence. That happens because words present in the responsive but excluded documents are not available to develop and grow predictive coding engine’s insights. By improperly withholding responsive documents from predictive coding engine’s analysis, attorneys are not only denying the likelihood that responsive documents will ever be produced but dumbing-down the predictive coding intelligence and, thereby, driving-down its value.

This is as much about starving the predictive coding engine of relevant documents as it is about failing to provide adequate informational nutrients.

2. Pulling a random sample of documents to train the initial seed set

Some attorneys and vendors recommend teaching the predictive coding system what is relevant by assessing a set of documents randomly accumulated. That is erroneous for at least two reasons: it is looking for relevance in all the wrong places, and turns a blind eye to what is staring you in the eye.

A. Looking for Relevance in All the Wrong Places.

Magnifying glassPredictive coding finds target documents that are “like” particular exemplar documents. As to how the prospective documents are selected, here seems to be an unfortunate emerging standard:

  1. “Counsel said he selected…” or
  2. “This Predictive Coding workflow begins with the identification of a ‘seed set’ or initial group of relevant documents that is developed…” or
  3. “The system presents a series of randomly chosen documents for the reviewer to indicate which documents are responsive from which is built the seed set.”

Do you catch the problems?

The first two methods are vague as to how the documents were gathered:  “Selected” or “identified:” by thoughtful consideration, meaningful search or random selection?

If an otherwise precise disclosure employs passive tense and vague verbs, be careful.

The third example is more forthcoming, but more alarming. Here is what it means: the predictive coding system pulls from the document collection a random sample of documents to serve as the predictive coding project’s initial and key seed set.

Yes, we have heard it said, that an appropriate selected sample of several thousand documents from a much larger population may provide, with some statistical certainty, accurate insights into the larger population.

After all, they continue, pulling a random set of documents for the seed set is akin to pulling a random sample in a presidential election: “If the election were held today would you vote for Mr. Romney or Mr. Obama?” seems an apt analogy to, “If you had to decide right now, is this document irrelevant or relevant?”

But researchers don’t start with that question.

Here is the presidential poll question applied to document search: “In how many ways do people decide for whom to vote and what words do they use to express that process?”

Thoughtful researchers don’t try to answer that sort of question by talking to a large number of people who aren’t inclined to think about the issue.  Nor do they hope to learn about relevant documents by examining irrelevant ones.

Yet, regrettably, some attorneys are forming their predictive coding seed sets from randomly pulled documents.

If our profession continues to develop seed sets based on random searches, here is the natural implication: dilution. This approach encourages producing parties who wish to hide the truth to accumulate as many documents as possible to reduce the chance that the random pull will select responsive documents for prospective inclusion into the seed set.

The underlying business problem is that relevant documents are hidden among the clutter and the noise. That problem is not ameliorated by forming seed sets via randomly compiled sets that say little of interest, but yet are so easily manipulated. Worse, this practice is erroneous because it seems to authorize producing parties’ attorneys to turn a blind eye to what they know or should know.

B. Turn a blind eye to what is staring you in the eye. 

Some attorneys employ random samples to populate seed sets apparently, because they:

  • Don’t know how to form the seed set in a better way or
  • Want to delegate responsibility to the computer “which said ‘so’,” or
  • Are emboldened by a statistical rationale premised on the claim that no one knows anything so random is a good a place to start as anywhere.

But that random-start strategy—“pure Borg” Ralph Losey calls it—is particularly problematic because it denies that attorneys know what they are paid to know: where to look and what to find.

Street lightIt is a well-known joke: at night a guy is looking for his keys not where he dropped them but beneath a street light where it is illuminated.

The random-start/pure-Borg reality is less funny and much worse: researchers are looking beneath the street light, finding little of value and then concluding that there is little of value elsewhere, either.

Certainly it is possible that the random-start/pure-Borg approach retrieves things of value. If so, the diligent researcher seeks out other such things throughout the entire document universe. For example, if the randomly generated starting document set finds “kickbacks” and “bribes” among the review set the predictive coding system — subject to the linguistic realities and technical constraints — may enable the researcher to find other documents containing those concepts.

Has the researcher found everything of value?

Many researchers’ may perceive that they “looked everywhere and found everything.”

Escher Hands

That misperception emerges from confirmation bias. The researcher found valuable documents, no doubt. The researcher looked everywhere, no doubt. But the researchers starting position was a fatally weak one. The random-generated set didn’t offer examples of all relevant concepts — “frequent flier miles” are a form of kickback, too — but only some of them. Chances are that the random sample generator (which pulls documents, not concepts) is more likely to provide the most common concepts and less likely to provide the less common ones.

Does that pose a problem?

Statisticians dismiss the problem: “Nothing is perfect and this doesn’t need to be…” “The process confirmed that there is a 95% probability that no other relevant documents exist…”

This analysis, however statistically defensible:

  • Compares statistically significant but legally unimportant numbers: The definition of “relevant” is constrained to those concepts presented in the sample and derived by the predictive coding process thereafter. That is interesting. But what is more important is the number of responsive concepts identified relative to the total number of relevant concepts that exist.
  • Uses a circular definition of responsiveness: The statistical approach improperly constrains the focus. Its constrains analysis by considering only that which the researcher found within the random set of documents and within document sets the predictive coding system “recognizes to be like” them. In so doing, the statistical approach improperly assumes that the:
    • Predictive coding systems’ recognition is perfect. By contrast, as a technical matter, it is unreasonable to premise any analysis with the assumption that predictive coding can find all different words by which any concept may be expressed.
    • Researcher’s understanding is limited to the point of ignorance. By contrast, as a legal matter, it is expected that attorneys be knowledgeable about the disputed facts and how concepts about the matter may be expressed
  • Confuses prevalence with probativeness and persuasiveness: While common relevant concepts may be probative and persuasive, oftentimes they are simply redundant —what Ralph Losey calls “irrelevantly relevant.” Normally, or at least frequently, uncommon relevant document are the most persuasive. “Smoking guns” tend to be scarce.

The random-start/pure-Borg strategy assumes that attorneys’ suffer from ignorance or amnesia and encourages them to stumble forward from a random spot confirming they found that for what they were looking.

Again, attorneys cannot be expected and should not presume to guess what particular language may be employed by parties to express concepts. Nevertheless, attorneys are paid to know what concepts are important to particular matters.

Those concepts include, for example:

  • What: Disputed facts and how ideas are expressed in it including terms-of-art, jargon.
  • When: Time lines and life-cycle states.
  • Who: Custodians and cast of characters (business groups, roles, key custodians, etc.).
  • Where: Technical environment, file types, etc.

Those are the concepts the expression of which can serve to capably identify and assess prospectively responsive documents. They — not randomly identify and conspicuously irrelevant documents — are the stuff of which competent seed sets are made.

3. There are “magic numbers” of predictive coding assessment “iterations” and of the minimum number responsive documents within a randomly accumulated population

Magic NumbersLike seeking “any port in the storm,” attorneys who fear the ambiguous document production related duties seek precision from higher authorities. Hoping to satisfy good faith document production duties attorneys want to know:

  • What is the magic number of iterations necessary?
  • When assessing random samples to confirm that prior document assessments are complete and accurate, what is the minimum number of relevant documents that must be considered?

While it is natural for attorneys to seek that clarification, it is erroneous to specify it as an absolute value.

A. May not be able to get there from here. 

Requiring a minimum number of iterations or, conversely, boasting about performing a particular number of iterations, is as erroneous as specifying:

  • How many times must the carpenter strike the nail?
  • How steps are necessary to get me from here to where I’ll be safe? 
  • How many edits until this document is done?  (Yes, I can imagine your answer: more; many more.)

Whether the question relates to strikes, steps, edits or iterations the level of effort it takes to finish project depends upon many factors including something they all have in common: a starting point. And in complicated projects like evidence management projects, starting points are not likely known and may not be knowable because the status at any time is proprietary, vague, tribal, secret… Anything but conclusive. If the starting status isn’t conclusively known then how can the steps — in this context, iterations — that lead to a fair end point be conclusively prescribed or even ordered?

While that seems to disappoint attorneys who bear the duty to make a good faith effort to meet abstract duties, they must learn to regret that there is no “magic number” of iterations.

B. Don’t know what isn’t yet known. There are various fascinating claims that a “magic number” of responsive documents must be found within a random set to serve as the basis to affirm the predictive coding approach. It is enticing to seek to find protection in that number. And statisticians claim it is absolutely statistical.

But however statistically interesting it may be, it isn’t operable because:

  • The test itself fails when within the document population there are fewer relevant documents than the “magic number” would require.
  • The value of the test fails if the sampling never pulls in (lower probability but) critically important documents.

Consider this question from a different context: “What is the magic number of places in my house should I look to find all my footwear? Auto-magically someone randomly searches and brings me back a collection of stuff among which includes the (they say) statistically significant “magic number” of 384 socks.

Does that mean I don’t need to look in my dryer (where, particularly in busy times, I keep most of my clothes)? Or in my closet (because shoes are footwear, too)? Or in my garage (because muck boots and roller blades are footwear, too)?

It is natural for researchers to want a “magic number” against which to gauge their progress. They, too, must learn to regret that there is no “magic number” of responsive documents in samples to prove that the document review is complete.

4. “Predictive Coding software is the gold standard for document retrieval in complex matters”

Think as though it’s a box.  Regularly we read reports that court rulings “affirmed,” “approved,” or “ordered” predictive coding.

Nevertheless, we recognize that predictive coding serves as a collective term of art describing various techniques and technologies which:

  • Share some commonly understood characteristics but no precise attributes;
  • Involves some general methodologies but no clear rules; and,
  • Are associated with general aspirations but no comprehensively defined operations.

Consequently, “ordering” predictive coding is akin to noting:

  • “House construction contract requires carpenters to use hammer.”
  • “Recipe calls for use of spoon.”
  • “Surgeons’ minimum care standards include using a scalpel.”

Obviously, using a hammer, spoon, or scalpel doesn’t necessarily make a good house, a good meal or a good operation any more than using predictive coding necessarily makes a defensible process. Consequently, reports and the rulings they summarize are unhelpfully vague or improperly asserted.

Now, similar to the time that other analytical techniques were being promoted to our legal profession that Frontline debunked, we accept the techniques and assertions of defensible process bolstered by claims of recall. We accept those assertions and claims because they appear to focus, to minimize, and to protect our work. Yet to the extent that those claims are based on erroneous practices, and are imbedded in erroneous precedent, it distracts, enlarges, and imperils our work. As discussed in the Frontline episode, it has taken decades to confront junk science — debunking it is an ongoing process — and the harm suffered by it is measured with imperiled justice and wasted lives.

Conclusion

I appreciate Ralph’s inviting me to offer these ideas about how we can better understand what predictive coding is. And what it isn’t.

shell gameAfter all, those of us who trust the scientific and adversarial process recognize that erroneous claims don’t naturally defeat truth. They suppress truth, distract from truth and sometimes persist so long that we forget to inquire into the truth. Oftentimes, weak interests seek to dispel erroneous claims which are promoted by strong commercial interests. With respect to predictive coding my sense is that we are neither deluded nor deceptive — well, not too much anyway — but we just have not yet thought it through.

We need to think through the implications of how:

  • Clients’ zero-sum game pushes attorneys into roles outside their trained area of competence by asking them to serve as information system analysts.
  • Courts’ discovery management procedures exacerbate disputes or let them fester.
  • Rules’ imposing nearly clairvoyant preservation and nearly unbounded scope enables requesting parties to extort through discovery.
  • Vendors’ promising extraordinary (and as discussed above oftentimes impossible) capabilities but delivering overly broad document sweeps, indiscriminate processing, and lost-leader pricing that prohibits full use of technical tools.
  • Attorneys’ trusting but not verifying claims about predictive coding that would help them understand that predictive coding does not stand alone but is a tool in the shed or, as Ralph Losey has previously asserted, a component on top of the “search pyramid.”

Now is the time for our industry to confront erroneous predictive coding practices that will otherwise encumber our profession with junk science.


Follow

Get every new post delivered to your Inbox.

Join 2,189 other followers