Concept Drift and Consistency: Two Keys To Document Review Quality – Part Two

January 24, 2016

This is Part Two of this blog. Please read Part One first.

Concept Freeze

frozenbrainsIn most complex review projects the understanding of relevance evolves over time, especially at the beginning of a project. This is concept drift. It evolves as the lawyers’ understanding evolves. It evolves as the facts unfold in the documents reviewed and other sources, including depositions. The concept of relevance shifts as the case unfolds with new orders and pleadings. This is a good thing. Its opposite, concept freeze, is not.

The natural shift in relevance understanding is well-known in the field of text retrieval. Consider for instance the prior cited classic study by Ellen M. Voorhees, the computer scientist at the National Institute of Standards and Technology in charge of TREC, where she noted:

Test collections represent a user’s interest as a static set of (usually binary) decisions regarding the relevance of each document, making no provision for the fact that a real user’s perception of relevance will change as he or she interacts with the retrieved documents, or for the fact that “relevance” is idiosyncratic.

Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt  697 (2000) at page 714 (emphasis added). (The somewhat related term query drift in information science refers to a different phenomena in machine learning. In query drift  the concept of document relevance unintentionally changes from the use of indiscriminate pseudorelevance feedback. Cormack, Buttcher & Clarke, Information Retrieval Implementation and Evaluation of Search Engines (MIT Press 2010) at pg. 277. This can lead to severe negative relevance feedback loops.)

In concept drift the concept of what is relevant changes as a result of:

  1. Trying to apply the abstract concepts of relevance to the particular documents reviewed, and
  2. Changes in the case itself over time from new evidence, stipulations and court orders.

cars_driftingThe word drift is somewhat inappropriate here. It suggests inadvertence, a boat at the mercy of a river’s current, drifting out of control. That is misleading. The kind of concept drift here intended is an intentional drift. The change is under the full conscious control of the legal team. The change must also be implemented in a consistent manner by all reviewers, not just one or two. As discussed, this includes retroactive corrections to prior document classifications. Concept drift is more like a racing car’s controlled drift around a corner. That is the more appropriate image.

In legal search relevance should change, should evolve, as the full facts unfold. Although concept drift is derived from a scientific term, it is a phenomena well-known to trial lawyers. If a lawyer’s concept of relevance does not change at all, if it stays frozen, then they are either in a rare black swan type of case, or the document review project is being mismanaged. It is usually the later. The concept of relevance has stratified. It has not evolved or been refined. It is instead static, dead. Sometimes this is entirely the fault of the SME for a variety of reasons. But typically the poor project management is a group effort. Proper execution of the first step in the eight-step work flow for document review, the communication step, will usually prevent concept drift. Although this is naturally the first step in a work-flow, communication should continue throughout a project.

predictive_coding_3.0

The problem of concept freeze is, however, inherent in all large document review projects, not just ones accelerated by predictive coding. In fact, projects using predictive coding are somewhat protected from this problem. Good machine learning software that makes suggestions, including suggestions that disagree with prior human coding, can sometimes prevent relevance stagnancy by forcing human re-conceptions.

No matter what the cause or type of search methods used, a concept freeze at the beginning of a review project, the most intense time for relevance development, is a big red flag. It should trigger a quality control audit. An early concept freeze suggests that the reviewers, the people who manage and supervise them, and SMEs, may not be communicating well, or may not be studying the documents closely enough. It is a sign of a project that has never gotten off the ground, an apathetic enterprise composed of people just going through the motions. It suggests a project dying at the time it should be busy being born. It is a time of silence about relevance when there should be many talks between team members, especially with the reviewers. Good projects have many, many emails circulating with questions, analysis, debate, decisions and instructions.

DylanAll of this reminds me of Bob Dylan’s great song, It’s Alright, Ma (I’m Only Bleeding):

To understand you know too soon
There is no sense in trying …

The hollow horn plays wasted words,
Proves to warn
That he not busy being born
Is busy dying. …

An’ though the rules of the road have been lodged
It’s only people’s games that you got to dodge
And it’s alright, Ma, I can make it.

Ralph Losey with this "nobody read my blog" sad shirtThis observation of the need for relevance refinement at the beginning of a project is based on long experience. I have been involved with searching document collections for evidence for possible use at trial for thirty-six years. This includes both the paper world and electronically stored information. I have seen this in action thousands of times. Since I like Dylan so much, here is my feeble attempt to paraphrase:

Relevance is rarely simple or static,
Drift is expected,
Complexities of law and fact arise and
Are work product protected.

An’ though the SMEs rules of relevance have been lodged
They must surely evolve, improve or be dodged
And its alright, Shira, I can make it.

My message here is that the absence of concept shift – concept freeze – is a warning sign. It is an indicator of poor project management, typically derived from inadequate communication or dereliction of duty by one or more of the project team members. There are exceptions to this general rule, of course, especially in simple cases, or ones where the corpus is well known. Plus, sometimes you do get it right the first time, just not very often.

The Wikipedia article on concept shift noted that such change is inherent in all complex phenomenon not governed by fixed laws of nature, but rather by human activity …. Therefore periodic retraining, also known as refreshing, of any model is necessary. I agree.

error-correctionDetermination of relevance in the law is a very human activity. In most litigation this is a very complex phenomenon. As the relevance concept changes, the classifications need to be refreshed and documents retrained according to the latest relevance model. This means that reviewers need to go back and change the prior classifications of documents. The classifications need to be corrected for uniformity. Here the quality factor of consistency comes into play. It is time-consuming to go back and make corrections, but important. Without these corrections and consistency efforts, the impact of concept drift can be very disruptive, and can result in decreased recall and precision. Important documents can be missed, documents that you need to defend or prosecute, or ones that the other side needs. The last error in egregious situations can be sanctionable.

Here is a quick example of the retroactive correction work in action. Assume that one type of document, say Spreadsheet X typehas been found to be irrelevant for the first several days, such that there are now hundreds, perhaps thousands of various documents coded irrelevant with information pertaining to Spreadsheet X. Assume that a change is made, and the SME now determines that a new type of this document is relevant. The SME realizes, or is told, that there are many other documents on Spreadsheet X that will be impacted by the decision on this new form. A conscious, proportional decision is then made to change the coding on all of the previously documents impacted by this decision. In this hypothetical the scope of relevance expanded. In other cases the scope of relevance might tighten. It takes time to go back and make such corrections in prior coding, but it is well worth it as a quality control effort. Concept drift should not be allowed to breed inconsistency.

Red_Flag_warningA static understanding by document reviewers of relevance, especially at the beginning of a project, is a red flag of mismanagement. It suggests that the subject matter expert (“SME”), who is the lawyer(s) in charge of determining what is relevant to the particular issues in the case, is not properly supervising the attorneys who are actually looking at the documents, the reviewers. If SMEs are not properly supervising the review, if they do not do their job, then the net result is loss of quality. This is the kind of quality loss where key documents could be overlooked. In this situation reviewers are forced to make their own decisions on relevance when new kinds of documents are encountered. This exasperates the natural inconsistencies of human reviewers (more on that later). Moreover, it forces the reviewers to try to guess what the expert in charge of the project might consider to be relevant. When in doubt the tendency of reviewers is to guess on the broadside. Over-extended notions of relevance are often result.

A review project of any complexity that does not run into some change in relevance at the beginning of a project is probably poorly managed and making many other mistakes. The cause may not be from the SME at all. It may be the fault of the document reviewers or mid-realm management. The reviewers may not be asking questions when they should, they may not be sharing their analysis of grey area documents. They may not care or talk at all. The target may be vague and elusive. No one may have a good idea of relevance, much less a common understanding.

This must be a team effort. If audits show that any reviewers or management are deficient, they should be quickly re-educated or replaced. If there are other quality control measures in place, then the potential damage from such mismanagement may be limited. In other review projects, however, this kind of mistake can go undetected and be disastrous. It can lead to an expensive redo of the project and even court sanctions for failure to find and produce key documents.

supervising-tipsSMEs must closely follow the document review progress. They must supervise the reviewers, at least indirectly. Both the law and legal ethics require that. SMEs should not only instruct reviewers at the beginning of a project on relevancy, they should be consulted whenever new document types are seen. This should ideally happen in near real time, but at least on a daily basis with coding on that document type suspended until the SME decisions are made.

With a proper surrogate SME agency system in place, this need not be too burdensome for the senior attorneys in charge. I have worked out a number of different solutions for that SME burdensomeness problem. One way or another, SME approval must be obtained during the course of a project, not at the end. You simply cannot afford to wait until the end to verify relevance concepts. Then the job can become overwhelming, and the risks of errors and inefficiencies too high.

Even if consistency of reviewers is assisted, as it should, by using similarity search methods, the consistent classification may be wrong. The production may well reflect what the SME thought months earlier, before the review started, whereas what matters is what the SME thinks at time of production. A relevance concept that does not evolve over time, that does not drift to the truth, is usually wrong. A document review project that ties all document classification to the SME’s initial ideas of relevance is usually doomed to failure. These initial SME concepts are typically made at the beginning of the case and after only a few relevant documents have been reviewed. Sometimes they are made completely in the abstract, with the SME having seen no documents. These initial ideas are only very rarely one hundred percent right. Moreover, even if the ideas, the concepts, are completely right from the beginning, and do not change, the application of these concepts to the documents seen will change. Modifications and shifts of some sort, and to some degree, are almost always required as the documents reveal what really happened and how. Modifications can also be driven by demands of the requesting party, and most importantly, by rulings of the court.

Consistency

Consistency as described before refers to the coding of the same or similar type documents in the same manner. This means that:

  1. A single reviewer determines relevance in a consistent manner throughout the course of a review project.
  2. Multiple reviewers determine relevance in a consistent manner with each other.

As mentioned, the best software now makes it possible to identify many of these inconsistencies, at least the easy ones involving near duplicates. Actual, exact duplicates are rarely a problem, as they are so easy to detect, but not all software is good at detecting near duplicates, threads, etc. Consistency in adjudications of relevance is a quality control feature that I consider indispensable. Ask your vendor how their software can help you to find and correct all obvious inconsistencies, and mitigate against the others. The real challenge, of course, is not in near duplicates, but in documents that have the same meaning, but very different form.

ConsistencyIsKey

VoorheesScientific research has shown that inconsistency of relevance adjudications is inherent in all human review, at least in large, document review projects requiring complex analysis. For authority I refer again to the prior cited study by Ellen M. Voorhees, the computer scientist at the National Institute of Standards and Technology in charge of TREC. Voorhees found that the average agreement rate of agreement by two human experts on documents determined to be relevant was only 43%. She called that overlap. This means that two manual reviewers disagreed with each other as to document relevance 57% of the time. Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, supra at pages 700-701.

Note that the reviewers in this study were all experts, all retired intelligence officers skilled in document analysis. Like litigation lawyers they all had similar backgrounds and training. When the relevance determinations of a third reviewer were considered in this study, the average overlap rate dropped down to 30%. That means the three experts disagreed in their independent analysis of document relevance 70% of the time. The 43% and 30% overlap they attained was higher that earlier TREC studies on inconsistency. The overlap rate is shown in Table 1 of her paper at page 701.

Voorhees_paper_screen_shot

Voorhees concluded from that this data was evidence for the variability of relevance judgments. Id. 

Ralph_InconsistenciesA 70% inconsistency rate on relevance classifications among three experts is troubling, and thus the need to check and correct for human errors, especially when expert decisions are required as is the case with all legal search. I assume that agreement rates would be much higher in a simple search matter, such as finding all articles in a newspaper collection relevant to a particular news event. That does not require expert legal analysis. It requires vert little analysis at all. For that reason I would expect human reviewer consistency rates to be much higher with such simple search. But that is not the world of legal search, where complex analysis of legal issues requiring special training is the norm. So for us, where document reviews are usually done with teams of lawyers, consistency by human reviewers is a real quality control problem that must be carefully addressed.

The Voorhees study was borne out by a later study on a legal search project by Herbert L. Roitblat, PhD, Anne Kershaw and Patrick Oot. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Journal of the American Society for Information Science and Technology, 61 (2010). Here a total of 1,600,047 documents were reviewed by contract attorneys in a real-world linear second review. A total of 225 attorneys participated in the review. The attorneys spent about 4 months, working 7 days a week, and 16 hours per day on this review.

A few years after the Verizon review, two re-review teams of professional reviewers (Team A and Team B) were retained by the Electronic Discovery Institute (EDI) who sponsored their study. They found that the overlap (agreement in relevance coding) between Team A and the original production was 16.3%; and the overlap between Team B and the original production was 15.8%. This means an inconsistency rate on relevance of 84%. The overlap between the two re-review Teams A and B was a little better at 28.1%, meaning an inconsistency rate of  72%. Better, but still terrible, and once again demonstrating how unreliable human review alone is without the assistance of computers, especially without active machine learning and the latest quality controls. Their study reaffirmed an important point about inconsistency in manual linear review, especially when the review requires complex legal analysis. It also showed the incredible cost savings readily available with using advanced search techniques to filter documents, instead of linear review of everything.

The total cost of the original Verizon merger review was $13,598,872.61 or about $8.50 per document. Apparently M&A has bigger budgets than Litigation.  Note the cost comparison to the 2015 e-Discovery Team effort at TREC reviewing Seventeen Million documents at an average review speed of 47,261 files per hour. The Team’s average cost per document was very low, but this cost is not yet possible in real-world for a variety of reasons. Still, it is illustrative of the state of the art. It shows what’s next in legal practice. Examining what we did at TREC: if you assume a billing rate of $500 per hour for the e-Discovery Team attorneys, then the cost per document for first pass attorney review would have been a penny a document. Compare that to $8.50 per document doing linear review without active machine learning, concept search, and parametric Boolean keyword searches.

Lexington - IT lexThe conclusions are obvious, and yet, there are many still ill-informed corporate clients that sanction the use horse and buggy linear reviews, along with their rich drivers, just like in the old days of 2008. Many in-house counsel still forgo the latest CARs with AI-enhanced drivers. Most do not know any better. They have not rad the studies, even the widely publicized EDI studies. Too bad, but that does spell opportunity for the corporate legal counsel who do keep up. More and more of the younger ones do get it, and the authority to make sweeping changes. The next generation will be all about active machine learning, lawyer augmentation, and super-fast smart robots, with and without mobility.

Clients still paying for large linear review projects are not only wasting good money, and getting poor results in the process, but no one is having any fun in such slow, boring reviews. I will not do it, no matter what the law firm profit potential from such price gouging. It is a matter of both professional pride and ethics, plus work enjoyment. Why would anyone other than the hopelessly greedy, or incompetent, mosey along at a snail’s pace when you could fly, when you could get there much faster, and overall do a better job, find more relevant documents?

The gullibility of some in-house counsel to keep paying for large-scale linear reviews by armies of lawyers is truly astounding. Insurance companies are waking up to this fact. I am helping some of them to clamp down on the rip offs. It is only a matter of time before everyone leaves the horse behind and gets a robot driven CAR. You can delay such progress, we are seeing that, but you can never stop it.

Google_Car_Hybrid

By the way, since my search method is Hybrid Multimodal, it follows that my Google CAR has a steering wheel to allow a human to drive. That is the Hybrid part. The Multimodal means the car has a stick shift, with many gears and search methods, not just AI alone. All of my robots, including the car, will  have an on-off-switch and manufacturer certifications of compliance with Isaac Asimov’s “Three Laws of Robotics.”

Back to the research on consistency, the next study that I know about was by Gordon Cormack and Maura Grossman: Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012). It considered data from the TREC 2009 Legal Track Interactive Task. It attempts to rebut the conclusion by Voorehees that the inconsistencies she noted are the result of inherently subjective relevance judgments, as opposed to human error.

As to the seven topics considered at TREC in 2009, Cormack and Grossman found that the average agreement for documents coded responsive by the first-pass reviewers was 71.2 percent (28.8% inconsistent), while the average agreement for documents coded non-responsive by the first-pass reviewer was 97.4 percent (2.6% inconsistent). Id. at 274 (parentheticals added). Over the seven topics studied in 2009 there was a total overlap of relevance determinations of 71.2%. Id at 281. This is a big improvement, but it still means inconsistent calls on relevance occurred 29% of the time, and this was using the latest circa 2009 predictive coding methods. Also, these scores are in the context of a TREC protocol that allowed for participants to appeal TREC relevance calls that they disagreed with. The overlap for two reviewers relevance calls was 71%  in the Grossman Cormack study, only if you assume all unappealed decisions were correct. But if you were to only consider the appealed decisions, the agreement rate was only 11%.

Grossman and Cormack concluded in this study that only 5% of the inconsistencies in determinations of document relevance were attributable to differences in opinion, that 95% were attributable to human error. They concluded that most reviewer categorizations were caused by carelessness, such as not following instructions, and were not caused by differences in subjective evaluations. I would point out that carelessness also impacts analysis. So I do not see a bright line, like they apparently do, between “differences of opinion” and “human error.” Additional research into this area should be undertaken. But regardless of the primary cause, the inconsistencies again noted by Cormack and Grossman highlight once again the need for quality controls to guard against such human errors.

Enron_Losey_StudyThe final study with new data on reviewer inconsistencies was mine. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents (2013). In this experiment I reviewed 699,082 Enron documents by myself, twice, on two review projects about six months apart. The projects were exactly the same, same issues, same relevance standards. The documents were also the same. The only difference between the two projects was in the type of predictive coding method used. The two projects were over six months apart and I had little or no recollection of the documents from one review to the next.

In a post hoc analysis of these two reviews I discovered that I had made 63 inconsistent relevance determinations of the same documentsLess Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two (12/2/13). Yes, human error at work with no quality controls at play to try to contain such inconsistency errors. I think it was an error in analysis, not simply checking the wrong box by accident, or something like that.

Borg_Losey_stage2In the first multimodal review project I read approximately 2,500 individual documents to categorize the entire set of 699,082 ENRON emails. I found 597 relevant documents. In the second monomodal project, the one I called the Borg experiment, I read 12,000 documents to find 376 relevant documents. After removal of duplicate documents, which were all coded consistently thanks to simple quality controls employed in both projects, there were a total of 274 different documents coded relevant by one or both methods.

Of the 274 overlapping relevant categorizations, 63 of them were inconsistent. In the first (multimodal) project I found 31 documents to be irrelevant that I determined to be relevant in the second project. In the second (monomodal) project I found 32 documents to be irrelevant that I had determined to be relevant in the first project. An inconsistency of coding of 63 out of 274 relevant documents represents an inconsistency rate of 23%. This was using the same predictive coding software by Kroll Ontrack and the quality control similarity features included in software back in 2012. The software has improved since then, and I have added more quality controls, but I am still the same reviewer with the same all too human reading comprehension and analysis skills. I am, however, happy to report that even without my latest quality controls all of my inconsistent calls on relevance pertained to unimportant relevant documents, what I consider “more of the same” grey area types. No important document was miscoded.

My re-review of the 274 documents, where I made the 63 errors, creates an overlap or Jaccard index of 77% (211/274), which, while embarrassing, as most reports of error are, is still the best on record. See Grossman Cormack Glossary, Ver. 1.3 (2012) (defines the Jaccard index and goes on to state that expert reviewers commonly achieve Jaccard Index scores of about 50%, and scores exceeding 60% are very rare.) This overlap or Jaccard index for my two Enron reviews is shown by the Venn diagram below.

Unique_Docs_VennBy comparison the Jaccard index in the Voorhees studies were only 43% (two reviewers) and 30% (three reviewers). The Jaccard index of the Roitblat, Kershaw and Oot study was only 16% (multiple reviewers).

Review_Consistency_Rates-CORRECTED

This is the basis for my less is more postulate and why I always use as few contract review attorneys as possible in a review project. Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three.  This helps pursue the quality goal of perfect consistency. Sorry contract lawyers, your days are numbered. Most of you can and will be replaced. You will not be replaced by robots exactly, but by other AI-enhanced human reviewers. SeeWhy I Love Predictive Coding (The Empowerment of AI Augmented Search).

To be continued …


IT-Lex Discovers a Previously Unknown Predictive Coding Case: “FHFA v. JP Morgan, et al”

March 2, 2014

brain_gearsThe researchers at IT-Lex have uncovered a previously unknown predictive coding case out of the SDNY, Federal Housing Finance Authority v. JP Morgan Chase & Co., Inc. et al. The et al here includes just about every other major bank in the world, each represented by one of the top 25 mega law firms in the country. The interesting orders approving predictive coding were entered in 2012, yet, until now, no one has ever talked about FHFA v JP Morgan. That is amazing considering the many players involved.

The two main orders in the case pertaining to predictive coding, are here (order dated July 24, 2012), and here (order dated July 31, 2012). I have highlighted the main passages in these long transcripts. These are Ore Tenus orders, but orders none the less. The Pacer file is huge, so IT-Lex may have missed others, but we doubt it. The two key memorandums underlying the orders are by the defendant, JP Morgan’s attorneys, Sullivan & Cromwell, dated July 20, 2012, and by the plaintiff, FHFA’s lawyers, Quinn Emanuel Urquhart & Sullivan, dated July 23, 2012.

The fact these are ore tenus rulings on predictive coding explains how they have remained under the radar for so long. The orders show the mastery, finesse, and wisdom of the presiding District Court Judge Denise Cote. She was hearing her first predictive coding issue and handled it beautifully. Unfortunately, under the transcript the trial lawyers arguing pro and con did not hold up as well. Still, they appear to have been supported by good e-discovery lawyer experts behind the scenes. It all seems to have all turned out relatively well in the end as a recent Order dated February 14, 2014 suggests. Predictive coding was approved and court ordered cooperation resulted in a predictive coding project that appears to have gone pretty well. 

Defense Wanted To Use Predictive Coding

JP_MorganThe case starts with the defense, primarily JP Morgan, wanting to use predictive coding and the plaintiff, FHFA, objecting. The FHFA wanted the defendant banks to review everything. Good old tried and true linear review. The plaintiff also had fall back objections on the way the defense proposed to conduct the predictive coding.

The letter memorandum by Sullivan & Cromwell for JP Morgan is only three pages in length, but has 63 pages of exhibits attached. The letter relies heavily on the then new Da Silva Moore opinion by Judge Peck. The exhibits include the now famous 2011 Grossman and Cormack law review article on TAR, a letter from plaintiff’s counsel objecting to predictive coding, and a proposed stipulation and order. Here are key segments of Sullivan and Cromwell’s arguments:

According to Plaintiff, it will not agree to JPMC’s use of any Predictive Coding unless JPMC agrees to manually review each and every one of the millions of documents that JPMC anticipates collecting. As Plaintiff stated: “FHF A’s position is straightforward. In reviewing the documents identified by the agreed-upon search terms, the JPM Defendants should not deem a document nonresponsive unless that document has been reviewed by an attorney.”

Plaintiffs stated position, and its focus on “non-responsive” documents, necessitates this request for prompt judicial guidance. Predictive Coding has been recognized widely as a useful, efficient and reliable tool precisely because it can help determine whether there is some subset of documents that need not be manually reviewed, without sacrificing the benefit, if any, gained from manual review. Predictive Coding can also aid in the prioritization of documents that are most likely to be responsive. As a leading judicial opinion as well as commentators have warned, the assumption that manual review of every document is superior to Predictive Coding is “a myth” because “statistics clearly show that computerized searches are at least as accurate, if not more so, than manual review.” Da Silva Moore v. Publicis Groupe, 2012 U.S. Dist. LEXIS 23350, at *28 (S.D.N.Y. Feb. 24, 2012) (Peck, Mag. J.) …

JPMC respectfully submits that this is an ideal case for Predictive Coding or “machine learning” to be deployed in aid of a massive, expedited document production. Plaintiffs claims in this case against JPMC concern more than 100 distinct securitizations, issued over a several year period by three institutions that were entirely separate until the end of that period, in 2008 (i.e., JPMorgan Chase, Bear Stearns & Co., and Washington Mutual). JPMC conservatively predicts that it will have to review over 2.5 million documents collected from over 100 individual custodians. Plaintiffhas called upon JPMC to add large numbers of custodians, expand date ranges, and otherwise augment this population, which could only expand the time and expense required? Computer assisted review has been approved for use on comparable volumes of material. See, e.g., DaSilva Moore, 2012 U.S. Dist. LEXIS 23350, at *40 (noting that the manual review of3 million emails is “simply too expensive.”).

Plaintiff’s Objections

FHFA

The plaintiff federal government agency, FHFA, filed its own three page response letter with 11 pages of exhibits. The response objects to use of predictive coding and the plaintiff’s proposed methodology. Here is the core of their argument:

First, JPMC’s proposal is the worst of both worlds, in that the set of documents to which predictive coding is to be applied is already narrowed through the use of search terms designed to collect relevant documents, and predictive coding would further narrow that set of documents without attorney review,1 thereby eliminating potentially responsive documents. …

Finally, because training a predictive coding program takes a considerable amount of time,2 the truncated timeframe for production of documents actually renders these Actions far from “ideal” for the use of predictive coding.

Poppy_headThe first objection on keyword search screening is good, but the second, that training would take too long, shows that the FHFA needed better experts. The machine learning training time is usually far less than the document review time, especially in a case like this, and the overall times savings from using predictive coding are dramatic. So the second objection was a real dog.

Still, FHFA made one more objection to method that was well placed, namely that their had been virtually no disclosure as to how Sullivan and Cromwell intended to conduct the process. (My guess is, they had not really worked that all out yet. This was all new then, remember.)

[I]t has similarly failed to provide this Court with any details explaining (i) how it intends to use predictive coding, (ii) the methodology or computer program that will be used to determine responsiveness, or (iii) any safeguards that will ensure that responsive documents are not excluded by the computer model. Without such details, neither FHFA nor this Court can meaningfully assess JPMC’s proposal. See Da Silva Moore v. Publicis Groupe SA, 2012 U.S. Dist. LEXIS 23350, at *23 (S.D.N.Y. Feb. 24, 2012) (“[Defendant’s] transparency in its proposed ESI search protocol made it easier for the Court to approve the use of predictive coding.”).4 JPMC’s proposed order sets forth an amorphous proposal that lacks any details. In the absence of such information, this Court’s authorization of JPMC’s use of predictive coding would effectively give JPMC carte blanche to implement predictive coding as it sees fit.

Hearing of July 24, 2012

Judge_Denise_CoteJudge Denise Cote came into the hearing having read the briefs and Judge Peck’s then recent landmark ruling in Da Silva Moore. It was obvious from her initial comments that her mind was made up that predictive coding should be used. She understood that this mega-size case needed predictive coding to meet the time deadlines and not waste a fortune on e-document review. Here are Judge Cote’s words at pages 8-9 of the transcript:

It seems to me that predictive coding should be given careful consideration in a case like this, and I am absolutely happy to endorse the use of predictive coding and to require that it be used as part of the discovery tools available to the parties. But it seems to me that the reliability and utility of predictive coding depends upon the process that takes place in the initial phases in which there is a pool of materials identified to run tests against, and I think that some of the documents refer to this as the seed — S-E-E-D — set of documents, and then there are various rounds of further testing to make sure that the code becomes smart with respect to the issues in this case and is sufficiently focused on what needs to be defined as a responsive document. And for this entire process to work, I think it needs transparency and cooperation of counsel.

I think ultimately the use of predictive coding is a benefit to both the plaintiff and the defendants in this case. I think there’s every reason to believe that, if it’s done correctly, it may be more reliable — not just as reliable but more reliable than manual review, and certainly more cost effective — cost effective for the plaintiff and the defendants.

To plaintiff’s counsel credit she quickly shifted her arguments from whether to how. Defense counsel also falls all over herself about how cooperative she has been and will continue to be, all the while implying that the other side is a closet non-cooperator.

As it turns out, very little actual conservation had occurred between the two lead counsel before the hearing, as both had preferred snarly emails and paper letters. At the hearing Judge Cote ordered the attorneys to talk first, and then rather than shoot off more letters, and to call her if they could not agree.

I strongly suggest you read the whole transcript of the first order to see the effect a strong judge can have on trial lawyers. Page 24 is especially instructive as to just how active a bench can be. At the second hearing of July 24, 2012, I suggest you read the transcript at pages 110-111 to get an idea as to just how difficult those attorneys meetings proved to be.

As a person obsessed with predictive coding I find the transcripts of the two hearings to be kind of funny in a perverse sort of way. The best way for me to share my insights is by using the format of a lawyer joke.

Two Lawyers Walked Into A Bar

star_trek_barOne e-discovery lawyer walks into a Bar and nothing much happens. Two e-discovery lawyers walks into a Bar and an interesting discussion ensues about predictive coding. One trial lawyer walks into a Bar the volume of the whole place increases. Two trial lawyers walk into a Bar and an argument starts.

The 37 lawyers who filed appearances in the FHFA case walk into a Bar and all hell breaks loose. There are arguments everywhere. Memos are written, motions are filed, and the big bank clients are billed a million or more just talking about predictive coding.

Then United States District Court Judge Denise Cote walks into the Bar. All the trial lawyers immediately shut up, stand up, and start acting real agreeable, nice, and polite. Judge Cote says she has read all of the letters and they should all talk less, and listen more to the two e-discovery specialists still sitting in the bar bemused. Everything becomes a cooperative love-fest thereafter, at least, as far as predictive coding and Judge Conte are concerned. The trial lawyers move on to fight and bill about other issues more within their kin.

Substantive Disputes in FHFA v. JP Morgan

disclosureThe biggest substantive issues in the first hearing of July 24, 2012 had to do with disclosure and keyword filtering before machine training. Judge Cote was prepared on the disclosure issue from having read the Da Silva Moore protocol, and so were the lawyers. The judge easily pressured defense counsel to disclose both relevant and irrelevant training documents to plaintiff’s counsel, with the exception of privileged documents.

As to the second issue of keyword filtering, the defense lawyers had been told by the experts behind the scenes that JP Morgan should be allowed to keyword filter the custodians ESI before running predictive coding. Judge Peck had not addressed that issue in Da Silva Moore, since the defense had not asked for that, so Judge Cote was not prepared to rule on that then new and esoteric issue. The trial lawyers were not able to articulate much on the issue either.

Judge Cote asked trial counsel if they had previously discussed this issue, not just traded memos, and they admitted no. So she ordered them to talk about it. It is amazing how much easier it is to cooperate and reach agreement when you actually speak, and have experts with you guiding the process. So Judge Cote ordered them to discuss the issue, and, as it turns out from the second order of July 31, 2012, they reached agreement. There would be no keyword filtering.

Although we do not know all of the issues discussed by attorneys, we do know they managed to reach agreement, and we know from the first hearing what a few of the issues were. They were outlined by plaintiff’s counsel who complained that they had no idea as to how defense counsel was going to handle the following issues at page 19 of the first hearing transcript:

What is the methodology for creating the seed set? How will that seed set be pulled together? What will be the number of documents in the seed set? Who will conduct the review of the seed set documents? Will it be senior attorneys or will it be junior attorneys? Whether the relevant determination is a binary determination, a yes or no for relevance, or if there’s a relevance score or scale in terms of 1 to 100. And the number of rounds, as your Honor noted, in terms of determining whether the system is well trained and stable.

So it seems likely all these issues and more were later discussed and accommodations reached.  At the second hearing of July 31, 2012, we get a pretty good idea as to how difficult the attorneys meetings must have been. At pages 110-111 of the second hearing transcript we see how counsel for JP Morgan depicted these meetings and the quality of input received from plaintiff’s counsel and experts:

We meet every day with the plaintiff to have a status report, get input, and do the best we can to integrate that input. It isn’t always easy, not just to carry out those functions but to work with the plaintiff.

The suggestions we have had so far have been unworkable and by and large would have swamped the project from the outset and each day that a new suggestion gets made. But we do our best to explain that and keep moving forward.

Defense counsel then goes into what most lawyers would call “suck-up” mode to the judge and says:

We very much appreciate that your Honor has offered to make herself available, and we would not be surprised if we need to come to you with a dispute that hasn’t been resolved by moving forward or that seems sufficiently serious to put the project at risk. But that has not happened yet and we hope it will not.

After that plaintiff’s counsel complains the defense counsel has not agreed to allow depositions transcripts and witness statements to be used as training documents. That’s right. The plaintiff wanted to include congressional testimony, depositions and other witness statements that they found favorable to their position as part of the training documents to find relevant information store of custodian information.

Judge Cote was not about to be tricked into making a ruling on the spot, but instead wisely told them to go back and talk some more and get real expert input on the advisability of this approach. She is a very quick study as the following exchange at page 114 of the transcript with defense counsel after hearing the argument of plaintiff’s counsel illustrates:

THE COURT: Good. We will put those over for another day. I’m learning about predictive coding as we go. But a layperson’s expectation, which may be very wrong, would be that you should train your algorithm from the kinds of relevant documents that you might actually uncover in a search. Maybe that’s wrong and you will all educate me at some other time. I expect, Ms. Shane, if a deposition was just shot out of this e-discovery search, you would produce it. Am I right?

MS. SHANE: Absolutely, your Honor. But your instinct that what they are trying to train the system with are the kinds of documents that would be found within the custodian files as opposed to a batch of alien documents that will only confuse the computer is exactly right.

It is indeed a very interesting issue, but we cannot see a report in the case on Pacer that shows how the issue was resolved. I suspect the transcripts were all excluded, unless they were within a custodian’s account.

2014 Valentines Day Hearing

kiss_me_im_a_custodian_keychainThe only other order we found in the case mentioning predictive coding is here (dated February 14, 2014). Most of the Valentine’s Day transcript pertains to trial lawyers arguing about perjury, and complaining that some key documents were missed in the predictive coding production by JP Morgan. But the fault appears due to the failure to include a particular custodian in the search, an easy mistake to have happen. That has nothing to do with the success of the predictive coding or not.

Judge Cote handled that well, stating that no review is “perfect” and she was not about to have a redo at this late date. Her explanation at pages 5-6 of the February 14, 2014 transcript provides a good wrap up for FHFA v. JP Morgan:

Parties in litigation are required to be diligent and to act in good faith in producing documents in discovery. The production of documents in litigation such as this is a herculean undertaking, requiring an army of personnel and the production of an extraordinary volume of documents. Clients pay counsel vast sums of money in the course of this undertaking, both to produce documents and to review documents received from others. Despite the commitment of these resources, no one could or should expect perfection from this process. All that can be legitimately expected is a good faith, diligent commitment to produce all responsive documents uncovered when following the protocols to which the parties have agreed, or which a court has ordered.

Indeed, at the earliest stages of this discovery process, JP Morgan Chase was permitted, over the objection of FHFA, to produce its documents through the use of predictive coding. The literature that the Court reviewed at that time indicated that predictive coding had a better track record in the production of responsive documents than human review, but that both processes fell well short of identifying for production all of the documents the parties in litigation might wish to see.

Conclusion

transparencyThere are many unpublished decisions out there approving and discussing predictive coding. I know of several more. Many of them, especially the ones that first came out and pretty much blindly followed our work in Da Silva Moore, call for complete transparency, including disclosure of irrelevant documents used in training. That is what happened in FHFA v. JP Morgan and the world did not come to an end. Indeed, the process seemed to go pretty well, even with a plaintiff’s counsel who, in the words of Sullivan and Cromwell, made suggestions everyday that were unworkable and by and large would have swamped the project … but we do our best to explain that and keep moving forward. Pages 110-111 of the second hearing transcript. So it seems cooperation can happen, even when one side is clueless, and even if full disclosure has been ordered.

Since the days of 2011 and 2012, when our Da Silva Moore protocol was developed, we have had much more experience with predictive coding. We have more information on how the training actually functions with a variety of chaotic email datasets, including the new Oracle ESI collection, and even more testing with the Enron dataset.

Based on what we know now, I do not think it is necessary to make disclosure of all irrelevant documents used in training. The only documents that have a significant impact on machine learning are the borderline, grey area documents. These are the ones who relevancy is close, and often a matter of opinion, of how you view the case. Only these grey area irrelevant documents need to be disclosed to protect the integrity of the process.

grey_area_disclosure

The science and other data behind that has to do with Jaccard Index classification inconsistencies, as well as the importance of mid-range ranked documents to most predictive coding algorithmic analysis. See Eg: Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three at the subheadings Disclosure of Irrelevant Training Documents and Conclusions Regarding Inconsistent Reviews. When you limit disclosure to grey area training documents, and relevant documents, the process can become even more efficient without any compromise in quality or integrity. This of course assumes honest evaluations of grey area documents and forthright communications between counsel. But then so does all discovery in our system of justice. So this is really nothing new, nor out of the ordinary.

All discovery depends on the integrity and trustworthiness of the attorneys for the parties. Fortunately, almost all attorneys honorably fulfill these duties, except perhaps for the duty of technology competence. That is the greatest ethical challenge of the day for all litigators.


Beware of the TAR Pits! – Part Two

February 23, 2014

This is the conclusion of a two part blog. For this to make sense please read Part One first.

Quality of Subject Matter Experts

Poppy_headThe quality of Subject Matter Experts in a TAR project is another key factor in predictive coding. It is one that many would prefer to sweep under the rug. Vendors especially do not like to talk about this (and they sponsor most panel discussions) because it is beyond their control. SMEs come from law firms. Law firms hire vendors. What dog will bite the hand that feeds him? Yet, we all know full well that not all subject matter experts are alike. Some are better than others. Some are far more experienced and knowledgeable than others. Some know exactly what documents they need at trial to win a case. They know what they are looking for. Some do not. Some have done trials, lots of them. Some do not know where the court house is. Some have done many large search projects, first paper, now digital. Some are great lawyers; and some, well, you’d be better off with my dog.

The SMEs are the navigators. They tell the drivers where to go. They make the final decisions on what is relevant and what is not. They determine what is hot, and what is not. They determine what is marginally relevant, what is grey area, what is not. They determine what is just unimportant more of the same. They know full well that some relevant is irrelevant. They have heard and understand the frequent mantra at trials: Objection, Cumulative. Rule 403 of the Federal Evidence Code. Also see The Fourth Secret of Search: Relevant Is Irrelevant found in Secrets of Search – Part III.

Quality of SMEs is important because the quality of input in active machine learning is important. A fundamental law of predictive coding as we now know it is GIGO, garbage in, garbage out. Your active machine learning depends on correct instruction. Although good software can mitigate this somewhat, it can never be eliminated. See: Webber & Pickens, Assessor Disagreement and Text Classifier Accuracy, SIGIR 2013 (24% more ranking depth needed to reach equivalent recall when not using SMEs, even in a small data search of news articles with rather simple issues).

Jeremy_PickensInformation scientists like Jeremy Pickens are, however, working hard on ways to minimize the errors of SME document classifications on overall corpus rankings. Good thing too because even one good SME will not be consistent in ranking the same documents. That is the Jaccard Index scientists like to measure. Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two, and search of Jaccard in my blog.

Unique_Docs_VennIn my Enron experiments I was inconsistent in determining the relevance of the same document 23% of the time. That’s right, I contradicted myself on relevancy 23% of the time. (If you included irrelevancy coding the inconsistencies were only 2%.) Lest you think I’m a complete idiot (which, by the way, I sometimes am), the 23% rate is actually the best on record for an experiment. It is the best ever measured, by far. Other experimentally measured rates have inconsistencies of from 50% to 90% (with multiple reviewers). Pathetic huh? Now you know why AI is so promising and why it is so important to enhance our human intelligence with artificial intelligence. When it comes to consistency of document identifications in large scale data reviews, we are all idiots!

With these human  frailty facts in mind, not only variable quality in expertise of subject matter, but also human inconsistencies, it is obvious why scientists like Pickens and Webber are looking for techniques to minimize the impact of errors and, get this, even use these inevitable errors to improve search. Jeremy Pickens and I have been corresponding about this issue at length lately. Here is Jeremy’s later response to this blog. In TAR, Wrong Decisions Can Lead to the Right Documents (A Response to Ralph Losey). Jeremy does at least concede that coding quality is indeed important. He goes on to argue that his study shows that wrong decisions, typically on grey area documents, can indeed be useful.

Penrose_triangle_ExpertiseI do not doubt Dr. Pickens’ findings, but am skeptical of the search methods and conclusions derived therefrom. In other words, how the training was accomplished, the supervision of the learning. This is what I call here the driver’s role, shown on the triangle as the Power User and Experienced Searcher. In my experience as a driver/SME, much depends on where you are in the training cycle. As the training continues the algorithms eventually do become able to detect and respond to subtle documents distinctions. Yes, it take a while, and you have to know what and when to train on, which is the drivers skill (for instance you never train with giant documents), but it does eventually happen. Thus, while it may not matter if you code grey area documents wrong at first, it eventually will, that is unless you do not really care about the distinctions. (The TREC overturn documents Jeremy tested on, the ones he called wrong documents, were in fact grey area documents, that is, close questions. Attorneys disagreed on whether they were relevant, which is why they were overturned on appeal.) The lack of precision in training, which is inevitable anyway even when one SME is used, may not matter much in early stages of training, and may not matter at all when testing simplistic issues using easy databases, such as news articles. In fact, I have used semi-supervised training myself, as Jeremy describes from old experiments in Pseudo Relevance Feedback. I have seen it work myself, especially in early training.

Still, the fact some errors do not matter in early training does not mean you should not care about consistency and accuracy of training during the whole ride. In my experience, as training progresses and the machine gets smarter, it does matter. But let’s test that shall we? All I can do is report on what I see, i.w. – anecdotal.

Outside of TREC and science experiments, in the messy real world of legal search, the issues are typically maddeningly difficult. Moreover, the difference in cost of review of hundreds of thousands of irrelevant documents can be mean millions of dollars. The fine points of differentiation in matured training are needed for precision in results to reduce costs of final review. In other words, both precision and recall matter in legal search, and all are governed by the overarching legal principle of proportionality. That is not part of information science of course, but we lawyers must govern our search efforts by proportionality.

Also See William Webber’s response: Can you train a useful model with incorrect labels? I believe that William’s closing statement may be correct, either that or software differences:

It may also be, though this is speculation on my part, that a trainer who is not only a subject-matter expert, but an expert in training itself (an expert CAR driver, to adopt Ralph Losey’s terminology) may be better at selecting training examples; for instance, in recognizing when a document, though responsive (or non-responsive), is not a good training example.

alchemyI hope Pickens and Webber get there some day. In truth, I am a big supporter of their efforts and experiments. We need more scientific research. But for now, I still do not believe we can turn lead into gold. It is even worse if you have a bunch of SMEs arguing with each other about where they should be going, about what is relevant and what is not. That is a separate issue they do not address, which points to the downside of all trainers, both amateurs and SMEs alike. See: Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts OneTwo, and Three.

For additional support on the importance of SMEs, see again Monica’s article, EDI-Oracle Studywhere she summarizes the conclusion of Patrick Oot from the study that:

Technology providers using similar underlying technology, but different human resources, performed in both the top-tier and bottom-tier of all categories. Conclusion: Software is only as good as its operators. Human contribution is the most significant element. (emphasis in original)

Also see the recent Xerox blog, Who Prevails in the E-Discovery War of Man vs. Machine? by Gabriela Baron.

Teams that participated in Oracle without a bona fide SME, much less a good driver, well, they were doomed. The software was secondary. How could you possibly replicate the work of the original SME trial lawyers that did the first search without having an SME yourself, one with at least a similar experience and knowledge level.

map_lost_navigator_SMEThis means that even with a good driver, and good software, if you do not also have a good SME, you can still end up driving in circles. It is even worse when you try to do a project with no SME at all. Remember, the SME in the automobile analogy is the navigation system, or to use the pre-digital reality, the passenger with the map. We have all seen what happens where the navigation system screws up, or the map is wrong, or more typically, out of date (like many old SMEs). You do not get to the right place. You can have a great driver, and go quite fast, but if you have a poor navigator, you will not like the results.

The Oracle study showed this, but it is hardly new or surprising. In fact, it would be shocking if the contrary were true. How can incorrect information ever create correct information? The best you can hope for is to have enough correct information to smooth out the errors. Put another way, without signal, noise is just noise. Still, Jeremy Pickens claims there is a way. I will be watching and hope he succeeds where the alchemists of old always failed.

Tabula Rasa

blank_slateThere is one way out of the SME frailty conundrum that I have high hopes for and can already understand. It has to do with teaching the machine about relevance for all projects, not just one. The way predictive coding works now the machine is a tabula rasa, a blank slate. The machine knows nothing to begin with. It only knows what you teach it as the search begins. No matter how good the AI software is at learning, it still does not know anything on its own. It is just good at learning.

That approach is obviously not too bright. Yet, it is all we can manage now in legal search at the beginning of the Second Machine Age. Someday soon it will change. The machine will not have its memory wiped after every project. It will remember. The training from one search project will carry over to the next one like it. The machine will remember the training of past SMEs.

That is the essential core of my PreSuit proposal: to retain the key components of the past SME training so that you do not have to start afresh on each search project. PreSuit: How Corporate Counsel Could Use “Smart Data” to Predict and Prevent Litigation. When that happens (I don’t say if, because this will start happening soon, some say it already has) the machine could start smart.

Scarlett_Johansson - Samantha in HERThat is what we all want. That is the holy grail of AI-enhanced search — a smart machine. (For the ultimate implications of this, see the movie Her, which is about an AI enhanced future that is still quite a few years down the road.) But do not kid yourself, that is not what we have now. Now we only have baby robots, ones that are eager and ready to learn, but do not know anything. It is kind of like 1-Ls in law school, except that when they finish a class they do not retain a thing!

When my PreSuit idea is implemented, the next SME will not have to start afresh. The machine will not be a tabula rasa. It will be able to see litigation brewing. It will help general counsel to stop law suits before they are filed. The SMEs will then build on the work of prior SMEs, or maybe build on their own previous work in another similar project. Then the GIGO principle will be much easier to mitigate. Then the computer will not be completely dumb, it will have some intelligence from the last guy. There will be some smart data, not just big dumb data. The software will know stuff, know the law and relevance, not just know how to learn stuff.

When that happens, then the SME in a particular project will not be as important, but for now, when working from scratch with dumb data, the SME is still critical. The smarter and more consistent the better. Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts OneTwo, and Three.

Professor Marchionini, like all other search experts, recognizes the importance of SMEs to successful search. As he puts it:

Thus, experts in a domain have greater facility and experience related to information-seeking factors specific to the domain and are able to execute the subprocesses of information seeking with speed, confidence, and accuracy.

That is one reason that the Grossman Cormack glossary builds in the role of SMEs as part of their base definition of computer assisted review:

A process for Prioritizing or Coding a Collection of electronic Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection.

Glossary at pg. 21 defining TAR.

Most SMEs Today Hate CARs
(And They Don’t Much Like High-Tech Drivers Either)

simpsonoldmanThis is an inconvenient truth for vendors. Predictive coding is defined by SMEs. Yet vendors cannot make good SMEs step up to the plate and work with the trainers, the drivers, to teach the machine. All the vendors can do is supply the car and maybe help with the driver. The driver and navigator have to be supplied by the law firm or corporate clients. There is no shortage of good SMEs, but almost all of them have never even seen a CAR. They do not like them. They can barely even speak the language of the driver. They don’t much like most of the drivers either. They are damn straight not going to spend two weeks of their lives riding around in one of those new fangled horseless carriages.

ringo and old guy

That is the reality of where we are now. Also see: Does Technology Leap While Law Creeps? by Brian Dalton, Above the Law. Of course this will change with the generations. But for now, that is the way it is. So vendors work on error minimization. They try to minimize the role of SMEs. That is anyway a good idea, because, as mentioned, all human SMEs are inconsistent. I was lucky to only be inconsistent 23% of the time on relevance. But still, there is another obvious solution.

There is another way to deal today with the reluctant SME problem, a way that works right now with today’s predictive coding software. It is a kind of non-robotic surrogate system that I have developed, and I’m sure a several other professional drivers have as well. See my CAR page for more information on this. But, in reality it is one of those things I would just have to show you in a driver education school type setting. I do it frequently. It involves action in behalf of an SME, and dealing with the driver for them. It places them in their comfort zone, where they just make yes no decisions on the close question documents, although there is obviously more to it than that. It is not nearly as good as the surrogate system in the movie Her, and of course, I’m no movie star, but it works.

HER_Samantha_Surrogate

My own legal subject matter expertise is, like most lawyers, fairly limited. I know a lot about a few things, and am a stand alone SME in those fields. I know a fair amount about many more legal fields, enough to understand real experts, enough to serve as their surrogate or right hand. Those are the CAR trips I will take.

If I do not know enough about a field of law to understand what the experts are saying, then I cannot serve as a surrogate. I could still drive of course, but I would refuse to do that out of principle, unless I had a navigator, an SME, who knew what they were doing and where they wanted to go. I would need an SME willing to spend the time in the CAR needed to tell me where to go. I hate a TAR pit as much as the next guy. Plus at my age and experience I can drive anywhere I want, in pretty much any CAR I want. That brings us to the final corner of the triangle, the variance in the quality of predictive coding software.

Quality of the CAR Software

I am not going to spend a lot of time on this. No lawyer could be naive enough to think that all of the software is equally as good. That is never how it works. It takes time and money to make sophisticated software like this. Anybody can simply add on open source machine learning software code to their review platforms. That does not take much, but that is a Model-T.

Old_CAR_stuck_mud

To make active machine learning work really well, to take it to the next level, requires thousands of programming hours. It takes large teams of programmers. It takes years. It take money. It takes scientists. It takes engineers. It takes legal experts too. It takes many versions and continuous improvements of search and review software. That is how you tell the difference between ok, good, and great software. I am not going to name names, but I will say the Gartner’s so called Magic Quadrant evaluation of e-discovery software is not too bad. Still, be aware that evaluation of predictive coding is not really their thing, or even a primary factor for rating review software.

Gartner_Magic_Quadrant

It is kind of funny how pretty much everybody wins in the Gartner evaluation. Do you think that’s an accident? I am privately much more critical. Many well known programs are very late to the predictive coding party. They are way behind. Time will tell if they are ever able to catch up.

Still, these things do change from year to year, as new versions of software are continually released. For some companies you can see real improvements, real investments being made. For others, not so much, and what you do see is often just skin deep. Always be skeptical. And remember, the software CAR is only as good as your driver and navigator.

car_mind_meld

When it comes to software evaluation what counts is whether the algorithms can find the documents needed or not. Even the best driver navigator team in the world can only go so far in a clunker. But give them a great CAR, and they will fly. The software will more than pay for itself in saved reviewer time and added security of a job well done.

Deja Vu All Over Again. 

Predictive coding is a great leap forward in search technology. In the longterm predictive coding and other AI-based software will have a bigger impact on the legal profession than did the original introduction of computers into the law office. No large changes like this are without problems. When computers were first brought into law offices they too caused all sorts of problems and had their pitfalls and nay sayers. It was a rocky road at first.

Ralph in the late 1980s

I was there and remember it all very well. The Fonz was cool. Disco was still in. I can remember the secretaries yelling many times a day that they needed to reboot. Reboot! Better save. It became a joke, a maddening one. The network was especially problematic. The partner in charge threw up his hands in frustration. The other partners turned the whole project over to me, even though I was a young associate fresh out of law school. They had no choice. I was the only one who could make the damn systems work.

Ifloppy_8incht was a big investment for the firm at the time. Failure was not an option. So I worked late and led my firm’s transition from electric typewriters and carbon paper to personal computers, IBM System 36 minicomputers, word processing, printers, hardwired networks, and incredibly elaborate time and billing software. Remember Manac time and billing in Canada? Remember Displaywriter? How about the eight inch floppy? It was all new and exciting. Computers in a law office! We were written up in IBM’s small business magazine.

For years I knew what every DOS operating file was on every computer in the firm. The IBM repair man became a good friend. Yes, it was a lot simpler then. An attorney could practice law and run his firm’s IT department at the same time.

ralph_1990sHey, I was the firm’s IT department for the first decade. Computers, especially word processing and time and billing software, eventually made a huge difference in efficiency and productivity. But at first there were many pitfalls. It took us years to create new systems that worked smoothly in law offices. Business methods always lag way behind new technology. This is clearly shown by MIT’s Erik Brynjolfsson and Andrew McAfee in their bestseller, Second Machine Age. It typically takes a generation to adjust to major technology breakthroughs. Also see Ted Talk by Brynjolfsson with video.

I see parallels with the 1980s and now. The main difference is legal tech pioneers were very isolated then. The world is much more connected now. We can observe together how, like in the eighties, a whole new level of technology is starting to make its way into the law office. AI-enhanced software, starting with legal search and predictive coding, is something new and revolutionary. It is like the first computers and word processing software of the late 1970s and early 80s.

It will not stop there. Predictive coding will soon expand into information governance. This is the PreSuit project idea that I, and others, are starting to talk about. See Eg: Information Governance Initiative. Moreover, many think AI software will soon revolutionize legal practice in a number of other ways, including contract generation and other types of repetitive legal work and analysis. See Eg: Rohit Talwar, Rethinking Law Firm Strategies for an Era of Smart Technology (ABA  LPT, 2014). The potential impact of supervised learning and other cognitive analytics tools on all industries is vast. See Eg: Deloitte’s 2014 paper: Cognitive Analytics (“For the first time in computing history, it’s possible for machines to learn from experience and penetrate the complexity of data to identify associations.”); Also see: Digital Reasoning software, and Paragon Science software. Who knows where it will lead the world, much less the legal profession? Back in the 1980s I could never have imagined the online Internet based legal practice that most of us have now.

The only thing we know for sure is that it will not come easy. There will be problems, and the problems will be overcome. It will take creativity and hard work, but it will be done. Easy buttons have always been a myth, especially when dealing with the latest advancements of technology. The benefits are great. The improvements from predictive coding in document review quality and speed are truly astonishing. And it lowers cost too, especially if you avoid the pits. Of course there are issues. Of course there are TAR pits. But they can be avoided and the results are well worth the effort. The truth is we have no choice.

Conclusion

retire

If you want to remain relevant and continue to practice law in the coming decades, then you will have to learn how to use the new AI-enhanced technologies. There is really no choice, other than retirement. Keep up, learn the new ways, or move on. Many lawyers my age are retiring now for just this reason. They have no desire to learn e-discovery, much less predictive coding. That’s fine. That is the honest thing to do. The next generation will learn to do it, just like a few lawyers learned to use computers in the 1980s and 1990s. Stagnation and more of the same is not an option in today’s world. Constant change and education is the new normal. I think that is a good thing. Do you?

Leave a comment. Especially feel free to point out a TAR pit not mentioned here. There are many, I know, and you cannot avoid something you cannot see.


Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three

December 8, 2013

Ralph_InconsistenciesThis is part-three of a three-part blog, so please read Part One and Part Two first.

The Losey Study on Inconsistencies Suggests a Promising Future for Active Machine Learning

The data from my Enron review experiment shows that relatively high consistent relevance determinations are possible. The comparatively high overlap results achieved in this study suggest that the problem of inconsistent human relevance determinations can be overcome. All it takes is hybrid multimodal search methods, good software with features that facilitate consistent coding, good SME(s), and systematic quality control efforts, including compliance with the less is more rule.

I am not saying good results cannot be achieved with multiple reviewers too. I am just saying it is more difficult that way. It is hard to be of one mind on something as tricky as some document relevance decisions with just one reviewer. It is even more challenging to attain that level of attunement with many reviewers.

The results of my study are especially promising for reviews using active machine learning processes. Consistency of coding training documents is very important to avoid GIGO errors. That is because of the cascading effects of sensitivity to initial conditions that are inherent in machine learning. As mentioned, good software can smooth out inconsistency errors somewhat, but if the Jaccard index is too low, the artificial intelligence will be impacted, perhaps severely so. You will not find the right documents, not because there is anything wrong with the software, or anything wrong with your conception of relevance, but because you did not provide coherent instructions. You instead sent mixed messages that did not track your right conceptions. (But see the research reports of John Tredennick, CEO of Catalyst, whose chief scientist Jeremy Pickens, is investigating the ability of their software to attain good rankings in spite of inconsistent machine training.)

The same thing can happen, of course, if your conceptions of relevance are wrong to begin with. If you fail to use bona fide, objective SMEs to do the training. Even if their message is consistent, it may be the consistently wrong message. The trainers do not understand what the real target is, do not know what it looks like, so of course they cannot find it.

The inexperienced reviewers lack the broad knowledge of the subject matter and the evidence required to prove the case, and they lack the necessary deep understanding to have a correct conception of relevance. In situations like that, despite all of the quality control efforts for consistency, you will still be consistently wrong in your training. (Again, but see the research of Catalyst, where what they admit are very preliminary test results seem to suggest that their software can fulfill the alchemists dream, of turning lead into gold, of taking intentionally wrong input for training and still getting better results than manual review, and even some predictive coding. Tredennick, J., Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?  (November 17, 2013). I will continue to monitor their research with interest, as data must trump theories, but for now remain skeptical. I am at a loss to understand how the fundamental principle of GIGO could be overcome. Does anyone else who has read the Catalyst reports have any insights or comments on their analysis?)

One information scientist I spoke with on the principle of GIGO and machine training, William Webber, explained that it might not matter too much if your trainer makes some mistakes, or even quite a few mistakes, if the documents they mistakenly mark as relevant nevertheless happen to contain similar vocabulary as the relevant documents. In that case the errors might not hurt the model of “a relevant vocabulary” too much. The errors will dilute the relevance model somewhat, but there may still be sufficient weight on the “relevant terms” for the overall ranking to work.

William further explained that the training errors would seriously hurt the classification system in three situations (which he admits are a bit speculative). First, errors would be fatal in situations where there is a specialized vocabulary that identifies relevant documents, and the trainer is not aware of this language. In that case key language would never make it into the relevance model. The software classification system could not predict that these documents were relevant. Second, if the trainers have a systematically wrong idea of relevance (rather than just being inattentive or misreading borderline cases). In that case the model will be systematically biased (but this is presumably the easiest case to QC, assuming you have an SME available to do so). Third, if the trainers flip too many relevant documents into the irrelevant class, and so the software classifier thinks that the “relevant vocabulary” is not really that strong an indicator of relevance after all. That is a situation where there is too much wrong information, where the training is too diluted by errors to work.

Consistency Between Reviews Even Without Horizontal Quality Control Efforts

Horizontal_QCIn my Enron experiment with two separate reviews I intentionally used only internal, or vertical, quality control procedures. That is one reason that the comparatively low 27% relevance inconsistency rate is so encouraging. There may have been some inconsistencies in coding in the same project, but not of the same document. That is because the methods and software I used (Kroll Ontrack’s Inview) made such errors easy to detect and correct. I made efforts to make my document coding consistent within the confines of both projects. But no efforts were made to try to make the coding consistent between the two review projects. In other words, I made no attempt in the second review to compare the decisions made in the first review nine-months earlier. In fact, just the opposite was true. I avoided horizontal quality control procedures on purpose in the second project to protect the integrity of my experiment to compare the two types of search methods used. That was, after all, the purpose of my experiment, not reviewer consistency.

I tried to eliminate carryover of any kind from one project to the next, even simple carryover like consulting notes or re-reading my first review report. I am confident that if I had employed quality controls between projects the Jaccard index would have been even higher, that I would have reduced the single reviewer error rate.

Ralph_Borg_stationAnother artificial reason the error rates between the two reviews might have been so high was the fact that I used a different, inferior methodology in the second review. Again, that was inherent in the experiment to compare methods. But the second method, a monomodal review method that I called a modified Borg approach, was a foreign method to me, and one that I found quite boring. Further, the Borg method was not conducive to consistent document reviews because it involved skimming a high number of irrelevant documents. I read 12,000 Enron documents in the Borg review and only 2,500 in the first, multimodal review. When using my normal methods in the first review I found 597 relevant documents in the 2,500 documents read. That is a prevalence rate of 24%. In the Borg review I found 376 relevant documents in the 12,000 documents read. That is a prevalence of only 03.1%. That kind of low prevalence review is, I suspect, more likely to lead to careless errors.

I am confident that if I had employed my same preferred hybrid multimodal methods in both reviews, that the consistency rate would have been even higher, even without additional quality control efforts. If I had done both, consistent methods and horizontal quality controls, the best results would have been attained.

In addition to improving consistency rates for a single reviewer, quality controls should also be able to improve consistency rates between multiple reviewer inconsistencies, at least in so far as the SME expertise can be transmitted between multiple reviewers. That in turn depends in no small part on whether the Grossman Cormack theory of review error causation is true, that inconsistencies are due to mere human error, carelessness and the like, as opposed to prior theories that relevance is always inherently subjective. If the subjective relevance theories are true, then everyone will have no choice but to just use one SME, who had better be well tuned to the judge. But, as mentioned, I do not believe in the theory that relevance is inherently subjective, so I do think multiple reviewers can be used, so long as there are multiple safeguards and quality controls in place. It will just be more difficult that way, and probably take longer.

How much more difficult, and how much longer, depends in part on the degree of subjectivity involved in the particular search project. I do not see the choice of competing theories as being all or nothing. Grossman and Cormack in their study concluded that only five percent of the relevance calls they made were subjective. It may well be higher than that on average, but, there is no way it is all subjective. I think it varies according to the case and the issues. The more subjectivity involved in a project, the more that strong, consistent, SME input is needed for machine training to work successfully.

Crowd Sourcing Does Not Apply to Most Predictive Coding Work

crowdSome think that most relevance determinations are just subjective, so SMEs are not really needed. They think that contract review lawyers will work just as well. After all, they are usually intelligent generalists. They think that more is better, and do not like the results of the studies I have discussed in this article, especially my own success as a Less is More Army of One type predictive coder. They hang their theories on crowd sourcing, and the wisdom of the crowd.

Crowd sourcing does work with some things, but not document review, and certainly not predictive coding. We are not looking for lost dogs here, where crowd sourcing does work. We are looking for evidence in what are often very complex questions. These questions, especially in large cases where predictive coding is common, are usually subject to many arcane rules and principles of which the crowd has no knowledge, or worse, has wrong knowledge. Multiple wrongs do not make a right.

Here is a key point to remember on the crowd sourcing issue: the judge makes the final decisions on relevance, not the jury. Crowd sourcing might help you to predict the final outcome of a jury trial, juries are, after all, like small crowds with no particular expertise, just instructions from the judge. Crowd sourcing will not, however, help you to predict how a judge will rule on legal issues. Study of the judge’s prior rulings are a much better guide (perhaps along with, as some contend, what the judge had for breakfast). The non-skilled reviewers, the crowd, have little or nothing to offer in predicting an expert ruling. To put this mathematically, no matter how many zeros you add together, the total sum is always still zero.

Bottom line, you cannot crowd-source highly specialized skills.When it comes to specialized knowledge, the many are not always smarter than the few.

crowd_surgeryWe all know this on a common sense level. Think about it. Would you want a crowd of nurses to perform surgery on you? Or would you insist on one skilled doctor? Of course you would want to have an SME surgeon operate on you, not a crowd. You would want a doctor who specializes in the kind of surgery you needed. One who had done it many times before. You cannot crowd source specialized skills.

The current facile fascination with crowd sourcing is trendy to be sure, but misplaced when it comes to most of the predictive coding work I see. Some documents, often critical ones, are too tricky, too subtle, for all but an experienced expert to recognize their probative value. Even documents that are potentially critical to the outcome of a case can be missed by non-experts. Most researchers critiquing the SME theory of predictive coding do not seem to understand this. I think that is because most are not legal experts, not experienced trial attorneys. They fail to appreciate the complexity and subtle nuances of the law in general, and evidence in particular.

They also fail to apprehend the enormous differences in skill levels and knowledge between attorneys. The law, like society, is so complex now that lawyers are becoming almost as specialized as doctors. We can only know a few fields of law. Thus, for example, just as you would not want a podiatrist to perform surgery on your eye, you would not want a criminal lawyer to handle your breach of contract suit.

To provide another example, if it were an area of law in which I have no knowledge, such as immigration law, I could read a hot document and not even know it. I might even think it was irrelevant. I would lack the knowledge and frame of reference to grasp its significance. The kind of quick training that passes muster in most contract lawyer reviews would not make much of a difference. That is because of complexity, and because the best documents are often the unexpected ones, the ones that only an expert would realize are important when they see one.

Penrose_triangle_ExpertiseIn the course of my 35 years of document review I have seen many inexperienced lawyers not recognize or misunderstand key documents on numerous occasions, including myself in the early days, and, to be honest, sometimes even now (especially when I am not the first-level SME, but just a surrogate). That is why partners supervise and train young lawyers, day in and day out for years. Although contract review lawyers may well have the search skills, and be power-users with great software skills, and otherwise be very smart and competent people, they lack the all important specialized subject matter expertise. As mentioned before, other experiments have shown that subject matter expertise is the most important of the three skill-sets needed for a good legal searcher. That is why you should not use contract lawyers to do machine training, at least in most projects. You should use SMEs. At the very least you should use an SME for quality control.

sexting

I will, however, concede that there may be some review projects where an SME is not needed at all, where multiple reviewers would work just fine. A divorce case for instance, where all of the reviewers might have an equally keen insight into sexy emails, or sexting, and no SMEs are needed. Alas, I never see cases like that, but I concede they are possible. It could also work in simplistic topics and non-real-world hypotheticals. That may explain some of the seemingly contra research results from Catalyst that rely on TREC data, not real world, complex, litigation data.

 Conclusions Regarding Inconsistent Reviews

The data from the experiments on inconsistent reviews suggest that when only one human reviewer is involved, a reviewer who is also an experienced SME, that the overall consistency rates in review are much higher than when multiple non-SME reviewers are involved (contract reviewers in the Roitblat, Kershaw and Oot study) (77% v 16%), or even when multiple SMEs are involved (retired intelligence officers in Voorhees study) (77% v 45% with two SMEs and 30% with three SMEs). These comparisons are shown visually in this graph.

Review_Consistency_Rates

These results also suggest that with one SME reviewer the classification of irrelevant documents is nearly uniform (99%), and that the inconsistencies primarily lie in relevant categorizations (77% Jaccard) of borderline relevant documents. (A caveat should be made that this observation is based on unfiltered data, and not a keyword collection or data otherwise distorted with artificially high prevalence rates.)

The overall Agreement rate of 98%+ of all relevancy determinations, including irrelevant classifications where almost all classifications are easy and obvious, suggests that the very low Jaccard index rates measured in previous studies of 16% to 45% were more likely caused by human error, not document relevance ambiguity or genuine disagreement on the scope of relevance. A secondary explanation for the low scores is lack of significant subject matter expertise, such that the reviewers were not capable of recognizing a clearly relevant document when they saw one. Half of the TREC reviews were done by volunteer law students where such mistakes could easily happen. As I understand the analysis of Grossman and Cormack, they would consider this to be mere error, as opposed to a difference of opinion.

Even if you only consider the determinations of relevancy, and exclude determinations of irrelevancy, the 77% Jaccard index for one reviewer is still significantly greater than the prior 16% to 45% consistency rates. The data on inconsistencies from my experiment thus generally support the conclusions of Cormack and Grossman that most inconsistencies in document classifications are due to human error, not the presence of borderline documents or the inherent ambiguity of all relevancy determinations. Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012). Of the 3,274 different documents that I read in both projects during my experiment, only 63 were seen to be borderline, grey area types, which is less than 2%. The rest, 3,211 documents, were consistently coded. This is shown in the graph below.

Enron_inconsistent_graph

There were almost certainly more grey area relevant documents than 63 in the 3,274 documents reviewed. But they did not come to my attention in the post hoc analysis because my determinations in both projects were consistent in review of the other borderline documents. Still, the findings support the conclusions of Grossman and Cormack that less than 5% of documents in a typical unfiltered predictive coding review project are of a borderline grey area type. In fact, the data from my study supports the conclusion that only 2% of the total documents subject to relevance were grey area types, that 98% of the judgment calls were not subjective. I think this is a fair assessment for the unfiltered Enron data that I was studying, and the relatively simple relevance issue (involuntary employment termination) involved.

The percentage of grey area documents where the relevance determinations are subjective and arguable may well be higher than 5%. More experiments are needed and nothing is proven by only a few tests. Still, my estimate, based on general experience and the Enron tests, is that when you are only considering relevant documents, it could be a high, on average, of as much as 20% subjective calls. (When considering all judgments, relevant and irrelevant, it is under 5% subjective.) Certainly subjectivity is a minority cause of inconsistent relevance determinations.

The data does not support the conclusion that relevance adjudications are inherently subjective, or mere idiosyncratic decisions. I am therefore confident that our legal traditions rest on solid relevance ground, not quicksand.

But I also understand that this solid ground in turn depends on competence, legal expertise, and a clear objective understanding of the rules of law and equity, not to mention the rules of reason and common sense. That is what legal training is all about. It always seems to come back to that, does it not?

Disclosure of Irrelevant Training Documents

These observations, especially the high consistency of review of irrelevance classifications (99%), support the strict limitation of disclosure of irrelevant documents as part of a cooperative litigation discovery process. Instead, only documents that a reviewer knows are of a grey area type or likely to be subject to debate should be disclosed. Even then the disclosure need not include the actual documents, but rather a summary and dialogue on the issues raised.

grey_area_disclosure

During my experimental review projects of the Enron documents, much like my reviews in real-world legal practice that I cannot speak of, I was personally aware of the ambiguous type grey area documents when originally classifying these documents. They were obvious because it was difficult to decide if they were within the border of relevance, or not. I was not sure how a judge would rule on the issue. The ambiguity would trigger an internal debate where a close question decision would ultimately be made. It could also trigger quality control efforts, such as consultations with other SMEs about those documents, although that did not happen in my Enron review experiment. In practice it does happen.

Even when limiting disclosure of irrelevant documents to those that are known to be borderline, disclosure of the actual documents themselves may often be unnecessary. Instead, a summary of the documents with explanation of the rationale as to the ultimate determination of irrelevance may suffice. The disclosure of a description of the borderline documents will at least begin a relevancy dialogue with the requesting party. Only if the abstract debate fails to reach agreement should disclosure of the actual documents be required. Even then it could be done in camera to a neutral third-party, such as a judge or special master. Alternatively, disclosure could be made with additional confidentiality restrictions, such as redactions, pending a ruling by the court.

Conclusion

Ralph_review_12-13Some relevance determinations certainly do include an element of subjectivity, of flexibility, and the law is used to that. But not all. Only a small minority. Some relevance determinations are more opinion than fact. But not all. Only a small minority. Some relevance determinations are more art than science. But not all. Only a small minority. Therefore, consistent and reliable relevance determinations by trained legal experts is possible, especially when good hybrid multimodal methods are used, along with good quality controls. (Good software is also important, and, as I have said many times before, some software on the market today is far better than others.)

The fact that it is possible to attain consistent coding is good news for legal search in general and especially good news for predictive coding, with its inherent sensitivity to initial conditions and cascading effects. It means that it is possible to attain the kind of consistent training needed for active machine learning to work accurately and efficiently, even in complex real-world litigation.

The findings of the studies reviewed in this article also support the use of SMEs with in-depth knowledge of the legal subject, and the use of as few SMEs to do the review as possible – Less Is More. These studies also strongly support that the greatest consistency in document review arises from the use of one SME only. By the way, despite the byline in Monica Bay’s article, EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013), that “Phase I of the study shows that older lawyers still have e-discovery chops and you don’t want to turn EDD over to robots,” the age of the lawyers is irrelevant. The best predictive coding trainers do not have to be old, they just have to be SMEs and have good search skills. In fact, not all SMEs are old, although many may be. It is the expertise and skills that matter, not age per se.

The findings and conclusions of the studies reviewed in this article also reinforce the need for strong quality control measures in large reviews where multiple reviewers must be used, such as second-pass reviews, or reviews led by traditionalists. This is especially true when the reviewers are relatively low-paid, non-SMEs. Quality controls detecting inconsistencies in coding and other possible human errors should be a part of all state-of-the-art software, and all legal search and review methodologies.

Army of One: Multimodal Single-SME Approach To Machine Learning Finally, it is important to remember that good project management skills are important to the success of any project, including legal search. That is true even if you are talking about an Army of One, which is my thing. Skilled project management is even more important when hundreds of reviewers are involved. The effectiveness of any large-scale document review, including its quality controls, always depends on the project management.


%d bloggers like this: