Concept Drift and Consistency: Two Keys To Document Review Quality – Part Two

This is Part Two of this blog. Please read Part One first.

Concept Freeze

frozenbrainsIn most complex review projects the understanding of relevance evolves over time, especially at the beginning of a project. This is concept drift. It evolves as the lawyers’ understanding evolves. It evolves as the facts unfold in the documents reviewed and other sources, including depositions. The concept of relevance shifts as the case unfolds with new orders and pleadings. This is a good thing. Its opposite, concept freeze, is not.

The natural shift in relevance understanding is well-known in the field of text retrieval. Consider for instance the prior cited classic study by Ellen M. Voorhees, the computer scientist at the National Institute of Standards and Technology in charge of TREC, where she noted:

Test collections represent a user’s interest as a static set of (usually binary) decisions regarding the relevance of each document, making no provision for the fact that a real user’s perception of relevance will change as he or she interacts with the retrieved documents, or for the fact that “relevance” is idiosyncratic.

Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt  697 (2000) at page 714 (emphasis added). (The somewhat related term query drift in information science refers to a different phenomena in machine learning. In query drift  the concept of document relevance unintentionally changes from the use of indiscriminate pseudorelevance feedback. Cormack, Buttcher & Clarke, Information Retrieval Implementation and Evaluation of Search Engines (MIT Press 2010) at pg. 277. This can lead to severe negative relevance feedback loops.)

In concept drift the concept of what is relevant changes as a result of:

  1. Trying to apply the abstract concepts of relevance to the particular documents reviewed, and
  2. Changes in the case itself over time from new evidence, stipulations and court orders.

cars_driftingThe word drift is somewhat inappropriate here. It suggests inadvertence, a boat at the mercy of a river’s current, drifting out of control. That is misleading. The kind of concept drift here intended is an intentional drift. The change is under the full conscious control of the legal team. The change must also be implemented in a consistent manner by all reviewers, not just one or two. As discussed, this includes retroactive corrections to prior document classifications. Concept drift is more like a racing car’s controlled drift around a corner. That is the more appropriate image.

In legal search relevance should change, should evolve, as the full facts unfold. Although concept drift is derived from a scientific term, it is a phenomena well-known to trial lawyers. If a lawyer’s concept of relevance does not change at all, if it stays frozen, then they are either in a rare black swan type of case, or the document review project is being mismanaged. It is usually the later. The concept of relevance has stratified. It has not evolved or been refined. It is instead static, dead. Sometimes this is entirely the fault of the SME for a variety of reasons. But typically the poor project management is a group effort. Proper execution of the first step in the eight-step work flow for document review, the communication step, will usually prevent concept drift. Although this is naturally the first step in a work-flow, communication should continue throughout a project.

predictive_coding_3.0

The problem of concept freeze is, however, inherent in all large document review projects, not just ones accelerated by predictive coding. In fact, projects using predictive coding are somewhat protected from this problem. Good machine learning software that makes suggestions, including suggestions that disagree with prior human coding, can sometimes prevent relevance stagnancy by forcing human re-conceptions.

No matter what the cause or type of search methods used, a concept freeze at the beginning of a review project, the most intense time for relevance development, is a big red flag. It should trigger a quality control audit. An early concept freeze suggests that the reviewers, the people who manage and supervise them, and SMEs, may not be communicating well, or may not be studying the documents closely enough. It is a sign of a project that has never gotten off the ground, an apathetic enterprise composed of people just going through the motions. It suggests a project dying at the time it should be busy being born. It is a time of silence about relevance when there should be many talks between team members, especially with the reviewers. Good projects have many, many emails circulating with questions, analysis, debate, decisions and instructions.

DylanAll of this reminds me of Bob Dylan’s great song, It’s Alright, Ma (I’m Only Bleeding):

To understand you know too soon
There is no sense in trying …

The hollow horn plays wasted words,
Proves to warn
That he not busy being born
Is busy dying. …

An’ though the rules of the road have been lodged
It’s only people’s games that you got to dodge
And it’s alright, Ma, I can make it.

Ralph Losey with this "nobody read my blog" sad shirtThis observation of the need for relevance refinement at the beginning of a project is based on long experience. I have been involved with searching document collections for evidence for possible use at trial for thirty-six years. This includes both the paper world and electronically stored information. I have seen this in action thousands of times. Since I like Dylan so much, here is my feeble attempt to paraphrase:

Relevance is rarely simple or static,
Drift is expected,
Complexities of law and fact arise and
Are work product protected.

An’ though the SMEs rules of relevance have been lodged
They must surely evolve, improve or be dodged
And its alright, Shira, I can make it.

My message here is that the absence of concept shift – concept freeze – is a warning sign. It is an indicator of poor project management, typically derived from inadequate communication or dereliction of duty by one or more of the project team members. There are exceptions to this general rule, of course, especially in simple cases, or ones where the corpus is well known. Plus, sometimes you do get it right the first time, just not very often.

The Wikipedia article on concept shift noted that such change is inherent in all complex phenomenon not governed by fixed laws of nature, but rather by human activity …. Therefore periodic retraining, also known as refreshing, of any model is necessary. I agree.

error-correctionDetermination of relevance in the law is a very human activity. In most litigation this is a very complex phenomenon. As the relevance concept changes, the classifications need to be refreshed and documents retrained according to the latest relevance model. This means that reviewers need to go back and change the prior classifications of documents. The classifications need to be corrected for uniformity. Here the quality factor of consistency comes into play. It is time-consuming to go back and make corrections, but important. Without these corrections and consistency efforts, the impact of concept drift can be very disruptive, and can result in decreased recall and precision. Important documents can be missed, documents that you need to defend or prosecute, or ones that the other side needs. The last error in egregious situations can be sanctionable.

Here is a quick example of the retroactive correction work in action. Assume that one type of document, say Spreadsheet X typehas been found to be irrelevant for the first several days, such that there are now hundreds, perhaps thousands of various documents coded irrelevant with information pertaining to Spreadsheet X. Assume that a change is made, and the SME now determines that a new type of this document is relevant. The SME realizes, or is told, that there are many other documents on Spreadsheet X that will be impacted by the decision on this new form. A conscious, proportional decision is then made to change the coding on all of the previously documents impacted by this decision. In this hypothetical the scope of relevance expanded. In other cases the scope of relevance might tighten. It takes time to go back and make such corrections in prior coding, but it is well worth it as a quality control effort. Concept drift should not be allowed to breed inconsistency.

Red_Flag_warningA static understanding by document reviewers of relevance, especially at the beginning of a project, is a red flag of mismanagement. It suggests that the subject matter expert (“SME”), who is the lawyer(s) in charge of determining what is relevant to the particular issues in the case, is not properly supervising the attorneys who are actually looking at the documents, the reviewers. If SMEs are not properly supervising the review, if they do not do their job, then the net result is loss of quality. This is the kind of quality loss where key documents could be overlooked. In this situation reviewers are forced to make their own decisions on relevance when new kinds of documents are encountered. This exasperates the natural inconsistencies of human reviewers (more on that later). Moreover, it forces the reviewers to try to guess what the expert in charge of the project might consider to be relevant. When in doubt the tendency of reviewers is to guess on the broadside. Over-extended notions of relevance are often result.

A review project of any complexity that does not run into some change in relevance at the beginning of a project is probably poorly managed and making many other mistakes. The cause may not be from the SME at all. It may be the fault of the document reviewers or mid-realm management. The reviewers may not be asking questions when they should, they may not be sharing their analysis of grey area documents. They may not care or talk at all. The target may be vague and elusive. No one may have a good idea of relevance, much less a common understanding.

This must be a team effort. If audits show that any reviewers or management are deficient, they should be quickly re-educated or replaced. If there are other quality control measures in place, then the potential damage from such mismanagement may be limited. In other review projects, however, this kind of mistake can go undetected and be disastrous. It can lead to an expensive redo of the project and even court sanctions for failure to find and produce key documents.

supervising-tipsSMEs must closely follow the document review progress. They must supervise the reviewers, at least indirectly. Both the law and legal ethics require that. SMEs should not only instruct reviewers at the beginning of a project on relevancy, they should be consulted whenever new document types are seen. This should ideally happen in near real time, but at least on a daily basis with coding on that document type suspended until the SME decisions are made.

With a proper surrogate SME agency system in place, this need not be too burdensome for the senior attorneys in charge. I have worked out a number of different solutions for that SME burdensomeness problem. One way or another, SME approval must be obtained during the course of a project, not at the end. You simply cannot afford to wait until the end to verify relevance concepts. Then the job can become overwhelming, and the risks of errors and inefficiencies too high.

Even if consistency of reviewers is assisted, as it should, by using similarity search methods, the consistent classification may be wrong. The production may well reflect what the SME thought months earlier, before the review started, whereas what matters is what the SME thinks at time of production. A relevance concept that does not evolve over time, that does not drift to the truth, is usually wrong. A document review project that ties all document classification to the SME’s initial ideas of relevance is usually doomed to failure. These initial SME concepts are typically made at the beginning of the case and after only a few relevant documents have been reviewed. Sometimes they are made completely in the abstract, with the SME having seen no documents. These initial ideas are only very rarely one hundred percent right. Moreover, even if the ideas, the concepts, are completely right from the beginning, and do not change, the application of these concepts to the documents seen will change. Modifications and shifts of some sort, and to some degree, are almost always required as the documents reveal what really happened and how. Modifications can also be driven by demands of the requesting party, and most importantly, by rulings of the court.

Consistency

Consistency as described before refers to the coding of the same or similar type documents in the same manner. This means that:

  1. A single reviewer determines relevance in a consistent manner throughout the course of a review project.
  2. Multiple reviewers determine relevance in a consistent manner with each other.

As mentioned, the best software now makes it possible to identify many of these inconsistencies, at least the easy ones involving near duplicates. Actual, exact duplicates are rarely a problem, as they are so easy to detect, but not all software is good at detecting near duplicates, threads, etc. Consistency in adjudications of relevance is a quality control feature that I consider indispensable. Ask your vendor how their software can help you to find and correct all obvious inconsistencies, and mitigate against the others. The real challenge, of course, is not in near duplicates, but in documents that have the same meaning, but very different form.

ConsistencyIsKey

VoorheesScientific research has shown that inconsistency of relevance adjudications is inherent in all human review, at least in large, document review projects requiring complex analysis. For authority I refer again to the prior cited study by Ellen M. Voorhees, the computer scientist at the National Institute of Standards and Technology in charge of TREC. Voorhees found that the average agreement rate of agreement by two human experts on documents determined to be relevant was only 43%. She called that overlap. This means that two manual reviewers disagreed with each other as to document relevance 57% of the time. Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, supra at pages 700-701.

Note that the reviewers in this study were all experts, all retired intelligence officers skilled in document analysis. Like litigation lawyers they all had similar backgrounds and training. When the relevance determinations of a third reviewer were considered in this study, the average overlap rate dropped down to 30%. That means the three experts disagreed in their independent analysis of document relevance 70% of the time. The 43% and 30% overlap they attained was higher that earlier TREC studies on inconsistency. The overlap rate is shown in Table 1 of her paper at page 701.

Voorhees_paper_screen_shot

Voorhees concluded from that this data was evidence for the variability of relevance judgments. Id. 

Ralph_InconsistenciesA 70% inconsistency rate on relevance classifications among three experts is troubling, and thus the need to check and correct for human errors, especially when expert decisions are required as is the case with all legal search. I assume that agreement rates would be much higher in a simple search matter, such as finding all articles in a newspaper collection relevant to a particular news event. That does not require expert legal analysis. It requires vert little analysis at all. For that reason I would expect human reviewer consistency rates to be much higher with such simple search. But that is not the world of legal search, where complex analysis of legal issues requiring special training is the norm. So for us, where document reviews are usually done with teams of lawyers, consistency by human reviewers is a real quality control problem that must be carefully addressed.

The Voorhees study was borne out by a later study on a legal search project by Herbert L. Roitblat, PhD, Anne Kershaw and Patrick Oot. Document Categorization in Legal Electronic Discovery: Computer Classification vs. Manual Review, Journal of the American Society for Information Science and Technology, 61 (2010). Here a total of 1,600,047 documents were reviewed by contract attorneys in a real-world linear second review. A total of 225 attorneys participated in the review. The attorneys spent about 4 months, working 7 days a week, and 16 hours per day on this review.

A few years after the Verizon review, two re-review teams of professional reviewers (Team A and Team B) were retained by the Electronic Discovery Institute (EDI) who sponsored their study. They found that the overlap (agreement in relevance coding) between Team A and the original production was 16.3%; and the overlap between Team B and the original production was 15.8%. This means an inconsistency rate on relevance of 84%. The overlap between the two re-review Teams A and B was a little better at 28.1%, meaning an inconsistency rate of  72%. Better, but still terrible, and once again demonstrating how unreliable human review alone is without the assistance of computers, especially without active machine learning and the latest quality controls. Their study reaffirmed an important point about inconsistency in manual linear review, especially when the review requires complex legal analysis. It also showed the incredible cost savings readily available with using advanced search techniques to filter documents, instead of linear review of everything.

The total cost of the original Verizon merger review was $13,598,872.61 or about $8.50 per document. Apparently M&A has bigger budgets than Litigation.  Note the cost comparison to the 2015 e-Discovery Team effort at TREC reviewing Seventeen Million documents at an average review speed of 47,261 files per hour. The Team’s average cost per document was very low, but this cost is not yet possible in real-world for a variety of reasons. Still, it is illustrative of the state of the art. It shows what’s next in legal practice. Examining what we did at TREC: if you assume a billing rate of $500 per hour for the e-Discovery Team attorneys, then the cost per document for first pass attorney review would have been a penny a document. Compare that to $8.50 per document doing linear review without active machine learning, concept search, and parametric Boolean keyword searches.

Lexington - IT lexThe conclusions are obvious, and yet, there are many still ill-informed corporate clients that sanction the use horse and buggy linear reviews, along with their rich drivers, just like in the old days of 2008. Many in-house counsel still forgo the latest CARs with AI-enhanced drivers. Most do not know any better. They have not rad the studies, even the widely publicized EDI studies. Too bad, but that does spell opportunity for the corporate legal counsel who do keep up. More and more of the younger ones do get it, and the authority to make sweeping changes. The next generation will be all about active machine learning, lawyer augmentation, and super-fast smart robots, with and without mobility.

Clients still paying for large linear review projects are not only wasting good money, and getting poor results in the process, but no one is having any fun in such slow, boring reviews. I will not do it, no matter what the law firm profit potential from such price gouging. It is a matter of both professional pride and ethics, plus work enjoyment. Why would anyone other than the hopelessly greedy, or incompetent, mosey along at a snail’s pace when you could fly, when you could get there much faster, and overall do a better job, find more relevant documents?

The gullibility of some in-house counsel to keep paying for large-scale linear reviews by armies of lawyers is truly astounding. Insurance companies are waking up to this fact. I am helping some of them to clamp down on the rip offs. It is only a matter of time before everyone leaves the horse behind and gets a robot driven CAR. You can delay such progress, we are seeing that, but you can never stop it.

Google_Car_Hybrid

By the way, since my search method is Hybrid Multimodal, it follows that my Google CAR has a steering wheel to allow a human to drive. That is the Hybrid part. The Multimodal means the car has a stick shift, with many gears and search methods, not just AI alone. All of my robots, including the car, will  have an on-off-switch and manufacturer certifications of compliance with Isaac Asimov’s “Three Laws of Robotics.”

Back to the research on consistency, the next study that I know about was by Gordon Cormack and Maura Grossman: Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012). It considered data from the TREC 2009 Legal Track Interactive Task. It attempts to rebut the conclusion by Voorehees that the inconsistencies she noted are the result of inherently subjective relevance judgments, as opposed to human error.

As to the seven topics considered at TREC in 2009, Cormack and Grossman found that the average agreement for documents coded responsive by the first-pass reviewers was 71.2 percent (28.8% inconsistent), while the average agreement for documents coded non-responsive by the first-pass reviewer was 97.4 percent (2.6% inconsistent). Id. at 274 (parentheticals added). Over the seven topics studied in 2009 there was a total overlap of relevance determinations of 71.2%. Id at 281. This is a big improvement, but it still means inconsistent calls on relevance occurred 29% of the time, and this was using the latest circa 2009 predictive coding methods. Also, these scores are in the context of a TREC protocol that allowed for participants to appeal TREC relevance calls that they disagreed with. The overlap for two reviewers relevance calls was 71%  in the Grossman Cormack study, only if you assume all unappealed decisions were correct. But if you were to only consider the appealed decisions, the agreement rate was only 11%.

Grossman and Cormack concluded in this study that only 5% of the inconsistencies in determinations of document relevance were attributable to differences in opinion, that 95% were attributable to human error. They concluded that most reviewer categorizations were caused by carelessness, such as not following instructions, and were not caused by differences in subjective evaluations. I would point out that carelessness also impacts analysis. So I do not see a bright line, like they apparently do, between “differences of opinion” and “human error.” Additional research into this area should be undertaken. But regardless of the primary cause, the inconsistencies again noted by Cormack and Grossman highlight once again the need for quality controls to guard against such human errors.

Enron_Losey_StudyThe final study with new data on reviewer inconsistencies was mine. A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents (2013). In this experiment I reviewed 699,082 Enron documents by myself, twice, on two review projects about six months apart. The projects were exactly the same, same issues, same relevance standards. The documents were also the same. The only difference between the two projects was in the type of predictive coding method used. The two projects were over six months apart and I had little or no recollection of the documents from one review to the next.

In a post hoc analysis of these two reviews I discovered that I had made 63 inconsistent relevance determinations of the same documentsLess Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two (12/2/13). Yes, human error at work with no quality controls at play to try to contain such inconsistency errors. I think it was an error in analysis, not simply checking the wrong box by accident, or something like that.

Borg_Losey_stage2In the first multimodal review project I read approximately 2,500 individual documents to categorize the entire set of 699,082 ENRON emails. I found 597 relevant documents. In the second monomodal project, the one I called the Borg experiment, I read 12,000 documents to find 376 relevant documents. After removal of duplicate documents, which were all coded consistently thanks to simple quality controls employed in both projects, there were a total of 274 different documents coded relevant by one or both methods.

Of the 274 overlapping relevant categorizations, 63 of them were inconsistent. In the first (multimodal) project I found 31 documents to be irrelevant that I determined to be relevant in the second project. In the second (monomodal) project I found 32 documents to be irrelevant that I had determined to be relevant in the first project. An inconsistency of coding of 63 out of 274 relevant documents represents an inconsistency rate of 23%. This was using the same predictive coding software by Kroll Ontrack and the quality control similarity features included in software back in 2012. The software has improved since then, and I have added more quality controls, but I am still the same reviewer with the same all too human reading comprehension and analysis skills. I am, however, happy to report that even without my latest quality controls all of my inconsistent calls on relevance pertained to unimportant relevant documents, what I consider “more of the same” grey area types. No important document was miscoded.

My re-review of the 274 documents, where I made the 63 errors, creates an overlap or Jaccard index of 77% (211/274), which, while embarrassing, as most reports of error are, is still the best on record. See Grossman Cormack Glossary, Ver. 1.3 (2012) (defines the Jaccard index and goes on to state that expert reviewers commonly achieve Jaccard Index scores of about 50%, and scores exceeding 60% are very rare.) This overlap or Jaccard index for my two Enron reviews is shown by the Venn diagram below.

Unique_Docs_VennBy comparison the Jaccard index in the Voorhees studies were only 43% (two reviewers) and 30% (three reviewers). The Jaccard index of the Roitblat, Kershaw and Oot study was only 16% (multiple reviewers).

Review_Consistency_Rates-CORRECTED

This is the basis for my less is more postulate and why I always use as few contract review attorneys as possible in a review project. Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three.  This helps pursue the quality goal of perfect consistency. Sorry contract lawyers, your days are numbered. Most of you can and will be replaced. You will not be replaced by robots exactly, but by other AI-enhanced human reviewers. SeeWhy I Love Predictive Coding (The Empowerment of AI Augmented Search).

To be continued …

6 Responses to Concept Drift and Consistency: Two Keys To Document Review Quality – Part Two

  1. Joshua Rubin says:

    Thanks for another great post, Ralph. One observation:

    You candidly concede that SMEs, including yourself, are fallible and will on occasion make inconsistent calls. Then, again with your typical candor, you say:

    I am, however, happy to report that even
    without my latest quality controls all of my
    inconsistent calls on relevance pertained
    to unimportant relevant documents, what
    I consider “more of the same” grey area
    types. No important document was miscoded.

    Since your earlier study was a merits review of an incoming production, you didn’t need to be too cumulative. For clarity, I think it’s important to distinguish reviews for outgoing productions, where coding inconsistencies should be QC’ed out as early as possible, to avoid confusing the prediction engine, to make the review as efficient as possible, and to maximize recall.

    Best,
    Josh

    • Ralph Losey says:

      Good comment, and I agree that all humans make mistakes, that’s for sure, even lawyers! This again proves contrary to popular belief that lawyers are too human, although the whole question of whether they have a heart or not is still up for debate.

      One correction to your comment, however, my earlier study, actually studies, were NOT reviews of incoming productions. Not sure where you got that? Both ENRON reviews were of the entire dataset of almost 800,000 docs. Both looked for ESI relevant to employee terminations (one used my multimodal techniques, the other used random and machine selected only). Both reviews excluded ESI pertaining to employee departures, voluntary, as opposed to involuntary terminations. This is a grey area to be sure in this Enron collection since the corp essentially going out of business.

      Final comment, most active machine learning algorithms can handle some inconsistency, but not too much. Still, I agree the goal is perfect consistency, especially when training.

      • > One correction to your comment,
        > however, my earlier study, actually
        > studies, were NOT reviews of
        > incoming productions. Not sure
        > where you got that?

        Hmmm. Maybe I was subconsciously trying to prove that I’m human. (And maybe I need to teach my subconscious about invalid syllogisms.)

  2. Tony says:

    Hello Ralph! I am very pleased someone is writing about and describing this issue. As someone who has dealt with document review management at a project level for years, I can assure you that everything you mention above does indeed happen on a regular basis and can be very problematic on many reviews.

    The best reviews I have seen are the ones that anticipate these issues. They may describe relevancy a little broader at first, knowing that they can eliminate documents later with more precision once the larger set has had the more obvious irrelevant documents removed. This minimizes the need to adjust relevancy because of “issue creep” or as you describe “concept drift” on the more marginal sets early in a review (since they are still included initially as relevant, with a SME making a determination subsequent). This is also a benefit to TAR, as the documents are found using a relatively “broad brush” at first so to speak to remove the likely irrelevant set, and then focus can be narrowed down to carve out the false positives on a more specific review.

    As in any situation, there are always things that arise that you just can’t predict; however, by planning ahead at the beginning of the project to accommodate some of these issues, people can save a lot of time and effort during the actual review process.

    Thanks again to shining a light on e-discovery and the nuances of TAR/ document review!

  3. […] is Part Three of this blog. Please read Part One and Part Two […]

  4. […] good thing. See: Concept Drift and Consistency: Two Keys To Document Review Quality – Parts One, Two and […]

Leave a Reply

Discover more from e-Discovery Team

Subscribe now to keep reading and get access to the full archive.

Continue reading