Concept Drift and Consistency: Two Keys To Document Review Quality

January 20, 2016

holy.grail.chaliceHigh quality effective legal search, by which I mean a document review project that is high in recall, precision and efficiency, and proportionally low in cost, is the holy grail of e-discovery. Like any worthy goal it is not easy to attain, but unlike the legendary grail, there is no secret on how to find it. As most experts already well know, it can be attained by:

  1. Following proven document search and review protocols;
  2. Using skilled personnel;
  3. Using good multimodal software with active machine learning features; and,
  4. Following proven methods for quality control and quality assurance.

Effective legal search is the perfect blend of recall and proportionate precision. See: Rule 26(b)(1), FRCP (creating nexus between relevance and six proportionality criteria). The proportionate aspect keeps the cost down, or at least at a spend level appropriate to the case. The quality control aspects are to guaranty that effective legal review is attained in every project.

The Importance of Quality Control was a Lesson of TREC 2015

This need for quality measures was one of the many lessons we re-learned in the 2015 TREC experiments. These scientific experiments (it is not a competition) were sponsored by the National Institute of Standards and Technology. They are designed to test the information text retrieval technology, which at this point means the latest active machine learning software and methods. My e-Discovery Team participated in the TREC Total Recall Track in 2015. We had to dispense with most of our usual quality methods to save time, and to fit into the TREC experiment format. We had to skip steps one, three, and seven, where most of our quality control and quality assurance methods are deployed. These methods take time, but are key to consistent quality and we would not do a large commercial project without them.

Predictive Coding Search diagram by Ralph Losey

By skipping step one, which we had to do because of the TREC experiment format, and skipping steps three and seven, where most of the quality control measures are situated, to save time, we were able to do mission impossible. A couple of attorneys working alone were able to complete thirty review projects in just forty-five days, and on a part-time after hours basis at that. It was a lot of work, approximately 360 hours, but it was exciting work, much like an Easter egg hunt with race cars. It is fun to see how fast you can find and classify relevant documents and still stay on-track. Indeed, I could never have done it without the full support and help of the software and top experts at Kroll Ontrack. At this point they know these eight-step 3.0 methods pretty well.

FASTEST_MAN_Appolo_10_reentryIn all we classified as relevant or irrelevant over seventeen million documents. We did so at a truly thrilling average speed of review at 47,261 files per hour! Think about that the next time your document review company brags that it can review from 50 to 100 files per hour. (If that were miles per hour, not files per hour, that would be almost twice as fast as Man has ever gone (Apollo 10 lunar module reentry)). Reviewers augmented with the latest AI, the latest CARs (computer assisted review), might as well be in a different Universe. Although 47,261 files per hour might be a record speed for multiple projects, it is still almost a thousand times faster than humans can go alone. Moreover, any AI-enhanced review project these days is able to review documents at speeds undreamed of just a few years ago.

In most of the thirty review projects we were able to go that fast and still attain extraordinarily high precision and recall. In fact we did so at levels never before seen at past TREC Legal Tracks, but we had a few problem projects too. In only seventeen of the thirty projects were we able to attain record-setting high F1 scores, where both are recall and precision high. This TREC, like others in the past, had some challenging aspects, especially the search for target posts in the ten BlackHat World Forum review projects.

To get an idea of how well we did in 2015, as compared to prior legal teams at TREC, I did extensive research of the TREC Legal Tracks of old, as well as the original Blair Maron study. Here are the primary texts I consulted:

  • Grossman and Cormack, Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review, CoRR abs/1504.06868 at pgs. 2-3 (estimating Blair Maron precision score of 20% and listing the top scores (without attribution) in most TREC years);
  • Grossman and Cormack, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014; at pgs. 24-27.
  • Hedin, Tomlinson, Baron, and Oard, Overview of the TREC 2009 Legal Track;
  • Cormack, Grossman, Hedin, and Oard; Overview of the TREC 2010 Legal Track;
  • Grossman, Cormack, Hedin, and Oard, Overview of the TREC 2011 Legal Track;
  • Losey, The Legal Implications of What Science Says About Recall (1/29/12).

Based on this research I prepared the following chart showing the highest F1 scores attained during these scientific tests. (Note that my original blog also identified the names of the participants with these scores, which was information gained from my analysis of public information, namely the five above cited publications. Unidentified persons, I must assume one of the entities named, complained about my disclosure. They did not complain to me, but to TREC. Out of respect to NIST the chart below has been amended to omit these names. My attitude towards the whole endeavor has, however, been significantly changed as a result.)

TREC_historic_best_scores_CENSOREDThis is not a listing of the average score per year, such scores would be far, far lower. Rather this shows the very best effort attained by any participant in that year in any topic. These are the high, high scores. Now compare that with not only our top score, which was 100%, but our top twelve scores. (Of course, the TREC events each year have varying experiments and test conditions and so direct comparisons between TREC studies are never valid, but general comparisons are instructive and frequently made in the cited literature.)

ediscovery-TEAM_logoOn twelve of the topics in 2015 the e-Discovery Team attained F1 scores of 100%, 99%, 97%, 96%, 96%, 95%, 95%, 93%, 87%, 85%, 84% and 82%. One high score as we have seen in past TRECs might just be chance, but not twelve. The chart below identifies our top twelve results and the topic numbers where they were attained. For more information on how we did, see e-Discovery Team’s 2015 Preliminary TREC Report. Also come hear us speak at Legal Tech in New York on February 3, 2016 10:30-11:45am. I will answer all questions that I can within the framework of my mandatory NDA with TREC. Joining me on the Panel will be my teammate at TREC, Jim Sullivan, as well as Jason R. Baron of Drinker Biddle & Reath, and Emily A. Cobb of Ropes & Gray. I am not sure if Mr. EDR will be able to make it or not.

TREC_Team_Scores

The numbers and graphs speak for themselves, but still, not all of our thirty projects attained such stellar results. In eighteen of the projects our F1 score was less than 80%, even though our recall alone was higher, or in some topics, our precision. (Full discussion and disclosure will be made in the as yet unpublished e-Discovery Team Final Report.) Our mixed results at TREC were due to a variety of factors, some inherent in the experiments themselves (mainly the omission of Step 1, the difficulty of some topics, and the debatable gold-standards for some of the topics), but also, to some extent, the omission of our usual quality control methods. Skipping Steps 3 and 7 was no doubt at least a factor in the sub-average performance – by our standards – in some of the eighteen projects we were disappointed with. Thus one of the take-away lessons from our TREC research was the continued importance of a variety of quality control methods. See eg: ZeroErrorNumerics.com. It is an extra expense, and takes time, but is well worth it.

Consistency and Concept Drift

4-5-6-only_predictive_coding_3.0The rest of this article will discuss two of the most important quality control considerations, consistency and concept drift. They both have to do with human review of document classification. This is step number five in the eight-step standard workflow for predictive coding. On the surface the goals of consistency and drift in document review might seem opposite, but they are not. This article will explain what they are, why they are complementary, not opposite, and why they are important to quality control in document review.

Consistency here refers to the coding of the same or similar documents, and document types, in the same manner. This means that a single reviewer determines relevance in a consistent manner throughout the course of a review project. It also means that multiple reviewers determine relevance in a consistent manner with each other. This is a very difficult challenge, especially when dealing with grey area documents and large projects.

The problem of inconsistent classifications of documents by human reviewers, even very expert reviewers, has been well documented in multiple information retrieval experiments. See eg: Voorhees, Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness, 36 Info. Processing & Mgmt 697 (2000); Losey, Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two (12/2/13). Fortunately, the best document review and search software now has multiple features that you can use to help reduce inconsistency, including the software I now use. See eg: MrEDR.com.

machine_learningConcept drift is a scientific term from the fields of machine learning and predictive analytics. (Legal Search Science is primarily informed by these fields, as well as information retrieval. See eg: LegalSearchScience.com.) As Wikipedia puts it, concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. In Legal Search the model we are trying to predict is one of legal relevance. See eg. Rule 26(b)(1), FRCP.

The target is all of the relevant documents in a large collection, a corpus. The documents them self do not change, of course, but whether they are relevant or not does change. The statistical properties are the content of individual documents, including their metadata, that makes them relevant or not. These properties change, and thus the documents that are predicted to be relevant change, as the concept of relevance evolves during the course of a document review.

In Legal Search Concept drift emerges from lawyers changing understanding of the relevance of documents. In the Law this may also be referred to as relevance shift, or concept shift. In some cases the change is the result of changes in individual lawyer analysis. In others it is the result of formalized judicial processes, such as new orders or amended complaints. Most large cases have elements of both. Quality control requires that concept drift be done intentionally and with retroactive corrections for consistency. Concept drift, as used in this article, is an intentional deviation, a pivot in coding to match an improved understanding of relevance. 

Conversely, from a quality control perspective, you are trying to avoid two common project management errors. You are trying to avoid concept freeze, where your initial relevance instructions never shift, never drift, during the course of a review. You are also trying to avoid inconsistencies, typically by reviewers, but really from any source.

Proj_Mang_errors

To be continued ….


Why I Love Predictive Coding

December 6, 2015

Making document review fun with Mr. EDR and Predictive Coding 3.0

LoveMany lawyers and technologists like predictive coding and recommend it to their colleagues. They have good reasons to do so. It has worked for them. It has allowed them to do e-discovery reviews in an effective, cost efficient manner. That is true for me too, but that is not why I love predictive coding. My feelings come from the excitement, fun, and amazement that often arise from seeing it in action. I love watching my predictive coding software find documents that I could never have found on my own. I love the way the AI in the software helps me do the impossible. I love how it makes me far smarter and skilled than I really am.

I have been getting those kinds of positive feelings a lot lately using the new Predictive Coding 3.0 methodology and Kroll Ontrack’s latest eDiscovery.com Review software (“EDR”). So too have my e-Discovery Team members who helped me to participate in this year’s TREC (the great annual science experiment for the latest text search techniques sponsored by the National Institute of Standards and Technology). During our grueling forty-five days of experiments we came to admire the intelligence of the new EDR software so much that we decided to personalize the AI as a robot. We named him Mr. EDR out of respect. He even has his own website now, MrEDR.com, where he explains how he helped my e-Discovery Team in the 2015 TREC Total Recall Track experiments. With Mr. EDR at your side document review need never be boring again.

How and Why Predictive Coding is Fun

predictive_coding_3.0Step Six of the eight-step workflow for Predictive Coding 3.0 is called Hybrid Active Training. That is where we work with the active machine-learning features of Mr. EDR, the predictive coding features, which are a type of artificial intelligence. We train the computer on our conception of relevance by showing it relevant and irrelevant documents that we have found. The software is designed to then go out and find all other relevant documents in the total dataset.

Mr_EDRWe use a multimodal approach to find training documents, meaning we use all of the other search features of Mr. EDR to find relevant ESI, such as keyword searches, similarity and concept. We iterate the training by sample documents, both relevant and irrelevant, until the computer starts to understand the scope of relevance we have in mind. It is a training exercise to make our AI smart, to get it to understand the basic ideas of relevance for that case. It usually takes multiple rounds of training for Mr. EDR to understand what we have in mind. But he is a fast learner, and by using the latest hybrid multimodal continuous active learning techniques, we can usually complete his training in a day or two.

After a while Mr. EDR starts to “get it,” he starts to really understand what we are after, what we think is relevant in the case. That is when a happy shock and awe type moment can happen. That is when Mr. EDR’s intelligence and search abilities start to exceed our own. Yes. It happens. The pupil then starts to evolve beyond his teachers. The smart algorithms start to see patterns and find evidence invisible to us. At that point we let him teach himself by automatically accepting his top-ranked predicted relevant documents without even looking at them. Our main role then is to determine a good range for the automatic acceptance and do some spot-checking. We are, in effect, allowing Mr. EDR to take over the review. Oh what a feeling to then watch what happens, to see him keep finding new relevant documents and keep getting smarter and smarter by his own self-programming. That is the special AI-high that makes it so much fun to work with Predictive Coding 3.0 and Mr. EDR.

EDR_lookIt does not happen in every project, but with the new Predictive Coding 3.0 methods and the latest Mr. EDR, we are seeing this kind of transformation happen more and more often. It is a tipping point in the review when we see Mr. EDR go beyond us. He starts to unearth relevant documents that my team would never even have thought to look for. The relevant documents he finds are sometimes completely dissimilar to any others we found before. They do not have the same keywords, or even the same known concepts. Still, Mr. EDR sees patterns in these documents that we do not. He can find the hidden gems of relevance, even outliers and black swans, if they exist. When he starts to train himself, that is the point in the review when we think of Mr. EDR as going into superhero mode. At least, that is the way my young e-Discovery team likes to talk about him.

By the end of many projects the algorithmic functions of Mr. EDR have attained a higher intelligence and skill level than our own (at least on the task of finding the relevant evidence in the document collection). He is always lightening fast and inexhaustible, even untrained, but by the end of his training, he becomes a search genius. Watching Mr. EDR in that kind of superhero mode is one of the things that make Predictive Coding 3.0 a pleasure.

The Empowerment of AI Augmented Search

It is hard to describe the combination of pride and excitement you feel when Mr. EDR, your student, takes your training and then goes beyond you. More than that, the super-AI you created, then empowers you to do things that would have been impossible before, absurd even. That feels pretty good too. You may not be Iron Man, or look like Robert Downey, but you will be capable of remarkable feats of legal search strength.

Iron_Man_robert-downey

For instance, using Mr. EDR as our Iron Man-like suits, my e-discovery team of three attorneys was able to do thirty different review projects and classify 17,014,085 documents in 45 days. See TREC experiment summary at Mr. EDR. We did these projects mostly at nights, and on weekends, while holding down our regular jobs. What makes this crazy impossible, is that we were able to accomplish this by only personally reviewing 32,916 documents. That is less than 0.2% of the total collection. That means we relied on predictive coding to do 99.8% of our review work. Incredible, but true. Using traditional linear review methods it would have taken us 45 years to review that many documents! Instead, we did it in 45 days. Plus our recall and precision rates were insanely good. We even scored 100% precision and 100% recall in one TREC project. You read that right. Perfection. Many of our other projects attained scores in the high and mid nineties. We are not saying you will get results like that. Every project is different, and some are much more difficult than others. But we are saying that this kind of AI-enhanced review is not only fast and efficient, it is effective.

Yes, it’s pretty cool when your little AI creation does all the work for you and makes you look good. Still, no robot could do this without your training and supervision. We are a team, which is why we call it hybrid multimodal, man and machine.

Having Fun with Scientific Research at TREC 2015

During the 2015 TREC Total Recall Track experiments my team would sometimes get totally lost on a few of the really hard Topics. We were not given legal issues to search, as usual. They were arcane technical hacker issues, political issues, or local news stories. Not only were we in new fields, the scope of relevance of the thirty Topics was never really explained (we were given one to three word explanations). We had to figure out intended relevance during the project based on feedback from the automated TREC document adjudication system. We would have some limited understanding of relevance based on our suppositions of the initial keyword hints, and so we could begin to train Mr. EDR with that. But, in several Topics, we never had any real understanding of exactly what TREC thought was relevant.

MrEdr_CapedThis was a very frustrating situation at first, but, and here is the cool thing, even though we did not know, Mr. EDR knew. That’s right. He saw the TREC patterns of relevance hidden to us mere mortals. In many of the thirty Topics we would just sit back and let him do all of the driving, like a Google car. We would often just cheer him on (and each other) as the TREC systems kept saying Mr. EDR was right, the documents he selected were relevant. The truth is, during much of the 45 days of TREC we were all like kids in a candy store having a great time. That is when we decided to give Mr. EDR a cape and superhero status. He never let us down. It is a great feeling to create an AI with greater intelligence than your own and then see it augment and improve your legal work. It is truly a hybrid human-machine partnership at its best.

I hope you get the opportunity to experience this for yourself someday. This year’s TREC experiments are over, but the search for truth and justice goes on in lawsuits across the country. Try it on your next document review project.

Do What You Love and Love What You Do

Mr. EDR, and other good predictive coding software like it, can augment our own abilities and make us incredibly productive. This is why I love predictive coding and would not trade it for any other legal activity I have ever done (although I have had similar highs from oral arguments that went great, or the rush that comes from winning a big case).

man_robotThe excitement of predictive coding comes through clearly when Mr. EDR is fully trained and able to carry on without you. It is a kind of Kurzweilian mini-singularity event. It usually happens near the end of the project, but can happen earlier when your computer catches on to what you want and starts to find the hidden gems you missed. I suggest you give Predictive Coding 3.0 and Mr. EDR a try. Then you too can have fun with evidence search. You too can love what you do. Document review need never be boring again.

Caution

Red_Flag_warningOne note of caution: most e-discovery vendors, including several prominent software makers, still do not follow the hybrid multimodal Predictive Coding 3.0 approach that we use to attain these results. They instead rely entirely on machine-selected documents for training, or even worse, rely entirely on random selected documents to train the software, or have elaborate unnecessary secret control sets. On the other end of the spectrum, some vendors use all search methods except for predictive coding, to keep it simple they say. It may be simple, but the power, speed, quality control and just plain fun given up for that simplicity are not worth it. The old ways are more costly because they take so much lawyer time to complete, they are less effective, and, they are boring. The use of AI data analytics is clearly the way of the future. It is what makes document review enjoyable and why I love to do big projects. It turns scary to fun.

I have also heard that the algorithms used by some vendors for predictive coding are not very good. Scientists tell me that some are only dressed-up concept search or unsupervised document clustering. Only bona fide active machine learning algorithms create the kind of AI experience that I am talking about. So, if it does not work for you, it could well be the software’s fault, not yours. The new 3.0 methods are not very hard to follow, and they certainly will work. We have proven that at TREC, but only if you have good software. With just a little training, and some help at first from consultants (most vendors will have good ones to help), you can have the kind of success and excitement that I am talking about.

Do not give up if it does not work for you the first time, especially in a complex project. Try another vendor instead, one that may have better software and better consultants. Also, be sure that your consultants are Predictive Coding 3.0 experts, and that you follow their advice. Finally, remember that the cheapest is almost never the best, and, in the long run will cost you a small fortune in wasted time and frustration.

Conclusion

Ralph_smilingLove what you do. It is a great feeling and sure-fire way to job satisfaction and success. With these new predictive coding technologies it is easier than ever to love e-discovery. Try them out. Treat yourself to the AI high that comes from using smart machine learning software and fast computers. There is nothing else like it. If you switch to the 3.0 methods and software, you too can know that thrill. You can watch an advanced intelligence, which you helped create, exceed your own abilities, exceed anyone’s abilities. You can sit back and watch Mr. EDR complete your search for you. You can watch him do so in record time and with record results. It is amazing to see good software find documents that you know you would never have found on your own.

Predictive coding AI in superhero mode can be exciting to watch. Why deprive yourself of that? Who says document review has to be slow and boring? Start making the practice of law fun again.

superman_animated3


Follow

Get every new post delivered to your Inbox.

Join 4,443 other followers

%d bloggers like this: