The gears in this diagram show the mechanics of a complex document review project circa 2018. The role I play in such projects, and other search specialists like me, is that of the AI Trainer (the dark blue gear with my picture in it).
AI Trainers Need Only Look at Metadata
As you can see from the diagram, in the role of AI Trainer I do not actually look at any documents myself, just the ESI metadata, such as number of documents with certain keywords or documents that already have relevance classifications. The metadata is all the information that I need in order to search, make machine training selection decisions and select documents for reviewers to actually look at and classify (multimodal review). These are steps four, five and six of my open-sourced eight-step process for document review using predictive coding. (Classes 10-16 of the TARcourse.com) This new Key Players diagram is another way of describing the iterative process that makes up the core of AI-enhanced document review today.
I could not complete my role in this process with my own limited human capacities and intelligence. I rely heavily on the input of the machine intelligence, the ranking metadata created by the AI (step-five in the iterated steps 4, 5 and 6 shown above). This is a Hybrid process, Man and Machine, working together using a variety of search techniques. Predictive Coding 4.0 Hybrid Multimodal IST Method (Classes 10-16 of the TARcourse.com). The entire eight-step work-flow for a predictive coding review project is shown in the diagram below.
The highest ranking documents are almost always included in the documents that I select to batch out to reviewers to examine and code. (Step One in the Key Players work flow.) This ranking is part of the metadata that the AI adds to the ESI. (Step Four in the Key Players work flow.)
The iterative training workflow when described in terms of the Key Players forms a figure-eight, an infinity loop, as shown in blue in the diagram below. Step One in the Key Players work flow is my work of multimodal search and choosing ESI for the Review Attorneys to read and code. Step Two is the work of the Review Attorneys to code the documents. This changes the metadata by adding classifications to the documents they review. In Step Three I study the classification metadata created by the Reviewers, and other metadata, and use this information to choose the ESI to Train the AI. In Step Four the AI re-ranks the ESI again, changing the metadata again. That brings us back to Step One again and my work to study the new metadata, create searches and batch out more documents for the Document Reviewers. The highest ranking documents are almost always included in the documents that I select to batch out to reviewers to examine and code.
This circle eight workflow keeps repeating until all of the responsive documents (ESI) required by the project have been found. That should include all Highly Relevant documents and most, if not close to all, of the merely relevant. The iterative infinity loop comes to an end when I determine reasonable, proportional efforts have been made and make a Stop Decision. Then we move onto testing that decision with a random sample, which is part of Step Seven, Zero Error Numerics (ZEN) Quality Control.
Limiting the Time of High Billing Rate Attorneys
The metadata, and my monitoring of all communications in a project, provides all of the information that I need to help supervise first-pass quality control. (I rarely get involved in second pass, final production work. The project managers have that well under control.) Since I do not spend my time looking at documents, my time on a project, which is at a relatively high billing rate, is very limited. Clients like that. So do I, since I just get to do what I enjoy the most.
With the help of my document review attorneys, most of whom are contract attorneys who specialize in review, I can complete an entire project without ever reading a single document! They do it for me and do a great job at it. (I only use the best.) The bulk of the work in complex projects like this is now performed by these document reviewers, either contract review attorneys or Junior SME (Subject Matter Expert) attorneys. They are the ones that put in the hours. The senior SMEs do not have to spend much time on the review at all, which is good, because most of them do not like this kind of work.
My time on the high-level meta-functions is also constrained and very limited compared to the reviewers. But I like what I do a lot; I find it challenging, even fun. WHY I LOVE PREDICTIVE CODING: Making Document Review Fun Again with Mr. EDR and Predictive Coding 4.0. After having done this since 2012, when our Da Silva Moore case with Judge Peck kicked off the predictive coding frenzy, I am able to do this pretty fast.
Better Recall and Precision
If I do my job right, and the AI probability ranking algorithm in the software works correctly, then the amount of time needed by the reviewers to do theirs will be far less. That is primarily because they will have to look at far fewer irrelevant documents. Moreover, we will be able to find far more of the relevant documents by using the AI-enhanced, iterated figure-eight methods.
Our tests, both formal and ad hoc, have shown that the number of relevant documents that we miss is far less in complex projects when we have the help of machine learning. In other words, our recall is higher when we use machine learning search features to help us to find documents. See e-Discovery Team’s 2016 TREC Report: Once Again Proving the Effectiveness of Our Standard Method of Predictive Coding; and, official TREC report for 2015, published on February 20, 2016, found on the NIST website at http://trec.nist.gov/pubs/trec24/papers/eDiscoveryTeam-TR.pdf. We can do without it in simple and small projects, but when the going gets tough, we need the help of machine learning. It is a game changer. Our team, and many others before us, proved this at TREC in both 2015 and again 2016. We proved it in 2015 when we followed TREC into searches of the dark web and BlackHat World. We proved it again in 2016 when we searched the considerably different world of Jeb Bush email. We prove it again every day in our legal practice.
Better, faster, cheaper. This is not a myth or idle vendor promise. If you tried it before and it did not work for you, your software might have been poor or your methods wrong. If you used a method with large random samples at the start and secret control sets, your method was wrong. If you used a method where you trained and then reviewed, instead of continuously training, your method was wrong too. Try it again with the latest methods and software with actual active machine learning features, not just passive analytics. Active machine learning is what makes the difference. Software quality is important. So too are the proper methods of using the technology.
The Use of Active Machine Learning to Find Documents is Not New
There are two sponsors of TREC, the U.S. Department of Commerce through the National Institute of Standards and Technology (NIST), and the Department of Defense. The power of this kind of specialized artificial intelligence to help find documents is an everyday reality for most document search experts. All text retrieval scientists in the field, many of whom we met at TREC, know this from decades of work in academia and for government agencies they cannot talk about. Active machine learning for text retrieval has been used by libraries and information gathering government agencies around the world for decades. It is a classic signal in the noise type of problem; find the important information from trillion of bits of useless noise. (Why do you think the NSA collects email metadata, and does not care so much about actual contents?)
Although the term Predictive Coding is somewhat new, and the methods discussed here for efficiently using this technology in legal document review projects are brand new, the machine learning technology itself is not new. It is well established. The 2016 TREC we participated in was the twenty-fifth anniversary of TREC. Active machine learning for purposes of text retrieval is far more powerful then mere passive analytics. It is truly a breakthrough for legal technology, and a win win for the legal profession.
Using Active Machine learning to Lower Costs and Improve Recall
With this technology and these methods it is now possible to have both improved precision and better recall. It is possible to find more relevant documents at a lower cost, even considering the relatively high billing rates of the AI-Trainers and Senior SMEs.
The system shown in the Key Player diagrams allows us to limit the time of these two key players. The Senior SMEs just supervise the work of their junior counterparts and the skilled legal searchers, people like me, just look at metadata, not the documents. Using this method I can supervise and serve as AI-Trainer in multiple projects at the same time.
AI Trainers Need Not Be Involved in SME Issues
As you can see from the gears work flow, as an AI Trainer I do not get involved in subject matter expert issues, such as scope of relevance/responsiveness. Although I occasionally still do this, and unlike most search experts, have several decades of experience as a Senior SME under my belt, it is not necessary. Most of the time I am not involved. The senior trial attorney and their number-two know a lot more about the subject and case than me. I am not concerned about the quality of their expertise, nor the good faith of the execution. See TARcourse.com, Ninth Class: 7th, 8th and 9th Insights – GIGO, QC, SME, Method, Software (discusses GIGO training issues and the problem of the negligent or corrupt SME).
The SME Team
On the question of competent, good faith activities, the SMEs I work with are usually my law partners. I have no reason to be concerned about their skills or bad faith malfeasance. Even if they were not, any negligent or intentional twisting of relevance to hide evidence would be revealed by the metadata. It would also be exposed by the AI and the Review Attorneys. In other words, the AI would notice, the AI Trainer would notice and so would the Review Attorneys. If an unscrupulous attorney were to attempt to hide evidence, it would have to be done before the review by excluding the ESI from the collection. (That is one reason I always like to be involved in collection and insist on bulk collections.) If the ESI is not in the review database to begin with, then it will not be found. But even then, tell-tale traces of the omitted documents may be noticed, such as gaps in email chains. Bottom line, the team approach described here makes the kind of corrupt practices described in Waymo v Uber far more difficult. Waymo v. Uber, Hide-the-Ball Ethics and the Special Master Report of December 15, 2017.
It would be much easier for a corrupt attorney to get away with hiding evidence in an older system, without AI Trainers and document review specialists. Unethical behavior thrives in the dark. It is near impossible to pull off in open group team work. Everyone would have to be in on it. For that reason, I am quite comfortable leaving most relevance decisions to the SMEs, even when they are not my attorneys, and focusing my time on multimodal search, machine training and quality controls.
SME in e-Discovery
At this point in my career, I am an e-discovery specialist. I do not attempt to stay current in other substantive areas of the law. It is hard enough to stay current with e-discovery, both case law and new technology. Assuming a project has good communications (and I help out with that), there is no reason for me to know much more than the basics about a case. Also, as discussed in the Ninth Class of the TARcourse.com, we have multiple built-in safeguards for quality control. They catch and help correct mistakes and inconsistencies in relevance judgment. Such mistakes are inevitably in any complex project. The understanding of relevance naturally evolves as more ESI is reviewed. That is the main reason the first methods of predictive coding often worked poorly. They used large, random secret control sets that incorrectly assumed that relevance was fixed. We have fixed and stopped using control sets long ago. TARcourse.com – First Class: Background and History of Predictive Coding.
Under our current methods, the document reviewers themselves (and the AI that I train) are the ones who have to closely understand and follow the SMEs, not me. That means I can work on almost any type of document review project without personal expertise in the type of law involved, nor the relevance rules that develop. It is more important that I know the Reviewers and type of ESI involved. It is especially important for me to understand the metadata created by the reviewers as they code the documents and the metadata created by the software algorithm, the AI, as it ranks the documents. It is more important for me to know the software than the subject matter of the case.
In a forthcoming blog I will provide a video with a more detailed explanation of how the Key Players flow-chart works. I will also add this new Key Players explanation to the TARcourse.com. In addition, look for me to be speaking about this at forthcoming webinars, including one for Bloomberg on January 18, 2018, 12:00-1:00 EST on Building Effective Discovery Teams through Relationship Management, and at Legal Tech in 2018. I also plan to talk about this on Twitter. If you are not already following me on Twitter, and do not mind my sometimes ill-controlled anti-Trump rage, suggest you follow me there at @ralphLosey.
Happy New Year!