The only types of Computer Assisted Review (CAR) software that I endorse for the search of large ESI collections include active machine learning algorithms, which provide full featured predictive coding capacities. Active machine learning is a type of artificial intelligence (AI). When used in legal search these AI algorithms significantly improve the search, review, and classification of electronically stored information (ESI). For this reason I prefer to call predictive coding AI-enhanced review or AI-enhanced search. For more background on the science involved see LegalSearchScience.com.
In CARs with AI-enhanced review and search engines, attorneys train a computer to find documents identified by the attorney as a target. The target is typically relevance to a particular lawsuit or legal issue, or some other legal classification, such as privilege. This kind of AI-enhanced review, along with general e-discovery training, are now my primary interests as a lawyer.
Personal Legal Search Background
In 2006 I dropped my civil litigation practice and limited my work to e-discovery. That is also when I started this blog. At that time I could not even imagine specializing more than that. In 2006 I was interested in all aspects of electronic discovery, including computer assisted review. AI-enhanced CARs were still just a dream that I hoped would someday come true.
The use of software in legal practice has always been a compelling interest for me. I have been an avid user of computer software of all kinds since the late 1970s, both legal and entertainment. I even did some game software design and programming work in the early 1980s. My now-grown kids still remember the computer games I made for them.
I carefully followed the legal search and review software scene my whole career, but especially since 2006. It was not until 2011 that I began to be impressed by the new types of predictive coding CAR software entering the market. After I got my hands on the new software, I began to do what had once been unimaginable. I started to limit my legal practice even further. I began to spend more and more of my time on predictive coding types of review work. Since 2012 my work as an e-discovery lawyer and researcher has focused almost exclusively on using predictive coding driven CARs in large document production projects, and on e-discovery training, another passion of mine. In that year one of my cases produced a landmark decision by Judge Andrew Peck that first approved the use of predictive coding, Da Silva Moore. (I do not write about it because it is still ongoing.)
Attorney Maura R. Grossman and I are among the first attorneys in the world to specialize in predictive coding as an e-discovery sub-niche. Maura is a colleague who is both a practicing attorney and an expert in the new field of Legal Search Science. We often present on CLE panels as a kind of technology evangelists for these new methods of legal review. Maura, and her information scientist partner, Gordon Cormack, wrote the seminal scholarly paper on the subject, and more recently an excellent glossary of terms used in CAR (they prefer to call it TAR). Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Richmond Journal of Law and Technology, Vol. XVII, Issue 3, Article 11 (2011); The Grossman-Cormack Glossary of Technology-Assisted Review, with Foreword by John M. Facciola, U.S. Magistrate Judge, 2013 Fed. Cts. L. Rev. 7 (January 2013). I recommend your reading of all of their works. I also recommend your study of the LegalSearchScience.com website that I put together, and the many references and citations included at Legal Search Science, including the writings of other pioneers in the field, such as the founders of TREC Legal Track, Jason R. Baron, Doug Oard, and David Lewis, and other key figures in the field, such as information scientists William Webber and EDI’s Herb Roitblat. Also see Baron and Grossman, The Sedona Conference® Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (2013).pdf (December 2013).
Advanced CARs Require Completely New Driving Methods
CAR or TAR is more than just new software. It entails a whole new legal method, a new approach to large document reviews. Below is the diagram that I created to show the new workflow I use in a typical CAR project.
For a basic description of the eight steps see the Electronic Discovery Best Practices page on predictive coding.
I have found that driving a CAR properly requires the highest skill levels and is, for me at least, the most challenging activity in electronic discovery. It also shows the promise of being the new tool that we have all been waiting for. When used properly, good predictive coding type software allows attorneys to find the information they need in vast stores of ESI, and to do so in an effective and affordable manner.
In my experience the best software and training methods use what is known as an active learning process in steps four and five in the chart above. My preferred active learning process in the iterative machine learning steps is threefold:
- The computer selects documents for review where the software classifier is uncertain of the correct classification. This helps the classifier algorithms to learn by adding diversity to the documents presented for review. This in turn helps to locate outliers of a type your initial judgmental searches in step two and five have missed. This is machine selected sampling, and, according to a basic text in information retrieval engineering, a process is not a bona fide active learning search without this ability. Manning, Raghavan and Schutze, Introduction to Information Retrieval, (Cambridge, 2008) at pg. 309.
- Some reasonable percentage of the documents presented for human review in step five are selected at random. This again helps maximize recall and premature focus on the relevant documents initially retrieved.
- Other relevant documents that a skilled reviewer can find using a variety of search techniques. This is called judgmental sampling. After the first round of training, aka the seed set, the judgmental sampling by a variety of search methods is used based on the machine selected or random selected documents presented for review, but sometimes the subject matter expert (“SME”) human reviewer follows a new search idea unrelated to the new documents seen. Any kind of searches can be used for judgmental sampling, which is why I call it a multimodal search. This may include some linear review of selected custodians or dates, parametric Boolean keyword searches, similarity searches of all kinds, concept searches, as well as several unique predictive coding probability searches.
The initial seed set generation, step two in the chart, should also use some random samples, plus judgmental multimodal searches. Steps three and six in the chart always use pure random samples and rely on statistical analysis. For more on the three types of sampling see my blog, Three-Cylinder Multimodal Approach To Predictive Coding.
My insistence on the use of multimodal judgmental sampling in steps two and five to locate relevant documents follows the consensus view of information scientists specializing in information retrieval, but is not followed by several prominent predictive coding vendors. They instead rely entirely on machine selected documents for training, or even worse, rely entirely on random selected documents to train the software. In my writings I call these processes the Borg approach, after the infamous villains in Star Trek, the Borg, a race half-human robots that assimilates people into machines. (I further differentiate between three types of Borg in Three-Cylinder Multimodal Approach To Predictive Coding.) Like the Borg, these approaches unnecessarily minimize the role of individuals, the SMEs. They exclude other types of search to supplement an active learning process. I advocate the use of all types of search, not just predictive coding.
Hybrid Human Computer Information Retrieval
Further, in contradistinction to Borg approaches, where the machine controls the learning process, I advocate a hybrid approach where Man and Machine work together. In my hybrid CARs the expert reviewer remains in control of the process, and their expertise is leveraged for greater accuracy and speed. The human intelligence of the SME is a key part of the search process. In the scholarly literature of information science this hybrid approach is known as Human–computer information retrieval (HCIR).
The classic text in the area of HCIR, which I endorse, is Information Seeking in Electronic Environments (Cambridge 1995) by Gary Marchionini, Professor and Dean of the School of Information and Library Sciences of U.N.C. at Chapel Hill. Professor Marchionini speaks of three types of expertise needed for a successful information seeker:
- Domain Expertise. This is equivalent to what we now call SME, subject matter expertise. It refers to a domain of knowledge. In the context of law the domain would refer to particular types of lawsuits or legal investigations, such as antitrust, patent, ERISA, discrimination, trade-secrets, breach of contract, Qui Tam, etc. The knowledge of the SME on the particular search goal is extrapolated by the software algorithms to guide the search. If the SME also has System Expertise, and Information Seeking Expertise, they can drive the CAR themselves. Otherwise, they will need a chauffeur with such expertise, one who is capable of learning enough from the SME to recognize the relevant documents.
- System Expertise. This refers to expertise in the technology system used for the search. A system expert in predictive coding would have a deep and detailed knowledge of the software they are using, including the ability to customize the software and use all of its features. In computer circles a person with such skills is often called a power-user. Ideally a power-user would have expertise in several different software systems. They would also be an expert in a particular method of search.
- Information Seeking Expertise. This is a skill that is often overlooked in legal search. It refers to a general cognitive skill related to information seeking. It is based on both experience and innate talents. For instance, “capabilities such as superior memory and visual scanning abilities interact to support broader and more purposive examination of text.” Professor Marchionini goes on to say that: “One goal of human-computer interaction research is to apply computing power to amplify and augment these human abilities.” Some lawyers seem to have a gift for search, which they refine with experience, broaden with knowledge of different tools, and enhance with technologies. Others do not.
Id. at pgs.66-69, with the quotes from pg. 69.
All three of these skills are required for an attorney to attain expertise in legal search today, which is one reason I find this new area of legal practice so challenging. It is difficult, but not impossible like this Penrose triangle.
It is not enough to be an SME, or a power-user, or have a special knack for search. You have to be able to do it all, and so does your software. However, studies have shown that of the three skill-sets, System Expertise, which in legal search primarily means mastery of the particular software used, is the least important. Id. at 67. The SMEs are more important, those who have mastered a domain of knowledge. In Professor Marchionini’s words:
Thus, experts in a domain have greater facility and experience related to information-seeking factors specific to the domain and are able to execute the subprocesses of information seeking with speed, confidence, and accuracy.
Id. That is one reason that the Grossman Cormack glossary builds in the role of SMEs as part of their base definition of computer assisted review:
A process for Prioritizing or Coding a Collection of electronic Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection.
Glossary at pg. 21 defining TAR.
According to Marchionini, Information Seeking Expertise, much like Subject Matter Expertise, is also more important than specific software mastery. Id. This may seem counterintuitive in the age of Google, where an illusion of simplicity is created by typing in words to find websites. But legal search of user-created data is a completely different type of search task than looking for information from popular websites. In the search for evidence in a litigation, or as part of a legal investigation, special expertise in information seeking is critical, including especially knowledge of multiple search techniques and methods. Again quoting Professor Marchionini:
Expert information seekers possess substantial knowledge related to the factors of information seeking, have developed distinct patterns of searching, and use a variety of strategies, tactics and moves.
Id. at 70.
In the field of law this kind of information seeking expertise includes the ability to understand and clarify what the information need is, in other words, to know what you are looking for, and articulate the need into specific search topics. This important step precedes the actual search, but is an integral part of the process. As one of the basic texts on information retrieval written by Gordon Cormack, et al, explains:
Before conducting a search, a user has an information need, which underlies and drives the search process. We sometimes refer to this information need as a topic …
Buttcher, Clarke & Cormack, Information Retrieval: Implementation and Evaluation of Search Engines (MIT Press, 2010) at pg. 5. The importance of pre-search refining of the information need is stressed in the first step of the above diagram of my methods, ESI Discovery Communications. It seems very basic, but is often under appreciated, or overlooked entirely in the litigation context where information needs are often vague and ill-defined, lost in overly long requests for production and adversarial hostility.
Hybrid Multimodal Bottom Line Driven Review
I have a long descriptive name for what Marchionini calls the variety of strategies, tactics and moves that I have developed for legal search: Hybrid Multimodal AI-Enhanced Review using a Bottom Line Driven Proportional Strategy. See eg. Bottom Line Driven Proportional Review (2013). I refer to it as a multimodal method because, although the predictive coding type of searches predominate (shown on the below diagram as AI-enhanced review - AI), I also use the other modes of search, including Unsupervised Learning Algorithms (explained in LegalSearchScience.com) (often called clustering or near-duplication searches), keyword search, and even some traditional linear review (although usually very limited). As described, I do not rely entirely on random documents, or computer selected documents for the AI-enhanced searches, but use a three-cylinder approach that includes human judgment sampling and AI document ranking. The various types of legal search methods used in a multimodal process are shown in this search pyramid.
Most information scientists I have spoken to agree that it makes sense to use multiple methods in legal search and not just rely on any single method. UCLA Professor Marcia J. Bates first advocated for using multiple search methods back in 1989, which she called it berrypicking. Bates, Marcia J. The Design of Browsing and Berrypicking Techniques for the Online Search Interface, Online Review 13 (October 1989): 407-424. As Professor Bates explained in 2011 in Quora:
An important thing we learned early on is that successful searching requires what I called “berrypicking.” … Berrypicking involves 1) searching many different places/sources, 2) using different search techniques in different places, and 3) changing your search goal as you go along and learn things along the way. This may seem fairly obvious when stated this way, but, in fact, many searchers erroneously think they will find everything they want in just one place, and second, many information systems have been designed to permit only one kind of searching, and inhibit the searcher from using the more effective berrypicking technique.
This berrypicking approach, combined with HCIR, is what I have found from practical experience works best with legal search. They are the Hybrid Multimodal aspects of my AI-Enhanced Bottom Line Driven Review method.
My Battles in Court Over Predictive Coding
In 2012 my case became the first in the country where the use of predictive coding was approved. See Judge Peck’s landmark decision Da Silva Moore v. Publicis, 11 Civ. 1279, _ FRD _, 2012 WL 607412 (SDNY Feb. 24, 2012). In that case my methods of using Recommind’s Axcelerate software were approved. Later in 2012, in another first, an AAA arbitration approved our use of predictive coding in a large document production. In that case I used Kroll Ontrack’s Inview software over the vigorous objections of the plaintiff, which, after hearings, were all rejected. These and other decisions have helped pave the way for the use of predictive coding search methods in litigation.
In addition to these activities in court I have focused on scientific research on legal search, especially machine learning. I have, for instance, become one of the primary outside reporters on the legal search experiments conducted by TREC Legal Track of the National Institute of Science and Technology. See eg. Analysis of the Official Report on the 2011 TREC Legal Track – Part One, Part Two and Part Three; Secrets of Search: Parts One, Two, and Three. Also see Jason Baron, DESI, Sedona and Barcelona.
After the TREC Legal Track closed down in 2011 the only group participant scientific study to test the efficacy of various predictive coding software, and search methods, is the one sponsored by Oracle, the Electronic Discovery Institute and Stanford. This search of a 1,639,311 document database was conducted in early 2013, with the results reported in Monica Bay’s article, EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013). Here is the below chart published by LTN that summarizes the results.
Monica Bay summaries the findings of the research as follows:
Phase I of the study shows that older lawyers still have e-discovery chops and you don’t want to turn EDD over to robots.
With respect to my dear friend Monica, I must disagree with her conclusion. The age of the lawyers is irrelevant. The best predictive coding trainers do not have to be old, they just have to be SMEs, power users of good software, and have good search skills. In fact, not all SMEs are old, although many may be. It is the expertise and skills that matter, not age per se. It is true as Monica reports that the lawyer, a team of one, who did better in this experiment than all of the other much larger participant groups, was chronologically old. But that fact is irrelevant. The skill set and small group size, namely one, is what made the difference. See: Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” - Parts One, Two, and Three.
Moreover, although Monica is correct to say we do not want to”turn over” review to robots, this assertion misses the point. We certainly do want to turn over review to robot-human teams. We want our predictive coding software, our robots, to hook up with our experienced lawyers. We want our lawyers to enhance their own limited intelligence with artificial intelligence – the Hybrid approach. Robots are the future, but only if and as they work hand-in-hand with our top human trainers. Then they are unbeatable, as the EDI-Oracle study shows.
For the time being the details of the EDI-Oracle scientific study are still closed, and even though Monica Bay was permitted to publicize the results, and make her own summary and conclusions, participants are prohibited from discussion and public disclosures. For this reason I can say no more on this study, and only assert without facts that Monica’s conclusions are in some respects incorrect, that age is not critical, that the hybrid multimodal method is what is important. I hope and expect that someday soon the gag order for participants will be lifted, the full findings of this most interesting scientific experiment will be released, and a free dialogue will commence. Truth only thrives in the open, and science concealed is merely occult.
Why Predictive Coding Driven CARs Are Important
I continue to focus on this sub-niche area of e-discovery as I am convinced that it is critical to advancement of the law in the 21st Century. Our own intelligence and search skills must be enhanced by the latest AI software. The new search and review methods I have developed allow a skilled attorney using readily available predictive coding type software to review at remarkable rates of speed and cost. The CAR review rates are more than 250-times faster than traditional linear review, and the costs less than a tenth as much. See eg Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron; EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013).
My Life as a Limo Driver and Trainer
I have spoken on this subject at many CLEs around the country since 2011. I explain the theory and practice of this new breakthrough technology. I also consult on a hands-on basis to help others learn the new methods. As an old software lover who has been doing legal document reviews since 1980, I also continue to like to do these review projects myself. I like to drive the CARs myself, not just teach others how to drive. I enjoy the interaction and enhancements from the hybrid, human-robot approach. Certainly I need an appreciate the artificial intelligence boosts to my own limited capacities.
I also like to serve as a kind of limo driver for trial lawyers from time to time. The top SMEs in the world (I prefer to work with the best), are almost never also software power-users, nor do they have special skills or talents for information seeking outside of depositions. For that reason they need me to drive the CAR for them. To switch to the robot analogy again, I like and can work with the bots, they cannot.
I can only do my job as a limo driver – robot friend in an effective manner if the SME first teaches me enough of their domain to know where I am going; to know what documents would be relevant or hot or not. That is where decades of legal experience handling a variety of cases is quite helpful. It makes it easer to get a download of the SME’s concept of relevance into my head, and then into the machine. Then I can act as a surrogate SME and do the machine training for them in an accurate and consistent manner.
Working as a driver for an SME presents many special communication challenges. I have had to devise a number of techniques to facilitate a new kind of SME surrogate agency process. Of course, it is easier to do the search when you are also the SME. For instance, in one project I reviewed almost two million documents, by myself, in only two-weeks. That’s right. By myself. (There was no redaction or privilege logging, which are tasks that I always delegate anyway.) A quality assurance test at the end of the review based on random sampling showed a very high accuracy rate was attained. There is no question that it met the reasonability standards required by law and rules of procedure.
It was only possible to do a project of this size so quickly because I happened to be an SME on the legal issues under review, and, just as important, I was a power-user of the software, and have, at this point, mastered my own search and review methods. I also like to think I have a certain knack for information seeking.
Thanks to the new software and methods, what was considered impossible, even absurd, just a few short years ago, namely one attorney accurately reviewing two million documents by him or herself in 14-days, is attainable by many experts. My story is not unique. Maura tells me that she once did a seven-million document review by herself. That is why Maura and Gordon were correct to refer to TAR as a disruptive technology in the Preface to their Glossary. Technology that can empower one skilled lawyer to do the work of hundreds of unskilled attorneys is certainly a big deal, one for which we have Legal Search Science to thank. It is also why I urge you to study this subject more carefully and learn to drive these new CARs yourself. Either that, or hire a limo driver.
My Writings on CAR
A good way to continue your study in this area is to read the articles by Grossman and Cormack, and the over forty or so articles on the subject that I have written since mid-2011. They are listed here in rough chronological order, with the most recent on top. Also see the CAR procedures described on Electronic Discovery Best Practices.
I am especially proud of the legal search experiments I have done using AI-enhanced search software provided to me by Kroll Ontrack to review the 699,083 public Enron documents and my reports on these reviews. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have been told by scientists that my over 100 hours of search, comprised of two fifty-hour search projects using different methods, is the largest search project by a single reviewer that has ever been undertaken, not only in Legal Search, but in any kind of search. I do not expect this record will last for long, as others begin to understand the importance of Information Science in general, and Legal Search Science in particular. But for now I will enjoy both the record and lessons learned from the hard work involved.
Articles by Ralph Losey on Legal Search
- The “If-Only” Vegas Blues: Predictive Coding Rejected in Las Vegas, But Only Because It Was Chosen Too Late. Part One and Part Two.
- IT-Lex Discovers a Previously Unknown Predictive Coding Case: “FHFA v. JP Morgan, et al”
- Beware of the TAR Pits! Part One and Part Two.
- PreSuit: How Corporate Counsel Could Use “Smart Data” to Predict and Prevent Litigation. Also see PreSuit.com.
- Predictive Coding and the Proportionality Doctrine: a Marriage Made in Big Data, 26 Regent U. Law Review 1 (2013-2014).
- Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three.
- My Basic Plan for Document Reviews: The “Bottom Line Driven” Approach, PDF version suitable for print, or HTML version that combines the blogs published in four parts.
- Relevancy Ranking is the Key Feature of Predictive Coding Software.
- Why a Receiving Party Would Want to Use Predictive Coding?
- Vendor CEOs: Stop Being Empty Suits & Embrace the Hacker Way
- Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two).
- A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One).
- Introduction to Guest Blog: Quick Peek at the Math Behind the Black Box of Predictive Coding that pertains to the higher-dimensional geometry that makes predictive coding support vector machines possible.
- Keywords and Search Methods Should Be Disclosed, But Not Irrelevant Documents.
- Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search.
- There Can Be No Justice Without Truth, And No Truth Without Search (statement of my core values as a lawyer explaining why I think predictive coding is important).
- Three-Cylinder Multimodal Approach To Predictive Coding.
- Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search. Video Animation.
- Borg Challenge: Report of my experimental review of 699,082 Enron documents using a semi-automated monomodal methodology (a five-part written and video series comparing two different kinds of predictive coding search methods).
- Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron in PDF form for easy distribution and the blog introducing this 82-page narrative, with second blog regarding an update.
- Journey into the Borg Hive: a Predictive Coding Narrative in science fiction form.
- The Many Types of Legal Search Software in the CAR Market Today.
- Georgetown Part One: Most Advanced Students of e-Discovery Want a New CAR for Christmas.
- Escape From Babel: The Grossman-Cormack Glossary.
- NEWS FLASH: Surprise Ruling by Delaware Judge Orders Both Sides To Use Predictive Coding.
- Does Your CAR (“Computer Assisted Review”) Have a Full Tank of Gas? (and you can also click here for the alternate PDF version for easy distribution).
- Analysis of the Official Report on the 2011 TREC Legal Track – Part One.
- Analysis of the Official Report on the 2011 TREC Legal Track – Part Two.
- Analysis of the Official Report on the 2011 TREC Legal Track – Part Three
- An Elusive Dialogue on Legal Search: Part One where the Search Quadrant is Explained.
- An Elusive Dialogue on Legal Search: Part Two – Hunger Games and Hybrid Multimodal Quality Controls.
- Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022.
- Second Ever Order Entered Approving Predictive Coding.
- Predictive Coding Based Legal Methods for Search and Review.
- New Methods for Legal Search and Review.
- Perspective on Legal Search and Document Review.
- LegalTech Interview of Dean Gonsowski on Predictive Coding and My Mission to Make Predictive Coding Software More Affordable.
- My Impromptu Video Interview at NY LegalTech on Predictive Coding and Some Hopeful Thoughts for the Future.
- The Legal Implications of What Science Says About Recall.
- Reply to an Information Scientist’s Critique of My “Secrets of Search” Article.
- Secrets of Search – Part I.
- Secrets of Search – Part II.
- Secrets of Search – Part III. (All three parts consolidated into one PDF document.)
- Information Scientist William Webber Posts Good Comment on the Secrets of Search Blog.
- Judge Peck Calls Upon Lawyers to Use Artificial Intelligence and Jason Baron Warns of a Dark Future of Information Burn-Out If We Don’t.
- The Information Explosion and a Great Article by Grossman and Cormack on Legal Search.
Please contact me at Ralph.Losey@gmail.com if you have any questions.