Legal Search Science

December 15, 2016

Gold_Lexie_robotLegal Search Science is an interdisciplinary field concerned with the search, review, and classification of large collections of electronic documents to find information for use as evidence in legal proceedings, for compliance to avoid litigation, or for general business intelligence. See Computer Assisted ReviewLegal Search Science as practiced today uses software with artificial intelligence features to help lawyers to find electronic evidence in a systematic, repeatable, and verifiable manner. See:  TAR Training Course (sixteen class course teaches our latest insights and methods of Predictive Coding 4.0) and the over sixty or so articles on the subject that I have written since mid-2011. The hybrid search method of AI human computer interaction developed in this field that will inevitably have a dramatic impact on the future practice of law. Lawyers will never be replaced entirely by robots embodying AI search algorithms, but the productivity of some lawyers using AI will allow them to do the work of dozens, if not hundreds of lawyers.

My own experience (Ralph Losey) provides an example. I participated in a study in 2013 where I searched and reviewed over 1.6 Millions documents by myself, with only the assistance of one computer – one robot, so to speak – running AI-enhanced software by Kroll Ontrack. I was able to do so more accurately and faster than large teams of lawyers working without artificial intelligence software. I was even able to work faster and more accurately than all other teams of lawyers and vendors that used AI-enhanced software, but did not use the science-based search methods described here. I do not attribute my success to my own intelligence, or any special gifts or talents. (They are very moderate.) I was able to succeed by applying the established scientific methods described here and in more detail in our TAR Training Course. They allowed me to augment my own small intelligence with that of the machine. If I have any special skills, it is in human-computer interaction, and legal search intuition. They are based on my long experience in the law with evidence (over 35 years), and in my experience in the last few years using predictive coding software.

Team_Triangle_2Legal Search Science as I understand it is a combination and subset of three fields of study: Information Science, the legal field of Electronic Discovery, and the engineering field concerned with the design and creation of Search Software. Its primary concern is with information retrieval and the unique problems faced by lawyers in the discovery of relevant evidence.

Most specialists in legal search science use a variety of search methods when searching large datasets. The use of multiple methods of search is referred to here as a multimodal approach. Although many search methods are used at the same time, the primary, or controlling search method in large projects is typically what is known as supervised or semi-supervised machine learning.  Semi-supervised learning is a type of artificial intelligence (AI) that uses an active learning approach. I refer to this as AI-enhanced review or AI-enhanced search. In information science it is often referred to as active machine learningand in legal circles as Predictive Coding.

For reliable introductory information on Legal Search Science see the works of attorney, Maura Grossman, and her information scientist partner, Professor Gordon Cormack, including:

The Grossman-Cormack Glossary explains that in machine learning:

Supervised Learning Algorithms (e.g., Support Vector Machines, Logistic Regression, Nearest Neighbor, and Bayesian Classifiers) are used to infer Relevance or Non-Relevance of Documents based on the Coding of Documents in a Training Set. In Electronic Discovery generally, Unsupervised Learning Algorithms are used for Clustering, Near-Duplicate Detection, and Concept Search.

For another perspective see the over sixty or so articles on the subject that I have written since mid-2011. They are listed in rough chronological order, with the most recent on top. The most important of these articles is Predictive Coding 4.0.

Legal Search ScienceMultimodal search uses both machine learning algorithms and unsupervised learning search tools (clustering, near-duplicates and concept), as well as keyword search and even some limited use of traditional linear search. This is further explained here in the section below entitled, Hybrid Multimodal Bottom Line Driven Review. The hybrid multimodal aspects described represent the consensus view among information search scientists. The bottom line driven aspects represent my legal overlay on the search methods. All of these components together make up what I call Legal Search Science. It represents a synthesis of knowledge and search methods from science, law, and software engineering.

The key definition of the Glossary is for Technology Assisted Review, their term for AI-enhanced review.

Technology-Assisted Review (TAR): A process for Prioritizing or Coding a Collection of Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection. Some TAR methods use Machine Learning Algorithms to distinguish Relevant from Non-Relevant Documents, based on Training Examples Coded as Relevant or Non-Relevant by the Subject Matter Experts(s), …. TAR processes generally incorporate Statistical Models and/or Sampling techniques to guide the process and to measure overall system effectiveness.

The Grossman-Cormack Glossary makes clear the importance of Subject Matter Experts (SMEs) by including their use as the document trainer into the very definition of TAR. Nevertheless, experts agree that good predictive coding software is able to tolerate some errors made in the training documents. For this reason experiments are being done on ways to minimize the central role of the SMEs, to see if lesser-qualified persons could also be used in document training, at least to some degree. See Webber & Pickens, Assessor Disagreement and Text Classifier Accuracy (SIGIR, 2013); John Tredennick, Subject Matter Experts: What Role Should They Play in TAR 2.0 Training? (2013). These experiments are of special concern to software developers and others who would like to increase the utilization of AI-enhanced software because, at the current time, very few SMEs in the law have the skills or time necessary to conduct AI-enhanced searches. This is one reason that predictive coding is still not widely used, even though it has been proven effective in multiple experiments and adopted by several courts.

Professor Oard

Professor Oard

For in-depth information on key experiments already performed in the field of Legal Search Science, see the TREC Legal Track reports whose home page is maintained by a leader in the field, information scientist, Doug Oard. Professor Oard is a co-founder of the TREC Legal track. Also see the research and reports of Herb Rotiblat and the Electronic Discovery Institute, and my papers on TREC (and otherwise as listed below): Analysis of the Official Report on the 2011 TREC Legal Track – Part OnePart Two and Part Three; and Secrets of Search: Parts OneTwo, and Three.

For general legal background on the field of Legal Search Science see the works of the attorney co-founder of TREC Legal Track, Jason R. Baron, including:

Baron_at_blackboardAs explained in Baron and Freeman’s Quick Peek at the Math, and my blog introduction thereto, the supervised learning algorithms behind predictive coding utilize a hyper-dimensional space. Each each document in the dataset, including its metadata, represent a different dimension mapped in trans-Cartesian space, called hyper-planes. Each document is placed according to a multi-dimensional dividing line of relevant and irrelevant. The important document ranking feature of predictive coding is performed by measure as to how far from the dividing line a particular document lies. Each time a training session is run the line moves and the ranking fluctuates in accordance with the new information provided. The below diagram attempts to portray this hyperplane division and document placement. The points shown in red designate irrelevant documents and the blue points relevant documents. The dividing line would run through multiple dimensions, not just the usual two of a Cartesian graph. This is depicted in this diagram by folding fields. For more read the entire Quick Peek article.hyperplanes3d_2

For a scientific and statistical view of Legal Search Science that is often at least somewhat intelligible to lawyers and other non-scientists, see the blog of information scientist and consultant, William Webber, Evaluating e-DiscoveryAlso see the many judicial opinions approving and encouraging the use of predictive coding.

AI-Enhanced Search Methods

AI-enhanced search represents an entirely new method of legal search, which requires a completely new approach to large document reviews. Below is the diagram of the latest Predictive Coding 4.0 workflow I use in a typical predictive coding project.

predictive_coding_4-0_web

For a full description of the eight steps take our free sixteen class online training program. See: TAR Training Course

I have found that proper AI-enhanced review is the most interesting and exciting activity in electronic discovery law. Predictive Coding version 3.0 is The tool that we have all been waiting for. When used properly, i.w. Version 4.0, good AI-enhanced software such as Mr.EDR, allows attorneys to find the information they need in vast stores of ESI, and to do so in an effective and affordable manner.

Hybrid Human Computer Information Retrieval

human-and-robots

Further, in contradistinction to Borg approaches, where the machine controls the learning process, I advocate a hybrid approach where Man and Machine work together. In my hybrid search and review projects the expert reviewer remains in control of the process, and their expertise is leveraged for greater accuracy and speed. The human intelligence of the SME is a key part of the search process. In the scholarly literature of information science this hybrid approach is known as Human–computer information retrieval (HCIR). (My thanks to information scientist Jeremy Pickens for pointing out this literature to me.)

The classic text in the area of HCIR, which I endorse, is Information Seeking in Electronic Environments (Cambridge 1995) by Gary Marchionini, Professor and Dean of the School of Information and Library Sciences of U.N.C. at Chapel Hill. Professor Marchionini speaks of three types of expertise needed for a successful information seeker:

  1. Domain Expertise. This is equivalent to what we now call SME, subject matter expertise. It refers to a domain of knowledge. In the context of law the domain would refer to the particular type of lawsuit or legal investigation, such as antitrust, patent, ERISA, discrimination, trade-secrets, breach of contract, Qui Tam, etc. The knowledge of the SME on the particular search goal is extrapolated by the software algorithms to guide the search. If the SME also has the next described System Expertise and Information Seeking Expertise, they can run the search project themselves. That is what I like to call the Army of One approach. Otherwise, they will need a chauffeur or surrogate with such expertise, one who is capable of learning enough from the SME to recognize the relevant documents.
  2. System Expertise. This refers to expertise in the technology system used for the search. A system expert in predictive coding would have a deep and detailed knowledge of the software they are using, including the ability to customize the software and use all of its features. In computer circles a person with such skills is often called a power-user. Ideally a power-user would have expertise in several different software systems. They would also be an expert in one or more particular method of search.
  3. Information Seeking Expertise. This is a skill that is often overlooked in legal search. It refers to a general cognitive skills related to information seeking. It is based on both experience and innate talents. For instance, “capabilities such as superior memory and visual scanning abilities interact to support broader and more purposive examination of text.” Professor Marchionini goes on to say that: “One goal of human-computer interaction research is to apply computing power to amplify and augment these human abilities.” Some lawyers seem to have a gift for search, which they refine with experience, broaden with knowledge of different tools, and enhance with technologies. Others do not.

Id. at pgs.66-69, with the quotes from pg. 69.

All three of these skills are required for an legal team to have the expertise in legal search today, which is one reason I find this new area of legal practice so interesting and exciting. See:  TAR Training Course

Predictive_coding_triangles

It is not enough to be an SME, or a power-user, or have a special knack for search. You need a team that has it all, and great software. However, studies have shown that of the three skill-sets, System Expertise, which in legal search primarily means mastery of the particular software used (Power User), is the least important. Id. at 67. The SMEs are more important, those who have mastered a domain of knowledge. In Professor Marchionini’s words:

Thus, experts in a domain have greater facility and experience related to information-seeking factors specific to the domain and are able to execute the subprocesses of information seeking with speed, confidence, and accuracy.

Id. That is one reason that the Grossman Cormack glossary quoted before builds in the role of SMEs as part of their base definition of technology assisted review. Glossary at pg. 21 defining TAR.

According to Marchionini, Information Seeking Expertise, much like Subject Matter Expertise, is also more important than specific software mastery. Id. This may seem counter-intuitive in the age of Google, where an illusion of simplicity is created by typing in words to find websites. But legal search of user-created data is a completely different type of search task than looking for information from popular websites. In the search for evidence in a litigation, or as part of a legal investigation, special expertise in information seeking is critical, including especially knowledge of multiple search techniques and methods. Again quoting Professor Marchionini:

Expert information seekers possess substantial knowledge related to the factors of information seeking, have developed distinct patterns of searching, and use a variety of strategies, tactics and moves.

Id. at 70.

In the field of law this kind of information seeking expertise includes the ability to understand and clarify what the information need is, in other words, to know what you are looking for, and articulate the need into specific search topics. This important step precedes the actual search, but is an integral part of the process. As one of the basic texts on information retrieval written by Gordon Cormack, et al, explains:

Before conducting a search, a user has an information need, which underlies and drives the search process. We sometimes refer to this information need as a topic …

Buttcher, Clarke & Cormack, Information Retrieval: Implementation and Evaluation of Search Engines (MIT Press, 2010) at pg. 5. The importance of pre-search refining of the information need is stressed in the first step of the above diagram of my methods, ESI Discovery Communications. It seems very basic, but is often under appreciated, or overlooked entirely in the litigation context where information needs are often vague and ill-defined, lost in overly long requests for production and adversarial hostility.

Hybrid Multimodal Bottom Line Driven Review

I have a long name for what Marchionini calls the variety of strategies, tactics and moves that I have developed for legal search: Hybrid Multimodal. See: TAR Training Course. This sixteen class course teaches our latest insights and methods of Predictive Coding 4.0.. I refer to it as a multimodal method because, although the predictive coding type of searches predominate (shown on the below diagram as AI-enhanced review – AI), I also  use the other modes of search, including the mentioned Unsupervised Learning Algorithms (clustering and concept), keyword search, and even some traditional linear review (although usually very limited). As described, I do not rely entirely on random documents, or computer selected documents for the AI-enhanced searches, but use a three-cylinder approach that includes human judgment sampling and AI document ranking. The various types of legal search methods used in a multimodal process are shown in this search pyramid.

search_pyramid_revised

Most information scientists I have spoken to agree that it makes sense to use multiple methods in legal search and not just rely on any single method, even the best AI method. UCLA Professor Marcia J. Bates first advocated for using multiple search methods back in 1989, which she called it berrypicking. Bates, Marcia J. The Design of Browsing and Berrypicking Techniques for the Online Search Interface, Online Review 13 (October 1989): 407-424. As Professor Bates explained in 2011 in Quora:

An important thing we learned early on is that successful searching requires what I called “berrypicking.” … Berrypicking involves 1) searching many different places/sources, 2) using different search techniques in different places, and 3) changing your search goal as you go along and learn things along the way. This may seem fairly obvious when stated this way, but, in fact, many searchers erroneously think they will find everything they want in just one place, and second, many information systems have been designed to permit only one kind of searching, and inhibit the searcher from using the more effective berrypicking technique.

This berrypicking approach, combined with HCIR, is what I have found from practical experience works best with legal search. They are the Hybrid Multimodal aspects of my AI-Enhanced Review Bottom Line Driven Review method.

Why AI-Enhanced Search and Review Is Important

I focus on this sub-niche area of e-discovery because I am convinced that it is critical to advancement of the law in the 21st Century. The new search and review methods that I have developed from my studies and experiments in legal search science allow a skilled attorney using readily available predictive coding type software to review at remarkable rates of speed and cost. Review rates are more than 250-times faster than traditional linear review, and costs less than a tenth as much. See eg Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron, and the report by the Rand Corporation,  Where The Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery.

Thanks to the new software and methods, what was considered impossible, even absurd, just a few short years ago, namely one attorney accurately reviewing over a million documents by him or herself in 14-days, is attainable by many experts. I have done it. That is when I came up with the Army of One motto and realized that we were at a John Henry moment in Legal Search. Maura tells me that she once did a seven-million document review by herself. Maura and Gordon were correct to refer to TAR as a disruptive technology in the Preface to their Glossary. Technology that can empower one skilled lawyer to do the work of hundreds of unskilled attorneys is certainly a big deal, one for which we have Legal Search Science to thank.

Ralph and some of his computers at one of his law offices

More Information On Legal Search Science

For further information on Legal Search Science see all of the articles cited above, along with the over sixty or so articles on the subject that I have written since mid-2011. Also enroll in our free 16 class TAR Training Course. This course teaches our latest insights and methods of Predictive Coding 4.0. Most of my articles were written for the general reader, some are highly technical but still accessible with study. All have been peer-reviewed in my blog by most of the founders of this field who are regular readers and thousands of other readers.

I am especially proud of the legal search experiments I have done using AI-enhanced search software provided to me by Kroll Ontrack to review the 699,083 public Enron documents and my reports on these reviews. Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents(Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have been told by scientists in the field that my over 100 hours of search, consisting of two fifty-hour search projects using different methods, is the largest search project by a single reviewer that has ever been undertaken, not only in Legal Search, but in any kind of search. I do not expect this record will last for long, as others begin to understand the importance of Information Science in general, and Legal Search Science in particular.


Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part Two

November 2, 2014

recordsThis is the second part of a two-part blog, please read part one first.

AI-Enhanced Big Data Search Will Greatly Simplify Information Governance

Information Governance is, or should be, all about finding the information you need, when you need it, and doing so in a cheap and efficient manner. Information needs are determined by both law and personal preferences, including business operation needs. In order to find information, you must first have it. Not only that, you must keep it until you need it. To do that, you need to preserve the information. If you have already destroyed information, really destroyed it I mean, not just deleted it, then obviously you will not be able to find it. You cannot find what does not exist, as all Unicorn chasers eventually find out.

Too_Many_RecordsThis creates a basic problem for Information Governance because the whole system is based on a notion that the best way to find valuable information is to destroy worthless information. Much of Information Governance is devoted to trying to determine what information is a valuable needle, and what is worthless chaff. This is because everyone knows that the more information you have, the harder it is for you to find the information you need. The idea is that too much information will cut you off. These maxims were true in the pre-AI-Enhanced Search days, but are, IMO, no longer true today, or, at least, will not be true in the next five to ten years, maybe sooner.

In order to meet the basic goal of finding information, Information Governance focuses its efforts on the proper classification of information. Again, the idea was to make it simpler to find information by preserving some of it, the information you might need to access, and destroying the rest. That is where records classification comes in.

The question of what information you need has a time element to it. The time requirements are again based on personal and business operations needs, and on thousand of federal, state and local laws. Information governance thus became a very complicated legal analysis problem. There are literally thousands of laws requiring certain types of information to be preserved for various lengths of time. Of course, you could comply with most of these laws by simply saving everything forever, but, in the past, that was not a realistic solution. There were severe limits on the ability to save information, and the ability to find it. Also, it was presumed that the older information was, the less value it had. Almost all information was thus treated like news.

These ideas were all firmly entrenched before the advent of Big Data and AI-enhanced data mining. In fact, in today’s world there is good reason for Google to save every search, ever done, forever. Some patterns and knowledge only emerge in time and history. New information is sometimes better information, but not necessarily so. In the world of Big Data all information has value, not just the latest.

paper records management warehouseThis records life-cycle ideas all made perfect sense in the world of paper information. It cost a lot of money to save and store paper records. Everyone with a monthly Iron Mountain paper records storage bill knows that. Even after the computer age began, it still cost a fair amount of money to save and store ESI. The computers needed to buy and maintain digital storage used to be very expensive. Finding the ESI you needed quickly on a computer was still very difficult and unreliable. All we had at first was keyword search, and that was very ineffective.

Due to the costs of storage, and the limitations of search, tremendous efforts were made by record managers to try to figure out what information was important, or needed, either from a legal perspective, or a business necessity perspective, and to save that information, and only that information. The idea behind Information Management was to destroy the ESI you did not need or were not required by law to preserve. This destruction saved you money, and, it also made possible the whole point of Information Governance, to find the information you wanted, when you wanted it.

Back in the pre-AI search days, the more information you had, the harder it was to find the information you needed. That still seems like common sense. Useless information was destroyed so that you could find valuable information. In reality, with the new and better algorithms we now have for AI-enhanced search, it is just the reverse. The more information you have, the easier it becomes to find what you want. You now have more information to draw upon.

That is the new reality of Big Data. It is a hard intellectual paradigm to jump, and seems counter-intuitive. It took me a long time to get it. The new ability to save and search everything cheaply and efficiently is what is driving the explosion of Big Data services and products. As the save everything, find anything way of thinking takes over, the classification and deletion aspects of Information Governance will naturally dissipate. The records lifecycle will transform into virtual immortality. There is no reason to classify and delete, if you can save everything and find anything at low cost. The issues simplify; they change to how to save and  search, although new collateral issues of security and privacy grow in importance.

Save and Search v. Classify and Delete

The current clash in basic ideas concerning Big Data and Information Governance is confusing to many business executives. According to Gregory Bufithis who attended a recent event in Washington D.C. on Big Data sponsored by EMC, one senior presenter explained:

The C Suite is bedeviled by IG and regulatory complexity. … 

The solution is not to eliminate Information Governance entirely. The reports of its complete demise, here or elsewhere, are exaggerated. The solution is to simplify IG. To pare it down to save and search. Even this will take some time, like I said, from five to ten years, although there is some chance this transformation of IG will go even faster than that. This move away from complex regulatory classification schemes, to simpler save and search everything, is already being adopted by many in the high-tech world. To quote Greg again from the private EMC event in D.C. in October, 2014:

Why data lakes? Because regulatory complexity and the changes can kill you. And are unpredictable in relationship to information governance. …

So what’s better? Data lakes coupled with archiving. Yes, archiving seems emblematic of “old” IT. But archiving and data lifecycle management (DLM) have evolved from a storage focus, to a focus on business value and data loss prevention. DLM recognizes that as data gets older, its value diminishes, but it never becomes worthless. And nobody is throwing out anything and yes, there are negative impacts (unnecessary storage costs, litigation, regulatory sanctions) if not retained or deleted when it should be.

But … companies want to mine their data for operational and competitive advantage. So data lakes and archiving their data allows for ingesting and retain all information types, structured or unstructured. And that’s better.

Because then all you need is a good search platform or search system … like Hadoop which allows you to sift through the data and extract the chunks that answer the questions at hand. In essence, this is a step up from OLAP (online analytical processing). And you can use “tag sift sort” programs like Data Rush. Or ThingWorx which is an approach that monitors the stream of data arriving in the lake for specific events. Complex event processing (CEP) engines can also sift through data as it enters storage, or later when it’s needed for analysis.

Because it is all about search.

Recent Breakthroughs in Artificial Intelligence
Make Possible Save Everything, Find Anything

AIThe New York Times in an opinion editorial this week discussed recent breakthroughs in Artificial Intelligence and speculated on alternative futures this could create. Our Machine Masters, NT Times Op-Ed, by David Brooks (October 31, 2014). The Times article quoted extensively another article in the current issue of Wired by technology blogger Kevin Kelly: The Three Breakthroughs That Have Finally Unleashed AI on the World. Kelly argues, as do I, that artificial intelligence has now reached a breakthrough level. This artificial intelligence breakthrough, Kevin Kelly argues, and David Brook’s agrees, is driven by three things: cheap parallel computation technologies, big data collection, and better algorithms. The upshot is clear in the opinion of both Wired and the New York Times: “The business plans of the next 10,000 start-ups are easy to forecast: Take X and add A.I. This is a big deal, and now it’s here.

These three new technology advances change everything. The Wired article goes into the technology and financial aspects of the new AI; it is where the big money is going and will be made in the next few decades. If Wired is right, then this means in our world of e-discovery, companies and law firms will succeed if, and only if, they add AI to their products and services. The firms and vendors who add AI to document review, and project management, will grow fast. The non-AI enhanced vendors, non-AI enhanced software, will go out of business. The law firms that do not use AI tools will shrink and die.

David_BrooksThe Times article by David Brooks goes into the sociological and philosophical aspects of the recent breakthroughs in Artificial Intelligence:

Two big implications flow from this. The first is sociological. If knowledge is power, we’re about to see an even greater concentration of power.  … [E]ngineers at a few gigantic companies will have vast-though-hidden power to shape how data are collected and framed, to harvest huge amounts of information, to build the frameworks through which the rest of us make decisions and to steer our choices. If you think this power will be used for entirely benign ends, then you have not read enough history.

The second implication is philosophical. A.I. will redefine what it means to be human. Our identity as humans is shaped by what machines and other animals can’t do. For the last few centuries, reason was seen as the ultimate human faculty. But now machines are better at many of the tasks we associate with thinking — like playing chess, winning at Jeopardy, and doing math. [RCL – and, you might add, better at finding relevant evidence.]

On the other hand, machines cannot beat us at the things we do without conscious thinking: developing tastes and affections, mimicking each other and building emotional attachments, experiencing imaginative breakthroughs, forming moral sentiments. [RCL – and, you might add, better at equitable notions of justice and at legal imagination.]

In this future, there is increasing emphasis on personal and moral faculties: being likable, industrious, trustworthy and affectionate. People are evaluated more on these traits, which supplement machine thinking, and not the rote ones that duplicate it.

In the cold, utilitarian future, on the other hand, people become less idiosyncratic. If the choice architecture behind many decisions is based on big data from vast crowds, everybody follows the prompts and chooses to be like each other. The machine prompts us to consume what is popular, the things that are easy and mentally undemanding.

I’m happy Pandora can help me find what I like. I’m a little nervous if it so pervasively shapes my listening that it ends up determining what I like. [RCL – and, you might add, determining what is relevant, what is fair.]

I think we all want to master these machines, not have them master us.

ralph_wrongAlthough I share the concerns of the NY Times about mastering machines and alternative future scenarios, my analysis of the impact of the new AI is focused and limited to the Law. Lawyers must master the AI-search for evidence processes. We must master and use the better algorithms, the better AI-enhanced software, not visa versa. The software does not, nor should it, run itself. Easy buttons in legal search are a trap for the unwary, a first step down a slippery slope to legal dystopia. Human lawyers must never over-delegate our uniquely human insights and abilities. We must train the machines. We must stay in charge and assert our human insights on law, relevance, equity, fairness and justice, and our human abilities to imagine and create new realities of justice for all. I want lawyers and judges to use AI-enhanced machines, but I never want to be judged by a machine alone, nor have a computer alone as a lawyer.

The three big new advances that are allowing better and better AI are nowhere near to threatening the jobs of human judges or lawyers, although they will likely reduce their numbers, and certainly will change their jobs. We are already seeing these changes in Legal Search and Information Governance. Thanks to cheap parallel computation, we now have Big Data Lakes stored in thousands of inexpensive, cloud computers that are operating together. This is where open-sourced software like Hadoop comes in. They make the big clusters of computers possible. The better algorithms is where better AI-enhanced Software comes in. This makes it possible to use predictive coding effectively and inexpensively to find the information needed to resolve law suits. The days of vast numbers of document reviewer attorneys doing linear review are numbered. Instead, we will see a few SMEs, working with small teams of reviewers, search experts, and software experts.

The role of Information Managers will also change drastically. Because of Big Data, cheap parallel computing, and better algorithms, it is now possible to save everything, forever, at a small cost, and to quickly search and find what you need. The new reality of Save Everything, Find Anything undercuts most of the rationale of Information Governance. It is all about search now.

Conclusion

Ralph_Losey_2013_abaNow that storage costs are negligible, and search far more efficient, the twin motivators of Information Science to classify and destroy are gone, or soon will be. The key remaining tasks of Information Governance are now preservation and search, plus relatively new ones of security and privacy. I recognize that the demise of the importance of destruction of ESI could change if more governments enact laws that require the destruction of ESI, like the EU has done with Facebook posts and the so-called “right to be forgotten law.” But for now, most laws are about saving data for various times, and do not require data be destroyed. Note that the new Delaware law on data destruction still keeps it discretionary on whether to destroy personal data or not. House Bill No. 295 – The Safe Destruction of Documents Containing Personal Identifying Information. It only places legal burdens and liability for failures to properly destroy data. This liability for mistakes in destruction serves to discourage data destruction, not encourage it.

Preservation is not too difficult when you can economically save everything forever, so the challenging task remaining is really just one of search. That is why I say that Information Governance will become a sub-set of search. The save everything forever model will, however, create new legal work for lawyers. The cybersecurity protection and privacy aspects of Big Data Lakes are already creating many new legal challenges and issues. More legal issues are sure to arise with the expansion of AI.

Automation, including this latest Second Machine Age of mental process automation, does not eliminate the need for human labor. It just makes our work more interesting and opens up more time for leisure. Automation has always created new jobs as fast as it has eliminated old ones. The challenge for existing workers like ourselves is to learn the new skills necessary to do the new jobs. For us e-discovery lawyers and techs, this means, among other things, acquiring new skills to use AI-enhanced tools. One such skill, the ability for HCIR, human computer information retrieval, is mentioned in most of my articles on predictive coding. It involves new skill sets in active machine learning to train a computer to find the evidence you want from large collections of data sets, typically emails. When I was a law student in the late 1970s, I could never have dreamed that this would be part of my job as a lawyer in 2014.

The new jobs do not rely on physical or mental drudgery and repetition. Instead, they put a premium on what makes up distinctly human, our deep knowledge, understanding, wisdom, and intuition; our empathy, caring, love and compassion; our morality, honesty, and trustworthiness; our sense of justice and fairness; our ability to change and adapt quickly to new conditions; our likability, good will, and friendliness; our imagination, art, wisdom, and creativity. Yes, even our individual eccentricities, and our all important sense of humor. No matter how far we progress, let us never lose that! Please be governed accordingly.



Reinventing the Wheel: My Discovery of Scientific Support for “Hybrid Multimodal” Search

April 21, 2013

reinventing the wheelGetting predictive coding software is just part of the answer to the high-cost of legal review. Much more important is how you use it, which in turn depends, at least in part, on which software you get. That is why I have been focusing on methods for using the new technologies. I have been advocating for what I call the hybrid multimodal method. I created this method on my own over many years of legal discovery. As it turns out, I was merely reinventing the wheel. These methods are already well-established in the scientific information retrieval community. (Thanks to information scientist Jeremy Pickens, an expert in collaborative search, who helped me to find the prior art.)

In this blog I will share some of the classic information science research that supports hybrid multimodal. It includes the work of  Gary Marchionini, Professor and Dean of the School of Information and Library Sciences of U.N.C. at Chapel Hill, and UCLA Professor Marcia J. Bates who has advocated for a multimodal approach to search since 1989. Study of their writings has enabled me to better understand and refine my methods. I hope you will also explore with me the literature in this field. I provide links to some of the books and literature in this area for your further study.

Advanced CARs Require Completely New Driving Methods

First I need to set the stage for this discussion by use of the eight-step diagram show below. This is one of the charts I created to teach the workflow I use in a typical computer assisted review (CAR) project. You have seen it here many times before. For a full description of the eight steps see the Electronic Discovery Best Practices page on predictive coding.

predictive coding work flow

The iterated steps four and five in this work-flow are unique to predictive coding review. They are where active learning takes place. The Grossman-Cormack Glossary defines active learning as:

An Iterative Training regimen in which the Training Set is repeatedly augmented by additional Documents chosen by the Machine Learning Algorithm, and coded by one or more Subject Matter Expert(s).

The Grossman-Cormack Glossary of Technology-Assisted Review,  2013 Fed. Cts. L. Rev. 7 (2013). at pg.

Beware of any co-called advanced review software that does not include these steps; they are not bona-fide predictive coding search engines. My preferred active learning process is threefold:

1.  The computer selects documents for review where the software classifier is uncertain of the correct classification. This helps the classifier algorithms to learn by adding diversity to the documents presented for review. This in turn helps to locate outliers of a type your initial judgmental searches in step two (and  five) of the above diagram have missed. This is machine-selected sampling, and, according to a basic text in information retrieval engineering, a process is not a bona fide active learning search without this ability. Manning, Raghavan and Schutze, Introduction to Information Retrieval, (Cambridge, 2008) at pg. 309.

2.  Some reasonable percentage of the documents presented for human review in step five are selected at random. This again helps maximize recall and premature focus on the relevant documents initially retrieved.

3.  Other relevant documents that a skilled reviewer can find using a variety of search techniques. This is called judgmental sampling. After the first round of training, a/k/a the seed set, judgmental sampling by a variety of search methods is used based on the machine selected or random selected documents presented for review. Sometimes the subject matter expert (“SME”) human reviewer may follow a new search idea unrelated to the documents presented.  Any kind of searches can be used for judgmental sampling, which is why I call it a multimodal search. This may include some linear review of selected custodians or dates, parametric Boolean keyword searches, similarity searches of all kinds, concept searches, as well as several unique predictive coding probability searches.

The initial seed set generation, step two in the chart, should also use some random samples, plus judgmental multimodal searches. Steps three and six in the chart always use pure random samples and rely on statistical analysis. For more on the three types of sampling see my blog, Three-Cylinder Multimodal Approach To Predictive Coding.

My insistence on the use of multimodal judgmental sampling in steps two and five to locate relevant documents follows the consensus view of information scientists specializing in information retrieval, but is not followed by several prominent predictive coding vendors. They instead rely entirely on machine selected documents for training, or even worse, rely entirely on random selected documents to train the software. In my writings I call these processes the Borg approach, after the infamous villans in Star Trek, the Borg, an alien race that assimilates people. (I further differentiate between three types of Borg in Three-Cylinder Multimodal Approach To Predictive Coding.) Like the Borg, these approaches unnecessarily minimize the role of individuals, the SMEs. They exclude other types of search to supplement an active learning process. I advocate the use of all types of search, not just predictive coding.

Hybrid Human Computer Information Retrieval

human-and-robots

In contradistinction to Borg approaches, where the machine controls the learning process, I advocate a hybrid approach where Man and Machine work together. In my hybrid CARs the expert reviewer remains in control of the process, and their expertise is leveraged for greater accuracy and speed. The human intelligence of the SME is a key part of the search process. In the scholarly literature of information science this hybrid approach is known as Human–computer information retrieval (HCIR).

The classic text in the area of HCIR, which I endorse, is Information Seeking in Electronic Environments (Cambridge 1995) by Gary Marchionini, Professor and Dean of the School of Information and Library Sciences of U.N.C. at Chapel Hill. Professor Marchionini speaks of three types of expertise needed for a successful information seeker:

1.  Domain Expertise. This is equivalent to what we now call SME, subject matter expertise. It refers to a domain of knowledge. In the context of law the domain would refer to particular types of lawsuits or legal investigations, such as antitrust, patent, ERISA, discrimination, trade-secrets, breach of contract, Qui Tam, etc. The knowledge of the SME on the particular search goal is extrapolated by the software algorithms to guide the search. If the SME also has System Expertise, and Information Seeking Expertise, they can drive the CAR themselves.   Otherwise, they will need a chauffeur with such expertise, one who is capable of learning enough from the SME to recognize the relevant documents.

2.  System Expertise. This refers to expertise in the technology system used for the search. A system expert in predictive coding would have a deep and detailed knowledge of the software they are using, including the ability to customize the software and use all of its features. In computer circles a person with such skills is often called a power-user. Ideally a power-user would have expertise in several different software systems.

3.  Information Seeking Expertise. This is a skill that is often overlooked in legal search. It refers to a general cognitive skill related to information seeking. It is based on both experience and innate talents. For instance, “capabilities such as superior memory and visual scanning abilities interact to support broader and more purposive examination of text.” Professor Marchionini goes on to say that: “One goal of human-computer interaction research is to apply computing power to amplify and augment these human abilities.” Some lawyers seem to have a gift for search, which they refine with experience, broaden with knowledge of different tools, and enhance with technologies. Others do not, or the gift is limited to interviews and depositions.

Id. at pgs.66-69, with the quotes from pg. 69.

All three of these skills are required for an attorney to attain expertise in legal search today, which is one reason I find this new area of legal practice so challenging. It is difficult, but not impossible like this Penrose triangle.

Penrose_triangle_Expertise

It is not enough to be an SME, or a power-user, or have a special knack for search. You have to be able to do it all. However, studies have shown that of the three skill-sets, System Expertise, which in legal search primarily means mastery of the particular software used, is the least important. Id. at 67. The SMEs are more important, those  who have mastered a domain of knowledge. In Professor Marchionini’s words:

Thus, experts in a domain have greater facility and experience related to information-seeking factors specific to the domain and are able to execute the subprocesses of information seeking with speed, confidence, and accuracy.

Id. That is one reason that the Grossman Cormack glossary builds in the role of SMEs as part of their base definition of computer assisted review:

A process for Prioritizing or Coding a Collection of electronic Documents using a computerized system that harnesses human judgments of one or more Subject Matter Expert(s) on a smaller set of Documents and then extrapolates those judgments to the remaining Document Collection.

Grossman-Cormack Glossary at pg. 21 defining TAR.

According to Marchionini, Information Seeking Expertise, much like Subject Matter Expertise, is also more important than specific software mastery. Id. This may seem counter-intuitive in the age of Google, where an illusion of simplicity is created by typing in words to find websites. But legal search of user-created data is a completely different type of search task than looking for information from popular websites. In the search for evidence in a litigation, or as part of a legal investigation, special expertise in information seeking is critical, including especially knowledge of multiple search techniques and methods. Again quoting Professor Marchionini:

Expert information seekers possess substantial knowledge related to the factors of information seeking, have developed distinct patterns of searching, and use a variety of strategies, tactics and moves.

Id. at 70.

In the field of law this kind of information seeking expertise includes the ability to understand and clarify what the information need is, in other words, to know what you are looking for, and articulate the need into specific search topics. This important step precedes the actual search, but should thereafter continue as an integral part of the process. As one of the basic texts on information retrieval written by Gordon Cormack, et al, explains:

Before conducting a search, a user has an information need, which underlies and drives the search process. We sometimes refer to this information need as a topic …

Buttcher, Clarke & Cormack, Information Retrieval: Implementation and Evaluation of Search Engines (MIT Press, 2010) at pg. 5.

The importance of pre-search refining of the information need is stressed in the first step of the above diagram of my methods, ESI Discovery Communications. It seems very basic, but is often under appreciated, or overlooked entirely in the litigation context. In legal discovery information needs are often vague and ill-defined, lost in overly long requests for production and adversarial hostility. In addition to concerted activity up front to define relevance, the issue of information need should be kept in mind throughout the project. Typically our understanding of relevance evolves as our understanding of what really happened in a dispute emerges and grows.

At the start of an e-discovery project we are almost never searching for specific known documents. We never know for sure what information we will discover. That is why the phrase information seeking is actually more appropriate for legal search than information retrieval. Retrieval implies that particular facts exist and are already known; we just need to look them up. Legal search is not like that at all. It is a process of seeking and discovery. Again quoting Professor Marchionini:

The term information seeking is preferred to information retrieval because it is more human oriented and open ended. Retrieval implies that the object must have been “known” at some point; most often, those people who “knew” it organized it for later “knowing” by themselves or someone else. Seeking connotes the process of acquiring knowledge; it is more problem oriented as the solution may or may not be found.

Information Seeking in Electronic Environments, supra at 5-6.

Legal search is a process of seeking information, not retrieving information. It is a process of discovery, not simple look-up of known facts. More often than not in legal search you find the unexpected, and your search evolves as it progresses. Concept shift happens. Or you find nothing at all. You discover that the requesting party has sent you hunting for Unicorns, for evidence that simply does not exist. For example, the plaintiff alleges discrimination, but a search through tens of thousands of defendant’s emails shows no signs of it.

Information scientists have been talking about the distinction between machine oriented retrieval and human oriented seeking for decades. The type of discovery search that lawyers do is referred to in the literature (without any specific mention of law or legal search) as exploratory search. See: White & Roth, Exploratory Search: Beyond the Query-Response Paradigm (Morgan & Claypool, 2009). Ryen W. White, Ph.D., a senior researcher at Microsoft Research, builds on the work of Marchionini and gives this formal definition of exploratory search:

Exploratory search can be used to describe an information-seeking problem context that is open-ended, persistent, and multi-faceted; and to describe information-seeking processes that are opportunistic, iterative, and multi-tactical. In the first sense, exploratory search is commonly used in scientific discovery, learning, and decision-making contexts. In the second sense, exploratory tactics are used in all manner of information seeking and reflect seeker preferences and experience as much as the goal.

Id. at 6. He could easily have added legal discovery to this list, but like most information scientists, seems unacquainted with the law and legal search.

White and Roth point out that exploratory search typically uses a multimodal (berrypicking) approach to information needs that begin as vague notions. A many-methods-approach helps the information need to evolve and become more distinct and meaningful over time. They contend that the information-seeking strategies need to be supported by system features and user interface designs, bringing humans more actively into the search process. Id. at 15. That is exactly what I mean by a hybrid process where lawyers are actively involved in the search process.

The fully Borg approach has it all wrong. They use a look-up approach to legal search that relies as much as possible on fully automated systems. The user interface for this type of information retrieval software is designed to keep humans out of the search, all in the name of ease of use and impartiality. The software designers of these programs, typically engineers working without adequate input from lawyers, erroneously assume that e-discovery is just a retrieval task. They erroneously assume that predictive coding always starts with well-defined information needs that do not evolve with time. Some engineers and lit-support techs may fall for this myth, but all practicing lawyers know better. They know that legal discovery is an open-ended, persistent, and multi-faceted process of seeking.

Hybrid Multimodal Computer Assisted Review

Professor Marchionini notes that information seeking experts develop their own search strategies, tactics and moves. The descriptive name for the strategies, tactics and moves that I have developed for legal search is Hybrid Multimodal Computer Assisted Review Bottom Line Driven Proportional Strategy. See eg. Bottom Line Driven Proportional Review (2013). For a recent federal opinion approving this type of hybrid multimodal search and review seeIn Re: Biomet M2a Maagnum Hip Implant Products Liability Litigation (MDL 2391), Case No. 3:12-MD-2391, (N.D. Ind., April 18, 2013); also seeIndiana District Court Approves Multimodal Computer Assisted Review.

I refer to this method as a multimodal because, although the predictive coding type of searches predominate (shown on the below diagram as Intelligent Review or IR), other modes of search are also employed. As described, I do not rely entirely on random documents, or computer selected documents. The other types of methods used in a multimodal process are shown in this search pyramid.

Pyramid Search diagram

Most information scientists I have spoken to agree that it makes sense to use multiple methods in legal search and not just rely on any single method. UCLA Professor Marcia J. Bates first advocated for using multiple search methods back in 1989, which she called berrypicking. Bates, Marcia J., The Design of Browsing and Berrypicking Techniques for the Online Search Interface, Online Review 13 (October 1989): 407-424. As Professor Bates explained in 2011 in Quora:

An important thing we learned early on is that successful searching requires what I called “berrypicking.” … Berrypicking involves 1) searching many different places/sources, 2) using different search techniques in different places, and 3) changing your search goal as you go along and learn things along the way. This may seem fairly obvious when stated this way, but, in fact, many searchers erroneously think they will find everything they want in just one place, and second, many information systems have been designed to permit only one kind of searching, and inhibit the searcher from using the more effective berrypicking technique.

This berrypicking approach, combined with HCIR exploratory search, is what I have found from practical experience works best with legal search. They are the Hybrid Multimodal aspects of my Computer Assisted Review Bottom Line Driven Method.

Conclusion

Predictive_coding_trianglesNow that we have shown that courts are very open to predictive coding, we need to move on to a different, more sophisticated discussion. We need to focus on analysis of different predictive coding search methods, the strategies, tactics and moves. We also need to understand and discuss what skill-sets and personnel are required to do it properly. Finally, we need to begin to discuss the different types of predictive coding software.

There is much more to discuss concerning the use predictive coding than whether or not to make disclosure of seed sets or irrelevant training documents. Although that, and court approval, are the only things most expert panels have talked about so far. The discussion on disclosure and work-product should continue, but let us also discuss the methods and skills, and, yes, even the competing software.

We cannot look to vendors alone for the discussion and analysis of predictive coding software and competing methods of use. Obviously they must focus on their own software. This is where independent practitioners have an important role to play in the advancement of this powerful new technology.

Join with me in this discussion by your comments below or send me ideas for proposed guest blogs. Vendors are of course welcome to join in the discussion, and they make great hosts for search SME forums. Vendors are an important part of any successful e-discovery team. You cannot do predictive coding review without their predictive coding software, and, as with any other IT product, some software is much better than others.


%d bloggers like this: