Computer-Assisted Review (aka Technology-Assisted Review) is a process where expert input on the classification of a subset of documents is extrapolated to classify and rank the complete collection. There are two different types of CARs: predictive coding and rule based. The Grossman-Cormack Glossary of Technology-Assisted Review, 2013 Fed. Cts. L. Rev. 7 (January 2013).
Only Drive CARs with Predictive Coding Engines
Rule based CARs seem to run pretty well, but require a team of linguists to design the complex rules, and in this respect are less automated. These CARs do not empower me, the attorney expert doing a computer assisted review. They instead make me dependent on non-legal experts. They are also far more expensive to drive. Gabriel Techs. Corp. v. Qualcomm, Inc., 2013 WL 410103 (S.D. Cal. Feb. 1, 2013) ($2,829,349.10 to first-pass classify a mere million documents). That is why I only use and endorse CARs with predictive coding search engines.
Supervised Machine Learning
Predictive coding is a CAR process where supervised, machine learning is used to extrapolate a legal expert’s input by analysis of the features of the documents. Grossman-Cormack Glossary of Technology-Assisted Review. This is an active learning process. It uses iterated cycles where the expert’s intent is clarified and applied by repeated selections of new document subsets for expert review. Id. Information retrieval experts Doug Oard and William Webber call this iterative process learning by example. The Many Types of Legal Search Software in the CAR Market Today, quoting Oard and Webber’s manuscript Information Retrieval for E-Discovery.
Technically software can only claim to have active learning features when it has the capacity to select documents for training, at least in part, by its own machine learning algorithms, and not just select them by random sampling or human expert judgmental sampling. As explained in the classic textbook on information retrieval, active learning is a
system that decides which documents a human should label … Usually these are the ones on which a classifier is uncertain of the correct classification. This can be effective in reducing annotation costs by a factor of 2 to 4, but has the problem that the good documents to label to train one type of classifier often are not the good documents to train a different type of classifier.
Manning, Raghavan and Schutze, Introduction to Information Retrieval, (Cambridge, 2008) at pg. 309. Do not just focus on the 2 to 4 times costs savings observation. Remember the warning at the end regarding a common problem with active learning. It supports my own findings that active learning alone is inadequate, that it should be supplemented by judgmental and random sampling. It also supports my general argument for multimodal search.
Monomodal Search v. Multimodal Search
I have previously indicated that I favor a multimodal approach to information retrieval in general, one that utilizes all types of search methods, including predictive coding. Software that only uses one type of search method to find things is, in my terminology, monomodal. The most ubiquitous type of search, namely keyword-only search, is monomodal. For example, the search feature on this blog on the upper right column only uses keywords to search. The same is true for the search feature in Outlook and other Office applications.
Another less obvious monomodal type of search is software that uses predictive coding methods only, and no other. I frequently refer to that as the Borg approach because it relies exclusively on machine learning. I advocate for multimodal CARs that use all types of searches: expert judgmental linear, keyword, similarity, concept and predictive coding. I want to empower attorneys with all known search tools, not just predictive coding.
One, Two, and Three Cylinder Predictive Coding Search Engines
This distinction between search methods that rely upon a single approach, instead of a variety of methods, also applies to the predictive coding search method itself. It also helps explain how different types of predictive coding search engines operate.
Lucky Borg Approach
Some types of predictive coding software rely entirely on random chance to select documents for machine training. They are, so to speak, a one-cylinder predictive coding search engine. They run on chance alone. I call that the lucky Borg approach.
The supposed justification for this simplistic, chance-only approach, is that it avoids human bias and will result in the broadest scope of search. The bias argument ignores the fact that these same humans supposedly infected with natural bias, the attorneys who are subject matter experts (SMEs), are the ones making all of the input to be extrapolated to begin with. They are both the instructors and ultimate judges of relevancy. Plus, they are highly trained experts with experience in evaluation of evidence. They are trained in discovery and disciplined to avoid bias. After all, a document bad for their case, a subject they know more about than anyone, is just as important to them as a good document.
Some software programmers seem to think that the SMEs who will use their software are untrustworthy types, bent of persuasion. They are not. SME attorneys doing search are more like researchers, just seeking the facts so that they can then decide what to argue. They do not just shop for facts to support their position. That is unethical and illegal. Besides, if an SME really was corrupt, there is a much simpler way to cheat. Just withhold the documents you want to hide, but look out for the judge and your license. You may never practice law again.
The SME attorneys know better than anyone what types of documents to look for. Based on their long experience in the law, and other similar cases, they know what kinds of documents may still be missing in a search as it progresses. They can compare what documents have been found in the current search to other cases of this type that they have handled. For example, they may be used to seeing a particular type of spreadsheet in these kind of cases. The failure of the machine learning to uncover any like that so far would make them suspicious; cause them to run speciality searches specifically for these spreadsheets.
No software now existing can even begin to know that. The computer comes into any search as a tabula rasa, ready to be trained. Whereas the SME has a whole lifetime of specialized legal knowledge in their head to draw upon.
Although I reject the bias argument as circular, and based on a misunderstanding of attorney SMEs, I concede there is some merit to the openness observation.
Introverted Borg Approach
There is another type of one-cylinder predictive coding search engine that only uses machine learning processes to select documents for training. It is a pure active learning system where the algorithms alone decide which documents a human should label. Id.
As discussed, the software is usually designed to select documents where it is uncertain as to classification, or ones they know the least about. Information retrieval expert Jeremy Pickens states that this provides what he calls “diverse topical coverage” to the search, and is thus superior to random sampling alone to find outlier type documents and reduce false negatives. Pickens, Predictive Ranking: Technology Assisted Review Designed for the Real World (2/1/13) at pg. 3. See Picken’s comment below and also see The Many Types of Legal Search Software in the CAR Market Today.
I agree that machine selection of documents is important, especially for diversity, and will make any CAR run faster, truer, and with less expense. But it is still just a one-cylinder approach. It does not include random selections, which can broaden a search even further. Nor does it provide any type of search based on human judgment, aside from the simplistic yes-no of relevant or not. I suppose you could call that the introverted Borg approach.
Enlightened Borg Approach
Other types of software are two-cylinder, relying on both machine and random selected documents for training. I call that the enlightened Borg approach. It is the best of the Borg approaches, both lucky and introverted, but, in my opinion, is still defective. It does not allow for other types of searches to find documents for machine training. It still excludes the full potential of human participation in the search. Instead, SMEs are relegated to a mere passive role of marking yes or no.
I tried this approach myself for over a week to test it out. I found it to be dehumanizing and incredibly boring. No SME worth their salt would ever do it twice. I know I will never do it again.
Three-Cylinder Multimodal Approach
The best predictive coding software takes a three-cylinder approach. One for chance, a second for machine analysis, and a third cylinder powered by human input. The third cylinder should be fully multimodal, allowing the attorney to add documents for machine training based on all types of search: expert linear, keyword, similarity, concept, and even predictive coding based searches. There is a kind of positive feedback loop possible where you can use predictive coding based ranking searches to find more documents for machine training. This is somewhat difficult to explain in the abstract, but not too difficult for a hands-on demonstration. A new narrative will be coming soon that includes a description of this process in action. I have found it to be very effective. It is also effective for quality control purposes, but I digress.
The three-cylinder approach empowers an attorney to use any other type of search they deem appropriate to find example documents for training. They could even make up fictitious documents to use as examples. This third cylinder is frequently called judgmental sampling, as opposed to random sampling, or machine selection sampling.
Judgmental sampling brings the attorney into the machine learning process, or at least allows them to participate, if and when they have some good ideas. This makes predictive coding a hybrid process, with man and machine working together on all cylinders. Some documents are selected for machine training by random sampling, others by judgmental sampling of SMEs, and still others by computer analysis. This three-cylinder search engine is the most powerful available today and the only kind I will use or endorse.
The Art of Driving High Powered CARs
The exact mixture of the three types of cylinders – random, analytic, and judgmental – is where the art of predictive coding search comes in. How much gas to you give to the random chance cylinder to look for outliers? How much do you give to judgmental? Or maybe the situation is such that you should let the computer analytics provide all of the input for a while, to allow it to sort things out. The answer as to the exact amount of gas to give to each cylinder depends on many factors, including the types of documents under review, the type of classifications desired, the goals of the search, and the need for mid-stream adjustments based on unexpected occurrences. You should be able to change focus in the middle of the race, especially if the collection of documents under review changes, as it frequently does.
I am still learning about all of these many variables, what mixtures work best with different data sets and different circumstances. This is the chief reason predictive coding is such a complex challenge. It requires skill and experience. Like law itself, it is more of an art, than a science.
Art is also required in the SME judgmental sampling, in determining what other multimodal searches are used. The skills that you may have learned from years of keyword search, for instance, will not be lost. They will be supplemented. Some kinds of documents are still easiest to find my Boolean keyword, or my specific metadata searches. The skills of knowing when and where are still important.
There is also a kind of art to the interaction with the machine. You provide input in your coding of the examples, and, in the next round, with certain techniques, you can instantly see the impact this has had. You can see how it has impacted the overall ranking of the entire collection. After multiple rounds, and days and days of interaction like this with the software code, you can develop a kind of deep-level intuition with the data analytics. You begin to learn that certain kinds of documents are more useful for training than others. You also get a pretty good sense of when the training has reached maximum effectiveness, and further rounds are unnecessary. In fact, this sense of near completion is fairly obvious, and can be measured as well.
I am urging software designers to go further with this hybrid aspect, and exploit the human intuitive abilities. As a life-long computer gamer I would also like to see more crossover from certain game software features to serious legal review software. The use of levels for instance, and of rewards to keep SMEs fully engaged and attentive. The search for truth is a serious business, but it can still be a flow-generative experience, indeed, it must be for the iterative SME feedback to maintain a high level of quality.
From the point of view of software selection, I want search software to provide for as much flexibility and customization as possible. Variable features will allow an SME to tailor the predictive coding process to particular projects. It also allows an SME to switch gears in the middle of a project, to go from one cylinder to another, depending on what happens next. Sometimes you may want to fire equally on all cylinders at once, or other times to follow a personal favorite formula like 20-60-20 (20% random selected, 60% machine selected, 20% SME selected).
The ability to change on the fly and be flexible in your search is, in my experience, what makes it possible to attain high accuracy rates, and super-high review speeds of 10,000, 20,000, even 30,000 files per hour. You have got to be able to hot rod the engines, to customize your CAR for maximum effectiveness and savings. Look for that when selecting your next CAR, either that or hire a chauffeur with a souped-up limo.