Three-Cylinder Multimodal Approach To Predictive Coding

Ralph_B&WsmileComputer-Assisted Review (aka Technology-Assisted Review) is a process where expert input on the classification of a subset of documents is extrapolated to classify and rank the complete collection. There are two different types of CARs: predictive coding and rule based. The Grossman-Cormack Glossary of Technology-Assisted Review, 2013 Fed. Cts. L. Rev. 7 (January 2013).

Only Drive CARs with Predictive Coding Engines

Rule based CARs seem to run pretty well, but require a team of linguists to design the complex rules, and in this respect are less automated. These CARs do not empower me, the attorney expert doing a computer assisted review. They instead make me dependent on non-legal experts. They are also far more expensive to drive. Gabriel Techs. Corp. v. Qualcomm, Inc., 2013 WL 410103 (S.D. Cal. Feb. 1, 2013) ($2,829,349.10 to first-pass classify a mere million documents). That is why I only use and endorse CARs with predictive coding search engines.

Supervised Machine Learning

Predictive coding is a CAR process where supervised, machine learning is used to extrapolate a legal expert’s input by analysis of the features of the documents. Grossman-Cormack Glossary of Technology-Assisted Review. This is an active learning process. It uses iterated cycles where the expert’s intent is clarified and applied by repeated selections of new document subsets for expert review. Id. Information retrieval experts Doug Oard and William Webber call this iterative process learning by example. The Many Types of Legal Search Software in the CAR Market Todayquoting Oard and Webber’s manuscript Information Retrieval for E-Discovery.

Technically software can only claim to have active learning features when it has the capacity to select documents for training, at least in part, by its own machine learning algorithms, and not just select them by random sampling or human expert judgmental sampling. As explained in the classic textbook on information retrieval, active learning is a

Intro_to_IR_bk_cover

system that decides which documents a human should label … Usually these are the ones on which a classifier is uncertain of the correct classification. This can be effective in reducing annotation costs by a factor of 2 to 4, but has the problem that the good documents to label to train one type of classifier often are not the good documents to train a different type of classifier.

Manning, Raghavan and Schutze, Introduction to Information Retrieval, (Cambridge, 2008) at pg. 309. Do not just focus on the 2 to 4 times costs savings observation. Remember the warning at the end regarding a common problem with active learning. It supports my own findings that active learning alone is inadequate, that it should be supplemented by judgmental and random sampling. It also supports my general argument for multimodal search.

Monomodal Search v. Multimodal Search

I have previously indicated that I favor a multimodal approach to information retrieval in general, one that utilizes all types of search methods, including predictive coding. Software that only uses one type of search method to find things is, in my terminology, monomodal. The most ubiquitous type of search, namely keyword-only search, is monomodal. For example, the search feature on this blog on the upper right column only uses keywords to search. The same is true for the search feature in Outlook and other Office applications.

Another less obvious monomodal type of search is software that uses predictive coding methods only, and no other. I frequently refer to that as the Borg approach because it relies exclusively on machine learning. I advocate for multimodal CARs that use all types of searches: expert judgmental linear, keyword, similarity, concept and predictive coding. I want to empower attorneys with all known search tools, not just predictive coding.

One, Two, and Three Cylinder Predictive Coding Search Engines

This distinction between search methods that rely upon a single approach, instead of a  variety of methods, also applies to the predictive coding search method itself.  It also helps explain how different types of predictive coding search engines operate.

Lucky Borg Approach

rouletteSome types of predictive coding software rely entirely on random chance to select documents for machine training. They are, so to speak, a one-cylinder predictive coding search engine. They run on chance alone. I call that the lucky Borg approach.

The supposed justification for this simplistic, chance-only approach, is that it avoids human bias and will result in the broadest scope of search. The bias argument ignores the fact that these same humans supposedly infected with natural bias, the attorneys who are subject matter experts (SMEs), are the ones making all of the input to be extrapolated to begin with. They are both the instructors and ultimate judges of relevancy. Plus, they are highly trained experts with experience in evaluation of evidence. They are trained in discovery and disciplined to avoid bias. After all, a document bad for their case, a subject they know more about than anyone, is just as important to them as a good document.

Some software programmers seem to think that the SMEs who will use their software are untrustworthy types, bent of persuasion. They are not. SME attorneys doing search are more like researchers, just seeking the facts so that they can then decide what to argue. They do not just shop for facts to support their position. That is unethical and illegal. Besides, if an SME really was corrupt, there is a much simpler way to cheat. Just withhold the documents you want to hide, but look out for the judge and your license. You may never practice law again.

The SME attorneys know better than anyone what types of documents to look for. Based on their long experience in the law, and other similar cases, they know what kinds of documents may still be missing in a search as it progresses. They can compare what documents have been found in the current search to other cases of this type that they have handled. For example, they may be used to seeing a  particular type of spreadsheet in these kind of cases. The failure of the machine learning to uncover any like that so far would make them suspicious; cause them to run speciality searches specifically for these spreadsheets.

No software now existing can even begin to know that. The computer comes into any search as a tabula rasa, ready to be trained. Whereas the SME has a whole lifetime of specialized legal knowledge in their head to draw upon.

Although I reject the bias argument as circular, and based on a misunderstanding of attorney SMEs, I concede there is some merit to the openness observation.

Introverted Borg Approach

Borg-7-of-9-IntrovertThere is another type of one-cylinder predictive coding search engine that only uses machine learning processes to select documents for training. It is a pure active learning system where the algorithms alone decide which documents a human should label. Id.  

As discussed, the software is usually designed to select documents where it is uncertain as to classification, or ones they know the least about. Information retrieval expert Jeremy Pickens states that this provides what he calls “diverse topical coverage” to the search, and is thus superior to random sampling alone to find outlier type documents and reduce false negatives. Pickens, Predictive Ranking: Technology Assisted Review Designed for the Real World (2/1/13) at pg. 3. See Picken’s comment below and also see The Many Types of Legal Search Software in the CAR Market Today.

I agree that machine selection of documents is important, especially for diversity, and will make any CAR run faster, truer, and with less expense. But it is still just a one-cylinder approach. It does not include random selections, which can broaden a search even further. Nor does it provide any  type of search based on human judgment, aside from the simplistic yes-no of relevant or not. I suppose you could call that the introverted Borg approach.

Enlightened Borg Approach

Borg_Hybrid-SevenOfNine_JerryRyanOther types of software are two-cylinder, relying on both machine and random selected documents for training. I call that the enlightened Borg approach. It is the best of the Borg approaches, both lucky and introverted, but, in my opinion, is still defective. It does not allow for other types of searches to find documents for machine training. It still excludes the full potential of human participation in the search. Instead, SMEs are relegated to a mere passive role of marking yes or no.

I tried this approach myself for over a week to test it out. I found it to be dehumanizing and incredibly boring. No SME worth their salt would ever do it twice. I know I will never do it again.

Three-Cylinder Multimodal Approach

engine_cylinders_animatedThe best predictive coding software takes a three-cylinder approach. One for chance, a second for machine analysis, and a third cylinder powered by human input. The third cylinder should be fully multimodal, allowing the attorney to add documents for machine training based on all types of search: expert linear, keyword, similarity, concept, and even predictive coding based searches. There is a kind of positive feedback loop possible where you can use predictive coding based ranking searches to find more documents for machine training. This is somewhat difficult to explain in the abstract, but not too difficult for a hands-on demonstration. A new narrative will be coming soon that includes a description of this process in action. I have found it to be very effective. It is also effective for quality control purposes, but I digress.

The three-cylinder approach empowers an attorney to use any other type of search they deem appropriate to find example documents for training. They could even make up fictitious documents to use as examples. This third cylinder is frequently called judgmental sampling, as opposed to random sampling, or machine selection sampling.

Judgmental sampling  brings the attorney into the machine learning process, or at least allows them to participate, if and when they have some good ideas. This makes predictive coding a hybrid process, with man and machine working together on all cylinders. Some documents are selected for machine training by random sampling, others by judgmental sampling of SMEs, and still others by computer analysis. This three-cylinder search engine is the most powerful available today and the only kind I will use or endorse.

The Art of Driving High Powered CARs

The exact mixture of the three types of cylinders – random, analytic, and judgmental – is where the art of predictive coding search comes in. How much gas to you give to the random chance cylinder to look for outliers? How much do you give to judgmental? Or maybe the situation is such that you should let the computer analytics provide all of the input for a while, to allow it to sort things out. The answer as to the exact amount of gas to give to each cylinder depends on many factors, including the types of documents under review, the type of classifications desired, the goals of the search, and the need for mid-stream adjustments based on unexpected occurrences. You should be able to change focus in the middle of the race, especially if the collection of documents under review changes, as it frequently does.

I am still learning about all of these many variables, what mixtures work best with different data sets and different circumstances. This is the chief reason predictive coding is such a complex challenge. It requires skill and experience. Like law itself, it is more of an art, than a science.

Art is also required in the SME judgmental sampling, in determining what other multimodal searches are used. The skills that you may have learned from years of keyword search, for instance, will not be lost. They will be supplemented. Some kinds of documents are still easiest to find my Boolean keyword, or my specific metadata searches. The skills of knowing when and where are still important.

human-and-robotsThere is also a kind of art to the interaction with the machine.  You provide input in your coding of the examples, and, in the next round, with certain techniques, you can instantly see the impact this has had. You can see how it has impacted the overall ranking of the entire collection. After multiple rounds, and days and days of interaction like this with the software code, you can develop a kind of deep-level intuition with the data analytics. You begin to learn that certain kinds of documents are more useful for training than others. You also get a pretty good sense of when the training has reached maximum effectiveness, and further rounds are unnecessary. In fact, this sense of near completion is fairly obvious, and can be measured as well.

I am urging software designers to go further with this hybrid aspect, and exploit the human intuitive abilities. As a life-long computer gamer I would also like to see more crossover from certain game software features to serious legal review software. The use of levels for instance, and of rewards to keep SMEs fully engaged and attentive. The search for truth is a serious business, but it can still be a flow-generative experience, indeed, it must be for the iterative SME feedback to maintain a high level of quality.

Conclusion

From the point of view of software selection, I want search software to provide for as much flexibility and customization as possible. Variable features will allow an SME to tailor the predictive coding process to particular projects. It also allows an SME to switch gears in the middle of a project, to go from one cylinder to another, depending on what happens next. Sometimes you may want to fire equally on all cylinders at once, or other times to follow a personal favorite formula like 20-60-20 (20% random selected, 60% machine selected, 20% SME selected).

The ability to change on the fly and be flexible in your search is, in my experience, what makes it possible to attain high accuracy rates, and super-high review speeds of 10,000, 20,000, even 30,000 files per hour. You have got to be able to hot rod the engines, to customize your CAR for maximum effectiveness and savings. Look for that when selecting your next CAR, either that or hire a chauffeur with a souped-up limo.

lamborghini_limo

21 Responses to Three-Cylinder Multimodal Approach To Predictive Coding

  1. Jeremy Pickens says:

    Hi Ralph, Jeremy Pickens here. Thanks for the link above. I do have to correct something with respect to the contextual diversity sampling (aka the diverse topical coverage) that you mention from our paper. The Catalyst TAR approach is not a one-cylinder approach, using only contextual diversity. Rather, the core of our approach uses the three main cylinders that you recommend: (1) straightforward (simple) randomness, (2) human knowledge and intuition that comes from judgmental sampling, and (3) “Relevance feedback” in the form of machine-assisted selection or prioritization of the most likely responsive documents, based on everything that has been judged up to that point.

    The contextual diversity / topical coverage sampling is the *fourth* cylinder on top of those existing three. Actually, let’s be clear here: We officially call it “contextual diversity” sampling, not “topical coverage” sampling, and call it that for a reason. The “context” is all the documents you’ve judged for any reason, no matter how they were selected.. whether via (1) random, (2) human judgmental, or (3) machine-selected relevance feedback. The “diversity” means the biggest pockets of information that look nothing like anything that has been found or seen, with respect to this context.

    What this contextual diversity sampling ends up doing is giving more of a topical coverage than you otherwise would have gotten. But we don’t call it “topical coverage sampling”, because it doesn’t exist in isolation, in a vacuum. Rather, it’s topical coverage with respect to some context, i.e. with respect to the other cylinders that you’re already firing on.

    I’ve also called it “mop up” sampling from time to time, because it mops up and sweeps out that which was missed by the other approaches.

    So I just had to clarify that. It’s not a solitary, isolated cylinder by itself. Rather, contextual diversity sampling is an augmentation, a fourth cylinder, on what we also already recommend, which is to fire on all three other cylinders (random, judgmental, relevance feedback) as well. Contextual diversity is a fourth cylinder, and rounds the whole set out.

    And the experiments in the writeup to which you link shows what happens to your responsive document yield curves (how many resopnsive documents you are able to find), when you add this fourth cylinder of topical diversity to existing random/relevance feedback/judgmental cylinders. They don’t show the tool in isolation.

    • Ralph Losey says:

      Thanks. I did not mean to suggest in any way by quoting you that the Catalyst software was one-cylinder. I have not seen a demo of the latest version, but it is clear from your paper that it is firing on all cylinders!

  2. Jeremy Pickens says:

    Oh, and I completely agree with wanting to have the ability to choose the power ratios between the cylinders you’re firing on. Sometimes you might want it to be equal. And sometimes you might want some other mixture. My recommendation for this 4th, contextual diversity cylinder is that it shouldn’t comprise the majority part of your sampling and judging efforts. Rather, it’s something that happens (say) 5-10% of the time, to make sure that you know what you don’t know. That’s what contextual diversity is, after all: It is an explicitly modeling of the known unknowns, i.e. those documents that are the most “about” what you know the least about. As such, you should make sure you have a little bit of it in the mixture, but shouldn’t spend the majority of your time on it.

    I suppose another name for it, in addition to the “contextual diversity” samping and “mop up” samping names, would be “CYA” sampling 🙂 You can basically guarantee that you haven’t missed anything that you didn’t even know that you weren’t aware of. And while, like you, I do place more trust in human expert ability to select and sample, and that the majority of your reward (responsive document finding) is going to come via that method, it’s still a basic fact of human nature that we all have unconscious blind spots. A little bit of contextual diversity “CYA” sampling not only theoretically shines light on those blind spots, but it demonstrably increases yield (aka decreases total overall effort), as is shown in the paper.

  3. Ralph,

    Great blog post on the in-depth mechanics behind Technology-assisted Review!

    Judgmental sampling versus random sampling is a concept Kroll Ontrack will cover in its new “TAR Learning Lab” educational event series. http://www.theediscoveryblog.com/2013/03/18/dissecting-technology-assisted-review/

    Training, sampling, effectiveness metrics, adapting to new documents mid-review — Our speakers are looking forward to digging in on these topics and more. Looking forward to these events this spring!

    Michele

  4. Laura Pearle says:

    For all the technical discussion behind the CAR engines, has anyone concluded how a client and/or attorney determines that the process is complete? On other words, how many iterations will one need to test/QC before one is comfortable with the CAR results?

  5. […] Multimodal Approach To Predictive Coding  http://bit.ly/16Te1TD (Ralph […]

  6. […] Three-Cylinder Multimodal Approach To Predictive Coding. […]

  7. […] by random and internal evaluation, which I have previously labeled the Enlightened Borg approach. Three-Cylinder Multimodal Approach To Predictive Coding. This means that, to be entirely accurate, when my training sets are concluded (and I am not yet […]

  8. […] Stay tuned for Borg Challenge: Part Four where my search continues using a modified Enlightened Borg approach. For an explanation of these terms see Three-Cylinder Multimodal Approach To Predictive Coding. […]

  9. […] by random and internal evaluation, which I have previously labeled the Enlightened Borg approach. Three-Cylinder Multimodal Approach To Predictive Coding. This means that, to be entirely accurate, when my training sets are concluded (and I am not yet […]

  10. […] The initial seed set generation, step two in the chart, should also use some random samples, plus judgmental multimodal searches. Steps three and six in the chart always use pure random samples and rely on statistical analysis. For more on the three types of sampling see my blog, Three-Cylinder Multimodal Approach To Predictive Coding. […]

  11. […] Multimodal Approach To Predictive Coding – http://bit.ly/16Te1TD (Ralph […]

  12. […] it is called the Borg method, more specifically the Hybrid Enlightened Borg approach. Losey, R., Three-Cylinder Multimodal Approach To Predictive Coding. The author does not endorse the Monomodal method, but wanted to know how effective it was compared […]

  13. […] by several vendors. I call my version the Enlightened Hybrid Borg Monomodal review. Losey, R., Three-Cylinder Multimodal Approach To Predictive Coding. I used all three-cylinders described in this article: one for random, a second for machine […]

  14. […] Three-Cylinder Multimodal Approach To Predictive Coding.   […]

  15. […] The initial seed set generation, step two in the chart, should also use some random samples, plus judgmental multimodal searches. Steps three and six in the chart always use pure random samples and rely on statistical analysis. For more on the three types of sampling see my blog, Three-Cylinder Multimodal Approach To Predictive Coding. […]

  16. […] Three-Cylinder Multimodal Approach To Predictive Coding. […]

  17. […] and once again they promote full disclosure. For my more detailed approach and analysis, see the Three-Cylinder Multimodal Approach To Predictive Coding (training from judgmental sampling using multimodal, random sampling, and machine suggestions […]

  18. […] I would be laughing at all of this random-only search propaganda like a Nasreddin joke, but for the fact that many lawyers do not get the joke. They are buying software and methods that rely exclusively on random search for training documents. Many are falling for the streetlight effect gimmicks and marketing. It is not funny because we are talking about truth and justice here, not just a fool’s house keys. I care about these pursuits and best practices for predictive coding. The future of legal search is harmed by this naive foolishness. That is why I have reacted before to vendor propaganda promoting random search. That is why I spent over fifty hours doing a predictive coding experiment based in part on random search, an approach I call the Random Borg approach. . Comparative Efficacy of Two Predictive Coding Reviews of 699,082 Enron Documents. (Part Two); A Modest Contribution to the Science of Search: Report and Analysis of Inconsistent Classifications in Two Predictive Coding Reviews of 699,082 Enron Documents. (Part One). I have also written several articles on this subject to try to debunk this method, and yet this method lives on. See eg The Many Types of Legal Search Software in the CAR Market Today; Three-Cylinder Multimodal Approach To Predictive Coding. […]

  19. […] of random and uncertainty machine selected. I called this the Enlightened Borg Approach. Three-Cylinder Multimodal Approach To Predictive Coding. In this experiment I used this protocol in 50 rounds of training, plus a final quality assurance […]

Leave a Reply

Discover more from e-Discovery Team

Subscribe now to keep reading and get access to the full archive.

Continue reading