computer assisted review | e-Discovery Team

TAR Course Updated to Add Video on Step Seven and the All Important “Stop Decision”

June 11, 2017

We added to the TAR Course again this weekend with a video introducing Class Fourteen on Step Seven, ZEN Quality Assurance Tests. ZEN stands for Zero Error Numerics with the double-entendre on purpose, but this video does not go into the math, concentration or reviewer focus. Ralph’s video instead provides an introduction to the main purpose of Step Seven from a work-flow perspective, to test and validate the decision to stop the Training Cycle steps, 4-5-6.

The Training Cycle shown in the diagram continues until the expert in charge of the training decides to stop. This is a decision to complete the first pass document review. The stop decision is a legal, statistical decision requiring a holistic approach, including metrics, sampling and over-all project assessment. You decide to stop the review after weighing a multitude of considerations, including when the software has attained a highly stratified distribution of documents. See License to Kull: Two-Filter Document Culling and Visualizing Data in a Predictive Coding Project – Part One, Part Two and Part Three, and Introducing a New Website, a New Legal Service, and a New Way of Life / Work; Plus a Postscript on Software Visualization. Then you test your decision with a random sample in Step Seven.

____

Team Methods in TREC Skipped Steps 1, 3 & 7

_____

____

By the way, I am using the phrase “accept on zero error” in the video in the general quality control sense, not in the specialized usage of the phrase contained in the The Grossman-Cormack Glossary of Technology Assisted Review. I forgot that phrase was in their glossary until recently. I have been using the term in the more general sense for several years. I do not advocate use of the accept on zero error method as defined in their glossary. I am not sure anyone does, but it is in their dictionary, so I felt this clarification was in order.

Stop Decision

The stop decision is the most difficult decision in predictive coding. The decision must be made in all types of predictive coding methods, not just our Predictive Coding 4.0. Many of the scientists attending TREC 2015 were discussing this decision process. There was no agreement on criteria for the stop decision, except that all seemed to agree it is a complex issue that cannot be resolved by random sampling alone. The prevalence of most projects is too low for that.

The e-Discovery Team grapples with the stop decision in every project, although in most it is a fairly simple decision because no more relevant documents have surfaced to the higher rankings. Still, in some projects it can be tricky. That is where experience is especially helpful. We do not want to quit too soon and miss important relevant information. On the other hand, we do not want to waste time look at uninteresting documents.

Still, in most projects we know it is about time to stop when the stratification of document ranking has stabilized. The training has stabilized when you see very few new documents predicted relevant that have not already been human reviewed and coded as relevant. You essentially run out of documents for step six review. Put another way, your step six no longer uncovers new relevant documents.

This exhaustion marker may, in many projects, mean that the rate of newly found documents has slowed, but not stopped entirely. I have written about this quite a bit, primarily in Visualizing Data in a Predictive Coding Project –Part One, Part Two and Part Three. The distribution ranking of documents in a mature project, one that has likely found all relevant documents of interest, will typically look something like the diagram below. We call this the upside down champagne glass with red relevant documents on top and irrelevant on the bottom.Also see Postscript on Software Visualization where even more dramatic stratification are encountered and shown.

Another key determinant of when to stop is the cost of further review. Is it worth it to continue on with more iterations of steps four, five and six? See Predictive Coding and the Proportionality Doctrine: a Marriage Made in Big Data, 26 Regent U. Law Review 1 (2013-2014) (note article was based on earlier version 2.0 of our methods where the training was not necessarily continuous). Another criteria in the stop decision is whether you have found the information needed. If so, what is the purpose of continuing a search? Again, the law never requires finding all relevant, only reasonable efforts to find the relevant documents needed to decide the important fact issues in the case. Rule 1 and 26(b)(1) must be considered.

The stop decision is state of the art in difficulty and creativity. We often provide custom solutions for testing the decision depending upon project contours and other unique circumstances. I wish Duke would have a conference on that, instead of one to reinvent old wheels. But as George Bernard Shaw said, those who can, do. You know the rest.

Conclusion

We continue with our work improving our document review methods and improving the free TAR Course. We want to make information on best practices in this area as accessible as possible and as easy to understand as possible. We have figured out our processes over thousands of projects since the Da Silva Moore days (2011-2012). It has come out of legal practice, trial and error. We learn by doing, but we also teach this stuff, just not for a living. We also run scientific experiments in TREC and on our own, again, just not for a living. Our Predictive Coding 4.0 Hybrid Multimodal IST method has not come out of conferences and debates. It is a legal practice, not an academic study or exercise in group consensus.

Try it yourself and see. Just do not use the first version methods of predictive coding that we used back in Da Silva Moore. Another TAR Course Update and a Mea Culpa for the Negative Consequences of ‘Da SIlva Moore’. Use the latest version 4.0 methods.

The old methods, versions 1.0 and 2.0, that most of the industry still follows, must be abandoned. Predictive Coding 1.o did not use continuous active training, it used Train Then Review (TTR). That invited needless disclosure debates and other poor practices. Version 1.0 also used control sets. In version 2.0 continuous active training (CAT) replaced TTR, but control sets are still used. In version 3.0 CAT is used, and Control Sets are abandoned. In our version 3.0 we replaced the secret control set basis of recall calculation with a prevalence based random sample guide in Step Three and an elusion based quality control sample in Step Seven. See: Predictive Coding 3.0 (October 2015).

In version 4.0, our current version, we further refined the continuous training aspects of our method with the technique we call Intelligently Spaced Training, IST.

Our new eight-step Predictive Coding 4.0 is easier to use than every before and is now battled tested in both legal and scientific arenas. Take the TAR Course, try using our new methods of document review, instead of the old Da Silva Moore methods. If you do, we think you will be as excited about predictive coding as we are. Why I Love Predictive Coding: Making document review fun with Mr. EDR and Predictive Coding.

Protected: Another TAR Course Update and a Mea Culpa for the Negative Consequences of ‘Da SIlva Moore’

June 4, 2017

Enter your password to view comments. | Lawyers Duties, Review, Search, Technology, VENDORS | Tagged: best practices, borg, computer assisted review, cybersecurity, document review, legal profession, legal search, machine learning, predictive coding, science, search, tar, vendors | Permalink
Posted by Ralph Losey

More Enhancements to the TAR Course with New Videos on the Importance of Keyword Search, Blair Maron, the Search Quadrant and a Similarity Search Tip

May 28, 2017

Many new enhancements were made to the TAR Course this weekend, including additions and revisions to the written materials, new graphics, new homework (for the first time) for the Twelfth Class (Random Prevalence), along with two new videos, one for the Sixth Class (Similarity Searches) and a longer one for the Seventh Class on the Search Quadrant and the classic Blair Maron research. The videos are reproduced below for the convenience of those who have already gone through the course or otherwise may be curious about my latest thoughts on legal search.

The Seventh Class is entitled Keyword and Linear Review. The new video gives background on legal search in general, and Keyword search in particular, including its known limitations. It is shown in two parts. I start off simple explaining the basic terminology but eventually get to some more nuanced points, including discussion of the Search Quadrant and the Blair and Maron study.

In spite of the limits of keyword search, we still use a sophisticated form of keyword search in every project, especially at the beginning of a project. We use tested, Boolean Parametric keyword search to find the low hanging fruit. That is part of Step Two of our eight-part method. It is also part of Step Six. We feed the documents we find by this, and all other methods, into our training matrix for our machine learning.That is part of Step-Four. The eight steps in our Predictive Coding 4.0 method are covered in Classes Nine through Fifteen of the sixteen class TAR Course.

One of the things we learned at our 2016 experiments at TREC was that keyword search is more valuable than we had originally thought, when done right and when done in a relatively simple search project. But still, when keyword search is done in a naive Go Fish manner, it is very poor at Recall and Precision, even in simple cases. In complex projects even sophisticated keyword search needs to be supplemented with the more powerful machine learning algorithms. Even the best forms of keyword search can only work well alone in projects with simple data, a clear target and a good SME. The war story in part two of my video above demonstrated that.

The second new video is a short one providing a search tip on one way to use Similarity Searches. it was added to the Sixth Class.

___

Here is one of the new graphics I added. It uses a photo of the Compact Muon Solenoid (CMS) detector in the Large Hadron Collider. That is the famous seventeen mile long particle accelerator that straddles the border of Switzerland and France. It is the largest machine in the world and was built by the European Organization for Nuclear Research (CERN).

This photo of a key component of the world’s most sophisticated electronic tool is shown with a lift in place. The lift allows engineers to step-in and keep the technology in good working order. (Stepping-In is discussed in Davenport and Kirby, Only Humans Need Apply, and by Dean Gonsowski, A Clear View or a Short Distance? AI and the Legal Industry, and A Changing World: Ralph Losey on “Stepping In” for e-Discovery. Also see: Losey, Lawyers’ Job Security in a Near Future World of AI, Part Two. The lift in the Hadron photo illustrates the importance of humans to maintain and operate all of the new technologies we are creating. It is truly a Man-Machine hybrid relationship, just like predictive coding, where we lawyers need to step-in and enhance our evidence finding by working with our own new technology tools.

I chose the CERN CMS because it is the ultimate technology tool now existing to enhance human capabilities. In this case to see elementary particles. The tool makes and records forty million measurements per second of high energy particle collisions. To understand my enthusiasm for the Compact Muon Solenoid in the Large Hadron Collider, the beauty of the design and boldness of the experiments, check out a few instructional videos. Start with this one by the BBC, then, if you are interested, watch a few more. The one below allows for a 360 view that you control.

____

Back to the stepping-in, double loop IST training, this is taught in the fifth class of the TAR Course. That class is called Balanced Hybrid and Intelligently Spaced Training. We use IST, Intelligently Spaced Training, a form of continuous active learning, as part of our process to select documents to use for machine training. This allows us to set up a Double Feedback Loop, where we both teach and learn to better understand the machine’s training needs. IST and double-loop training are advanced concepts and techniques taught throughout the TAR Course, but featured in the Fifth Class. The writing in this class was also slightly improved and expanded. Here is one of the new graphics for that class. The class now explains that the extra control provided by the IST method provides more wiggle-room for human creativity and innovation. (This next graphic is not a giff animation. It is an optical illusion based on work of the Japanese experimental psychologist, Akiyoshi Kitaoka. The image itself is static.)

Another photo of the CERN collider without the lift is shown below. This graphic was added to the Second Class, on TREC Total Recall Track, 2015 and 2016. It illustrates the importance of experiments and research to the e-Discovery Team’s current understanding of the three primary quality controls in TAR: (1) Method, (2) Software and (3) SME.

These three QC process factors are explained in the Eighth Class, SME, Method, Software; the Three Pillars of Quality Control. In this class we discuss the debate between AI leading to automation, versus, IA, intelligence augmentation. We advocate for enhancement and empowerment of attorneys by technology, including quality controls and fraud detection. We oppose delegation of control to the machine for document review. See Why the ‘Google Car’ Has No Place in Legal Search.

This delegation to automated methods will not stop fraud as the full-automation side argues. The SMEs are still programing relevance input. But it will decrease precision and so drive up the costs of review. It will also result in too many lost black swans when a bad stop decision is made. There are other more effective ways to guard against a crooked attorney then trying to remove the human attorney from the equation. Experienced lawyers can already detect omissions, especially when using ranking based searches.

Finally, I also added new writings and some challenging homework assignments for the Twelfth Class. This class covers Step Three – Random Prevalence, of the Team’s standard eight-step workflow. In this step a little math is required, so I added some more explanations and detailed exercises. This should make it easier to learn this new knowledge. Now only the fourteenth, fifteenth and sixteenth classes do not have homework assignments. They will be added soon enough. Consider this a rolling production.