Maura R. Grossman and Gordon V. Cormack understand the importance of language, the importance of a common tongue to bring order and clarity to any field of knowledge, especially a new one like legal search. That, I suspect, is why they labored so hard to build a definitive guide to our speech in this realm: Maura R. Grossman and Gordon V. Cormack, The Grossman-Cormack Glossary of Technology-Assisted Review, with Foreword by John M. Facciola, U.S. Magistrate Judge, 2013 Fed. Cts. L. Rev. 7 (January 2013). They understood one of the fundamental images of Western Civilization, the Tower of Babel and the Old Testament vision of what God saw:
1 And the whole earth was of one language and of one speech. …
6 Behold, they are one people and have one language, and nothing will be withheld from them which they purpose to do.
God, Book of Genesis, Chapter 11.
If the whole earth was of one language and of one speech, then, in the language of the King James version, nothing that they propose to do will now be impossible for them. As you may recall, God did not like that situation (perhaps we were not ready) and so He smote the tower of knowledge and confused their language, that they may not understand one another’s speech.
The vendors and commentators, myself included, have done a fine job, almost divine, of confounding the language of legal search. We now have as many dialects to speak about it as there are vendors. It is a swamp of babel, near gibberish. Commercial speech exploits the confusion. You think you just bought predictive coding enhanced software, but did you really? Scientific terms are thrown about like candy to children as software vendors make promises to consumers that scientists know to be fabrications.
Those of us dedicated to improving electronic discovery, especially the art of finding the electronic needle in the haystack, have been held back by the multiplicities of language where truth is lost and real knowledge is elusive. We needed a new Tower of Non-Babel to gain clarity and understanding. Maura Grossman and Gordon Cormack have built such a tower for us.
Maura and Gordon are uniquely qualified for this role. They bring the perfect mix of science and law. Gordon is a noted information scientist and professor, and Maura is a noted lawyer at a major law firm. They not only research and write together, but Gordon also assists Maura with her TAR projects. They know from first hand experience the language needed for this kind of work to succeed.
Now that we have a solid glossary to give us one language to speak about TAR, nothing that we propose to do now will be impossible for us. We can now say what we mean and mean what we say. As a writer I am especially grateful for that and recommend you read and use their new work: The Grossman-Cormack Glossary of Technology-Assisted Review, with Foreword by John M. Facciola, U.S. Magistrate Judge, 2013 Fed. Cts. L. Rev. 7 (January 2013).
A glossary, which I was surprised to learn when researching for this blog is also called an idioticon, provides an alphabetical list of terms in a particular domain of knowledge with definitions for those terms. This is exactly what Maura and Gordon have done in the field of technology assisted review (“TAR”), which, as I have noted before I prefer to call CAR (computer assisted review). There is room for more than one term to signify the same thing in any domain of knowledge, and so both terms TAR and CAR are included in the The Grossman-Cormack Glossary. Escape from Babel may mean the end of linguistic confusion, but it does not mean the end of controversy, nor arrogant imposition of a Borg-like, droll uniformity of words. We will still disagree, but at least now we will have a better understanding as to what we really disagree about. We will get down to the core issues. That is a huge step forward.
Grossman and Cormack Explain Why They Did It
Maura Grossman and Gordon Cormack take a less biblical approach to explaining their work, one more appropriate to their PhDs. In the Preamble to their Glossary the dynamic-duo explain that TAR is a disruptive technology, a term coined by Harvard Business School professor Clayton M. Christensen, in his 1997 book The Innovator’s Dilemma. (I use their term, TAR, instead of my preference, CAR, for this essay in deference to the fact I am reviewing their work.) Disruptive technology is used to describe innovations that improve a product or service in ways that the market did not expect. The new technologies are disruptive to the marketplace because they change consumer demand and lower prices. In the words of Cormack and Grossman:
Products based on disruptive technologies are typically cheaper to produce, simpler, smaller, better performing, more reliable, and often more convenient to use. Technology assisted review (“TAR”) is such a disruptive technology. Because disruptive technologies differ from sustaining technologies – ones that rely on incremental improvements to established technologies – they bring with them new features, new vernaculars, and other challenges.
The introduction of disruptive technologies from the latest offerings of vendor search and review software has triggered a wealth of linguistic confusion. There are not only a multiplicity of terms meaning essentially the same thing, (e.g., TAR, CAR and predictive coding); but often the same terms are used to refer to different things (e.g., seed sets and control sample). Moreover, as G&C explain:
[T]he introduction of complex statistical concepts, and terms-of-art from the science of information retrieval, have resulted in widespread misunderstanding and sometimes perversion of their actual meanings.
This glossary is written in an effort to bring order to chaos by introducing a common framework and set of definitions for use by the bar, the bench, and service providers. The glossary endeavors to be comprehensive, but its definitions are necessarily brief. Interested readers may look elsewhere for detailed information concerning any of these topics.
Like my idea for a clearinghouse of attorney best practices in electronic discovery, the EDBP, Cormack and Grossman envision their glossary as changing and growing with the times. They also invite reader participation, as I do with EDBP. Towards that end I have carefully studied their definitions, and submitted suggestions. They encourage you to do the same as explained in the Preamble.
In the future, we plan to create an electronic version of this glossary that will contain live links, cross references, and annotations. We also envision this glossary to be a living, breathing work that will evolve over time. Towards that end, we invite our colleagues in the industry to send us their comments on our definitions, as well as any additional terms they would like to see included in the glossary, so that we can reach a consensus on a consistent, common language relating to technology assisted review.
I have found their glossary very useful and intend to cite to it in the future. It is not a mere icon of idiots, it is a well thought out and challenging work. You can learn a lot by close study of many of the definitions. The annotations will be a terrific addition when they are added. In my view as a wordsmith, this latest work of Grossman and Cormack is as important as their first landmark article that brought TAR to age in the first place, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, Richmond Journal of Law and Technology, Vol. XVII, Issue 3, Article 11 (2011). Read the glossary through once and then refer time and again. The Grossman-Cormack Glossary of Technology Assisted Review. To see how their glossary fits into the bigger picture of electronic discovery vocabulary, see The Sedona Conference® Glossary: E-Discovery & Digital Information Management (Third Edition) (Sept. 2010).
A Few Examples
Here are a few of my favorite TAR words to give you some examples of how the glossary works. The first two describe two types of estimation formulas and shows why the binomial method should be used in low yield situations. As I have mentioned before, low prevalence is the norm in legal search, so methods that rely solely on Gaussian are out of touch. Not only that, they give a false distortion that over-emphasizes the higher end of the confidence interval. To the unwary this can then lead to unnecessarily large sample sizes, which in turn pushes up the cost of the quality controls. That is why my blog has links to both types of calculators in the box on the right column labeled Math Tools for Quality Control.
Binomial Calculator / Binomial Estimation: A statistical method used to calculate Confidence Intervals, based on the Binomial Distribution, that models the random selection of Documents from a large Population. Binomial Estimation is generally more accurate, but less well known, than Gaussian Estimation. A Binomial Estimate is substantially better than a Gaussian Estimate (which, in contrast, relies on the Gaussian or Normal Distribution) when there are few (or no) Relevant Documents in the Sample. When there are many Relevant and many Non-Relevant Documents in the Sample, Binomial and Gaussian Estimates are nearly identical.
Classical, Gaussian, or Normal Calculator / Classical or Gaussian Estimation: A method of calculating Confidence Intervals based on the assumption that the quantities to be measured follow a Gaussian (Normal) Distribution. This method is most commonly taught in introductory statistics courses, but yields unreasonably large Confidence Intervals when the Prevalence of items with the characteristic being measured is low. (C.f. Binomial Calculator / Binomial Estimation.)
For the last example of a good TAR term and definition, check out this longer that usual definition of the confusion matrix. This is a very good word to learn because it includes so much. Plus it sounds so cool. Study up on this or you could well be lost in the confusion matrix.
Confusion Matrix: A two-by-two table listing values for the number of True Negatives (“TN”), False Negatives (“FN”), True Positives (“TP”), and False Positives (“RP”) resulting from a search or review effort. As shown below, all of the standard evaluation measures are algebraic combinations of the four values in the Confusion Matrix. Also referred to as a Contingency Table. An example of a Confusion Matrix (or Contingency Table) is provided immediately below.
|Truly Non-Relevant||Truly Relevant|
|Coded Non-Relevant||True Negatives (“TN”)||False Negatives (“FN”)|
|Coded Relevant||False Positives (“FP”)||True Positives (“TP”)|
Accuracy = 100% – Error = (TP + TN) / (TP + TN + FP + FN)
Error = 100% – Accuracy = (FP + FN) / (TP + TN + FP + FN)
Elusion = 100% – Negative Predictive Value = FN / (FN + TN)
Fallout = False Positive Rate = 100% – True Negative Rate = FP / (FP + TN)
Negative Predictive Value = 100% – Elusion = TN / (TN + FN)
Precision = Positive Predictive Value = TP / (TP + FP)
Prevalence = Yield = Richness = (TP + FN) / (TP + TN + FP + FN)
Recall = True Positive Rate = 100% – False Negative Rate = TP / (TP + FN)
The Grossman Cormack Glossary does not, by itself, make the whole world of legal search of one language and of one speech. It does not, by itself, make it possible to accomplish everything we propose. But it establishes a literary framework by which we can attain greater clarity and understanding in advanced legal search. Using my preferred image, this glossary should be in the glove compartment of every CAR. If nothing else, should you get lost, the glossary will make it easier for you to look at a map or, shudder, even ask for directions.
The Grossman-Cormack Glossary provides a solid foundation for a Twenty-First Century tower of TAR. Unlike the Bible story, however, this tower will not be stricken down by higher powers. Verily, nay, I say unto you, it will even be cited by the appellate courts.