The Sedona Conference has just released its Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (August 2007) for public comments. A copy may be downloaded for personal use. This Best Practices Commentary, like all of the Sedona publications, was written by a committee of expert members of The Sedona Conference, who agreed upon the content and wording. This particular group is called the “Search and Retrieval Sciences Project Team.” Writings by committee are usually an invitation for disaster, but Sedona consistently manages to pull it off, and do a first rate job, primarily, I think, because of the quality of their editors. The Editor-in-Chief for the Search Team is Jason Baron, about whom I have written several times previously, along with Executive Editors Richard Braman and Kenneth Withers, and Senior Editors Thomas Allman, James Daley and George Paul.
The Search Commentary begins by concisely stating the problems faced today to search high volumes of ESI. It then offers three general solutions, followed by eight specific “Practice Points.” The comments contain both intellectual depth and good practical advice to all those struggling with the problems of search.
The Search Commentary is carefully considered and well written. Although I have a couple of suggestions on the comments, I fully agree with the committee’s observations and solutions. Many will not. In fact, I suspect that this publication will be quite challenging to many in the legal profession because it contradicts several well-established myths. For instance, the Search Team acknowledges that most people consider:
manual review by humans of large amounts of information is as accurate and complete as possible – perhaps even perfect – and constitutes the gold standard by which all searches should be measured.
But the committee states that this is a myth! Manual review may be perfect for a few hundred pages of documents, but fails miserably for a few hundred thousand, much less million, or billion. So much for the gold standard.
The Search Team also make the point, which is not controversial, that the large amounts of ESI in many lawsuits today has made the “venerated process of ‘eyes only’ review” both impractical and cost-prohibitive. They contend that a new consensus is forming in the legal community:
that human review of documents in discovery is expensive, time consuming, and error-prone. There is growing consensus that the application of linguistic and mathematic-based content analysis, embodied in new forms of search and retrieval technologies, tools, techniques and process in support of the review function can effectively reduce litigation cost, time, and error rates.
This leads to the Practice Point 1 (of 8):
In many settings involving electronically stored information, reliance solely on manual search process for the purpose of finding responsive documents may be infeasible or unwarranted. In such cases, the use of automated search methods should be viewed as reasonable, valuable, and even necessary.
The automated search method of choice today is the almost-as-venerated process of keyword search review. It involves the use of select keywords that you think the documents you are looking for will contain. Keyword searches also frequently include “boolean” logic, and can be expanded further with fuzzy logic, and stemming. You then manually search the documents located by keyword search to determine relevance. The manual review then frequently leads to adjustments in the query terms and repeat of the keyword search. Most lawyers think that with this kind of iterative process, and skilled researchers, you can find most of the documents you are looking for.
In fact, in a study done in 1985, lawyers and paralegals having special skills in this area searched a discovery database of 40,000 documents and 350,000 pages in a case involving a subway accident. David Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System, 28 Com. A.C.M. 289 (1985). At the end of the lengthy process, the legal team was confident that they had located about 75% of the relevant documents. In my experience, most attorneys think they have a similar, if not better, success rate.
Lawyers have been using keyword searches since the ’70s with Lexis and Westlaw to find case law. I was first trained in this in 1978. At that time, Westlaw and Lexis each had mandatory video (VHS) training programs leading to certification. Once certified, you could use “dumb terminals” to access mainframes over modems. It was a tremendous innovation in its day.
It was a natural extension in the ’80s and ’90s to use the same keyword search technology to locate relevant documents in large sets of ESI. Lawyers and judges quickly endorsed this legal research method to also search for documents. As one judge put it, “the glory of electronic information is not merely that it saves space but that it permits the computer to search for words or ‘strings’ of text in seconds.’ In re Lorazepam & Clorazepate, 300 F.Supp.2d 43, 46 (D.D.C. 2004). Keyword searching appeared to solve the problem of large volumes of electronic documents where the gold standard of “eyes only” review was not practical. It might not be perfect like manual searches, but it got at least 75% of the documents, and so was an acceptable alternative.
The profession today is very familiar and comfortable with keyword searching. Keyword search is the method employed by almost all lawyers when they use an automated search process. In fact, I suspect that most lawyers are not even aware that there are alternatives to keyword searches.
That is why the committee’s next contention may prove very controversial: the supposed accuracy of keyword searches is just another myth! The Blair and Maron study in 1985 showed that, while the lawyers thought they had found at least 75% of the relevant documents, in fact they had only located 20%.
Can justice really be served with only 20% of the picture? Has the exploding cornucopia of ESI cursed the legal system with the pretence of real knowledge?
The Blair and Moran study, which is still the only one of its kind, led one commentator, Daniel Dabney, a lawyer and information scientist who now works for Westlaw, to equate the false confidence of computer searchers to the Curse of Thamus. Daniel P. Dabney, The Curse of Thamus: An Analysis of Full-Text Legal Document Retrieval, 78 LawLibr. J. 5 (1986). Thamus was an Egyptian Pharaoh reported by Plato in his Phaedrus Dialogue to have criticized the invention of writing as a false substitute for real learning. Thamus condemned writing, said to be a gift from the god Theuth (aka Hermes), as a curse in disguise. The Pharaoh predicted that writing would only lead to a delusionary “semblance of truth” and “conceit of wisdom.” As Dabney put it in his article:
Since the mere possession of writings does not give knowledge, how are we to extract from this almost incomprehensibly large collection of written records the knowledge that we need?
Dabney argued that the Blair and Maron study proved that full-text computer assisted retrieval was not a valid cure to the Pharaoh’s curse. The Sedona Search Team agrees:
. . . the experience of many litigators is that simple keyword searching alone is inadequate in at least some discovery contexts. This is because simple keyword searches end up being both over- and under-inclusive in light of the inherent malleability and ambiguity of spoken and written English (as well as all other languages). . . .
The problem of the relative percentage of “false positive” hits or noise in the data is potentially huge, amounting in some cases to huge numbers of files which must be searched to find responsive documents. On the other hand, keyword searches have the potential to miss documents that contain a word that has the same meaning as the term used in the query, but is not specified. . . .
Finally, using keywords alone results in a return set of potentially responsive documents that are not weighted and ranked based upon their potential importance or relevance. In other words, each document is considered to have an equal probability of being responsive upon further manual review.
The Sedona Search Team notes that currently most e-discovery vendors and software providers continue to rely on outdated keyword searching. This is also what I am seeing. So, obviously this message may come as an unwelcome challenge to many e-discovery providers, and is therefore likely to be controversial.
But the Sedona Search Commentary does not end on a negative note; instead it points to new search technologies that will significantly improve upon the dismal recall and precision ratios of keyword searches. Here is how they summarize the herald of coming good:
Alternative search tools are available to supplement simple keyword searching and Boolean search techniques. These include using fuzzy logic to capture variations on words; using conceptual searching, which makes use of taxonomies and ontologies assembled by linguists; and using other machine learning and text mining tools that employ mathematical probabilities.
This part of the new Commentary is really interesting, albeit challenging, as the Team talks about alternative search tools and methods, and describes many of them in detail in the Appendix.
The many incredible advances in technology over the last twenty years have created the legal morass we are in now. In our present cursed state, it is impossible to find all relevant evidence, and a mere 20% capture rate seems pretty good. The only viable solution is to fight fire with fire, and find a high-tech answer. This requires a new kind of team synergy that I often talk about in this blog, a combination of Science, Technology and the Law. The Sedona search group concludes with a similar recommendation:
The legal community should support collaborative research with the scientific and academic sectors aimed at establishing the efficacy of a range of automated search and information retrieval methods.
The problems created by the information explosion impact all of society, not just the law. There is strong demand for new, improved search technologies, and this is becoming big business. Billions of dollars are now pouring into search technology research. For instance, in 2006 Google spent $1.23 billion, Yahoo spent $833 million, and e-Bay spent $495 million in core research and development activities. With this kind of commercial activity, there is good reason to hope that the Pharaoh’s curse may soon be lifted.
For more information on this subject check out the West Legalworks CLE webinar I did with Jason R. Baron – Director of Litigation, U.S. National Archives and Records Administration, College Park, Maryland; Doug Oard, Ph.D. – Associate Dean for Research, College of Information Studies, University of Maryland; and my law partner at Akerman in Los Angeles, Michael S. Simon. The 1.5 hour audio CLE is entitled The e-Discovery Search Quagmire: New Approaches to the Problem of Finding Relevant Needles in the Electronic Haystack.