Sedona’s New Commentary on Search, and the Myth of the Pharaoh’s Curse

Thoth brings the gift of writing, but Thamus sees it as a curseThe Sedona Conference has just released its Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (August 2007) for public comments.  A copy may be downloaded for personal use.  This Best Practices Commentary, like all of the Sedona publications, was written by a committee of expert members of The Sedona Conference, who agreed upon the content and wording.  This particular group is called the “Search and Retrieval Sciences Project Team.”  Writings by committee are usually an invitation for disaster, but Sedona consistently manages to pull it off, and do a first rate job, primarily, I think, because of the quality of their editors. The Editor-in-Chief for the Search Team is Jason Baron, about whom I have written several times previously, along with Executive Editors Richard Braman and Kenneth Withers, and Senior Editors Thomas Allman, James Daley and George Paul.

The Search Commentary begins by concisely stating the problems faced today to search high volumes of ESI.  It then offers three general solutions, followed by eight specific “Practice Points.” The comments contain both intellectual depth and good practical advice to all those struggling with the problems of search.

The Search Commentary is carefully considered and well written. Although I have a couple of suggestions on the comments, I fully agree with the committee’s observations and solutions.  Many will not.  In fact, I suspect that this publication will be quite challenging to many in the legal profession because it contradicts several well-established myths.  For instance, the Search Team acknowledges that most people consider:

manual review by humans of large amounts of information is as accurate and complete as possible – perhaps even perfect – and constitutes the gold standard by which all searches should be measured.

But the committee states that this is a myth!  Manual review may be perfect for a few hundred pages of documents, but fails miserably for a few hundred thousand, much less million, or billion. So much for the gold standard.

The Search Team also make the point, which is not controversial, that the large amounts of ESI in many lawsuits today has made the “venerated process of ‘eyes only’ review” both impractical and cost-prohibitive.  They contend that a new consensus is forming in the legal community:

that human review of documents in discovery is expensive, time consuming, and error-prone. There is growing consensus that the application of linguistic and mathematic-based content analysis, embodied in new forms of search and retrieval technologies, tools, techniques and process in support of the review function can effectively reduce litigation cost, time, and error rates.

This leads to the Practice Point 1 (of 8):

In many settings involving electronically stored information, reliance solely on manual search process for the purpose of finding responsive documents may be infeasible or unwarranted. In such cases, the use of automated search methods should be viewed as reasonable, valuable, and even necessary.

The automated search method of choice today is the almost-as-venerated process of keyword search review. It involves the use of select keywords that you think the documents you are looking for will contain.  Keyword searches also frequently include “boolean” logic, and can be expanded further with fuzzy logic, and stemming.  You then manually search the documents located by keyword search to determine relevance.  The manual review then frequently leads to adjustments in the query terms and repeat of the keyword search.  Most lawyers think that with this kind of iterative process, and skilled researchers, you can find most of the documents you are looking for. 

In fact, in a study done in 1985, lawyers and paralegals having special skills in this area searched a discovery database of 40,000 documents and 350,000 pages in a case involving a subway accident. David Blair & M.E. Maron, An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System, 28 Com. A.C.M. 289 (1985). At the end of the lengthy process, the legal team was confident that they had located about 75% of the relevant documents. In my experience, most attorneys think they have a similar, if not better, success rate.

Lawyers have been using keyword searches since the ’70s with Lexis and Westlaw to find case law.  I was first trained in this in 1978. At that time, Westlaw and Lexis each had mandatory video (VHS) training programs leading to certification. Once certified, you could use “dumb terminals” to access mainframes over modems.  It was a tremendous innovation in its day. 

It was a natural extension in the ’80s and ’90s to use the same keyword search technology to locate relevant documents in large sets of ESI. Lawyers and judges quickly endorsed this legal research method to also search for documents. As one judge put it, “the glory of electronic information is not merely that it saves space but that it permits the computer to search for words or ‘strings’ of text in seconds.’ In re Lorazepam & Clorazepate, 300 F.Supp.2d 43, 46 (D.D.C. 2004). Keyword searching appeared to solve the problem of large volumes of electronic documents where the gold standard of “eyes only” review was not practical.  It might not be perfect like manual searches, but it got at least 75% of the documents, and so was an acceptable alternative.

The profession today is very familiar and comfortable with keyword searching.  Keyword search is the method employed by almost all lawyers when they use an automated search process.  In fact, I suspect that most lawyers are not even aware that there are alternatives to keyword searches.

That is why the committee’s next contention may prove very controversial: the supposed accuracy of keyword searches is just another myth! The Blair and Maron study in 1985 showed that, while the lawyers thought they had found at least 75% of the relevant documents, in fact they had only located 20%.

Can justice really be served with only 20% of the picture? Has the exploding cornucopia of ESI cursed the legal system with the pretence of real knowledge?

The Blair and Moran study, which is still the only one of its kind, led one commentator, Daniel Dabney, a lawyer and information scientist who now works for Westlaw, to equate the false confidence of computer searchers to the Curse of Thamus.  Daniel P. Dabney, The Curse of Thamus: An Analysis of Full-Text Legal Document Retrieval, 78 LawLibr. J. 5 (1986). Thamus was an Egyptian Pharaoh reported by Plato in his Phaedrus Dialogue to have criticized the invention of writing as a false substitute for real learning.  Thamus condemned writing, said to be a gift from the god Theuth (aka Hermes), as a curse in disguise. The Pharaoh predicted that writing would only lead to a delusionary “semblance of truth” and “conceit of wisdom.”  As Dabney put it in his article:

Since the mere possession of writings does not give knowledge, how are we to extract from this almost incomprehensibly large collection of written records the knowledge that we need?

Dabney argued that the Blair and Maron study proved that full-text computer assisted retrieval was not a valid cure to the Pharaoh’s curse.  The Sedona Search Team agrees:

. . . the experience of many litigators is that simple keyword searching alone is inadequate in at least some discovery contexts.  This is because simple keyword searches end up being both over- and under-inclusive in light of the inherent malleability and ambiguity of spoken and written English (as well as all other languages). . . .

The problem of the relative percentage of “false positive” hits or noise in the data is potentially huge, amounting in some cases to huge numbers of files which must be searched to find responsive documents. On the other hand, keyword searches have the potential to miss documents that contain a word that has the same meaning as the term used in the query, but is not specified. . . .

Finally, using keywords alone results in a return set of potentially responsive documents that are not weighted and ranked based upon their potential importance or relevance. In other words, each document is considered to have an equal probability of being responsive upon further manual review.

The Sedona Search Team notes that currently most e-discovery vendors and software providers continue to rely on outdated keyword searching. This is also what I am seeing. So, obviously this message may come as an unwelcome challenge to many e-discovery providers, and is therefore likely to be controversial.

But the Sedona Search Commentary does not end on a negative note; instead it points to new search technologies that will significantly improve upon the dismal recall and precision ratios of keyword searches. Here is how they summarize the herald of coming good:

Alternative search tools are available to supplement simple keyword searching and Boolean search techniques. These include using fuzzy logic to capture variations on words; using conceptual searching, which makes use of taxonomies and ontologies assembled by linguists; and using other machine learning and text mining tools that employ mathematical probabilities.

This part of the new Commentary is really interesting, albeit challenging, as the Team talks about alternative search tools and methods, and describes many of them in detail in the Appendix. 

The many incredible advances in technology over the last twenty years have created the legal morass we are in now.  In our present cursed state, it is impossible to find all relevant evidence, and a mere 20% capture rate seems pretty good.  The only viable solution is to fight fire with fire, and find a high-tech answer.  This requires a new kind of team synergy that I often talk about in this blog, a combination of Science, Technology and the Law. The Sedona search group concludes with a similar recommendation:

The legal community should support collaborative research with the scientific and academic sectors aimed at establishing the efficacy of a range of automated search and information retrieval methods.

The problems created by the information explosion impact all of society, not just the law. There is strong demand for new, improved search technologies, and this is becoming big business. Billions of dollars are now pouring into search technology research. For instance, in 2006 Google spent $1.23 billion, Yahoo spent $833 million, and e-Bay spent $495 million in core research and development activities. With this kind of commercial activity, there is good reason to hope that the Pharaoh’s curse may soon be lifted.

For more information on this subject check out the West Legalworks CLE webinar I did with Jason R. Baron – Director of Litigation, U.S. National Archives and Records Administration, College Park, Maryland; Doug Oard, Ph.D. – Associate Dean for Research, College of Information Studies, University of Maryland; and my law partner at Akerman in Los Angeles, Michael S. Simon. The 1.5 hour audio CLE is entitled The e-Discovery Search Quagmire: New Approaches to the Problem of Finding Relevant Needles in the Electronic Haystack.

11 Responses to Sedona’s New Commentary on Search, and the Myth of the Pharaoh’s Curse

  1. Lee Barrett says:

    The problems of keyword search are even more pronounced in smaller and medium sized firms across the country, many of whom are most assuredly not terribly familiar with automated electronic searches through documents. In order to be truly effective, a technology-based answer will also have to be affordable and largely “turn-key”.
    Even more importantly, as has been pointed in this blog, and in others, is that the attorneys MUST know what their clients know about the day to day business implementation of the digital enterprise. Many keyword searches, while seemingly intuitive and rational, can be easily defeated by the machinations of the parties. The short hand that people often use in email or other forms of truncated digital communication, especially when used to refer to offbeat phrases such as “Project Look What the Cat Dragged In”, can result in numerous variations of keywords for one independent area of interest. Whether the reliabilty of keyword search is pegged at 20% simply due to the reliability of the underlying technology, or because of other external factors, clearly lawyers cannot be lulled into relying on a single “magic bullet” to meet their obligations in e-discovery. This is clearly one area where one must keep sight of the entire forest, and not focus on just the trees.

  2. rjbiii says:

    This is a brilliant post. There is no doubt that many (most? all?) searches executed against a data universe today in lit support fail in some ways. I wrote a paper for law school that discussed the need for a “feedback loop” in the e-discovery process to test the validity of the “initial assumptions” used to formulate the search criteria.

    In its simplest form, this feedback loop consists of indexing and building of terms list of those documents classified by counsel as “relevant” or “privileged,” and reviewing that list of terms to see if modification to the seach criteria is necessary. Additionally, metadata could be scanned to search for data custodians that might have been overlooked. I don’t know any vendors that do this, or consultants that advise clients to incorporate such a process, but it only makes sense. It strengthens the discovery process and at the very least allows a greater measure of defensibility of it in the courtroom. After all, the criteria used to filter through the data universe determines what the reviewers see, and if there is a flaw in that process, then how can you be sure of the final result?

  3. […] here). An excellent view of the recommendations contained in the white paper may be found at e-Discovery Team. What is certain is that technology associated with e-discovery still has a ways to come (although […]

  4. Ken Withers says:

    Thank you, Ralph, for resurrecting Dabney’s “Curse of Thamus” from the tomb of academic history – I always loved the article and still have a copy in my file of important reading that I downloaded on October 24, 2000, as one of the few works exploring the intersection of law and information science and the implications of the Blair and Maron study (and as an avid reader of Biblical Archaeology Review, I applaud your choice of graphic illustration). I appreciate your comments and hope that your readers will submit more. All Sedona Conference papers are a “work in progress,” even though we might call them “final” for the purposes of publication.

    Your readers should also be aware that the research behind this most recent commentary on search and retrieval technology is very much in progress. We actively support the new “legal track” of the National Institute of Standards and Technology’s annual Text REtrieval Conference (TREC) at http://trec.nist.gov/. Jason Baron of the National Records and Archives Administration captains the effort, under which members of The Sedona Conference’s Working Group 1 supply sample discovery-like queries and human reviewers to help evaluate the performance of a variety of cutting-edge automated text retrieval methods in development. Membership in Working Group 1, and participation in developing these Commentaries, is open to all, and I encourage your readers to join up at http://www.thesedonaconference.org.

    Keep up the good work on your excellent blog, and I hope you allow us to post URLs occasionally.

  5. […] Blog entry posted by Ralph Losey on e-Discovery Team, September 16, 2007: The Sedona Conference has just released its Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (August 2007) for public comments. A copy may be downloaded for personal use. This Best Practices Commentary, like all of the Sedona publications, was written by a committee of expert members of The Sedona Conference, who agreed upon the content and wording. This particular group is called the “Search and Retrieval Sciences Project Team.” Writings by committee are usually an invitation for disaster, but Sedona consistently manages to pull it off, and do a first rate job, primarily, I think, because of the quality of their editors. The Editor-in-Chief for the Search Team is Jason Baron, about whom I have written several times previously, along with Executive Editors Richard Braman and Kenneth Withers, and Senior Editors Thomas Allman, James Daley and George Paul…. Print Share This Close […]

  6. […] For more on the law review article, Information Inflation, by Paul & Baron cited above, see my prior blog Information Explosion and the Future of Litigation. I have also previously written on the above cited Sedona Commentary on Search in The Myth of the Pharaoh’s Curse. […]

  7. […] more sophisticated concept-type search alternatives to keyword search should be considered because keyword searches alone may not […]

  8. […] Information Retrieval, 8 The Sedona Conf. J. 189 (2007), which I have previously written about in Sedona’s New Commentary on Search, and the Myth of the Pharaoh’s Curse, and the Text Retrieval Conference (TRC) sponsored by the National Institute of Standards and […]

  9. […] Ralph explains the study and comments on the Sedana Search & Retrieval paper in one of his prior blogs.] Technologies are available to minimize the number of relevancy and privilege decisions a review […]

  10. […] Most litigation lawyers today do not understand just how hard it is to search large data-sets. They think that when they request production of “all” relevant documents (and now ESI), that “all or substantially all” will in fact be retrieved by existing manual or automated search methods. This is a myth. The corollary of this myth is that the use of  “keywords” alone in automated searches will reliably produce all or substantially all documents from a large document collection. Again, most litigators think this is true, but it is not. That is not just Jason’s opinion, or my opinion, it is what scientific, peer-reviewed research has shown to be true.   A study by information scientists David Blair and M.E. Maron in 1985 revealed a significant gap or disconnect between lawyers’ perceptions of their ability to ferret out relevant documents and their actual ability to do so. The study involved a 40,000 document case (350,000 pages). The lawyers estimated that a keyword search process uncovered 75% of the relevant documents, when in fact it had only found 20%! Blair, David C., & Maron, M. E., An evaluation of retrieval effectiveness for a full-text document-retrieval system; Communications of the ACM Volume 28, Issue 3 (March 1985); Also see: Dabney, The Curse of Thamus: An Analysis of Full-Text Legal Document Retrieval, 78 LawLibr. J. 5 (1986); Losey, Sedona’s New Commentary on Search, and the Myth of the Pharaoh’s Curse. […]

  11. […] keep experimenting in our practice. Aside from my own humble books, and the books and articles of The Sedona Conference, especially Jason R. Baron, you should also check out the new 78 page Search Guide prepared by the […]

%d bloggers like this: