An opinion this month by Judge Facciola distinguishes between keyword searching and concept searching. Disability Rights Council of Greater Wash. v. Wash. Metro. Area Transit Auth., 2007 WL 1585452 (D.D.C. June 1, 2007). The plaintiff had proposed simple keyword searching of email by people’s names, but Judge Facciola suggested the parties instead consider concept searching. This is the first opinion to recognize the distinction between the two types of searches according to Jason R. Baron, Director of Litigation of the National Archives and Records Administration. He wrote to me earlier today to bring this to my attention. Jason should know, as he is an expert and strong proponent of concept searching. Indeed, Judge Faccio cites to his article in the opinion. Here is the operative language from Disability Rights Council at *9:
I bring to the parties’ attention recent scholarship that argues that concept searching, as opposed to keyword searching, is more efficient and more likely to produce the most comprehensive results. See George L. Paul & Jason R. Baron, Information Inflation: Can the Legal System Adapt? 13 Rich. J.L. & Tech. 10 (2007).
My blog of April 9, 2007, reviews this article in depth, and I mentioned it again yesterday when discussing the need for cooperation between counsel in my blog on Intellectual Foundations. Concept searching is just one of many cutting edge ideas discussed in Paul & Baron’s article; one that I have not discussed before. It pertains to promising new software technology that may allow for far better searching than simple keyword matching. But before I go into that, a little more about the interesting Disability Rights Council case itself.
The defendant Transit Authority configured their Groupwise email system so that all emails were automatically deleted after only 60 days. The only exception was when a user went to the trouble to archive a particular email. These archived emails were not deleted. In practice, few Transit Authority users ever bothered to archive any of their emails, and so after 60 days almost all were deleted. Nothing wrong with such a system in principle, but the problem here is that it was not suspended when suit was filed. In fact, in what the court stated was “remarkable” and “indefensible”, the defendant continued to destroy all emails for over two years after the suit was filed.
The opinion begins by noting that the “safe harbor” of new Rule 37(f) was not intended to apply to this situation, at least insofar as the emails destroyed after the suit was filed are concerned. The rule requires “routine” and “good faith” operation of a system. Although it was routine destruction, the court did not consider it to have been carried out in good faith after suit was filed. That is primarily because we are talking about the destruction of live ESI, namely email still on the system, and not on back-up tapes. A preservation hold should have prevented this. After the live emails are so destroyed, the only place to find them is on the backup tapes. For that reason, among others, even though the court agreed with defendant that the backup tapes were not reasonably accessible under Rule 26(b)(2)(B), it nevertheless found good cause to order that they be restored and searched at defendant’s expense. To hold otherwise would reward defendant for destroying relevant emails, leaving the backup tapes as the only remaining source of the evidence. The court rejected this at *8 with a humorous touch:
While the newly amended Federal Rules of Civil Procedure initially relieve a party from producing electronically stored information that is not reasonably accessible because of undue burden and cost, I am anything but certain that I should permit a party who has failed to preserve accessible information without cause to then complain about the inaccessibility of the only electronically stored information that remains. It reminds me too much of Leo Kosten’s definition of chutzpah: “that quality enshrined in a man who, having killed his mother and his father, throws himself on the mercy of the court because he is an orphan.”
Judge Facciola rejected Defendant’s undue burden and expense inaccessibility arguments, and granted Plaintiff’s motion to compel. He then ordered the parties to meet and discuss how the backup tapes will be restored, and as mentioned, how to search the restored emails, either through keyword as plaintiff had proposed, or via concept search as the judge suggested might be more efficient. (As a postscript, I understand the parties met, but instead of agreeing on search, they settled the case instead.)
Of course, keyword searches have been around for decades and are familiar to any lawyer who has ever done computer research. You can, for instance, run a computer search of hundreds of thousands of emails to find all emails that include one or more of a list of names, as plaintiff here proposed. This takes just seconds, but can produce a high percentage of irrelevant emails; ones that include the names but have nothing to do with the case. It can also omit many relevant emails that just do not happen to include the keywords you guessed a relevant email would have (or perhaps included them, but misspelled them, a problem not often found with computerized legal research).
The use of complex Boolean connectors (such as directives that one term be within the same paragraph as another, or that an otherwise included email be excluded if it contains certain terms, i.e. “but not”) can sometimes improve on the search. So too can the use of so called “fuzzy logic.” But even with the use of Boolean and fuzzy logic, these keyword searches, also known as “theoretical set” searches, are still largely a guessing game. In practice, they often fail to uncover too many otherwise relevant emails, without significantly reducing the irrelevant ones.
A search that creates a lot of noise, that is, one that produces too many irrelevant emails, can create very significant time and expense burdens on all the parties, but especially on the producing party. If for instance the search creates a list of 100,000 emails, the producing party will have to review all of these emails for possibly privileged communication before production. This is a very expensive undertaking, and although clawback agreements provide some comfort, they cannot obviate the need for, and expense of, the privilege review. It is also expensive for the receiving party who also has to spend time and money to review the irrelevant emails.
Therefore, if there is a better search method than keyword that can produce a high percentage of relevant hits, and thus less noise and less wasted time for privilege review, it is to the advantage of all parties to use it. Moreover, it is a potentially very valuable product. There are several software vendors who have created alternative search algorithms to keyword searches. All are sometimes lumped together as “concept searches.” They use a variety of methods, involving such things as contextual usages, algebraic modeling and probabilistic categories. The exact formulas are usually kept secret by the software vendors for obvious reasons, but most are prepared to provide expert testimony in court if necessary to justify the legitimacy of their search methods.
Paul & Baron’s article at pages 26-27 summarizes the existing state of information retrieval science in a “mind boggling,” but eloquent manner:
However, broadly speaking, information retrieval methods fall into three broad classes: set-theoretic (Boolean strings, supplemented by fuzzy search capabilities), algebraic (premised on the mathematical idea that the meaning of a document can be derived from the constituent terms in a document, and thus weighting retrieval by the proximity of a document’s terms in the form of two or higher dimensional maps, as in vector space modeling), and probabilistic (using language models and Bayesian belief networks, the latter of which involves making educated inferences about the relevance of future documents based on prior experience in reviewing documents in a given collection).
In thinking about retrieval problems, one can also supplement all of these methods by focusing on the language used by the creators of the records, which will include using taxonomies and ontologies, essentially synonyms of words and relevant classes of related words to be developed and built in at the front end of a search process to better refine the search, and to maximize both recall and precision. In contrast to strict set-based Boolean techniques, the above algebraic and probabilistic categories of search methods are often broadly termed under various forms of the heading “concept searching.”
A review of the various vendors who offer such marvels will have to await another time, as there are several of them now, and it is a growing field. Still, this is definitely something that you should look into before agreeing to simple keyword searches, especially if the volume of ESI to be reviewed is high. The concept search software fees are probably too expensive for small volume or low dollar cases, but they could be a huge money saver for a larger case. Most vendors will provide you with an idea of the price break point based on the byte size of the ESI. Of course, it is more than the number of megabytes involved. You also need to consider the case subject matter, the amount of money involved, and the importance and complexity of issues.
For more information on this subject check out the West Legalworks CLE webinar I did with Jason R. Baron – Director of Litigation, U.S. National Archives and Records Administration, College Park, Maryland; Doug Oard, Ph.D. – Associate Dean for Research, College of Information Studies, University of Maryland; and my law partner at Akerman in Los Angeles, Michael S. Simon. The 1.5 hour audio CLE is entitled The e-Discovery Search Quagmire: New Approaches to the Problem of Finding Relevant Needles in the Electronic Haystack.