You think you’ve got email problems, Jason R. Baron just received 200 million emails from the outgoing Bush administration! Jason is the Director of Litigation for the National Archives and Record Administration (“NARA”), the government agency responsible for maintaining and searching all of these emails, and more, in response to never-ending information requests. NARA keeps the permanent records of the U.S. government, including the emails and other records of the White House and Presidential Libraries. As difficult as it is to search for one relevant email in a universe of 200,000,000, the situation is getting worse all of the time. Jason expects the Obama administration, if it goes for two terms, to generate over a billion emails by 2017. These kinds of Carl Sagan type numbers (“Billions and Billions!”) help motivate Jason to think long and hard about the future of search and explains why he has reached out to information scientists for help. See eg. National Institute of Standards and Technology TREC Legal Track, the general TREC conference, and the DESI III at ICAIL 2009 workshop in Barcelona.
Jason shared his thoughts and science outreach efforts with about 60 law students at the University of Florida last week in a class that Bill Hamilton and I usually teach. Jason spoke for two hours before students who had previously read Jason’s scholarly writings on the subject and the landmark cases. Paul & Baron, Information Inflation: Can The Legal System Adapt? 13 Rich J.L. & Tech 10 (2007); Baron, Jason, Editor, The Sedona Conference® Best Practices Commentary on Search & Retrieval Methods (Aug. 2007); Baron, Jason, E-discovery and the Problem of Asymmetric Knowledge (Presentation at the Mercer Law School Ethics in the Digital Age Symposium, Nov. 2008); Disability Rights Council of Greater Wash. v. Wash. Metro. Area Transit Auth., 2007 WL 1585452 (D.D.C. June 1, 2007); United States v. O’Keefe, 2008 WL 449729 (D.D.C. Feb. 18, 2008); Victor Stanley, Inc. v. Creative Pipe, Inc., 2008 WL 2221841 (D. Md., May 29, 2008). This is the perfect format to give Jason sufficient time to present a full overview of his ideas and projects in this area. It also led to excellent questions and discussions, which are hard to come by in a typical attorney CLE program where few, if any of the attendees actually study the material in advance.
I say Jason Baron had time to provide a full overview of his ideas because it would take a day or more to flush out all of the details of his work on this subject. Typically, e-discovery CLEs include search as part of a curriculum, and, at best, you only hear Jason Baron as part of a panel with limited time. I know because I’ve been on two search panels with him. That is better than nothing, but not really adequate for a full airing of his views or mine. See eg. a webinar on March 19, 2009 where Jason is a panelist: Buyer Beware: How TREC Can Help You Evaluate Your E-Discovery Investments. This free Webinar promises to be better than most and I suggest you attend. Still, search is a critical issue for e-discovery and deserves a full day in-person seminar of its own, at least.
I call on the vendors out there to sponsor a two-day, ad-free, seminar devoted entirely to search. Then, Jason and others could take the time needed to really get into the meat of these issues. For instance, given a couple of hours, I could lay out my current thinking and practice on search and cost, a topic that I can only sketch in very broad outline in a 20-minute share of a panel discussion. I know that Anne Kershaw and Patrick Oot, among others, also have important insights to share with the e-discovery community on the topic of search.
Tobacco Litigation e-Discovery
Jason began his presentation with a story of his experience assisting trial attorneys in the Department of Justice on the tobacco litigation team. In the early 2000s, the team at DOJ worked with various agency counsel (including Jason representing NARA) on the task of responding to discovery requests from the tobacco industry in U.S. v. Philip Morris. This was a mammoth e-discovery project. There were 1,726 Requests to Produce propounded by tobacco companies against 30 federal agencies for tobacco related records.
The hardest part of the project was the search of 32 million Clinton era email records. It started by Jason and his team studying the requests and “dreaming up” 12 keyword combinations to search/cull the 32 million emails. They ran some tests on samples and then had the good sense to do something that was then new and daring: they told the tobacco company requesting parties what the search terms were and invited them to participate. The tobacco company lawyers responded favorable and suggested some new terms that were then explored. This was followed by more sampling to find “noisy” terms, that is, keyword terms that generated too many false positives (Marlboro, PMI, TI, etc.). The results were reported back to the opposing counsel and a consensus was reached as to additional terms to be used in the search protocol. Then and only then was the full search run against the 32 Million emails. Here is an example that Jason gave of one of the boolean search strings that was used in the search:
(((master settlement agreement OR msa) AND NOT (medical savings account OR metropolitan standard area)) OR s. 1415 OR (ets AND NOT educational testing service) OR (liggett AND NOT sharon a. liggett) OR atco OR lorillard OR (pmi AND NOT presidential management intern) OR pm usa OR rjr OR (b&w AND NOT photo*) OR phillip morris OR batco OR ftc test method OR star scientific OR vector group OR joe camel OR (marlboro AND NOT upper marlboro)) AND NOT (tobacco* OR cigarette* OR smoking OR tar OR nicotine OR smokeless OR synar amendment OR philip morris OR r.j. reynolds OR (“brown and williamson”) OR (“brown & williamson”) OR bat industries OR liggett group)
As a result of the search, 99% of the documents were culled out. But that still left 320,000 emails, plus attachments. About half of those were found to be relevant, which, in my experience, is a high precision ratio. Of the relevant emails and attachments, about 20% were found to be privileged. They were logged and withheld, and the 80% balance of relevant files were produced. Although I am sure the documents uncovered were of some help to both sides, the sad truth is, none were ever used as an exhibit at trial.
The One Percent Solution Does Not Scale
The parties in the tobacco litigation were, under Jason’s leadership, able to cooperate and agree upon boolean search parameters that reduced the total universe to be reviewed for production by 99%. That is, in my experience, a very high cull rate. The use of keyword based culling alone can rarely, if ever, go beyond the one percent barrier. That is especially true in a negotiated term setting. In the tobacco case the government was willing to search the one percent remaining after culling, here 320,000 emails. The case was big enough (billions of dollars were at stake) and the U.S. government could afford the millions of dollars required for the review and production.
Jason then explained that the core problem is that the one percent solution does not scale. The government could afford to review and produce one percent of the Clinton era email, but cannot afford to review and produce one percent of Bush’s email, which equals 2 million emails (1% of 200,000,000 = 2,000,000), much less the expected email of Obama (1% of 1,000,000,000 = 10,000,000). What would it cost and how long would it take to review ten million emails (1% of 1 billion)? Jason estimates it would cost at least $20 Million and take a team of 100 lawyers working 10-hour days, seven days a week, over 28 weeks. I personally think that is an underestimate in time and cost. But regardless, it is far more than the federal government can afford or is willing to pay for a discovery request (even if, in my opinion, not Jason’s, some of the judges on the D.C. Circuit Court of Appeals do not appear to care how much discovery costs as the decision In Re Fannie Mae Litigation suggests).
Here is how Jason summed up the problem of scale in his talk to U.F. law students:
One percent of a billion after a keyword search is too much. Something has got to change… You have to take that huge volume and somehow cut down the haystack as much as possible that’s reasonable to do searches against, and then those searches need to be more efficient than what they are today. But that problem is a hard one; doing efficient searches is very hard.
Jason then explained some of the many reasons that search of large, heterogeneous data collections is so hard to do. They include such things as “Polysemy,” which means ambiguous terms (e.g., “George Bush,” “strike”), “Synonymy,” which means variation in describing the same person or thing in a multiplicity of ways (e.g., “diplomat,” “consul,” “official,” ambassador,” etc.), and “Pace of Change,” which refers to the never-ending development of new communication media and languages (e.g., twitter, text messaging, and computer gaming, i.e. “POS,” “1337”).
The Myth of Search & Retrieval
Most litigation lawyers today do not understand just how hard it is to search large data-sets. They think that when they request production of “all” relevant documents (and now ESI), that “all or substantially all” will in fact be retrieved by existing manual or automated search methods. This is a myth. The corollary of this myth is that the use of “keywords” alone in automated searches will reliably produce all or substantially all documents from a large document collection. Again, most litigators think this is true, but it is not. That is not just Jason’s opinion, or my opinion, it is what scientific, peer-reviewed research has shown to be true.
Electronic documents that are relevant to a request for information, and are retrieved by a search process, are referred to as “True Positives.” These are the files we want. We do not want a search to retrieve irrelevant files. The irrelevant files that are not retrieved are called “True Negatives.” In an ideal, perfect world, our automated search would find all relevant files, and only relevant files. We would have 100% True Positives and 100% True Negatives. But in reality, it never works that way, at least not in large sets of data. In reality, a search retrieves both relevant files and irrelevant files. The irrelevant files retrieved are called False Positives.
The ratio between True Positives and False Positives is referred to in information science as “Precision.” Precision is good; it means you spend less time reviewing irrelevant files. That saves money and thus is very important to real world e-discovery. In the Blair and Maron study, for instance, the Precision was 79%, while the Recall was only 20%. That means that 79% of the documents retrieved by the search were relevant, a high rate of Precision in my experience, but 80% of the relevant documents were not retrieved.
The relevant documents that are not found by a search are called “False Negatives.” The ratio between the True Positives, and the False and True Positives, is the “Recall” rate. Thus, in the Blair and Maron study, which was again confirmed in the TREC study, for every 100 relevant files the keyword search sorted through, it identified only 20, the True Positives, and failed to see 80, the False Negatives. In an ideal, perfect search, which again is impossible for large data-sets, you would find all relevant documents and achieve a 100% Recall. Information science research has discovered that in the search of large data-sets there is a typical ratio between Recall and Precision, such that the higher your Precision, the lower your Recall, and visa versa. This is shown in the graph below that I have taken from Jason’s PowerPoint.
Thus, for example, if your search only uncovered five documents, and you were lucky enough that all five were relevant, then you would have 100% Precision. There would be no False Positives. But in that circumstance, you would likely have attained a very low Recall rate. You may have found five relevant files, but left behind another five hundred. Thus, in that example, your Recall would be 5/505, or slightly less than one percent (.99%). That is the basic stuff of search analysis. The next instructional step after that, in my opinion, requires venturing into the world of sampling and thus is one of those things that requires a full day seminar, and (horrors) more math.
TREC Legal Track
The Recall Precision trade-off is a problem well known to all of the participants in the Legal Track of the TREC conferences. The Legal Track supervises an open data search experiment and sponsors an annual meeting where the results are discussed and debated in academic fashion. The participants are primarily professors and their students from information science departments, plus a few attorneys like Jason, and recently a few e-discovery vendors as well. In addition to Jason Baron, the coordinators for the 2008 TREC Legal Track were Bruce Hedin, Ph.D., Douglas W. Oard, Ph.D., and Stephen Tomlinson.
I look forward to the latest Legal Track Overview article to be published later this month detailing the results of the experiments and findings in 2008. In the meantime, I will try and explain some of the basics here. Any over-simplifications and errors are solely my own, not Jason’s or anyone else. Go here for the official, lengthy report on the 2007 TREC Legal Track. Also see Sedona Conference Open Letter on the 2008 TREC Legal Track.
The TREC Conference series is sponsored by the National Institute of Standards and Technology (NIST). It is designed to promote research into the science of information retrieval in general and has a number of different fields of study, or “Tracks.” The first TREC conference was in 1992. The 15th Conference was held in 2006 where Jason Baron and his colleague Doug Oard at the University of Maryland convinced the TREC conference to begin a new Legal Track for the study of problems faced by attorneys searching large data sets to respond to discovery requests. The TREC Legal Track was thus born in 2006 and has continued every year thereafter. This is the first time this kind of study has been performed using non-proprietary data since the Blair and Maron research in 1985.
TREC Legal Track sets up a search problem using hypothetical legal complaints and “requests to produce” with over 100 categories created to date. The requests are drafted by members of The Sedona Conference with litigation experience. “Boolean negotiations” were then conducted by a control group of expert attorneys simulating real-world conditions. They agreed upon baseline keyword search terms with Boolean operators and wildcards to retrieve data relevant to the requested categories. These categories varied tremendously from the dry and serious shown in the example below, to the slightly whimsical, such as a category requesting all documents making a connection between the music and songs of Peter, Paul, and Mary, Joan Baez, or Bob Dylan, and the sale of cigarettes. Here is the example provided as to how the negotiations went for one of the 100 topics:
Request Number: 52
Request Text: Please produce any and all documents that discuss the use or introduction of high-phosphate fertilizers (HPF) for the specific purpose of boosting crop yield in commercial agriculture.
Proposal by Defendant (recipient of discovery): “high-phosphate fertilizer!” AND (boost! w/5 “crop yield”) AND (commercial w/5 agricultur!)
Rejoinder by Plaintiff (requestor of discovery): (phosphat! OR hpf OR phosphorus OR fertiliz!) AND (yield! OR output OR produc! OR crop OR crops)
Final Query (as agreed to by the parties): ((“high-phosphat! fertiliz!” OR hpf) OR ((phosphat! OR phosphorus) w/15 (fertiliz! OR soil))) AND (boost! OR increas! OR rais! OR augment! OR affect! OR effect! OR multipl! OR doubl! OR tripl! OR high! OR greater) AND (yield! OR output OR produc! OR crop OR crops)
A search was then made of the chosen public document database using the agreed protocols. In 2006, 2007, and 2008 TREC used the nearly 7 million document database from the tobacco litigation. These documents are a set of OCR scanned Tiff type files. The next study in the Summer and Fall of 2009 will use the Enron litigation public data-set. I expect this collection will have no OCR scanning errors and thus, in my opinion, be more reflective of modern practice. See Text REtrieval Conference (TREC) Call to TREC 2009 (more information on the Legal Track will be available soon).
The various search teams participating then ran their own searches of the same database. Up until 2008 most of the participating teams were information scientists from universities, but in 2008 two e-discovery vendors joined the project, H5 and Clearwell Systems. The public database is, of course, totally unstructured and disorganized, and, like real life, it is filled with spelling errors, scanning errors, and language idiosyncrasies. The search teams used various automated methods and protocols to try to locate documents in the database responsive to various categories.
The experiment was, among other things, designed to evaluate the Precision and Recall of the various search methods used by the teams and to compare their results with the arms-length, expert attorney negotiated search terms. The negotiated keyword search method did about the same as the original Blair and Maron study with an approximate average 22% recall rate based on sampling. This means that once again approximately 78% of the relevant documents were not found by the approach now most commonly employed by attorneys. Some of the automated search methods used by the various teams beat this 22% Recall rate, but usually not by much, and not consistently over all categories. The degree of success depended upon the particular category. But, I am pleased to report higher recall – up to 81% — was achieved for at least one individual topic in the so-called “Interactive task,” which more closely models e-discovery practice than TREC’s set piece “ad hoc” task. The Interactive task used “Topic Authorities” drawn from the ranks of the Sedona Conference who acted in the role of senior litigators giving advice and feedback about the topics to the participating teams, and participating teams spent far greater overall resources in attempting to respond to one or two or three topics only. I am not sure if the 81% figure was for the Bob Dylan topic or what, but it is a far cry from the 22% average of keyword searches and shows great hope for the future of search.
The chart below from Jason illustrates the estimated Recall rates attained by search teams on various categories (topics) in the 2006 experiment. The green part of the bar represents the base line keyword search results. The yellow and red parts of the bar represent additional relevant documents captured by alternative methods employed by the search teams. As you can see, keyword search did just fine, comparatively speaking, for a couple of topics, but for most it was out-performed by newer methods.
The TREC Legal Track is grappling with three fundamental issues. In Jason’s words:
(1) How can one go about improving rates of Recall and Precision (so as to find a greater number of relevant documents, while spending less overall time, cost, etc., sifting through noise?)
(2) What alternatives to keyword searching exist?
(3) Are there ways in which to benchmark alternative search methodologies so as to evaluate their efficacy?
This is a work in progress and there are now far more questions than answers. That is what research is all about. The importance of TREC to the legal community has already been recognized by many scholars and at least two leading jurists, Judge Grimm in Victor Stanley and Judge Scheindlin in Securities and Exchange Commission v. Collins & Aikman Corp., 2009 WL 94311 (S.D.N.Y., Jan. 13, 2009).
Jason ended his presentation at U.F. by inviting all of the law students there to be a part of this scientific exploration. TREC Legal Track is seeking volunteers to review files in the next experiment and make a determination of relevance. Participants will receive a detailed explanation of what files should be considered relevant and will then review thousands of files and classify them as either irrelevant, relevant, or highly relevant. See: last year’s Call for Participation by Relevance Assessors.
In view of the high number of electronic documents that must be manually reviewed for relevance, Jason and the Legal Track need hundreds of volunteers, all willing to donate substantial time to this worthy, scientific endeavor. It is open to all law students, paralegals, and attorneys.
Jason has given me permission to invite all of my student, paralegal, and attorney readers to join in the experiment and become a reviewer for the 2009 experiment. The review time will be needed in August and September of 2009. You can control the amount of review work you take on and do the work at home, or wherever you want, at any times you want, 24/7. Have trouble sleeping, have free time, tired of golf? Don’t waste your time watching tv or cruising the Internet. Instead, contribute some of your time and expertise to the advancement of science. Law students can receive pro bono credit in schools where that is required and we can all pad our resume with a really cool line item. For more information on how you can help, email Jason Baron at firstname.lastname@example.org.
The old days of simple keyword search for relevant documents are coming to an end. We can no longer afford its gross inefficiencies and its outrageous expense. There is simply too much data in law suits today to continue using this method of search from the 1980s. It was only able to recall 20% of the relevant information when it first started in the 1980s, and still does little better than that today, even in the hands of experts. My guess is that average lawyers with no special expertise in keyword search are only achieving Recall of from 10% to 15%, but like the attorneys in the Blair and Maron study, think they are getting most of it. The power of myth is strong.
There has got to be a better way than negotiated keyword search. Many people are working on this problem right now, myself included, and breakthroughs are imminent. As Jason Baron put it at the end of his session at U.F.:
We are just at the beginning, sort of the dawn of some new paradigm in the law. There is something happening out there, something different – and you can feel it.
As one vendor I know likes to say, “catch the wave.” Be a part of the solution and make yourself relevant; contact Jason Baron today and volunteer to be a relevancy reviewer. This is a rare chance to be a part of science, a part of history. Don’t let it pass you by.