Eighth Class: Keyword and Linear Review
This class covers Tested, Parametric Boolean Keyword Search and Linear Review.
Introduction to Keyword Search
The e-Discovery Team has extensive knowledge of how electronic document review is conducted by lawyers in the U.S. It comes with the territory of Losey’s work as an attorney in private practice with a large national law firm. He works with over fifty offices around the country dealing with e-discovery. He has seen it all. So too have the search experts that he works with at KrolLDiscovery. They handle even more cases. Bottom line, for most lawyers today keyword search is still king. They think that multimodal search means linear and keyword. Sad but true. It is as if the profession was stuck in the nineties. By taking this course you are joining an elite group.
The average lawyer in the U.S. knows only a little about legal search. The paucity of knowledge and skills is especially prevalent in lawyers in small to medium size cases and in lawyers who specialize in the representation of plaintiffs. Our knowledge of the practice of lawyers outside of the U.S. shows that things are pretty much the same world-wide.
We often must deal with opposing counsel who are mired in keywords, thinking it is the end-all and be-all of legal search. Moreover, they usually want to go about doing it without any testing. (We will go over some of the testing that they should be doing in this class.) Instead, they think they are geniuses who can just dream up good searches out of thin air. They cannot.
No one can, no matter what their intelligence. We know we cannot. Not unless we are already very familiar with the data-set in question through many prior reviews of that set. That kind of experience can give you the linguistic insight needed, at least for simple search projects. But even then, we know the limitations of keywrods (yes, intentional).
The inexperienced lawyers think they can guess right in every case, even without study of the data. They think they can guess right simply because they know what their legal complaint is about. They assume this knowledge somehow gives them special insights into what keywords were used by the witnesses in all relevant documents. This is delusional. They think they are state-of-the-art, but in fact they are using old search tools.
Knowledge of the case and law is not the same thing as knowledge of the documents. Moreover, inexperienced search lawyers have no idea as to the many limitations of keyword search. They are unacquainted with the scientific studies showing the poor recall using keyword search alone. Blair, David C., & Maron, M. E., An evaluation of retrieval effectiveness for a full-text document-retrieval system; Communications of the ACM Volume 28, Issue 3 (March 1985) (The study involved a 40,000 document case (350,000 pages). The lawyers, who were experts in keyword search, estimated that the Boolean searches they ran uncovered 75% of the relevant documents. In fact, they had only found 20%.). Also see: Grossman and Cormack, Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review, CoRR abs/1504.06868 (2015) at pgs. 2-3.
To put that into standard scientific language shown in the Search Quadrant below, most of the documents that the lawyers in the Blair & Maron study found were not relevant, they were False Positives. They thought they were relevant, but they were not. As a result their Recall was only 20%. That necessarily means that their False Negative rate was 80%. False Negatives are the relevant documents that they never found in their keyword searches.
Few keyword search obsessed attorneys have considered the substantial problem of false positives, meaning documents with the keywords that are not relevant. I cannot tell you how many times I see the word “complaint” in their keyword list. They also underestimate the problem of misspellings, odd language, special acronyms, nick-names and slang; not to mention intentional obfuscation. Here is a quick explanation of the Search Quadrant along with a war story of an document review project Losey was involved with.
The guessing involved in blind negotiated keyword legal search has always reminded me of the child’s game of Go Fish. I wrote about this in 2009 and the Go Fish phrase caught on after Judge Peck and others started citing to that article, which later became a chapter in my book, Adventures in Electronic Discovery, 209-211 (West 2011). The Go Fish analogy appears to be the third most popular reference in predictive coding case-law, after the huge, Da Silva Moore case in 2012 that Judge Peck and I are best known for.
From our experience with thousands of lawyers in real world cases there is no doubt in our minds that keyword search is still the dominant method used by most attorneys. It is especially true in small to medium-sized firms, but also in larger firms that have no bona fide e-discovery search expertise. Many attorneys and paralegals who use a sophisticated, full featured document review platforms such as KrolLDiscovery’s EDR, still only use keyword search. They do not use the many other powerful search techniques of EDR, even though they are readily available to them. The Search Pyramid to them looks more like this, which I call a Dunce Hat.
The AI at the top, standing for Predictive Coding, is, for average lawyers today, still just a far off remote mountain top; something they have heard about, but never tried. Or if they have tried, it was the early poorly designed methods, Predictive Coding 1.0 or 2.0. Those methods were flawed in many ways as I have detailed. Predictive Coding 3.0 article, part one. The use of a control set, which required SME review of thousands of irrelevant documents, was a big waste of time that did not work. The required disclosure of irrelevant documents was also flawed. We have now fixed these early mistakes and others. For that reason, even though AI-enhanced legal search is my specialty, I am not worried about the slow development. I am confident that this will all change soon. Our new, easier to use methods will help, so too will ever improving software by the few vendors left standing. I continue to try to push them, but it is like steering a battleship.
The judges are already doing their part. No judge has ever disapproved the use of predictive coding, although they do refuse to require it (so far). Hyles v. New York City, No. 10 Civ. 3119 (AT) (AJP), 2016 WL 4077114 (S.D.N.Y. Aug. 1, 2016). So far they refuse to force predictive coding largely because of an old Principle of the Sedona Conference, Principle Six:
Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.
Principle Six presumes that the responding party always knows best. So the producing party can refuse to use moderns tools and effective techniques for search and review if they want to. They can instead just use linear review, or use guessed keywords. This Principle is on shaky grounds these days, to say the least, especially when applied to legal search. See: Protecting the Fourteen Crown Jewels of the Sedona Conference in the Third Revision of its Principles; and Sedona Principle Six: Overdue for an Overhaul (Ball in Your Court, October 2014). Also see: Ross-Williams, derivatively, on behalf of Sprint Nextel Corp. v. Sprint Nextel, Civil Action No. 11-cv-00890 (D.C., Kansas, 11/22/16) (Plaintiff’s counsel apparently chose linear review to run up a bill. One contract lawyer, Alexander Silow, spent 6,905 hours reviewing 48,443 documents at a charge of $1.5 million. The presiding Judge James Vano called the bill “Unbelievable!”).
In spite of these obstacles, we are none the less confident that change will come. Soon the profession’s unhealthy obsession with keyword search will end. The profession will eventually embrace the higher levels of the search pyramid, analytics and active machine learning. High-tech propagation is an inevitable result of the next generation of lawyers assuming leadership positions in law firms and legal departments. The old-timey paper lawyers around the world are finally retiring in droves. The aging out of current leadership is a good thing. Their over-reliance on untested keyword search to find evidence is holding back our whole justice system. The law must keep up with technology and lawyers must not fear math, science and AI. They must learn to keep up with technology. This is what will allow the legal profession to remain a bedrock of contemporary culture. It will happen. Positive disruptive change is just under the horizon and will soon rise.
In the meantime we encounter opposing counsel everyday who think e-discovery means to dream up keywords and demand that every document that contains their keywords be produced. The more sophisticated of this confederacy of dunces understand that we do not have to produce them, that they might not all be per se relevant, but they demand that we review them all and produce the relevant ones. Fortunately we have the revised rules to protect our clients from these kind of disproportionate, unskilled demands. All too often this is nothing more than discovery as abuse.
This still dominant approach to litigation is really nothing more than an artifact of the old-timey paper lawyers’ use of discovery as a weapon. Let me speak plainly. This is nothing more than adversarial bullshit discovery with no real intent by the requesting party to find out what really happened. They just want to make the process as expensive and difficult as possible for the responding party because, well, that’s what they were trained to do. That is what they think smart, adversarial discovery is all about. Just another tool in their negotiate and settle, extortion approach to litigation. It is the opposite of the modern cooperative approach. That is one reason why so many lawyers still support Principle Six, even though it seems irrational as applied to legal search.
I cannot wait until these dinosaurs retire so we can get back to the original intent of discovery, a cooperative pursuit of the facts. Fortunately, a growing number of our opposing counsel do get it. We are able to work very well with them to get things done quickly and effectively. That is what discovery is all about. Both sides save their powder for when it really matters, for arguments over the meaning of the facts, the governing law, and how the facts apply to this law for the result desired.
Tested, Parametric Boolean Keyword Search
The biggest surprise from our 2016 TREC research was just how well sophisticated, test-based keyword search can perform under the right circumstances. We are talking about hands-on, tested keyword search. This is not naive, Go Fish keyword guessing in the blind, although it can start that way. It is based on looking at documents and tests of a variety of keywords. It is based on human document review and human file scanning, by which we mean very quick review of portions of files, for instance, just of subject lines. It is based on sampling, usually judgment based sampling, not random. It is also based on strong keyword search software that has parametric and Boolean features.
When keyword search is done with skill and is based on the evidence seen, typically in a refined series of keyword searches, very high levels of Precision, Recall and F1 are sometimes attainable. Again, the dataset and other conditions must be just right for it to be that effective, as explained in the diagram: simple data, clear target and good SME. Sometimes keywords are the best way to find clear targets like names and dates.
In those circumstances the other search forms may not be needed to find the relevant documents, or at least to find almost all of the relevant documents. These are cases where the hybrid balance is tipped heavily towards the human searchers. All the AI does in these circumstances, when the human using keyword search is on a roll, is double-check and verify that it agrees that all relevant documents have been located. It is always nice to get a free second opinion from Mr. EDR. This is an excellent quality control and quality assurance application from our legal robot friends.
We are not going to try to go through all of the ins and outs of tested keyword search in this TAR Course. There are many variables and features available in most document review platforms today that make it easy to construct effective keyword searches and otherwise find similar documents. This is the kind of thing that Kroll and Losey teach to the e-discovery liaisons in Losey’s firm and other attorneys and paralegals handing electronic document reviews. The passive learning software features can be especially helpful, so too can simple indexing and clustering. Most software programs have important features to improve keyword search and make it more effective. All lawyers should learn the basic tested, keyword search skills.
There is far more to effective keyword search than a simple Google approach. (Google is concerned with finding websites, not recall of relevant evidence.) Still, in the right case, with the right data and easy targets, keywords can open the door to both high recall and precision. But, even then, for keyword search to work, even in those simple projects, must be tested, use metadata parameters and Boolean logic. Naive keyword search, the untested Go Fish variety, does not work, even with simple projects. That is one of the things we tested an proved in our post-hoc analysis of TREC 2016. See MrEDR.com for the TREC reports.
Moreover, we found in 2015 and 2016 TREC that keyword search, even tested and sophisticated, does not work well in complex cases or with dirty data. It certainly has its limits and there is a significant danger in over reliance on keyword search. It is typically very imprecise and can all to easily miss unexpected word usage and misspellings. That is one reason that the e-Discovery Team always supplements keyword search with a variety of other search methods, including predictive coding.
Focused Linear Search – Key Dates & People
In Abraham Lincoln’s day all a lawyer had to do to prepare for a trial was talk to some witnesses, talk to his client and review all of the documents the clients had that could possibly be relevant. All of them. One right after the other. In a big case that might take an hour. Flash forward one hundred years to the post-photocopier era of the 1960s and document review, linear style reviewing them all, might take a day. By the 1990s it might take weeks. With the data volume of today such a review would take years.
All document review was linear up until the 1990s. Until that time almost all documents and evidence were paper, not electronic. The records were filed in accordance with an organization wide filing system. They were combinations of chronological files and alphabetical ordering. If the filing was by subject then the linear review conducted by the attorney would be by subject, usually in alphabetical order. Otherwise, without subject files, you would probably take the data and read it in chronological order. You would certainly do this with the correspondence file. This was done by lawyers for centuries to look for a coherent story for the case. If you found no evidence of value in the papers, then you would smile knowing that your client’s testimony could not be contradicted by letters, contracts and other paperwork.
This kind of investigative, linear review still goes on today. But with today’s electronic document volumes the task is carried out in warehouses by relatively low paid, document review contract lawyers. By itself it is a fool’s errand, but it is still an important part of a multimodal approach.
There is nothing wrong with Focused Linear Search when used in moderation. And there is nothing wrong with document review contract-lawyers, except that they are underpaid for their services, especially the really good ones. I am a big fan of document review specialists.
Large linear review projects can be expensive and difficult to manage. Moreover, it typically has only limited use. It breaks down entirely when large teams are used because human review is so inconsistent in document analysis. Losey, R., Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” (parts One, Two and three) (December 8, 2013, e-Discovery Team). When review of large numbers of documents are involved the consistency rate among multiple human reviewers is dismal. Also see: Roitblat, Predictive Coding with Multiple Reviewers Can Be Problematic: And What You Can Do About It (4/12/16).
Still, linear review can be very helpful in limited time spans and in reconstruction of a quick series of events, especially communications. Knowing what happened one day in the life of a key custodian can sometimes give you a great defense or great problem. Either are rare. Most of the time Expert Manual Review is helpful, but not critical. That is why Expert Manual Review is at the base of the Search Pyramid that illustrates our multimodal approach.
An attorney’s knowledge, wisdom and skill are the foundation of all that we do, with or without AI. The information that an attorney holds is also of value, especially information about the latest technology, but the human information roles are diminishing. Instead the trend is to delegate mere information level services to automated systems. The legal robots would not be permitted to go beyond information fulfillment roles and provide legal advice based on human knowledge and wisdom. Their function would be constrained to Information processing and reports. The metrics and technology tools they provide can make it easier for the human attorneys to build a solid evidentiary foundation for trial.
Or pause to do this suggested “homework” assignment for further study and analysis.
SUPPLEMENTAL READING: Read the Go Fish article and then look for other articles and cases that mention it. Can you find any defenses at all to this still, very common approach to legal search? Consider what other games might apply as a good analogy for the untested, guessing based approach to locating evidence. Also read the articles cited in this class on the Sedona Principle Six. See if you can find contra articles that defend Principle Six as it pertains to legal search. Read them and come to your own conclusion on this controversy. Also review the cited articles on the limits of keyword search and Blair and Moran’s work in the 1980s. Finally, if you have not already done so, read the latest, revised version of the e-Discovery Team’s Final Report for 2016 TREC. Study the findings and discussion on keyword search.
EXERCISES: Speculate as to why the guessing approach still seems so popular in the legal profession as a method to find evidence. How does the success of Google search play a part in lawyer preoccupation with keyword guessing? Why do you think the vast majority of lawyers still prefer the Dunce Hat approach to legal search where keyword and linear search are king?
On the issue of the Sedona Principles, consider your own position on Six. Speculate on why this Principle, unlike the others, has never changed. Try to understand both sides of this issue. Consider especially the inherent tension between Rule 26(b)(1) proportionality, which makes review costs central, and the responding parties decision to use expensive, ineffective methods to conduct the review. Check out the details of the post-settlement hearings in Ross-Williams, derivatively, on behalf of Sprint Nextel Corp. v. Sprint Nextel, Civil Action No. 11-cv-00890 (D.C., Kansas, 11/22/16) (Plaintiff’s counsel apparently chose linear review to run up a bill).
Finally, on Linear review, consider why we keep this ancient method on the search pyramid and still consider it a useful method to keep in a multimodal tool belt. Have you ever tried looking at the day in the life of a custodian’s email? That means looking at all of their email in and out on a particular day. What lessons did you learn about the issues and the custodian? It can be a good way to get into a custodian’s head, to see what they are like and how they operate.
Students are invited to leave a public comment below. Insights that might help other students are especially welcome. Let’s collaborate!
e-Discovery Team LLC COPYRIGHT 2017
ALL RIGHTS RESERVED
I think tailoring this to the audience of the small to mid-size firms would do well, as they are less likely to have the benefit of having vendor insight into the more typical keyword search process. Would there be any motion practice off of pacer that could be pulled setting forth some best practices in negotiations of ediscovery regarding the process in small cases, say employment or fraud
Thanks for the comment.
Negotiations are basically same regardless of size. There are several cases on keyword search under both the old (2006) and new rules (2015). You don’t have to use Pacer to find them. They are published decisions and multiple commentators have written about them, including me.