Now finally we come to the conclusion of this series on the Secrets of Search where all will be revealed. Secrets of Search: Parts One, Two, and Three. (Well, to be entirely honest, not all will be revealed. I’m still going to keep a few trade secrets up my sleeve for law partners and family.) As you can see by the photo, junior here was quite astonished by the latest revelations. I hope you will be too.
Recap of the First Three Secrets
Before I get to the fourth secret of search, I need to review the first three again and connect a few more dots. The first secret was already known to many. (Craig Ball said it was about as much of a secret as the square root of 256.) It was that keyword search, done alone, and as part of a blind Go Fish game of dueling attorneys, is remarkably ineffective. Keyword search only works when performed as part of an interactive, multi-modal process, one which uses constant sampling and review. Still, keyword search is yesterday’s (1960s) technology. No matter how many Boolean bells and whistles and interactivity and quality controls you may add to keyword search, its only real strengths are familiarity and quick peeks. The future of legal search, the best promise for adequate precision and recall, lies in artificial intelligence software. By this I mean the so called predictive coding algorithms where expert humans train computer agents, plus ever improving legal methods.
The second secret really was a secret, kind of like knowledge of the square root of two was in ancient Greece. This secret was little known outside of information science circles, who, in speech at least, tend to emulate the Pythagoreans in enigmaticalities. This secret is that the gold standard used to test precision and recall is, like keyword search, remarkably ineffective. That so-called gold standard is human review. This is a very imprecise, very fuzzy standard. The few studies we have on big data projects, ones where humans reviewed thousands of documents for days on end, reveal terribly inconsistent relevancy calls. (Not surprising when you consider how bleary-eyed and underpaid they were.) For instance, in the $14 Million Verizon project, human reviewers only agreed 28% of the time. This means that our yardstick for recall measurements has nothing smaller on it than a foot. All claims for precision within a few inches are bull. We really have no way of knowing that.
As information scientist William Webber notes, our maximum possible mean precision and recall rate (“F1″) measured objectively is only 44%, and other studies suggest an only slightly higher F1 rate of 66%. This is very significant because it means there is no objective basis to ever demand a recall rate of better than 66%. A requesting party that asks for recall better than that is asking for something that cannot be reliably measured.
Logically, this also means random samples with 95% confidence levels +/- 2 are also unrealistically high. Plus or minus 5 might be more realistic considering the vagaries of our measurements and subjective determinations. I favor random sample buttons on software, but I want our use of them to be realistic and not budget busting. What is the point of such accuracy when the underlying data is so fuzzy? The demands of 99% confidence level, or plus or minus one confidence interval, are completely misplaced and illogical. Our measuring stick is too imprecise to justify such large sample sizes. The experts who ask for that kind of delusional certainty have not understood the second secret. Either that, or they are just trying to drive up the costs of the other side’s quality control efforts.
Still, sampling is a powerful tool if used right, and if you understand what it can, and cannot do. For instance, it cannot by itself improve accuracy of search at all. It is just a tool to get an idea of how you are doing in your search processes. Since I am a strong proponent and have been urging all software providers to add a random sample generators to their programs for years, I decided to practice what I preach and figured out a way to add one on this blog. It can now always be found on the blog sidebar on the right, identified as a Math Tool for Quality Control.
The third secret is that even though humans are terrible at large-scale reviews, it is a completely different story when dealing with small-scale reviews. When reviewing small sets of data, in the 500-1,000 document range (this is the number of documents reviewed by the individual TREC reviewers), there were several professional reviewers in TREC who were more precise and had better recall than the best computer systems, even though they were not subject matter experts and had no access to such experts. Even a couple of the law students won a few times. Webber’s analysis showed that the complete demise of human reviewers has been grossly exaggerated. Re-examining the Effectiveness of Manual Review.
Although pure manual review is good for a few hours, it is poor and inaccurate over large scales, as the second secret revealed. Even if it were not, manual review is far too expensive and slow for large-scale review projects. We cannot go it alone. We need the machines. But we also need to keep the arts alive, the special skills of persuasion and evidence evaluation that we lawyers have refined over centuries. (More on that in the fifth secret at the end of this blog.)
Requesters who demand production with only machine review, and any responders foolish enough to comply, have not understood the third secret. It is way too risky to turn it all over to the machines. They are not that good! The reports of their excellence have been grossly over-stated. Humans, there is need for you yet. The Borg be damned! Jobs may have passed away, but his work continues. Technology is here to empower art, not replace it. (For more on this see the blog comments at the end.)
Webber’s research, and the common experience of our best law firms and vendor review teams nationwide, suggest that a hybrid multi-modal combination of both manual and machine review is the best approach. The new emerging gold standard uses the talents of both and a variety of automated tools. It also uses extensive interactivity between humans, and between humans and machines. In Part Two of Secrets of Search I suggested nine characteristics of what I hope may become an accepted best practice for legal review worldwide. I invited peer review and comments on what I may have left out, or any challenges to what I put in, but so far this list of nine remains unchallenged:
- Bottom Line Driven Proportional Review where the projected costs of review are estimated at the beginning of a project (more on this in the next blog);
- High quality tech assisted review, with predictive coding type software, and multiple expert review of key seed-set training documents using both subject matter experts (attorneys) and AI experts (technologists);
- Direct supervision and feedback by the responsible lawyer(s) (merits counsel) signing under 26(g);
- Extensive quality control methods, including training and more training, sampling, positive feedback loops, clever batching, and sometimes, quick reassignment or firing of reviewers who are not working well on the project;
- Experienced, well motivated human reviewers who know and like the AI agents (software tools) they work with;
- New tools and psychological techniques (e.g. game theory, story telling) to facilitate prolonged concentration (beyond just coffee, $, and fear) to keep attorney reviewers engaged and motivated to perform the complex legal judgment tasks required to correctly review thousands of usually boring documents for days on end (voyeurism will only take you so far);
- Highly skilled project managers who know and understand their team, both human and computer, and the new tools and techniques under development to help coach the team;
- Strategic cooperation between opposing counsel with adequate disclosures to build trust and mutually acceptable relevancy standards; and,
- Final, last-chance review of a production set before going out the door by spot checking, judgmental sampling (i.e. search for those attorney domains one more time), and random sampling.
I have probably missed a few key factors. This is a group effort and I cannot talk to everyone, nor read all of the literature. If you think I have missed something key here, please let me know. I will be at Legal Tech New York for three days with four presentations. Seek me out and let’s talk. You can reach me at ralph.losey@gmail.com.
You may note that I am herewith joining the call of other leaders in the field to develop best practice standards, notably including Jason Baron, and have overcome my initial reluctance to go there for a variety of reasons. See Jason R. Baron, Law in the Age of Exabytes: Some Further Thoughts on ‘Information Inflation’ and Current Issues in E-Discovery Search, XVII RICH. J.L. & TECH. 9, at 29-33. My concerns on arbitrary standards and unfounded malpractice claims remain, but I think we have no choice but to develop some basic industry standards. The nine characteristics of good document review outlined above constitute a first modest step in that direction.
The Fourth Secret of Search:
Relevant Is Irrelevant
Sorry to sound like one of Steve Jobs’ Zen Masters, but a contradiction like Relevant Is Irrelevant has more impact than the technically more accurate statement, which is: merely relevant documents in big data reviews are irrelevant as compared to highly relevant documents. In other words, all that counts in litigation are the hot documents, the highly relevant ones with strong probative value, not the documents which are just relevant, not to mention just responsive. In fact, in big data collections, I could care less about merely relevant documents. Their only purpose is to lead me to highly relevant documents. Moreover, as we will see in the fifth and final secret, I only care about a handful of those.
In a case involving tens of thousands of documents, much less hundreds of thousands of documents, or millions of documents, almost all of the documents that are merely relevant will not be admissible into evidence. (I’ll explain why in a minute.) For that reason alone their discovery should be subject to very close scrutiny. The gathering of evidence for admission at trial is, after all, the only valid purpose of discovery. Discovery is never an end in itself, although many litigators (as opposed to true trial lawyers) and vendors often lose that track of that basic truth. Discovery is only permitted for purposes of preparation for trial. It is never permitted to extort one side into a settlement to avoid the costs of a document review, or to at least gain a strategic edge, although we all know this happens all of the time.
Why won’t most merely relevant evidence be admissible as evidence you may wonder? For the same reason that most of the even highly relevant evidence won’t be admissible. Even though relevant, this evidence is a cumulative waste of time, and for that reason is inadmissible under Rule 403 of the Federal Evidence Code and its state law equivalents. To refresh your memory on the Evidence Code:
Rule 403. Excluding Relevant Evidence for Prejudice, Confusion, Waste of Time, or Other Reasons.
The court may exclude relevant evidence if its probative value is substantially outweighed by a danger of one or more of the following: unfair prejudice, confusing the issues, misleading the jury, undue delay, wasting time, or needlessly presenting cumulative evidence.
Also see Rule 611. (“The court should exercise reasonable control over … presenting evidence so as to … (2) avoid wasting time”)
The typical fact scenario used in law school to exemplify the principle of cumulative evidence is a situation where 100 witnesses see the same accident. Each would each give roughly the same description of the event and the testimony of each would be equally relevant. Still the testimony of 100 witnesses would never be allowed because it would be a waste of time, and/or a needless presentation of cumulative evidence, to have all 100 repeat the same facts at trial. The same principle applies to documentary evidence. If there are 100 emails that show essentially the same relevant fact, you cannot admit all 100 of them. That would be a cumulative waste of time.
The question of admissibility presented in Federal Rule of Evidence 403 requires a balancing of the costs and benefits of logically relevant evidence. This is sometimes referred to as the Rule 403 balancing test. This is similar to the balancing tests in Rule 26(b)(2)(C)(i) and (iii) of the Federal Rules of Civil Procedure between the benefits and burdens of discovery.
26(b)(2)(C) The frequency or extent of use of the discovery methods otherwise permitted under these rules and by any local rule shall be limited by the court if it determines that:
(i) the discovery sought is unreasonably cumulative or duplicative, or is obtainable from some other source that is more convenient, less burdensome, or less expensive; … or
(iii) the burden or expense of the proposed discovery outweighs its likely benefit, taking into account the needs of the case, the amount in controversy, the parties’ resources, the importance of the issues at stake in the litigation, and the importance of the proposed discovery in resolving the issues.
New e-discovery Rule 26(b)(2)(B) has a similar balancing test for hard-to-access ESI. So too does Rule 26(g) that requires only a reasonable inquiry of completeness in a response to discovery. Perhaps more importantly, Rule 26(g)(1)(B) also prohibits any request for discovery made “for any improper purpose, such as to harass, cause unnecessary delay, or needlessly increase the cost of litigation” and prohibits any request that is unreasonable or unduly burdensome or expensive “considering the needs of the case, prior discovery in the case, the amount in controversy, and the importance of the issues at stake in the action.” All the rules point to reasonability in discovery, and yet in e-discovery we routinely engage in unreasonable, cumulative overkill. See Patrick Oot, Anne Kershaw and Herbert L. Roitblat, Mandating Reasonableness in a Reasonable Inquiry, Denver University Law Review, 87:2, 522-559, at 537-538 (2010).
The rules clearly state that cumulative evidence is not, or at least should not, be subject to discovery. It would be a waste of time and money. Thus even though the documents might be relevant, if they are unreasonably cumulative, repetitive, or duplicative, such that the burden outweighs the benefit, they are not only inadmissible as evidence, but they are, or should be, outside of discovery.
This is buttressed by the prime directive of the Federal Rules of Civil Procedure, Rule 1. It requires all of the other rules of procedure to be interpreted and applied so as to make litigation just, speedy and inexpensive.
In spite of the clear law against cumulative, over burdensome discovery, lawyers and judges faced with big data cases today still routinely engage in discovery overkill. A 2010 survey of large cases that went to trial in 2008 showed that on average, 4,980,441 pages of documents were produced in discovery, but only 4,772 exhibit pages were entered into evidence. Duke Litigation Cost Survey of Major Companies (2010) at pg. 3. That is a ratio of over one thousand to one! Also see DCG Sys., Inc. v. Checkpoint Techs., LLC, No. C-11-03792 PSG, 2011 WL 5244356 (N.D. Cal. Nov. 2, 2011) (little benefit to justify burden of large scale email production because on average only “.0074% of the documents produced actually made their way onto the trial exhibit list” and in appeals “email appears more rarely as relevant evidence”).
These are absurd numbers for a variety of reasons. The 4,772 admitted into evidence is ridiculous over-kill, as will be shown further in the fifth secret, and so is the number of documents produced. The producing parties, acting in concert and cooperation with the requesting parties, should do a better job of culling down the irrelevant documents and marginally relevant documents. They are not needed for trial preparation.
This so-called Duke Survey, which was commissioned by the Lawyers for Civil Justice, not Duke, also offered opinion convergent with my own that such discovery is excessive (although we disagree on causation):
Whatever marginal utility may exist in undertaking such broad discovery pales in light of the costs. … Reform is clearly needed. A discovery system that requires the production of a field full of “haystacks” of information merely on the hope that the proverbial “needle” might exist and without any requirement for any showing that it actually does exist, creates a suffocating burden on the producing party. Despite this, courts almost never allocate costs to equalize the burden of discovery.
The Fifth Secret of Search: 7±2
Should Control All e-Discovery (But Doesn’t)
We have already established that the purpose of discovery is to prepare for trial. But what is the purpose of a trial? We have to understand that to be able to grasp the fifth secret: 7±2. We have to understand that the purpose of all trials is to persuade. It is a time and place, a level playing field, where lawyers try to persuade a judge and/or jury as to what happened and what should be done about it.
In this place of trial of humans, by humans, the rule of 7 ± 2 reigns supreme. It always has and, unless we allow robots as jurors, always will. Unfortunately, most litigators are unaware of this rule of the transmission of information, or if they did know of it, most fail to see its connection to discovery and search. The rule of 7±2 now has little place in e-discovery analysis.
It is a secret, and because it is unknown, we have gone astray in e-discovery. Because this secret is unknown vast sums of money are routinely wasted in the production of fields full of “haystacks” of information. Because the secret has not yet been heard, and its clear implications have not been yet been understood, trial lawyers everywhere still scratch their head in disbelief at the sheer mention of e-discovery. Yes, this secret is also the key to the seventh insight. The insights into wide-spread lawyer resistance to e-discovery analyzed in Tell Me Why?
I have alluded to this rule of seven in a few past blogs, and discussed it at a few late night dinners. But this is the first time I have written at length on the magic power of seven, plus or minus two. I hesitate to go to this deep place of information transmission and cognitive limitations, but, in order to keep the search for truth and justice on track, we really have no choice. We must, like the Pythagoreans of old, consider the significance of the number seven and its impact on our work, especially on our conceptions of proportionality.
The fifth secret of search is based on the legal art of persuasion and the limitations of information transmission. The truth is, no jury can possibly hold more than five to nine documents in their head at a time.
It is a waste of time to build a jury case around more documents than that. Judges who are trained in the law, and are quite comfortable with documents, can do a little better, but not that much. In a bench trial you might be able to use eight to twelve documents to persuade the skilled judge. But even then, you may be pushing your luck. Judges, after all, have a lot on their mind, and your particular case is just one among hundreds (in state court make that thousands).
Computers Expand Document Counts, Not Minds
Even though the computerization of society has exploded the number of documents we retain a trillion-fold, the ability of the human mind to remember and process has remained the same. We still can only be persuaded by a handful of writings. That is all of the information we can retain. Presenting dozens of documents is a waste of time.The only reason to present more that five to nine documents at trial is to provide context and an evidentiary foundation. The few dozen other documents that you may need at trial are merely window dressing, a frame for the real art.
A computer can easily process and recall millions of documents, and can do so in minutes, but we cannot. Even fast readers are limited to about 500 words per minute or a skim-review rate of 1,000. No matter how much time we may have, and in legal proceedings the time is always constrained, our ability to read, understand, and comprehend relevant writings is limited. This is especially true in the high pressure and expedited schedule of a trial and formal presentation of evidence in court. That is why all experienced trail lawyers I have talked to agree that the average juror is likely to remember and be influenced by only a handful of documents. By the way this rule of seven in persuasion is a corollary to the KISS principle (“keep it simple, stupid”), well known to all persuaders, along with “tell-tell-and-tell.”
Although most trial lawyers learn this just from hard experience, there is good theoretical support in psychology for such memory limitations. It is sometimes called Miller’s Law, after cognitive psychologist George A. Miller, a professor at Princeton University. Professor Miller first described this limitation of human cognition in his 1956 article: The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information, Psychological Review 63 (2): 81–97. This is supposedly the most widely quoted psychology paper of all time. According to Wikipedia, Miller’s paper suggests that seven (plus or minus two) is the magic number that characterizes people’s memory performance on random lists of letters, words, numbers, or almost any kind of meaningful familiar item. He essentially found that human beings were only capable of receiving, processing and remembering seven (plus or minus two) variables at any one time.
Professor Miller’s ends his famous paper on the limits of our capacity to process information with this somewhat odd remark, especially considering his reputation as a scientist:
What about the magical number seven? What about the seven wonders of the world, the seven seas, the seven deadly sins, the seven daughters of Atlas in the Pleiades, the seven ages of man, the seven levels of hell, the seven primary colors, the seven notes of the musical scale, and the seven days of the week? What about the seven-point rating scale, the seven categories for absolute judgment, the seven objects in the span of attention, and the seven digits in the span of immediate memory? For the present I propose to withhold judgment. Perhaps there is something deep and profound behind all these sevens, something just calling out for us to discover it. But I suspect that it is only a pernicious, Pythagorean coincidence.
George A. Miller, The Magical Number Seven, Plus or Minus Two (1956), 42-3.
Apparently some psychologists think Professor Miller overestimated the average human capacity when he said it was between 5-9. They think the limit is more likely to be from two to six, that the magic number is 4, not 7. Farrington, Jeanne, Seven plus or minus two, Performance Improvement Quarterly 23 (4): 113–6. doi:10.1002/piq.20099 (2011).
In any event, it is not hundreds of documents, much less thousands or millions. Yet in an average large case today 4,980,441 pages of documents are produced and 4,772 pages allowed into evidence. What is wrong with this picture? The discovery chase has lost track of the goal.
An experienced trial lawyer, who may use hundreds of exhibits in a very large trial for context and technical reasons, will still only focus on five to nine documents. They know jurors cannot handle more information than that. They know the rest of the documents that go into evidence will have little or no real persuasive value.
The limitations of the human mind thus provide a consistency and continuity with the trials and systems of justice of our past pre-computer civilizations. No matter how many more documents may exist today within the technical scope of legal relevance, our jurors’ capacities are the same; the art of legal persuasion remains the same.
These mental persuasion limits provide a governor on the number of documents useful to a trial lawyer, judge, and jury. The human mind has its limits. Computer discovery must start to realize these limits and take them into consideration. This is a basic truth that we e-discoverers have lost sight of.
It is the core of why most old-time trial lawyers think the whole business of e-discovery is ridiculous. It is high time for the secret of seven to be outed and, more importantly, to be followed. The rule of seven should have significant consequences on our legal practice and scientific research.
Uneducated Searchers Will Never Find the Top 7±2
The location of these few highly relevant documents has always been a problem in the law. But in the low volume paper world it was never an overwhelming one. The paper document search and retrieval process was a relatively simple problem traditionally assigned to the youngest, most inexperienced lawyers. Today the search for the smoking e-guns is much more difficult than ever before, yet untrained young associates are still commonly given this task. Many are simply told to go do e-discovery. They are provided with little more training than attendance of a few CLEs, which, you should know by now, don’t really teach you that much.
That is one compelling reason I took the time to make my law school training program available online to law firms, attorneys, paralegals and students everywhere. e-DiscoveryTeamTraining.com. It provides over 75 hours of instruction, which is what it takes to really learn something. Just don’t try to learn more than seven things at a time. Take your time and study online whenever it is convenient to you.
Lack of real education is the primary impediment to further progress in all e-discovery issues, including search. Patrick Oot, Anne Kershaw, and Herbert Roitblat explained it well in their excellent Mandating Reasonableness article:
The problem is not technology; it is attorneys’ lack of education and the judicial system’s inattentiveness to ensure that attorneys have the proper education and training necessary for a proportional and efficient discovery process. Lack of attorney education aggravates the problem because uneducated litigators are unable to make informed judgments as to where to draw the line on discovery, thereby creating unrealistic expectations from the courts—particularly as to costs and burdens. For example, failing to understand how different methods of search methodology work, some judges will unnecessarily mandate traditional and expensive “brute force” attorney review. …
Simply put, the legal system has a crisis of education. Both attorneys and judges need to better understand technology as it applies to the reasonable inquiry.
Mandating Reasonableness, supra at pg. 545, 547.
Just Give Me the Smoking Guns
Since only a few documents are needed for analysis of a case and even less for persuasion at trial, the search of paper-only has sufficed, until recently, for most trial lawyers. They have found the few they needed in printouts. But these days are now all but gone. The few important documents found by paper searches, and even by ESI searches that are driven by old paper based systems, are not likely to uncover the best documents. The smoking guns will remain hidden in the data deluge. Lawyers will not find the top seven needed for the judge and jury.
As the nature of documents changes, and the previously noted habits of witnesses to print key documents disappears, this problem will worsen. No one today says incriminating things in paper letters. Very few still even write paper letters. They say it in emails, text messages, instant messages, Facebook posts, blogs, tweets, etc., and almost no one prints these out and puts them in filing cabinets.
There is a key lesson for e-discovery in the trial lawyer wisdom of seven. To be useful discovery must drastically cull down from the millions of ESI files that may be relevant, to the few hundred that are useful, and the five or nine really needed for persuasion. Culling down from millions to only tens of thousands is not serving the needs of the law. It is a pointless waste of resources, a waste of client money. A production of tens of thousands of documents, not to mention hundreds of thousands, is unjust, slow and inefficient.
Many vendors today brag about how their smart culling was able to eliminate up to 80% of the corpus. They will tell you this is an excellent cull rate before you begin review. It is not. They may also tell you that it is unreasonable for you to try to cull out more than that. They are wrong. They have a financial motivation to take such conservative positions. The more documents you review, the more money they make. Some law firms see it that way too. But they won’t last, the firm’s clients will eventually catch on and switch their work away from the haystack builders.
Even if well-intentioned, many vendors (and lawyers) don’t understand that the law requires only reasonable efforts, and proportional efforts, not perfect or exhaustive efforts. They don’t understand the basic limitations of a trial or cumulative evidence. Many have never even seen a trial, much less tried one. Vendors are not supposed to give legal advice, yet I hear them do it all of the time when, for instance, they talk about how much you should review to meet your obligations under the law. Or they may say it would be very risky to try to cull out more than that. As if they could ever really eliminate risk, much less quantify risk. The only way to eliminate risk is by cooperation or court order. Not by following vendor best practice suggestions.
When you understand the fourth and fifth search secrets, you realize that a cull rate of at least 90% is proportional. It does not matter if you weed out a few merely relevant documents. If you have a million files, you should be able to weed out at least 90%, 900,000 documents, before you begin review. In fact, you should aim for elimination of 98%+ by using relevancy ranking, and only do a human hybrid review of the remaining 20,000 documents.
New e-discovery search and culling methods need to be perfected that limit the quantity of documents to a size that the human mind can deal with and comprehend. The processes should try to find all, or nearly all, of the highly relevant documents, even if a significant percentage of marginally relevant documents are missed. Who cares about these technically relevant documents? No one, except maybe those dazzled by recall stats who do not understand the natural speed limits of the mind. All that really matters are the hot documents. That is the lesson of the fourth secret of search, that Relevant Is Irrelevant.
The lesson of the fifth secret, 7±2, is that the true goal of e-discovery should be the five to nine of the hot documents that the triers of fact can understand. If your search finds those magic seven, and no others, it is a great success, regardless of all of its other misses. If your search finds a million relevant documents, and attains a precision and recall rate of 99%, but misses the top seven key documents, it is a complete failure. We have to change our search methods to focus on the top seven.
Change the Scientific Testing
We also have to redesign our scientific testing to measure what really counts, the 7±2, plus time and money. I suggest that the TREC Legal Track have a seeded test set next year where all searchers look to find seven planted Easter eggs. Whoever finds them all, or finds the most, and does so the fastest, and at the least expense, gets the highest score. In fact, for the tests to be fair and realistic, they should be time limited, and cost limited. Participants should no longer be allowed to keep that secret. In the law time and money matter. A search process is worthless that costs too much, or takes too long.
So far, all of the scientific experiments I have heard about in e-discovery have measured effectiveness, meaning how well or poorly a search performs, by only looking at Relevance measures, primarily precision and recall (or the harmonic mean thereof – F1). But in information science, Relevance is just one of the four basic measures of search effectiveness. The other three are Efficiency, Utility, and User Satisfaction. Sándor Dominich, The Modern Algebra of Information, Pgs 87-88 (Springer-Verlag, 2008). According to Dominich, the Efficiency measures are the costs of search and the time it takes. We need to start to include Efficiency measures in our tests, as well as provide heavy ranking to our Relevance measures.
In Law One Key Document is Worth a Million Relevant Documents
Too few experts in e-discovery today understand the fifth secret of search, namely the magic limiting power of seven. On the other hand, all experienced trial lawyers seem to know it well, even if they have never heard of Professor Miller. As a result of 7±2 being such a secret to many of my friends in e-discovery, they have erroneously focused on an effort to recall as many relevant documents as possible. They pride themselves in amassing large volumes of relevant documents, when in fact that is the last thing real trial lawyers want. They don’t want ten thousand relevant documents; they want ten. They want just a handful of killer documents that will help persuade the jury, that will make their story clear and convincing. The failure of e-discovery proponents to focus on this is another reason, the 7th in fact, why so many lawyers think e-discovery is stupid.
Electronic discovery search is not an academic game to be played. It is all about finding evidence for trial. Statistics and methods are worthless unless they properly weigh recall statistics by persuasive impact. One highly relevant document can, and usually does, counteract ten million relevant ones. It is like one grand master playing a thousand amateurs. The amateurs don’t have a chance. Because of this if your search is not designed to find the five to nine most persuasive documents, then your search is flawed, no matter what your precision and recall rates are.
High recall rates are only imperative for highly relevant documents, the hot documents. Nothing else matters, except for the costs involved, the time and money it takes to find evidence. If you don’t focus your search on 7±2 hottest documents, you may never find them.
I know that some will argue you have to find all of the relevant documents in order to be able to find the top 7±2. That was true in the paper world of linear review of hundreds of documents, but is not true in large-scale electronic review. You can now use software that focuses its search on the highly ranked relevant documents. But you hve to adopt your methods accordingly.
New methods for ESI review should be used that focus on retrieval of ranked relevancy, not just relevancy. The methods should focus on finding the hot documents with the understanding that merely responsive documents are, due to their extreme number, of little importance. Relevant is irrelevant. The same ranking applies to identification of privileged and confidential ESI. If one hot privileged document is missed in a privilege review, it can be far more damaging that the inadvertent production of hundreds of marginally privileged ones.
Bottom line, to follow the fourth and fifth secrets we have examined in this blog, the key feature you should look for in search software is the ability to accurately rank the probable relevant documents. Ranking must be a far more sophisticated function than simply counting the number of times a keyword, or pattern, appears in a document. It should epitomize all of the criteria and indices used by the software black box – latent semantic, four-dimensional geometric, or otherwise.
The ideal e-discovery Watson computer must not only search and find, he must rank. Put the highest on top please. Watson may not be able to put the five you will use as the first five documents shown, but it is not too much to expect that the 7±2 will be in the top 5,000. The humans working with Watson will narrow them down, and the trial lawyers making the pitch will make the final selections.
Recap of All Five Secrets
To recap, in Part I we discussed the first two secrets. The first is that keyword search sucks and so most attorneys still using this old method are searching for ESI the wrong way. The second secret is that large scale linear manual reviews also sucks and this means we do not have a reliable gold standard by which to make precision and recall measurements. We do, however, know that a hybrid approach of man and machine, using keyword, predictive coding and other automated methods, is at least as accurate as manual review and far faster and less expensive.
In Part II we discussed the third secret that in small scale reviews of 500-1,000 documents professional reviewers are still better than our best automated methods, and it is foolhardy to take human review out of the final computer proposed production set. We need human review not only to instruct the computer, but for quality control and confidentiality protection. We also discussed the parameters for a new gold standard of hybrid, multimodal search and review.
In this Part III we discussed the fourth secret that relevant is irrelevant, meaning that smart culling that follows best practices is required by the rules to keep the time and cost of review proportional. The fifth secret gleaned from our friends the trial lawyers, 7±2, reminds us of the true goal of e-discovery and the need to heavily weight and constrain our relevancy searches.
The following graphic summarizes these thoughts using the symbol of the Pythagoreans, the five-sided polygon, or pentagon, who were, by the way, famous among the ancient Greeks for secret keeping and a relentless search for truth.
As you have no doubt guessed by now, my real goal here was not to give away secrets, but to lay the foundation for new standards of search and review. The pentagon shows the first five steps, but there is still one more. In the next blog I will discuss that step and use the six-sided figure, a hexagon, to show my current understanding of best practices.
Conclusion
Way back in 1947 the Supreme Court in Hickman v Taylor, the landmark case on discovery, stated that “[m]utual knowledge of all the relevant facts gathered by both parties is essential to proper litigation.” 329 U.S. 495, 507 (1947). The opinion was written by Justice Frank Murphy (1890-1949) shown right. Today his statement is obsolete in so far as it says ALL the relevant facts gathered should be shared. This statement was reasonable when written in 1947, but not today. In those days, the forties, all of the relevant facts could be found in a few dozen documents. In the sixties that became at most a few hundred. In the nineteen seventies and eighties, a few thousand.
Today, sixty-five years after Hickman v. Taylor, we now live in a completely different world. Today written words profligate and multiply with the help of computers in a way that our ancestors could never imagine. Now you can gather hundreds of thousands or millions of relevant documents in even small cases. Now we write all of the time, and our writings multiply and remain, albeit in electronic form only.
The sharing of marginally important knowledge is no longer essential to proper litigation. In fact, as we have seen, it is contrary to the rules, especially Rule 26, Federal Rules of Civil Procedure. Most merely relevant documents today are inadmissable. Rule 403, Federal Rules of Evidence. They are a cumulative waste of time. It is unreasonable to gather them, much less disclose them. Rule 1 prohibits such a waste of time and money. Moreover, it is unjust. For it is easy to bury the truth in mountains of technically relevant haystacks. Document dumps are a way to hide the truth essential to proper litigation.
We need to design our e-discovery to be reasonably calculated to lead to admissible evidence, which means non-cumulative. We need to focus on the hot documents. We need to remember that all that really matters are the five to nine of the hottest documents. This is what the trial lawyers need to tell their story of prosecution or defense. The few other documents that you may want to put into evidence are just window dressing. The millions of other technically relevant documents are of little or no use in the preparation for trial, and of no use whatsoever in the conduct of a trial.
This means we need smart AI enhanced software tools. Software that we can teach to find the hottest documents. Software that has ranking built in as a core function. It also means that we need informed e-discovery attorneys who understand the secrets of search. They can then bridge the gap that now exists with trial lawyers. Then maybe the current e-discovery strategy used by most lawyers today of avoidance will be abandoned. Then maybe all lawyers will adopt proportional e-discovery designed for trial. There is a new year coming. Let’s all resolve to work together as a team to make it happen! Let’s focus our efforts. As Pythagoras supposedly said: Do not talk a little on many subjects, but much on a few.
Can one identify what documents are hot documents without knowing first what a side’s legal strategy is going to be? And if not, how does the producing party know what the requesting party will regard as “hot”?
Also, do large scale productions really drive up costs in the presence of automated text analysis tools, particularly if we assume that not every produced document needs to be manually reviewed by either the producing or the requesting party? The producing party hands over a large document set; then the requesting party loads it into their own text analysis software for condensation and extraction. With a competent search and analysis tool, finding the seven smoking guns from four million documents is not harder than from four thousand; indeed, it may be easier, because you have better statistical information to perform your inference on.
I think you’re very right, though, that our evaluation needs to measure how many of the hot documents a retrieval is catching, rather than merely what proportion of technically responsive documents (by a highly variable pyrite standard). The easter egg approach is an interesting but generally disfavoured one (first, it is artificial; and second, who knows what unintended hot documents might also exist in the corpus). But could we take existing collections with their retrievals, and do a further (sampled) analysis of documents marked relevant to determine which of them are hot? How big, and how well-defined, a task is this?
The way to get TREC to do this systematically, I think, is to run a trial experiment on an existing collection, and demonstrate the difference that a hotness-based analysis makes.
Thanks for the comment William. The law and what you must prove under the cause of action plead is a far more certain goal than mere strategy. It usually is not hard to know if a document is hot. If, for instance, the case requires proof of racial discrimination, and an email is found from an accused that contains a racial slur, it is obviously highly relevant. Deciding which of several hot documents to put in your top seven for trial may involve court room persuasion strategy, assuming you have more than that. But the other side’s basic strategy is always to prove, or disprove, the basic elements of the cause of action or affirmative defense. There is a lot of structure to the law and evidence.
The assumption in your comment that you save money by skipping human review before production is not a best practice in the law. In fact, no one does this, or it is only very rarely done, because of the issues of confidentiality and privilege. Remember, this is not an academic exercise. You are turning your private documents over to people who are suing you. They want to harm you. Frequently they are worst of enemies. The rights to confidentiality are very significant and legitimate concerns of all persons involved in the judicial process. The contrary position suggests what I refer to in the blog as a Borg position of abdication, sometimes also called a quick peek. Here you trust your enemy to look through your secrets for you and extract the ones it wants. No way. The Borg position is usually adopted by requesting parties for ulterior motives, and/or because they have no trust at all in the producing party’s processes or final human review.
I advocate a hybrid approach, where we trust the software, but still verify. We rely on the computer to review the entire corpus to cull down to a manageable size, then we do a human review of the documents selected for production only. Of course, there is also human review in the seed set instruction process. We address the issue of mistrust by the requesting party by being open about the processes, but we do not open the doors to all our client’s information and hope for the best. This is litigation.
I would like to talk to you re the “Easter Egg” approach. Perhaps together, and with the help of others, we can “think different” and figure out an acceptable way to do this. Let’s go offline for this, but one thought to consider, and perhaps a reader could help here, is to start with a case that actually went to trial, and work backwards from the five to nine documents that the winning side used to persuade the jury.
[…] Requesters who demand production with only machine review, and any responders foolish enough to comply, have not understood the third secret. It is way too risky to turn it all over to the machines. They are not that good! The reports of their excellence have been grossly over-stated. Humans, there is need for you yet. The Borg be damned! Jobs may have passed away, but his work continues. Technology is here to empower art, not replace it. (For more on this see the blog comments at the end.) […]
I can’t address the needs of civil litigants, but I don’t believe that the rule of seven applies to white collar criminal cases. The problem is that criminal conspiracies are often spoken of in code and only by viewing a chain of emails will the true nature of a transaction become apparent.
Putting aside best practices for an internal investigation, consider a subpoena from a federal grant jury. Such subpoenas are primarily an investigative device. At this point what documents will be used at trial is secondary. Indeed, an individual document may not be worth much, but a series of documents may lead to the discovery of a target or a witness. And the witness’s testimony will be what is introduced at trial. Consequently prosecutors often want the production of what is “merely” relevant.
Now consider the needs of an internal investigation. In one respect counsel is in the same position as a prosecutor. Except company (or outside) counsel will know even less to begin with because employees have every reason to lie to counsel, fearing loss of their employment or being reported to the authorities. By contrast, at the point employees are talking to the prosecutor they have every incentive to spill their guts. The bottom line is that counsel will have to cast an even wider net to figure out what has gone wrong.
Only after an individual or an entity is under indictment will counsel know enough about the charges to search for the 7 or so items that will be of most use to the defense. The problem is that the government is not required to disclose the 7 or so hot documents they will introduce out of the 100,000 they have disclosed to the defense.
Part of solution lies in discovery reform under the rules of criminal procedure. Not likely to happen. So I see no easy solution to extremely expensive searches.
Say hi to my good friend Mike Wolf.
Ralph,
For anyone wondering how your systematics hold together, this essay is a must read. It’s also a must for anyone who has lost sight of the crucial point that the real purpose of document search is to enable lawyers to tell their client’s story. Everyone and everything in discovery needs to work together towards that goal.
Larry Chapin
[…] Secrets of Search – Part III […]
[…] Secrets of Search – Part III […]
My most immediate reaction to the article is the need for TIME to do everything right. And, as we know, time costs the client money. There is no way that the best of doc teams that I could put together, could figure out what the “7 plus/minus 2” true relevant documents are out of a universe of 500,000 without the time to really analyze the corpus. Yes — we could de-dupe across custodians, and employ tools so that we only look at the most complete version of an email string rather than each iteration . . . And we could use smart key-word/TAR review and sampling to narrow down early on what seems to be the most responsive and truly relevant documents . . . But, from there we would need to start doing some sophisticated review to determine if what seemed “hot” on the first or second go around proved to be really meaningful when put together in a time line.
Just as an example — assume that some “hot” issue generated a bunch of email traffic between five people and ultimately a meeting was set up to discuss the topic. Ah-ha! . . . . Then there are electronic notes taken at the meeting — cryptic as they are — and some follow-up emails between three of the five people. What about custodians four and five? . . . If the initial culling that was done three weeks ago, in a non-linear review, got rid of as “non-responsive” things such as electronic calendar entries showing appointments or emails that simply read — “Folks — something came up. Won’t be able to make it today.” — AND if it were important to the case to place custodian four at the meeting, I may never have discovered that custodian four actually showed a completely different event on his or her calendar on the date and time of the meeting, or the email that would have clued me into the fact that a few hours before the meeting custodian four cancelled? . . . Let me say this — I would like to think that I, personally, having seen that the topic and the meeting were “hot”, would have at that point gone back and searched among “all” docs, including the non-responsive ones, to try to figure out what happened to custodians four and five and whether they attended or not, etc. However, because that would be a subjective decision (and one demonstrating my e-discovery geekiness!!), there is no way that I can be sure that my “senior level review” colleagues — as good as they are — would have made the same subjective decision to go hunting. If they did not and we never produced the benign-on-its-face calendar entry and the email saying that the person was not going to attend — a year later we will be at deposition, with the deponent being pummeled with questions about the meeting, what happened at the meeting, who said what, what did the person think . . . Etc. Etc. Etc. And, given that the meeting occurred 3 and a half years ago, the deponent really doesn’t recall very much, the best he/she can answer is “I do not recall” — which as we all know is a safe answer but one that allows the other side to argue to a judge/jury about the person’s participation because by having no memory about it, the person cannot deny what is being argued and is stuck with “I don’t recall.” Meanwhile, immediately after the deposition, we all are scrambling around looking to see what we can find out about this meeting and the person’s participation — racing back through the document corpus . . . And now for the first time we realize that there are docs showing that the person cancelled and went off and did something else that day. “Damn it!! Why didn’t we know that in the first place???” . . . “Why didn’t we produce these docs to have spared the deponent the entire line of questioning, leaving us open on this topic???” . . . “How do we answer the in-house general counsel about why we didn’t figure all this out a year and a half ago when we were doing doc discovery so that we could have properly prepared the witness???” . . . “Should we go ahead and do a supplemental production now and suggest a second deposition of the person — and face the possibility that opposing counsel will file a motion for sanctions for failure to produce responsive documents causing them to waste time and money on a theory of the case that placed the custodian at that meeting and now have to go through the time and expense of a second deposition and maybe related supplemental depositions of three other custodians — a motion that they probably would not win, but we all will spend time — and the client will spend money!!! — defending against???” And, even if the motion for sanctions is denied for the most part — there is a good chance the judge will nonetheless make us have to pay for the supplemental depositions . . . .
I assume that you get my point. So — wouldn’t the answer have been to just produce everything even if at first glance it only appeared remotely responsive? . . . And, wouldn’t a linear chron review have let us see at the get-go of discovery that custodian four never attended the meeting in the first place and that, therefore, that calendar entry and the email saying he/she wasn’t going to be able to attend, which were non-responsive on their face in isolation, actually were quite relevant and responsive?
Assume all of the above — but now assume that we did not even realize or understand that the topic or the meeting were “hot” from the other side’s perspective — that they know something about the topic and the meeting that we do not and, therefore, we never even knew to produce anything about it from the start but treated this entire subject matter as “not really responsive/relevant” and so ignored it and did not produce it?
In short , by allowing the review team to make much tighter decisions about what is really relevant and responsive versus the “technically responsive but utterly meaningless”, don’t we open ourselves up to tremendous vulnerability in terms of being second guessed down the road? Again, isn’t the safest thing to do — albeit the stupidest and very expensive thing to do — just to produce it “all” and let the other side worry about finding what it needs to find — and if it doesn’t, then at least we can go back through the produced universe of docs and say “Hey — you all need to look at docs X,Y,Z as they counter your theory. We produced them you know — didn’t you bother to look??!!!”
I think what all of these comments are pointing to is a very basic issue. Yes, all I really want/need to get from you when you are responding to a discovery request are the key documents. But I would much rather be the one evaluating, for myself, whether a particular document is critical to my case than have my opposing counsel do it for me.
That’s not to say that your point about filtering down to the 5-9 key documents (assuming that’s the right number in a particular case) is wrong. Just that the filtering pretty much has to occur on the receiving party’s end, not the producing party’s end.
Agreed to a point. Of course the receiving party produces more than the hottest documents. That is not for them to say, and anyway, the producing party would never want to do that. My point is to remember the goal. So think in terms of producing the top thousands, not millions. I am suggesting we must cull more and smarter, and that both sides have to do that in order to contain costs, and to further justice and speed. But the culling itself must be a transparent process, and must be smart, iw – use the latest technologies.
[…] Part III of Secrets of Search I listed a nine-point checklist for quality reviews. Point number six was: “New tools and […]
[…] Ralph Losey, “Secrets of Search Part 3” […]
[…] topic including a recent excellent three-part post called Secrets of Search: Part I, Part 2 and Part 3 by Ralph Losey’s electronic discovery […]
[…] with final manual review. As I explained in my series Secrets of Search, Parts One, Two and Three, we are not going to turn that over to the Borg anytime soon. I’ve asked around and no law […]
[…] of legal search, Herbert L. Roitblat, Ph.D, was kind enough to write a detailed critique of my Secrets of Search series. I tried to post a response on his blog, Information Discovery where it appeared. But the […]
[…] ESI? I examined this question at length in my Secrets of Search series, volumes one, two and three. Still, people find it hard to accept, especially in view of the unregulated clamor of the […]
[…] This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom […]
The problem is that we often don’t know what the hot documents will look like. It may work in an experiment with planted Easter Eggs, but that would inject an objective definition of what “hot” is, which is to be matched up against the subjective judgment of the testers.
In real life we only have the latter, no matter how much the obvious hotness excites us. Often times, however, the hotness of a document is derived from a pool of simply relevant documents.
I am not sure that the intrinsic hotness of “cook the books” would appear with great frequency. Instead, it takes the form of a contextually based “let’s do it.” In other words, the testers would have to look not only for the Easter Egg, but for the little girl mixing the egg paint and her mom’s going to the grocery story to buy the eggs. Maybe the chicken that laid them?
Second, I hope you would agree that the 7±2 should not be tied only to key documents admitted into evidence at trial. Simply relevant documents can be used to impeach a testifying witness, or draw out additional evidence at a prior deposition. And that portion of the testimony would then become key, even though the underlying document only created a question to be further explored.
The problem with eDiscovery is that its very name is a misnomer. It should have been called eProduction. It only applies to document sharing, not interrogatories, depositions, or other discovery vehicles. THAT is discovery. It also doesn’t need the “e” anymore. Maybe it did in 2005, but not in 2012. It’s just an element of discovery that, as you pointed out, no litigator can run away from. Tradition, I suppose.
I should also add that the inadmissibility of evidence at trial becomes irrelevant when a party is faced with a subpoena. A lot of massive litigation today is driven in response to federal and state issued subpoenas, often non-public and extra-judicial, where the respondents take a “cheerfully cooperative” posture that would be different if it were an all-out litigation. Here the cost of production is weighed against the likelihood of being sued and costs incurred if you do get sued. It’s a paradoxal reasonableness.
[…] part of the 7±2 documents that the trial lawyers of all parties will build their arguments around. See: Secrets of Search, Part III. In addition to the “nightmare well” email that will be the centerpiece of every […]
At the risk of sounding like a quibbler, I have to challenge your representation of Miller’s work. Miller did not find that humans can remember 7 ± 2 documents – he found that we can manage 7 ± 2 concepts simultaneously. Those concepts can be anything from a single digit of a number to a complex emotion like love or a philosophy like liberty. Miller’s finding is entirely applicable to the problems of modern discovery but it does not translate to a useful rule of thumb about document counts. Or rather, it does but not in the way that some of your comments in the post imply.
A document is not the “thing” that a juror must remember. A document may contain multiple facts or it could require multiple documents to establish a single fact. The juror cares about the concepts that make up the story of the case and the facts that prove or disprove each step in the story. The juror’s cognitive ability limits the number of disparate facts that you can realistically use to make your case regardless of how many documents they are distributed across.
Say that you need to prove that Sally deserved to be fired for submitting fraudulent expense reports. A single bad expense report does not prove or even suggest fraud. It could too easily be an innocent error. A pattern of expense reports showing the same behavior over time, however, locks in the one concept that Sally intentionally committed the fraud. The juror will not remember the details of 25 individual expense reports but will remember the pattern that “Sally is a cheat.” The same case might need to establish the concept that “the company was the good guy.” A single harsh email undercuts but does not disprove that concept. Even supervisors have a bad day. A pattern of emails over time locks in the concept that the manager had it in for poor Sally.
At the other end of the spectrum, even a single document may be indigestible if it covers more disparate concepts that the juror can absorb. Attempting to understand that one over-large document will inevitably drive other content out of his/her span of attention. The corrolary to your argument is that even single documents may have to be exerpted wisely.
I agree strongly with your argument that it doesn’t take thousands of documents to make your point and that it is counter-productive even to try. I agree completely with your point that Discovery should start with the ends in mind and that it should be subordinate to the goal of justice, not an end to itself. You just can’t jump from Miller’s research on cognitive concepts all the way to the word “documents”.
(As a side note, judges are statistically no more capable than the rest of the population in Miller’s 7 ± 2 analysis. The judge’s sole advantage is that through context and training, he/she may be able to aggregate components into a single, larger concept which can then be managed as a cognitive unit. Specialists in every field demonstrate this competence. Notably, it is context-specific. An engineer who can manage the thousands of data elements necessary to design a bridge is no more capable with competing legal concepts than the average judge would be with the calculus of engineering design. It is identical to the short-hand you do when you memorize phone numbers by blocking the area code as a single cognitive unit.)
My apologies if this comes across as a pedantic. You have an excellent and important point to this post. I don’t want that critical point to get lost or get taken out of context because of a misunderstanding of Miller’s research.
I don’t disagree with you. I was over-simplifying so that I could get my main point across. Thanks for the learnded comments.
[…] Secrets of Search: Parts One, Two, and Three […]
[…] Many of our leading jurists, information scientists, academics, scholars, writers, and legal practitioners recognize that the old methods and attitudes that worked for paper no longer work for ESI. Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) (Judge Grimm); Securities and Exchange Commission v. Collins & Aikman Corp., 2009 WL 94311 (S.D.N.Y., 2009) (Judge Scheindlin); Disability Rights Council of Greater Wash. v. Wash. Metro. Area Transit Auth., 2007 WL 1585452 (D.D.C. 2007) (Judge Facciola); United States v. O’Keefe, 2008 WL 449729 (D.D.C. 2008) (Judge Facciola); William A. Gross Const. Associates, Inc. v. American Mfrs. Mut. Ins. Co., _F.R.D._, 2009 WL 724954 (S.D.N.Y. 2009) (Judge Peck); Digicel (St. Lucia) Ltd & Ors v. Cable & Wireless & Ors, [2008] EWHC 2522 (Ch) (Justice Morgan) (UK decision). Moreover, scientific research has shown that keyword search alone is ineffective and multi-modal approaches that use keyword and other methods work far better. See: Child’s Game of “Go Fish” is a Poor Model for e-Discovery Search; The Multi-Modal “Where’s Waldo?” Approach to Search and My Mock Debate with Jason Baron; Secrets of Search: Parts One, Two, and Three. […]
[…] Secrets of Search Parts One, Two and Three, I outlined the five key characteristics of effective search today, using the rubric of secrets. In […]
[…] Secrets of Search: Parts One, Two, and Three; […]
[…] less ever get used to make a difference in a case. That is why in my Secrets of Search article, Part Three, I say Relevant Is Irrelevant and point out the old trial psychology rule of 7±2, to argue for […]
[…] Better, Best: a Tale of Three Proportionality Cases – Part Two; and, Secrets of Search article, Part Three (Relevant Is […]
[…] Secrets of Search: Parts One, Two, and Three; […]
[…] is different from other kinds of search. The goal of relevant evidence is inherently fuzzy. The 7±2 Rule reigns supreme in the court room, a place where most such computer geeks have never even been, much […]
[…] That is the bottom line: four cents per document versus six dollars and nine cents per document. That is the power of predictive culling and precision. It is the difference between a hybrid, predictive coding, targeted approach with high precision, and a keyword search, gas-guzzler, shotgun approach with very low precision. The recall rates are also, I suggest, at least as good, and probably better, when using far more precise predictive coding, instead of keywords. Hopefully my lengthy narrative here of a multimodal approach, including predictive coding, has helped to show that. Also see the studies cited above and my prior trilogy Secrets of Search: Parts One, Two, and Three. […]
[…] thus not worth the extra time, money and effort required to unearth them. See my Secrets of Search, Part III, where I expound on the two underlying principles at play here: Relevant Is Irrelevant […]
[…] no understanding of actual trials, cumulative evidence, and the modern data koan of big data “relevant is irrelevant.” Even though random sampling is not The Answer we once thought, it should be part of the […]
[…] of trial practice would never change as long as the jury trial remained a fundamental right: 7 +/-2. The number of electronic communications was ever-increasing, but not discussions of topics […]
[…] ESI? I examined this question at length in my Secrets of Search series, volumes one, two and three. Still, people find it hard to accept, especially in view of the unregulated clamor of the […]
[…] Secrets of Search – Part III. […]
[…] of legal search should always be on retrieval of Hot documents, not relevant documents. Losey, R. Secrets of Search – Part III (2011) (the 4th secret). This is based in part on the well-known rule of 7 +/- 2 that is often […]
[…] my fourth secret of search, relevant is irrelevant? See Secrets of Search – Part III. This Zen koan means that merely relevant documents are not important to a case. What really […]
[…] This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom […]
[…] I explained in my series Secrets of Search, Parts One, Two and Three, the latest AI enhanced software is far better than keyword search, but not yet good enough to […]
[…] I explained in my series Secrets of Search, Parts One, Two and Three, the latest AI enhanced software is far better than keyword search, but not yet good enough to […]
[…] Legal Track – Part One, Part Two and Part Three; Secrets of Search: Parts One, Two, and Three. Also see Jason Baron, DESI, Sedona and […]
[…] This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom Line […]
[…] Requesters who demand production with only machine review, and any responders foolish enough to comply, have not understood the third secret. It is way too risky to turn it all over to the machines. They are not that good! The reports of their excellence have been grossly overstated. Humans, there is need for you yet. The Borg be damned! Jobs may have passed away, but his work continues. Technology is here to empower art, not replace it. (For more on this see the blog comments at the end.) […]
[…] Secrets of Search – Part III. […]
[…] The SMEs are the navigators. They tell the drivers where to go. They make the final decisions on what is relevant and what is not. They determine what is hot, and what is not. They determine what is marginally relevant, what is grey area, what is not. They determine what is just unimportant more of the same. They know full well that some relevant is irrelevant. They have heard and understand the frequent mantra at trials: Objection, Cumulative. Rule 403 of the Federal Evidence Code. Also see The Fourth Secret of Search: Relevant Is Irrelevant found in Secrets of Search – Part III. […]
[…] with Exhibit “A.” Maybe even Exhibits “A” though “G.” See: Secrets of Search – Part III, Fifth Secret: 7±2 Should Control All e-Discovery (But Doesn’t). I would have the few smoking […]
[…] This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom […]