48 Responses to Secrets of Search – Part III

  1. Can one identify what documents are hot documents without knowing first what a side’s legal strategy is going to be? And if not, how does the producing party know what the requesting party will regard as “hot”?

    Also, do large scale productions really drive up costs in the presence of automated text analysis tools, particularly if we assume that not every produced document needs to be manually reviewed by either the producing or the requesting party? The producing party hands over a large document set; then the requesting party loads it into their own text analysis software for condensation and extraction. With a competent search and analysis tool, finding the seven smoking guns from four million documents is not harder than from four thousand; indeed, it may be easier, because you have better statistical information to perform your inference on.

    I think you’re very right, though, that our evaluation needs to measure how many of the hot documents a retrieval is catching, rather than merely what proportion of technically responsive documents (by a highly variable pyrite standard). The easter egg approach is an interesting but generally disfavoured one (first, it is artificial; and second, who knows what unintended hot documents might also exist in the corpus). But could we take existing collections with their retrievals, and do a further (sampled) analysis of documents marked relevant to determine which of them are hot? How big, and how well-defined, a task is this?

    The way to get TREC to do this systematically, I think, is to run a trial experiment on an existing collection, and demonstrate the difference that a hotness-based analysis makes.

    • Ralph Losey says:

      Thanks for the comment William. The law and what you must prove under the cause of action plead is a far more certain goal than mere strategy. It usually is not hard to know if a document is hot. If, for instance, the case requires proof of racial discrimination, and an email is found from an accused that contains a racial slur, it is obviously highly relevant. Deciding which of several hot documents to put in your top seven for trial may involve court room persuasion strategy, assuming you have more than that. But the other side’s basic strategy is always to prove, or disprove, the basic elements of the cause of action or affirmative defense. There is a lot of structure to the law and evidence.

      The assumption in your comment that you save money by skipping human review before production is not a best practice in the law. In fact, no one does this, or it is only very rarely done, because of the issues of confidentiality and privilege. Remember, this is not an academic exercise. You are turning your private documents over to people who are suing you. They want to harm you. Frequently they are worst of enemies. The rights to confidentiality are very significant and legitimate concerns of all persons involved in the judicial process. The contrary position suggests what I refer to in the blog as a Borg position of abdication, sometimes also called a quick peek. Here you trust your enemy to look through your secrets for you and extract the ones it wants. No way. The Borg position is usually adopted by requesting parties for ulterior motives, and/or because they have no trust at all in the producing party’s processes or final human review.

      I advocate a hybrid approach, where we trust the software, but still verify. We rely on the computer to review the entire corpus to cull down to a manageable size, then we do a human review of the documents selected for production only. Of course, there is also human review in the seed set instruction process. We address the issue of mistrust by the requesting party by being open about the processes, but we do not open the doors to all our client’s information and hope for the best. This is litigation.

      I would like to talk to you re the “Easter Egg” approach. Perhaps together, and with the help of others, we can “think different” and figure out an acceptable way to do this. Let’s go offline for this, but one thought to consider, and perhaps a reader could help here, is to start with a case that actually went to trial, and work backwards from the five to nine documents that the winning side used to persuade the jury.

  2. […] Requesters who demand production with only machine review, and any responders foolish enough to comply, have not understood the third secret. It is way too risky to turn it all over to the machines. They are not that good! The reports of their excellence have been grossly over-stated. Humans, there is need for you yet. The Borg be damned! Jobs may have passed away, but his work continues. Technology is here to empower art, not replace it. (For more on this see the blog comments at the end.) […]

  3. Jon May says:

    I can’t address the needs of civil litigants, but I don’t believe that the rule of seven applies to white collar criminal cases. The problem is that criminal conspiracies are often spoken of in code and only by viewing a chain of emails will the true nature of a transaction become apparent.

    Putting aside best practices for an internal investigation, consider a subpoena from a federal grant jury. Such subpoenas are primarily an investigative device. At this point what documents will be used at trial is secondary. Indeed, an individual document may not be worth much, but a series of documents may lead to the discovery of a target or a witness. And the witness’s testimony will be what is introduced at trial. Consequently prosecutors often want the production of what is “merely” relevant.

    Now consider the needs of an internal investigation. In one respect counsel is in the same position as a prosecutor. Except company (or outside) counsel will know even less to begin with because employees have every reason to lie to counsel, fearing loss of their employment or being reported to the authorities. By contrast, at the point employees are talking to the prosecutor they have every incentive to spill their guts. The bottom line is that counsel will have to cast an even wider net to figure out what has gone wrong.

    Only after an individual or an entity is under indictment will counsel know enough about the charges to search for the 7 or so items that will be of most use to the defense. The problem is that the government is not required to disclose the 7 or so hot documents they will introduce out of the 100,000 they have disclosed to the defense.

    Part of solution lies in discovery reform under the rules of criminal procedure. Not likely to happen. So I see no easy solution to extremely expensive searches.

    Say hi to my good friend Mike Wolf.

  4. Larry Chapin says:


    For anyone wondering how your systematics hold together, this essay is a must read. It’s also a must for anyone who has lost sight of the crucial point that the real purpose of document search is to enable lawyers to tell their client’s story. Everyone and everything in discovery needs to work together towards that goal.

    Larry Chapin

  5. […] Secrets of Search – Part III […]

  6. […] Secrets of Search – Part III […]

  7. Melinda Levitt says:

    My most immediate reaction to the article is the need for TIME to do everything right. And, as we know, time costs the client money. There is no way that the best of doc teams that I could put together, could figure out what the “7 plus/minus 2” true relevant documents are out of a universe of 500,000 without the time to really analyze the corpus. Yes — we could de-dupe across custodians, and employ tools so that we only look at the most complete version of an email string rather than each iteration . . . And we could use smart key-word/TAR review and sampling to narrow down early on what seems to be the most responsive and truly relevant documents . . . But, from there we would need to start doing some sophisticated review to determine if what seemed “hot” on the first or second go around proved to be really meaningful when put together in a time line.

    Just as an example — assume that some “hot” issue generated a bunch of email traffic between five people and ultimately a meeting was set up to discuss the topic. Ah-ha! . . . . Then there are electronic notes taken at the meeting — cryptic as they are — and some follow-up emails between three of the five people. What about custodians four and five? . . . If the initial culling that was done three weeks ago, in a non-linear review, got rid of as “non-responsive” things such as electronic calendar entries showing appointments or emails that simply read — “Folks — something came up. Won’t be able to make it today.” — AND if it were important to the case to place custodian four at the meeting, I may never have discovered that custodian four actually showed a completely different event on his or her calendar on the date and time of the meeting, or the email that would have clued me into the fact that a few hours before the meeting custodian four cancelled? . . . Let me say this — I would like to think that I, personally, having seen that the topic and the meeting were “hot”, would have at that point gone back and searched among “all” docs, including the non-responsive ones, to try to figure out what happened to custodians four and five and whether they attended or not, etc. However, because that would be a subjective decision (and one demonstrating my e-discovery geekiness!!), there is no way that I can be sure that my “senior level review” colleagues — as good as they are — would have made the same subjective decision to go hunting. If they did not and we never produced the benign-on-its-face calendar entry and the email saying that the person was not going to attend — a year later we will be at deposition, with the deponent being pummeled with questions about the meeting, what happened at the meeting, who said what, what did the person think . . . Etc. Etc. Etc. And, given that the meeting occurred 3 and a half years ago, the deponent really doesn’t recall very much, the best he/she can answer is “I do not recall” — which as we all know is a safe answer but one that allows the other side to argue to a judge/jury about the person’s participation because by having no memory about it, the person cannot deny what is being argued and is stuck with “I don’t recall.” Meanwhile, immediately after the deposition, we all are scrambling around looking to see what we can find out about this meeting and the person’s participation — racing back through the document corpus . . . And now for the first time we realize that there are docs showing that the person cancelled and went off and did something else that day. “Damn it!! Why didn’t we know that in the first place???” . . . “Why didn’t we produce these docs to have spared the deponent the entire line of questioning, leaving us open on this topic???” . . . “How do we answer the in-house general counsel about why we didn’t figure all this out a year and a half ago when we were doing doc discovery so that we could have properly prepared the witness???” . . . “Should we go ahead and do a supplemental production now and suggest a second deposition of the person — and face the possibility that opposing counsel will file a motion for sanctions for failure to produce responsive documents causing them to waste time and money on a theory of the case that placed the custodian at that meeting and now have to go through the time and expense of a second deposition and maybe related supplemental depositions of three other custodians — a motion that they probably would not win, but we all will spend time — and the client will spend money!!! — defending against???” And, even if the motion for sanctions is denied for the most part — there is a good chance the judge will nonetheless make us have to pay for the supplemental depositions . . . .

    I assume that you get my point. So — wouldn’t the answer have been to just produce everything even if at first glance it only appeared remotely responsive? . . . And, wouldn’t a linear chron review have let us see at the get-go of discovery that custodian four never attended the meeting in the first place and that, therefore, that calendar entry and the email saying he/she wasn’t going to be able to attend, which were non-responsive on their face in isolation, actually were quite relevant and responsive?

    Assume all of the above — but now assume that we did not even realize or understand that the topic or the meeting were “hot” from the other side’s perspective — that they know something about the topic and the meeting that we do not and, therefore, we never even knew to produce anything about it from the start but treated this entire subject matter as “not really responsive/relevant” and so ignored it and did not produce it?

    In short , by allowing the review team to make much tighter decisions about what is really relevant and responsive versus the “technically responsive but utterly meaningless”, don’t we open ourselves up to tremendous vulnerability in terms of being second guessed down the road? Again, isn’t the safest thing to do — albeit the stupidest and very expensive thing to do — just to produce it “all” and let the other side worry about finding what it needs to find — and if it doesn’t, then at least we can go back through the produced universe of docs and say “Hey — you all need to look at docs X,Y,Z as they counter your theory. We produced them you know — didn’t you bother to look??!!!”

  8. I think what all of these comments are pointing to is a very basic issue. Yes, all I really want/need to get from you when you are responding to a discovery request are the key documents. But I would much rather be the one evaluating, for myself, whether a particular document is critical to my case than have my opposing counsel do it for me.

    That’s not to say that your point about filtering down to the 5-9 key documents (assuming that’s the right number in a particular case) is wrong. Just that the filtering pretty much has to occur on the receiving party’s end, not the producing party’s end.

  9. Ralph Losey says:

    Agreed to a point. Of course the receiving party produces more than the hottest documents. That is not for them to say, and anyway, the producing party would never want to do that. My point is to remember the goal. So think in terms of producing the top thousands, not millions. I am suggesting we must cull more and smarter, and that both sides have to do that in order to contain costs, and to further justice and speed. But the culling itself must be a transparent process, and must be smart, iw – use the latest technologies.

  10. […] Part III of Secrets of Search I listed a nine-point checklist for quality reviews. Point number six was: “New tools and […]

  11. […] Ralph Losey,  “Secrets of Search Part 3” […]

  12. […] topic including a recent excellent three-part post called Secrets of Search: Part I, Part 2 and Part 3 by Ralph Losey’s electronic discovery […]

  13. […] with final manual review. As I explained in my series Secrets of Search, Parts One, Two and Three, we are not going to turn that over to the Borg anytime soon. I’ve asked around and no law […]

  14. […] of legal search, Herbert L. Roitblat, Ph.D, was kind enough to write a detailed critique of my Secrets of Search series. I tried to post a response on his blog, Information Discovery where it appeared. But the […]

  15. […] ESI? I examined this question at length in my Secrets of Search series, volumes one, two and three. Still, people find it hard to accept, especially in view of the unregulated clamor of the […]

  16. […] This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom […]

  17. T says:

    The problem is that we often don’t know what the hot documents will look like. It may work in an experiment with planted Easter Eggs, but that would inject an objective definition of what “hot” is, which is to be matched up against the subjective judgment of the testers.

    In real life we only have the latter, no matter how much the obvious hotness excites us. Often times, however, the hotness of a document is derived from a pool of simply relevant documents.

    I am not sure that the intrinsic hotness of “cook the books” would appear with great frequency. Instead, it takes the form of a contextually based “let’s do it.” In other words, the testers would have to look not only for the Easter Egg, but for the little girl mixing the egg paint and her mom’s going to the grocery story to buy the eggs. Maybe the chicken that laid them?

    Second, I hope you would agree that the 7±2 should not be tied only to key documents admitted into evidence at trial. Simply relevant documents can be used to impeach a testifying witness, or draw out additional evidence at a prior deposition. And that portion of the testimony would then become key, even though the underlying document only created a question to be further explored.

    The problem with eDiscovery is that its very name is a misnomer. It should have been called eProduction. It only applies to document sharing, not interrogatories, depositions, or other discovery vehicles. THAT is discovery. It also doesn’t need the “e” anymore. Maybe it did in 2005, but not in 2012. It’s just an element of discovery that, as you pointed out, no litigator can run away from. Tradition, I suppose.

    • T says:

      I should also add that the inadmissibility of evidence at trial becomes irrelevant when a party is faced with a subpoena. A lot of massive litigation today is driven in response to federal and state issued subpoenas, often non-public and extra-judicial, where the respondents take a “cheerfully cooperative” posture that would be different if it were an all-out litigation. Here the cost of production is weighed against the likelihood of being sued and costs incurred if you do get sued. It’s a paradoxal reasonableness.

  18. […] part of the 7±2 documents that the trial lawyers of all parties will build their arguments around. See: Secrets of Search, Part III.  In addition to the “nightmare well” email that will be the centerpiece of every […]

  19. Mike Rossander says:

    At the risk of sounding like a quibbler, I have to challenge your representation of Miller’s work. Miller did not find that humans can remember 7 ± 2 documents – he found that we can manage 7 ± 2 concepts simultaneously. Those concepts can be anything from a single digit of a number to a complex emotion like love or a philosophy like liberty. Miller’s finding is entirely applicable to the problems of modern discovery but it does not translate to a useful rule of thumb about document counts. Or rather, it does but not in the way that some of your comments in the post imply.

    A document is not the “thing” that a juror must remember. A document may contain multiple facts or it could require multiple documents to establish a single fact. The juror cares about the concepts that make up the story of the case and the facts that prove or disprove each step in the story. The juror’s cognitive ability limits the number of disparate facts that you can realistically use to make your case regardless of how many documents they are distributed across.

    Say that you need to prove that Sally deserved to be fired for submitting fraudulent expense reports. A single bad expense report does not prove or even suggest fraud. It could too easily be an innocent error. A pattern of expense reports showing the same behavior over time, however, locks in the one concept that Sally intentionally committed the fraud. The juror will not remember the details of 25 individual expense reports but will remember the pattern that “Sally is a cheat.” The same case might need to establish the concept that “the company was the good guy.” A single harsh email undercuts but does not disprove that concept. Even supervisors have a bad day. A pattern of emails over time locks in the concept that the manager had it in for poor Sally.

    At the other end of the spectrum, even a single document may be indigestible if it covers more disparate concepts that the juror can absorb. Attempting to understand that one over-large document will inevitably drive other content out of his/her span of attention. The corrolary to your argument is that even single documents may have to be exerpted wisely.

    I agree strongly with your argument that it doesn’t take thousands of documents to make your point and that it is counter-productive even to try. I agree completely with your point that Discovery should start with the ends in mind and that it should be subordinate to the goal of justice, not an end to itself. You just can’t jump from Miller’s research on cognitive concepts all the way to the word “documents”.

    (As a side note, judges are statistically no more capable than the rest of the population in Miller’s 7 ± 2 analysis. The judge’s sole advantage is that through context and training, he/she may be able to aggregate components into a single, larger concept which can then be managed as a cognitive unit. Specialists in every field demonstrate this competence. Notably, it is context-specific. An engineer who can manage the thousands of data elements necessary to design a bridge is no more capable with competing legal concepts than the average judge would be with the calculus of engineering design. It is identical to the short-hand you do when you memorize phone numbers by blocking the area code as a single cognitive unit.)

    My apologies if this comes across as a pedantic. You have an excellent and important point to this post. I don’t want that critical point to get lost or get taken out of context because of a misunderstanding of Miller’s research.

  20. […] Secrets of Search: Parts One, Two, and Three […]

  21. […] Many of our leading jurists, information scientists, academics, scholars, writers, and legal practitioners recognize that the old methods and attitudes that worked for paper no longer work for ESI. Victor Stanley, Inc. v. Creative Pipe, Inc., 250 F.R.D. 251 (D. Md. 2008) (Judge Grimm); Securities and Exchange Commission v. Collins & Aikman Corp., 2009 WL 94311 (S.D.N.Y., 2009) (Judge Scheindlin); Disability Rights Council of Greater Wash. v. Wash. Metro. Area Transit Auth., 2007 WL 1585452 (D.D.C. 2007) (Judge Facciola); United States v. O’Keefe, 2008 WL 449729 (D.D.C. 2008) (Judge Facciola); William A. Gross Const. Associates, Inc. v. American Mfrs. Mut. Ins. Co., _F.R.D._, 2009 WL 724954 (S.D.N.Y. 2009) (Judge Peck); Digicel (St. Lucia) Ltd & Ors v. Cable & Wireless & Ors, [2008] EWHC 2522 (Ch) (Justice Morgan) (UK decision). Moreover, scientific research has shown that keyword search alone is ineffective and multi-modal approaches that use keyword and other methods work far better. See:  Child’s Game of “Go Fish” is a Poor Model for e-Discovery Search;  The Multi-Modal “Where’s Waldo?” Approach to Search and My Mock Debate with Jason Baron; Secrets of Search: Parts One, Two, and Three. […]

  22. […] Secrets of Search Parts One, Two and Three, I outlined the five key characteristics of effective search today, using the rubric of secrets. In […]

  23. […] less ever get used to make a difference in a case. That is why in my Secrets of Search article, Part Three, I say Relevant Is Irrelevant and point out the old trial psychology rule of 7±2, to argue for […]

  24. […] Better, Best: a Tale of Three Proportionality Cases – Part Two; and, Secrets of Search article, Part Three (Relevant Is […]

  25. […] Secrets of Search: Parts One, Two, and Three; […]

  26. […] is different from other kinds of search. The goal of relevant evidence is inherently fuzzy. The 7±2 Rule reigns supreme in the court room, a place where most such computer geeks have never even been, much […]

  27. […] That is the bottom line: four cents per document versus six dollars and nine cents per document. That is the power of predictive culling and precision. It is the difference between a hybrid, predictive coding, targeted approach with high precision, and a keyword search, gas-guzzler, shotgun approach with very low precision. The recall rates are also, I suggest, at least as good, and probably better, when using far more precise predictive coding, instead of keywords. Hopefully my lengthy narrative here of a multimodal approach, including predictive coding, has helped to show that. Also see the studies cited above and my prior trilogy Secrets of Search: Parts One, Two, and Three. […]

  28. […] thus not worth the extra time, money and effort required to unearth them. See my Secrets of Search, Part III, where I expound on the two underlying principles at play here: Relevant Is Irrelevant […]

  29. […] no understanding of actual trials, cumulative evidence, and the modern data koan of big data “relevant is irrelevant.” Even though random sampling is not The Answer we once thought, it should be part of the […]

  30. […] of trial practice would never change as long as the jury trial remained a fundamental right: 7 +/-2. The number of electronic communications was ever-increasing, but not discussions of topics […]

  31. […] ESI? I examined this question at length in my Secrets of Search series, volumes one, two and three. Still, people find it hard to accept, especially in view of the unregulated clamor of the […]

  32. […] of legal search should always be on retrieval of Hot documents, not relevant documents. Losey, R. Secrets of Search – Part III (2011) (the 4th secret). This is based in part on the well-known rule of 7 +/- 2 that is often […]

  33. […] my fourth secret of search, relevant is irrelevant? See Secrets of Search – Part III. This Zen koan means that  merely relevant documents are not important to a case. What really […]

  34. […] This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom […]

  35. […] I explained in my series Secrets of Search, Parts One, Two and Three, the latest AI enhanced software is far better than keyword search, but not yet good enough to […]

  36. […] I explained in my series Secrets of Search, Parts One, Two and Three, the latest AI enhanced software is far better than keyword search, but not yet good enough to […]

  37. […] Legal Track – Part One, Part Two and Part Three; Secrets of Search: Parts One, Two, and Three. Also see Jason Baron, DESI, Sedona and […]

  38. […] This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom Line […]

  39. […] Requesters who demand production with only machine review, and any responders foolish enough to comply, have not understood the third secret. It is way too risky to turn it all over to the machines. They are not that good! The reports of their excellence have been grossly overstated. Humans, there is need for you yet. The Borg be damned! Jobs may have passed away, but his work continues. Technology is here to empower art, not replace it. (For more on this see the blog comments at the end.) […]

  40. […] The SMEs are the navigators. They tell the drivers where to go. They make the final decisions on what is relevant and what is not. They determine what is hot, and what is not. They determine what is marginally relevant, what is grey area, what is not. They determine what is just unimportant more of the same. They know full well that some relevant is irrelevant. They have heard and understand the frequent mantra at trials: Objection, Cumulative. Rule 403 of the Federal Evidence Code. Also see The Fourth Secret of Search: Relevant Is Irrelevant found in Secrets of Search – Part III. […]

  41. […] with Exhibit “A.” Maybe even Exhibits “A” though “G.” See: Secrets of Search – Part III, Fifth Secret: 7±2 Should Control All e-Discovery (But Doesn’t). I would have the few smoking […]

  42. Hacker Law says:

    […] This incorporates the search principles of Relevant Is Irrelevant and 7±2 that I addressed in Secrets of Search, Part III. My own work has been driven by this hacker focus on impact and led to my development of Bottom […]

Leave a Reply