38 Responses to Relevancy Ranking is the Key Feature of Predictive Coding Software

  1. […] e-Discovery Team: Relevancy Ranking is the Key Feature of Predictive Coding Software […]

  2. Ralph, I disagree with your claim that predictive coding without ranking is worthless. Ranking, or more properly scoring, documents adds noise to the process, invites controversy over the cutoff score, and is based on some dubious assumptions.

    Training predictive coding requires a set of documents that have been categorized as responsive and a set that has been categorized as non-responsive. The ultimate product of the system is another set of documents that have been categorized by the system as responsive. Adding scores in the middle of this process merely smears the discreet categories of the training set and adds noise to the process that is basically unnecessary.

    Scores add to the controversy of eDiscovery. The two sides tend to argue about where to put the cutoff score that re-assigns the scores to categories. At what score do you say: documents above this score are responsive, documents below this score are not? For example, see http://ediscoveryjournal.com/2013/07/predictive-coding-cooperation-experiment-gets-contentious/ .

    Is it logical to say that a given document is 10% responsive? Do we produce a 10% responsive document?

    The exact score that is assigned to a document depends precisely on the categorization algorithm. It represents the algorithm’s view of how well the document matches its representation of the category, and that may or may not correspond with the user’s view of more or less responsiveness.

    Over a broad range, the score can be useful to distinguish between responsive and non-responsive, but the factors that go into computing the details of a score may not, as you point out, correspond to the details of responsiveness. As in keyword searching, a document that scores 0.40 is not necessarily more responsive than one that scores 0.39. The factors that affect the score in keyword search, which you correctly deride, are also at play in most predictive scoring algorithms. A document that mentions “debenture” twice would usually score higher than one that mentions the word once, but would not necessarily be more responsive.

    Scores may be useful as a rule of thumb for prioritizing review, but they are far from an essential feature of predictive coding. We start with categories and our goal is to end with categories, the detour through scores primarily adds to the complexity and uncertainty of predictive coding. If the subject matter expert is generous, willing to accept more documents as responsive, then an effective predictive coding algorithm will learn to be generous. This is a legal decision that lawyers are used to making, rather than an information science judgment about cutoff scores, that they are generally not well skilled in.

  3. Ralph Losey says:

    Herb – you might want to reread my blog. I said ranking was a “key feature” for me. Re “worthless,” what I actually said was:

    “The software is worthless without good users – the SMEs – and good user methods, and good software.”

    This is a point that I think we agree on.

  4. Jeremy Pickens says:

    How many times have you seen the hundred page spreadsheets ranked as the highest “relevant,” the spreadsheets with tens of thousands of names in them? The same applies to long reports, sometimes thousands of pages long. Those were the documents ranked the highest under old-fashioned keyword count ranking.

    Ralph, with all respect, you’re fighting against a strawman, a type of relevance ranking that has not been used in serious Information Retrieval circles in at least 20 years, if not more. Nowadays, all keyword-based relevance ranking algorithms (including those of the major web search engines) normalize the raw frequency counts by the number of terms in the document. It has been recognized for decades that longer documents have a higher probability of having more matching terms, and so the score is “tamped down” by the length of the document.

    Citation: http://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html#sec:inner

    So I’m not sure if you’re opposed to relevance ranking to keyword searches in general, or just to ones that don’t normalize by document length.

    Next you write: Relevancy based solely on keyword counts is just a coincidence. It is also quite likely that a document with only one keyword, or no keywords, and thus ranked last, is actually the hottest document in the collection. There was no way you could rely on that kind of arbitrary, pseudo-ranking to justify reduced review. But now with predictive coding, which uses a complex analytic system that looks at the entire document to rank documents, including metadata, and how every document relates to other documents, the relevancy ranking become real. It becomes testable. It can even be the basis for new types of ranking searches…The use of complex multidimensional vectoring and probability math has made the ranking reliable.

    Again, I have to mention the fact that you don’t have to have full-on predictive coding in order to rank documents by relevancy using more than just keywords. Again, here is a formula (one of many possible!) from almost 10 (!) years ago that makes use of metadata in its relevance ranking:

    http://trec.nist.gov/pubs/trec13/papers/microsoft-cambridge.web.hard.pdf

    Full-on predictive coding should indeed use both keywords and metadata. But just normal ad hoc search relevance ranking can use keywords and metadata, too. It is entirely possible for your SME to construct a query that consists of both keywords and dates and whatever else, and do a relevancy ranking that takes all that information into account.

    But that’s not CAR. That’s just normal, ad hoc search. It’s just that more intelligence is built into the search than you get from many systems. But the algorithms for doing multimodal ad hoc search have been around for a decade. And the ones for doing term frequency document length normalization have been around for two decades.

    So please, criticize algorithms that don’t use these kinds of formulae. But please don’t make the assumptions that all algorithms are incapable of these things. To those skilled in the art, these problems have been studied, and solutions have existed, for decades.

    • Ralph Losey says:

      Jeremy – Thanks for the comments. I always like to hear from PhDs in information science. I learn something new each time. This post was no exception.

      Re your point, alas, I wish it were a “straw man” re the failure of many e-discovery software programs to adjust the keyword counts, like you say is standard in your elite world. I of course believe you, but it is not standard in my world as an e-discovery lawyer. It is not a “straw man” in e-discovery review software. (I don’t have to engage in “straw man trickery” for my arguments. No need. There are so many things wrong, why would I invent any? Really Jeremy!) Alternatively, if the vendors who have been writing this software for the past decade did try to correct for large documents, or as you put it “tamp down,” then they failed miserably. Either that or I just never happened to use the software that did get it right. I have plenty of personal experience with this in the legal trenches of document review. Moreover, whenever I mention this at CLEs, I get many nods from others in the audience who have encountered the same thing.

      Unfortunately for us lawyer and paralegal users, most e-discovery vendors do not have scientists like you on staff! Or if they do, they don’t listen or do not know how to code it right. Too bad what you call “old hat” has not been included in any of the software I have used over the last decade up until the latest predictive coding versions. The only ranking that was any good that I’ve personally seen has been in review software with new AI-enhanced, aka “predictive coding” features. If you would like to give me some other specific examples that you have personally seen and used in real world tests in legal review software, please do so by private email, or maybe we could have another one of our Google video chats, complete with funny hats et al!

      • Jeremy Pickens says:

        Apologies for the delay.. I just saw your response now (didn’t get an email notification that a response had happened).

        Perhaps using the word “strawman” was too strong. And I certainly didn’t mean it in the spirit of “trickery”. It’s just.. I’m pretty sure most modern algorithms, even those from my competitors, do account for this length in some way.

        But perhaps it doesn’t go far enough, as you suggest. Perhaps tamping down is indeed occurring, and things would be even worse if it weren’t. Hard to know.

        Here’s a question: Do a lot of those long documents end up being responsive, or non-responsive? They’re often (though of course not always) responsive, correct? I know I’ve read your writings about this in the past.. how you tag a long document as responsive, but also tag it as “don’t use this for training, because I’ve already seen too many of these”.

        If so, if these long documents (a lot of them.. of course not all of them) are responsive, then it seems what you really want is a way of saying, “show me something relatively different, something (ahem) contextually diverse”, because I already know about these kinds of long documents.. even if they are responsive.

        If that’s the kind of workflow augmentation that you’d like to see, then that’s the sort of thing that should be modeled explicitly by the vendors. But if a long document is responsive, then it’s going to be ranked highly by the CAR algorithms.

  5. Ralph,

    Having a header that promises a ranking that will sort documents to show the “most relevant” documents first is plain misleading.

    As Herb, points out, as Jeremy has stated in my Linkedin discussion http://goo.gl/YHgPVx, and indeed as your post itself states at one point (“Ranking orders all documents in a collection according to likely relevance as you define it”), the score tells a user how likely it is that a document is in fact responsive, in a sense how sure the system is of the prediction. It says nothing intrinsically about how relevant a document is. In this respect, Herb’s remark is on all fours, “It represents the algorithm’s view of how well the document matches its representation of the category, and that may or may not correspond with the user’s view of more or less responsiveness.” Using the likelihood scores to percolate to the top more of the responsive material is fine; it obviously has value.

    However, a message that conflates degree of relevance and likelihood of responsiveness is very troublesome. IAs an example, your post leads with “more relevant” in the lead-in picture, swings back to “likely relevance” and then swings back again to focusing on the “most relevant documents” again. So then which is it? Are you making the claim that there is some property of predictive coding that makes them so highly correlated that distinguishing between them is unnecessary? I have heard such claims; I would not care to walk on ice as thin as the support for those claims. In any case, if it’s your position that they can be interchanged, logic dictates that the critical difference between probability of responsiveness score and a level of relevance must at least be acknowledged. Then make the case why the former also predicts the latter.

    Absent such a claim, the conflation leads (and this I continually observe) those end users and judges who are too busy to drill down into this stuff to think that the current models are actually scoring “how relevant” a document is. And invariably, the belief is that producing the highest ranked documents is assuring that the corpus’ most important documents are being retrieved. Hence, all the nattering about cut-offs, with users mistakenly believing that they are deciding the point at which documents become of such relative minor importance that proportionality dictates that they can remain behind. This is the dangerous outcome of such conflation.

    Moreover, the truth is that under current protocols neither users nor the courts receive information sufficient to rule on proportionality (at lease rationally). The documents that from a sophisticated mathematical perspective look most akin to the training sets are chosen, and then through the magic of conflation they are deemed “most relevant.”

    With the low Biomet-like recall levels now seemingly acceptable, one must wonder at the import of information in the part of the figurative worm that lies on the wrong side of the predictive ranking plow. Unlike the worm, the client, if notified of the loss of information crucial to the case, is unlikely to forgive the slight.

    It may be that due to the nature of the process- unreviewed predicted non-relevant documents that don’t make the cut are effectively a black box — invalid “relevance ranking” based protocols will catch on.

    But I don’t believe that doesn’t make them any less invalid.

    Also, to Herb’s point, I believe that the basis of his objection was your statement, “If your vendor pushes so-called predictive coding-type software that cannot rank by relevancy, then get a new vendor, get new software.” That’s pretty much the same as saying its worthless, no?

    • Ralph Losey says:

      I have never seen software with a heading for rankings that says “most relevant,” have you? The ones I’ve seen have probability numbers. Sophisticated users understand the distinctions you made about relevance. The ranking correlation accuracy depends on both the SME training and the software. Obviously some software is better at it than others, just as some SMEs are better at machine training than others.

      As a trainer, an SME or extension thereof, I want the ranking information to assist in the training process. It allows for improved HCIR. As the active machine learning process kicks in after several iterations of training, then, and only then, a viable correlation develops with the SME’s idea of relevance, or at least it should with good software. You can see this by review of the documents by ranking categories. Of course you can and should test all of this with sampling, including the low ranked. This is not a perfect process, nothing is. But the law demands reasonable efforts, not perfection, and this additional tool of feedback helps me to make better efforts to find relevant evidence at a proportionate price. It must all be understood as part of the doctrine of proportionality, Rules 1, 26(b)(2)(B), 26(b)(2)(C), 26(g), etc. This is a legal search method.

      The availability of ranking the probability of all documents is, for me at least, a very valuable tool and makes it possible for me to be more effective as a machine trainer. I would not want to use software that did not have this capacity. It really helps me in HCIR, in getting the feedback I need from the computer. But it is just a tool, and, like any tool requires a knowledgeable user to be of any real value. There is no “easy button” and I did not mean to suggest ranking provides such a button.

  6. […] [13] Losey, R., Relevancy Ranking is the Key Feature of Predictive Coding Software found at http://e-discoveryteam.com/2013/08/25/relevancy-ranking-is-the-key-feature-of-predictive-coding-soft…. […]

  7. Ralph,

    Thanks for responding.
    My responses.
    Regardless of how vendors label the ranking score (the label is usually vague- the messaging that comes with the proposals and training are not; they reference “most relevant” ubiquitously), the message being pushed onto clients, potential clients, and judges is that the ranking provides a per se measure of relevance level. This is very troubling because it has facilitated the increasingly common mistaken belief that a protocol that skims off documents using score thresholds is inherently assuring that the most relevant documents in the corpus are being produced, and that where recall is even significantly less than 100%, the score-based selection process ensures that missing relevant documents must be of lesser importance. It should be noted both well documented judicial decisions related to predictive coding protocols, Da Silva Moore and Biomet, evidence an acceptance of the conflation without noting any distinction.
    It may be that sophisticated users understand the distinction and don’t make this mistake. However, it’s my experience that the set of people who are sophisticated enough to actually “get” this fundamental and important distinction isn’t large enough to get a good pickup game off the ground at St. Tommy’s. And to the extent some do get it, they seem to be willing to join the chorus of advocates who conflate the two measures in promoting predictive coding. End users, almost universally not sophisticated in the nuances of predictive coding, clearly make no distinction.

    The rampant misuse of probability scores as measures of document importance is evident from an examination of just the first pages of a Google search for “ ’most relevant’ ‘predictive coding’ ranking”:
    • “MSL’s vendor used software that ranked documents on a score of 100 (most relevant) to zero (least relevant)” [referring to Da Silva Moore].
    • “It allows you to focus on the most relevant documents first”
    • “enabling your review team to start with the most relevant documents first”
    • “ your review time by focusing on the most relevant documents first”
    • “predictive coding to determine which documents were most relevant”
    • “re-ranking the remaining documents to bring the most relevant ones to the top”
    • “assign those that are likely to be most relevant to be reviewed first”
    • “As a result, the most relevant, responsive documents are ranked”
    (And I’d pointed out again that your graphic above also terms the ranking as least to most relevant.)

    You point out that there CAN BE a correlation where the training is done correctly. There are fatal problems that arise with the acceptance of this as a rationale for conflating the two measures.
    First, even under optimal training conditions, a predictive coding system will assign very high ranks to any documents similar to any training documents whether they are important or not. This is not true of a real relevance ranking system.
    Also, there is the problem of reliance on an assertion. In audit, compliance and monitoring, there is a rule. When you are measuring whether an operation has achieved compliance e,g, with a control, you can’t use a measure that requires that you rely on an unproven assertion of some type of conduct in that operation. Everything (at least the important things) must be measured, or the reliability of the resulting measure is shot. Probability ranking used as a measure of relevance level fails where it relies upon an unproven claim of “good workmanship”. To validly use probability as a substitute for relevance level, there has to be an additional set of measures, including measures related to the seed and training sets – and the collection of these measures has consequences in how reviews are performed. Until that happens, using probability scores as any general reliable measure of document relevance level is a fallacy, a fallacy that may go on for a while because mistakes are hidden in the predictive coding process. For a while.

  8. Jeremy Pickens says:

    And to the extent some do get it, they seem to be willing to join the chorus of advocates who conflate the two measures in promoting predictive coding.

    Let me publicly state that I, personally, do not conflate the two. And while I still of course advocate TAR/CAR, I see no need to conflate the two in order to still successfully tell the story of why the approach is useful.

    That is, what you are saying boils down (I think) to this:

    People are too casual in their language. They say that TAR brings the most relevant docs to the fore. In reality, TAR brings most of the relevant docs to the fore.

    Is that a fair thing to say?

    If so, then yes, perhaps we as a community do need to change our language, to be a little more precise. But even if TAR doesn’t yield the most relevant docs, rather most of the relevant docs, it’s still extremely useful, and still saves time and money, by focusing limited reviewer resources on most of the relevant docs, first. Just not necessarily on the most relevant docs.

    But consider this:

    If you are able to find most of the relevant docs, first, then even if the most relevant docs were at the tail end of most of the relevant docs, you will still have gotten to those most relevant docs long before you otherwise would have, in a standard manual linear review. D’ya see what I’m saying? So many of these vendor claims might just turn out to be true after all.. just indirectly so, if you see my twisted logic.

    Still, I agree with you that we all should be more careful in exactly what is being claimed.

  9. Jeremy Pickens says:

    ..in other words, getting most of the relevant docs immediately will also get you the most relevant docs immediately, for the same reason that even a stopped clock is right twice a day 🙂

  10. Jeremy,

    “most of the relevant docs” will also get you the most relevant docs”

    And there’s the rub.

    To the extent users employ ranking scores to percolate relevant items to the top early, I’m all for it. However, predictive coding’s allure is really about not triage but truncation … that is reducing the set of documents that have to be reviewed. Recall rates will rarely include “most of the relevant documents” (And that is putting aside the significant issue of the wide range of possible values for the estimated number of expected relevant documents in a corpus where prevalence is small).

    And now that courts are beginning to use without much thought recall rates to gauge predictive coding performance (and I have issues with that too – but that’s another story) it is almost a certainty that you will see parties, pundits and unfortunately courts making the kooky, unsupported and unsupportable claim that “although recall was only 40% [by the way this percentage is not far off from recall accepted by the court in Biomet, and of course not nearly “most of the relevant documents” in the corpus] the selection of documents with the highest relevance scores ensures that they are those documents most relevant to the case.”

    Mark my words.

    While this will be in the short term a boon for corporate defendants who have things that are best left undisclosed, it’s not a valid way to conduct the redress of grievances.

    And when courts start parroting this bunkum, I will feel, as Will Ferrell’s character in Zoolander said so eloquently, “like I’m taking crazy pills!”

  11. Jeremy Pickens says:

    However, predictive coding’s allure is really about not triage but truncation … that is reducing the set of documents that have to be reviewed.

    I don’t completely follow. If most of the relevant docs are at the top, then you can indeed do truncation, can you not?

    Recall rates will rarely include “most of the relevant documents”

    I’m not sure I follow what you mean by this statement. A rate of 20% recall will include 20% of the relevant docs. A rate of 63% recall will include 63% of the relevant docs. A rate of 80.487% will include 80.487% of the relevant docs. So what do you mean by recall rates rarely including most of the relevant docs?

    Oh, do you mean *sampled* recall rates rarely being correct? Is that what you mean?

    (And that is putting aside the significant issue of the wide range of possible values for the estimated number of expected relevant documents in a corpus where prevalence is small).

    Yes, I think you’re talking about the issue of sampled recall vs. actual recall. Um, I think we should take this off line, but I can show you data, especially with some low prevalence matters *for which I have 100% ground truth, meaning that I’m working with actual recall rates and not sampled recall rates*, which might change your mind about some of this. But we shouldn’t get into that here.

    And now that courts are beginning to use without much thought recall rates to gauge predictive coding performance (and I have issues with that too – but that’s another story)

    Oh, I also have issue with it, too. And yes, that is a different story. But again, are you talking about recall rates, or are you talking about sampled recall rates?

    “although recall was only 40% [by the way this percentage is not far off from recall accepted by the court in Biomet, and of course not nearly “most of the relevant documents” in the corpus] the selection of documents with the highest relevance scores ensures that they are those documents most relevant to the case.” Mark my words.

    I am marking your words, and I agree with you that these 40% aren’t necessarily the ones most relevant to the case.

    But let’s turn it around, and let’s suppose you could get 90% recall.. and provably so (what that means, let’s leave for a separate discussion.. but just assume for the moment that you’re able to truncate at a point above which 90% of the total number of available responsive documents are found.) Would you then not agree that there is a pretty good chance that this 90% contains the “most responsive” docs? Probabilistically, there is a low chance of the most relevant docs NOT being in this set.

    Again, what you are saying is that within that ranking, the ones at the top are not necessarily “more” relevant than the ones further down. True, true, true. But the more you can get more of the relevant docs above that truncation point, the greater chance you will have of getting the ones most relevant to the case into the production set. N’est pas? That’s really all I’m saying.

    Otherwise, I agree with you that courts, clients, and vendors should not be talking about the “more relevant” docs being at the top, if that’s not something that they’ve explicitly modeled.

    But mark my words also: It is possible to explicitly model that, if one desires to do so. And to measure whether or not that model (system) is getting the “most relevant docs” at the top. And that’s by using a metric known as NDCG rather than recall.

    I frankly don’t think that it’s the right metric for eDiscovery. But if people want to make that claim about the “more relevant” docs being at the top, and try to convince the industry that this is the way that it should be done, then they can indeed use NDCG to make that claim. They just, as we both agree, cannot use recall to make that claim directly.

  12. Jeremy,

    Thanks again,

    Your last point first. Yes, I’ve read about some of the information gain metrics, and I think (based upon my somewhat limited review) the one you propose is a distinct improvement. If I understand it correctly though, it still does not fully measure what attorneys need to have measured. Three issues there: relevance here is not relevance there; relevance comes in sizes; and marginal relevance is what counts. That’s a whole ‘nother discussion that I’d love to get your thoughts on, and one that implicates where predictive coding needs to go to approach human review performance (when performance is measured validly). And I think it’s great that you’re actively working on this.

    To your main points:

    1. Predictive coding recall rates in the lab may reach 80%, but it may well be that in real cases, it is lower. My point in relation to that was that there is no way to determine (and therefore to assert) that any specific proportion of important documents is contained in that subset. It seems obvious as you state that as recall approaches the 100% mark, that concern is lessened.

    2. I am taking issue with sample-based recall estimates and would really appreciate hearing about how you avoid it. I also have some non-metric compliance ideas about how to defensibly demonstrate mitigation of missing documents in the very small world of high relevance documents.

    I’ll ping you offline.

  13. Jeremy Pickens says:

    relevance here is not relevance there

    I’m not sure what you mean by that. Do you mean that what the plaintiff means by relevance is not always what the defendant means by relevance? If so, yeah.. uh.. that’s just a tough nut to crack, no matter what your metric is. That’s really beyond the scope of these issues.

    relevance comes in sizes

    Am also not completely sure what you mean by that. Do you mean that there are levels, or gradations, in relevance? If so, that’s exactly what NDCG was designed to measure.

    and marginal relevance is what counts.

    Yes, sure. But until we figure out how to measure marginal relevance.. I mean really measure it in a semantic rather than information theoretic manner.. wouldn’t you agree that being able to get to 90% recall does indeed capture most of that marginal relevance? I can always think edge cases.. For example suppose there are 11 unique, semantically diverse responsive documents, but one of those 11 has 89 exact duplicates (for a total of 100 docs, but only 11 “unique” docs). Then you could turn over that doc and it’s 89 companions and get to 90% recall.. and only be at 1/11 = 9.1% marginal relevance. Which of course is not the spirit of what we’re after here.

    My point in relation to that was that there is no way to determine (and therefore to assert) that any specific proportion of important documents is contained in that subset.

    There is no way to 100% prove that any specific proportion of important documents is contained in that subset. But you can make probabilistic claims. For example, suppose only 5% of the responsive documents really matter, are really the “most responsive” ones. And I’ve given you 90% of the responsive docs from the collection. If you make the assumption that the “more responsive” docs are no more (but also no less!) likely to appear in that produced 90%, then you can actually, and with confidence, make statements about the probability of all 5% of those most responsive docs appearing in the 90%, the probability of 4% of those most responsive docs appearing in the 90%, the probability of 3%, etc.

    What is “good enough”? Not for me to say. But one can make assertions about these sorts of things, even if they’re probabilistic assertions.

    I am taking issue with sample-based recall estimates and would really appreciate hearing about how you avoid it. I also have some non-metric compliance ideas about how to defensibly demonstrate mitigation of missing documents in the very small world of high relevance documents.

    Yup, would be happy to share offline. Best.

    • Bill Dimm says:

      If you make the assumption that the “more responsive” docs are no more (but also no less!) likely to appear in that produced 90%, then you can actually, and with confidence, make statements about the probability…

      That’s a whopper of an assumption, don’t you think? Consider this analogy: You ask 100 students a true-false question where the correct answer is true. If you assume each student flips a coin to pick an answer, you can compute the probability that 50, or 40, or 30 will answer true. It sounds very precise, very scientific, and … very wrong. Lots of precise calculation piled on top of an unjustified assumption. If the question is easy, the number of students that will get it right will be much higher than your calculation indicates is likely. If it is a trick question, the number will be much lower.

      Turning back to predictive coding, if you randomly selected 90% of the responsive documents you could calculate the probabilities of finding various quantities of hot documents, but there is nothing random about how a predictive coding algorithm selects the 90% — it is completely systematic. The algorithm will either favor the hot documents (the students were asked an easy question) or discriminate against them (the students were asked a trick question), depending on the nature of the hot documents and the algorithm.

      To take a pessimistic example, suppose your document set contains a small number of hot documents and none of them are in the training set. The predictive coding algorithm might very well give the hot documents medium scores rather than very high or very low scores because the hot documents are not highly “similar” (the meaning of that word depends on the algorithm) to any of the documents in the training set, i.e. the algorithm has little confidence about them because they are different from what it has seen during training. Which side of the cutoff will the hot documents fall on? If responsive documents talk about “Project X” and a hot document says “I shredded all the documents from Project X” the hot document may make the relevance score cutoff. On the other hand, if the hot document says “I shredded all the documents” (no mention of Project X) it may fall below the cutoff because some emails talking about shredded beef were marked as non-responsive, causing “shredded” to be seen as a negative indicator (look for emails with subject “Chili contest” in the Enron dataset).

      If all of your hot documents are near-dupes of each other their relevance scores will be nearly equal, so all of them will either fall above the cutoff or below it (unless you have very weird luck). You won’t get (approximately) 90% of the hot documents above the cutoff just because 90% of the responsive documents are above the cutoff.

      • Jeremy Pickens says:

        Excellent points, Bill. And I don’t necessarily disagree with any of them. But let me expand on some of the implications of some of these points.

        In summary: You note that the algorithm will favor (rank) certain documents above others, because of the nature of the algorithm. And that certain docs will get ranked above others due to the nature of the hot docs themselves. And then you expand upon this by noting that it’s not just the vagaries of the nature of hot docs and/or algorithm, but which documents have been used for training the algorithm.

        Make no mistake: I agree with all these points. Actually, I would carry one of your points even a bit further. You write: “The algorithm will either favor the hot documents (the students were asked an easy question) or discriminate against them (the students were asked a trick question), depending on the nature of the hot documents and the algorithm.” And I would expand this by saying that, depending on the nature of the hot docs and the algorithm, not only will the hot docs sometimes be favored and sometimes be discriminated against, but sometimes some of the hot docs will be favored and other hot docs will be discriminated against (because not all hot docs are necessarily on the same “topic”, or hot for the same reason). And sometimes the hot docs will neither be favored or discriminated against.. they’ll just sorta receive middling scores (middling, relative of course to all the other plain jane responsive docs), i.e. it is just as possible that what makes a hot doc hot is orthogonal to what the CAR algorithm is modeling, which results in neither favoring nor discriminating.

        Right? I mean, wouldn’t you agree that all these outcomes are possible, for system that are not explicitly designed to model the more responsive docs, but instead are only explicitly modeling more of the responsive docs?

        And if so, then all that I am saying is that, in expectation across the last hundred cases that you’ve dealt with, “the “more responsive” docs are no more (but also no less!) likely to appear in that produced 90%

        On one case, the 90% will get all the hot docs. In another case, the 90% will get none of the hot docs. In the other 98 cases, the 90% will get a a little, some, or a lot, but not all. Every case is going to be different. But because (as you say) it can change from case to case, and where those hot docs fall depends on the biases of the algorithm, the subjective and matter-dependent nature of what “hotness” means from case to case, and which documents actually got selected for training the system,

        So what that means is that in expectation you can assume a uniformly random distribution, even if you cannot assume it for one particular matter.

        So perhaps that’s the only real point that we’re disagreeing on: Can you talk about this in expectation, or do you have to treat every single case separately? Since we’re talking about the CAR software here (Losey’s blog post title is: “Relevancy Ranking is the Key Feature of Predictive Coding Software“), I’m assuming that we can take the expected value across the hundreds of matters that software deals with, rather than just across a single matter.

        But if you want to make the counter point that you can’t take the expected value of a six-sided die (i.e. that it makes no sense to talk about the expected value of your die roll being 3.5, because 3.5 isn’t a real, possible outcome), then I would not be able to say anything against that.

        Do know, however, that on the big points you raise, I agree.

  14. Jeremy Pickens says:

    ..sorry, didn’t finish this thought:

    Then you could turn over that doc and it’s 89 companions and get to 90% recall.. and only be at 1/11 = 9.1% marginal relevance. Which of course is not the spirit of what we’re after here.

    What I meant to say was that yes, this would not be the spirit of what we’re after in eDiscovery. But what are the chances of this sort of thing happening.. of getting 90% of the total available responsive documents, but only getting less than 10% of the meaning? I find that to be a very small chance. So I understand the concern, but I think we can still alleviate it if we focus on getting as close to full recall (rather than to 40% recall) as we can. The latter solves the former, even if (I agree with you) it is an indirect solution.

  15. Bill,
    Great examples. It is heartening to see agreement on the nature of the measurements used and the need to inform as to document importance, and present metrics that can actually provide adequate information to support rational proportionality assessments.

    Jeremy, I’d point out that by its nature metrics in predictive coding must be case specific. In statistics and probability you can use the term of art “expectation”. However, used to cavalierly in litigation and a legal term of art appears, “malpractice” :).

    Jeremy, your repeated note about the effect of near duplicates we have discussed. And I think you are right on the money here. I agree that as recall approaches 1, the concern lessens. But there is a practical hitch here, based on two limiting questions: what is the practical method that you propose that will allow you to estimate within say 10% of the true relevant yield value (at 95% CL) so that when you say you have 90% recall, the possible range of the actual recall rate doesn’t include 60% through 80%; and, are there solutions that can cost-effectively work on the entire 30 million Biomet documents and thereby avoid the 50% recall loss from initial keyword usage.

    I’d also point both of you have established the “bulk” nature of predictive coding when it comes to highly similar documents. And Jeremy, you’ve laid out the persuasive conceptual argument about measuring discovery performance as a function of novel relevant information. I think there are irrefutable corollaries that arise: where predictive coding operates at anywhere under 90%, the redundant error-embedded human review process alternative provides much more robust production; and the reports of comparative performance are certainly fatally flawed, and their use as part of a judicial proportionality assessment that includes reliance upon them is not rationally justified.

    • Jeremy Pickens says:

      I agree that as recall approaches 1, the concern lessens. But there is a practical hitch here, based on two limiting questions: what is the practical method that you propose that will allow you to estimate within say 10% of the true relevant yield value (at 95% CL) so that when you say you have 90% recall, the possible range of the actual recall rate doesn’t include 60% through 80%

      That’s an important question.. whether your recall point estimate is correct, or whether you really can only establish a range. But I see that question as orthogonal to the “more relevant vs more of the relevant” issue. That is, finding 90% of the responsive docs is separate from proving that you’ve found 90% of the responsive docs. And so whether or not you can prove it, if you’ve indeed found 90%, the fear of not having gotten the “more responsive” docs is indeed allayed, is it not?

      Let me turn this around, and ask either you or Bill or Ralph or all of you:

      If someone were producing documents to you, and gave you the following choice between two ways of going about the process, which would you choose, which would you agree to in court? This is a thought experiment more than anything, so of course you can poke holes in it. I’m looking simply for a general sense of what it is folks want.

      (1) The defendant has CAR software that doesn’t just rank by relevancy (i.e. gets more of the responsive docs at the top), but ranks by degree of relevance (i.e. gets the more responsive docs at the top). And so the defendant says to you that because he does degree-of-relevancy ranking and that it’s more difficult (proportionality and all that), you’re only going to get 40% of the available responsive docs. But those 40% will be the “more responsive” ones.

      (2) The defendant has CAR software that only ranks by relevancy, but makes no distinction between degrees of relevancy. Oh, it still produces a relevancy ranking, as Losey desires, but as we’ve discussed above, we have no way of telling in this particular matter, due to the vagaries of hotness, of explicit and implicit biases in the algo itself, etc.whether the hot docs are at the top, middle, or bottom of this relevancy ranking, even if the relevancy ranking is doing a great job of ranking the relevant above the nonrelevant docs. Because of this, the client will agree to produce 90% of the responsive docs to you.

      So do you want the 40% production, but with the “most responsive docs” in that 40%? Or do you instead prefer the 90% production, but with no guarantee that the “most responsive docs” will be in that 90%?

      If you’re still stuck on the proof issue, then let me rephrase that question as: (1) The defendant can get you 40-60% of the “more responsive” docs, vs (2) The defendant can get you 70-90% of the responsive docs, with no guarantee that you’ll get the more responsive ones.

      Which would you prefer?

      • Jeremy Pickens says:

        I’d point out that by its nature metrics in predictive coding must be case specific. In statistics and probability you can use the term of art “expectation”. However, used to cavalierly in litigation and a legal term of art appears, “malpractice” 🙂

        And yet.. and yet.. legal folks seem to be focused on what sort of algorithms every vendor is using. Time after time, I see surveys and polls and questions about whether folks are using support vector machines or naive bayesian classifiers or decision trees or whatever. Because folks want to know what the best algorithms are.

        In machine learning, you establish that one algorithm is better than another only via expectation, via averaging across dozens or hundreds of situations. For any one situation, one case, this or that algorithm might not be the best.

        So are you telling me that the whole industry is basically guilty of malpractice, because they’re trying to find out what the best algorithms are, in expectation?

        Or is it just a fact of life that it is reasonable to talk about things “in expectation”?

        I know this response might sound slightly antagonistic, but it’s not. Read it, if I may ask, while imagining me with a playful smile on my face.

  16. Bill Dimm says:

    So are you telling me that the whole industry is basically guilty of malpractice, because they’re trying to find out what the best algorithms are, in expectation?

    If you applied SVM to hundreds of cases and collected statistics about the number of responsive documents found, and then walked into court and said: “I am have produced 90% of the responsive documents (with 95% confidence, no less!) for this case because I used SVM on this case (not because I actually tested the results for this specific case and measured the number of responsive documents, but based on the assumption that this specific case conforms to “expectations”)” then, yes, that sounds like malpractice to me (although I am not a lawyer). That would be like claiming that the husband is guilty of murdering his wife, not evidence needed to convict, because in X% of previous cases the husbands were indeed the killers.

    On the other hand, if you went into court and claimed that you found 90% of the responsive documents with 95% confidence because you actually did valid statistical testing on the data for this case, then the algorithm you used shouldn’t matter to the court.

    So, why fuss about what algorithm is “best” if it’s not relevant to whether or not you’ve performed your obligation to the court? Because the better algorithm produces an adequate result with less human review (e.g. fewer false positives to review), so it is cheaper. If you are going to buy a single piece of software that offers only one algorithm and use it on all of your cases, the one that is “best” on average will cost you the least on human review overall. If you have the option of picking different software/algorithms for different cases, then certainly optimize for each case individually instead of optimizing for the average.

    Sidenote: I think it is extremely misguided to choose software based on “Oh, it uses SVM (or whatever the sexy toy of the day is) so it must be great.” The classification algorithm is only one of many factors in how well the system will work (feature selection, training set choice, avoidance of overfitting, etc.). It’s like picking a car (not a CAR) based only on the engine (look, the wheels fell off!).

    • Jeremy Pickens says:

      If you applied SVM to hundreds of cases and collected statistics about the number of responsive documents found, and then walked into court and said: “I am have produced 90% of the responsive documents (with 95% confidence, no less!) for this case because I used SVM on this case (not because I actually tested the results for this specific case and measured the number of responsive documents, but based on the assumption that this specific case conforms to “expectations”)” then, yes, that sounds like malpractice to me

      I agree, that sounds like malpractice to me, too. But that’s not what I am saying here. Rather, what I am saying is this:

      (1) There is a difference between relevancy rankings that rank documents based on their likelihood of responsiveness, and rankings that rank documents based on their degree of responsiveness. Actually, this point is Gerard’s point, not mine, but I agree with it.

      (2) However, even though there is a difference in the two types of relevance ranking, the former is strongly correlated with the latter. In other words, even if you have an algorithm that ranks by likelihood of responsiveness, rather than by degree of responsiveness, the more responsive documents that you get, the higher chances you will have that documents that you do get via that ranking are the higher degree relevancy documents.

      So what I am saying is that if you have got a hundred cases under your belt, and have found in every single case that this correlation actually is a positive one.. that as you find more responsive documents, you also find the higher degree responsive documents, then I see nothing wrong with walking in to court and saying that as you’ve produced more and more responsive documents, you’ve likely also produced more and more of the responsive ones. That’s all I’m claiming.

      Unless there has been a deliberate attempt in this specific case to hide the more responsive documents, I absolutely feel that we should be able to argue that this correlation is a positive one in a current case, and furthermore to do so based on the fact that we’ve seen the correlation to be a positive one on the past hundred cases.

      I’ll give you another example that drives this point home. Losey prefers CAR over boolean keyword search, am I correct? But why? Why would he, or anyone for that matter, argue that boolean (set based, non-ranked) retrieval needs to go away, and that CAR and relevancy ranking is the future? Boolean keyword search is capable of being 100% perfect, is it not? Of course it is. Suppose you have n responsive documents, each with m terms in them. All you need to do is construct a boolean query of the following form:

      q = ((term1 AND term2 AND term3) OR (term2 AND term5 AND term8) OR (term17 AND term1 AND term9) OR …)

      ..where the first conjunct contains all the terms from the first responsive document, the second conjunct contains all the terms from the second responsive document, etc. for all n responsive docs in the collection.

      I guarantee you that this one boolean keyword query will get you 100% precision at 100% recall. Absolutely guarantee it. (Aside: Let me give credit to Doug Oard for this particular example.. he made this point to me a few years ago, and I want to give credit where credit is due.)

      So why would anyone argue against boolean keywords, as so many practicing lawyers do? The reason they argue against it is that, in expectation, people are not very good at constructing that “perfect” boolean query above. Experience, historical evidence, has shown, across hundreds of cases, that what humans are capable of doing with boolean keyword search falls far short of boolean keyword search is capable of. So given this historical evidence, lawyers argue against boolean keyword search.

      So does this make them guilty of malpractice? They are basing their recommendation to not allow boolean keyword search on past cases, not on this case. Who is to say that somehow the people running the boolean searches on this case won’t do a fantabulous job? Who is to say that this case won’t be different than historical cases?

      But they do recommend against it, and I believe they are correct in so doing. And they are using expected values to do this argumentation. And really should not be considered guilty of malpractice by so doing, in my (also nonlawyer) opinion.

      • Jeremy Pickens says:

        Ugh, I type too fast. Let me correct one sentence here:

        So what I am saying is that if you have got a hundred cases under your belt, and have found in every single case that this correlation actually is a positive one.. that as you find more responsive documents, you also find the higher degree responsive documents, then I see nothing wrong with walking in to court and saying that as you’ve produced more and more responsive documents, you’ve likely also produced more and more of the MORE responsive ones. That’s all I’m claiming.

      • Bill Dimm says:

        I guarantee you that this one boolean keyword query will get you 100% precision at 100% recall. Absolutely guarantee it.

        I really don’t want to spend more time on this since I have work to do, but if you’re going to goad me on with words like “absolutely guarantee it”…

        doc1: Shred the documents and put the chicken away.
        doc2: Shred the chicken and put the documents away.

        Assuming that doc1 is responsive and doc2 isn’t, your query construction gets you 100% recall, but not 100% precision (both documents are returned), unless I’ve completely misunderstood it.

        I’ll address the other issues in a response to your other post later.

      • Jeremy Pickens says:

        Heh, that’s great. Love it. Fun example.

        So let me modify my boolean query, but still keep it a boolean query, which does not alter my main point about boolean searching. One feature of many boolean systems is the “ordered distance” operator. It takes one integer and two terms as parameters, and returns “true” if the two terms are found in a document in the same order as they appear in the function, within a window of the size of the integer. And false otherwise.

        Let’s use the notation #od(n, x, y) for this, where n is the window size, and x and y are the terms.

        Thus, my boolean query becomes:

        q = (
        (#od(1, term1, term2) AND #od(1, term2, term3) OR
        (#od(1, term2, term5) AND #od(1, term5, term8) OR
        (#od(1, term17, term1) AND #od(1, term1, term9) OR
        …)

        Schachmatt? 😉

      • Jeremy Pickens says:

        Wait, wait, I already have your counter example. Suppose document #1 is:

        [Woman: without her, man is nothing.]

        And document #2 is:

        [Woman, without her man, is nothing.]

        And suppose doc1 is responsive and doc2 is not.

        Then my #od operator will still not get you 100% precision. Ok, ok. Am guilty of hyperbole.

        Let me say this: Even if you get 100% recall at 99.996% precision, the point is that lawyers still, I believe correctly, are moving away from keyword-only approaches. And they’re doing it not because keyword-only approaches are incapable. You can get extremely high precision at extremely high recall, higher than people are currently getting in CAR systems, if you have the right query.

        But what we’ve learned, through not just 10 years of eDiscovery, but 50 years of Information Retrieval research, is that, in expectation, people are not good and finding that right query.

        Expectations still guide us.

  17. Bill, Jeremy,
    I think two separate and valid things are being argued here.

    In any competitive industry, competitors should and do try to find a way to slice bread “better”. For example, in predictive coding, that might mean an algorithm that on average gets a user to a valid completion metric in fewer turns (thereby reducing the dear costs of additional training document review). Being able to state that tests demonstrate over s large enough set of trials that a particular solution “lifts” faster and/or produces a more contextually robust production set is valuable, and valid, because it offers a client the verifiable chance that the solution will save them money and reduce risk of discovery failure.

    I agree with both of you that there’s a lot of noise around algorithms. As someone from the outside who has spent at least 25% of the proverbial 10,000 hours to understand what lies beneath the hood of predictive coding techniques, it’s clear to me that any predictive coding solution that could boast recall in an amount equal to the percentage of people talking about predictive coding details who don’t really understanding it would be eligible for text mining’s version of the Heinlein award.

    Over time, better solutions should allow the vendors that provide more efficient solutions to reduce prices; those solutions should become more attractive. There may be a day when the track records of solutions permit a party to be able to argue, “X vendor can reliably get to validation threshold in under $__MM (and will contract to do so sight unseen) because they use their patented DeLorean flux capacitor, so your honor, opponent’s unreasonable burden and cost argument fails.”

    However,because that warranty isn’t currently available, the only issue that really matters currently is: did the technique hit the threshold that was set? And that distinct individual analysis is indifferent to algorithm performance over a distribution.

    • Jeremy Pickens says:

      I think two separate and valid things are being argued here.

      Actually, I think I count at least seven separate and valid things 🙂

      There may be a day when the track records of solutions permit a party to be able to argue, “X vendor can reliably get to validation threshold in under $__MM (and will contract to do so sight unseen)

      Yeah, I am not recommending going that far, just quite yet, either. I’m not saying anything about being able to get to this or that validation threshold. Rather, I’m making the softer claim that as the number of responsive documents found increases, the number of high degree responsive documents also increases, even if your relevancy ranking algorithm is ranking by likelihood of relevance (more of the responsive docs) rather than degree of relevance (the more responsive docs).

      So I probably confused things earlier by giving specific example with a hard boundary.. 90% or 40% or whatever. Forget about any specific x% recall, and forget about proving that you’ve gotten to x% recall. All I am saying is that as you go from x% recall to (x+k)% recall (whether or not you can prove it), where k is a non-negative number, you will go from getting y% of the highly responsive documents to getting (y+j)% of the highly responsive documents, where j is a non-negative number.

      I don’t understand why that claim is so contentious.

  18. Jeremy Pickens says:

    So, why fuss about what algorithm is “best” if it’s not relevant to whether or not you’ve performed your obligation to the court? Because the better algorithm produces an adequate result with less human review (e.g. fewer false positives to review), so it is cheaper. If you are going to buy a single piece of software that offers only one algorithm and use it on all of your cases, the one that is “best” on average will cost you the least on human review overall.

    Yes, yes, I see what you’re saying. Again, don’t necessarily disagree. But what is that court obligation? Is it to produce the responsive documents with the highest probative value? Or is it to produce x% of all available responsive documents?

    It’s typically the latter, correct?

    And yet there is an assumption — the same correlative assumption that I keep defending — that by getting x% of the responsive documents in a relevancy ranking approach, you’ll also be getting the ones with highest probative value. Losey even makes this assumption above, in his original blog post (aside: Hi Ralph.. hope we’re not clogging up your blog too much here :-), he writes:

    Ranking also makes proportionality a doctrine that is palatable to both producing and receiving parties. It facilitates efficiency, and provides everyone the most bang for the buck, or to be precise, the most documents with the highest probative value for the buck.

    The claim is being made here that the documents at the top of a relevancy-ranked list are also the ones with the highest probative value.

    And yet when was the last time any defendant was asked to prove that these were the docs with the highest probative value? Even if the client can prove that they’ve achieved 92% recall, how do we know that in that 8% remaining the highest probativity isn’t lurking? As far as I’m aware, this concern isn’t even on the industry’s radar, no one is even thinking to ask the question. In terms of statistical testing of the quality of the outcome, people are only doing testing to make sure that they’re close to x% recall. They’re not doing the testing to make sure that they haven’t missed k% probativity.

    So why is this happening? It’s happening, I think, because people intuitively understand, even if they haven’t consciously thought about it, that there is a positive correlation between increasing recall and increasing probative recall. The closer one can get to getting everything that is responsive, the closer one can get to getting everything that is highly responsive.

    But again, the only reason we currently accept that line of reasoning, and it seems like at least some non-malpracticing lawyers like Losey do, is because in expectation we’ve found that to be the case.

  19. “Even if the client can prove that they’ve achieved 92% recall, how do we know that in that 8% remaining the highest probativity isn’t lurking? As far as I’m aware, this concern isn’t even on the industry’s radar, no one is even thinking to ask the question. ”

    Ahem…actually, Jeremy, I have been — for a while — in part as a result of the 2006 Advisory Committee’s discussion of importance of discovery as a key aspect of proportionality analysis (my how things have changed). In addition, there is literature that addresses document importance and the intersection of litigation relevance, proportionality and tying “burdensomeness” to discovery importance.

    The fact that litigators and judges aren’t speaking of it more – yet at least – is more a testament to the novel nature of analytics, the prevalence (excuse) of Luddism, and aggressive marketing.

    To bring this full circle (at least for me) and return Ralph’s blog: I can accept that at indisputably high retrieval levels (again reserving my concern about estimate ranges), the distinction between relevance level and relevance prediction confidence loses significance because most documents have in fact been retrieved. However, in the other, much more likely case of lower recall, I would argue that a rational proportionality assessment should in fact require that the false equivalence be recognized and rejected as potentially misleading, and should require both content diversity measures and the demonstration of additional efforts, roughly speaking inversely proportionate to recall level, to specifically assure that “lurking” important documents have not been overlooked, and that very likeley would not have been overlooked in a human review.

    Thanks Ralph, and thanks Jeremy and Bill for the dialectic.

  20. Jeremy Pickens says:

    Ahem…actually, Jeremy, I have been — for a while

    Twice in the same 30 minute period I am (correctly) called out for hyperbole. I again publicly stand corrected.

    The fact that litigators and judges aren’t speaking of it more – yet at least – is more a testament to the novel nature of analytics, the prevalence (excuse) of Luddism, and aggressive marketing.

    For what it’s worth, while boolean keyword searching is 50+ years old, some of these degree of relevance metrics only go back about 10 years. Some even only 2-3 years (here is an interesting paper you might want to read: http://www.ccs.neu.edu/home/ekanou/research/papers/mypapers/sigir10a.pdf)

    Aside: Degree of relevance has been acknowledged for over half a century in the IR community — see Tefko Seracevic, page 327 of this article from 1975 in which he cites an earlier conference from 1958 in which it is suggested that relevance be a matter of degree: http://comminfo.rutgers.edu/~tefko/Saracevic_relevance_75.pdf

    But metrics to help with assessing the quality of outcome in relation to degree of relevance are, in my understanding, still relatively new. So you’re probably right about novelty being at least one factor.

    However, in the other, much more likely case of lower recall, I would argue that a rational proportionality assessment should in fact require that the false equivalence be recognized and rejected as potentially misleading, and should require both content diversity measures and the demonstration of additional efforts, roughly speaking inversely proportionate to recall level, to specifically assure that “lurking” important documents have not been overlooked, and that very likeley would not have been overlooked in a human review.

    I am opposed to none of what you suggest here.

    Thank you as well for the spirited discussion.

  21. […] Losey’s recent article “Relevancy Ranking is the Key Feature of Predictive Coding Software” generated some debate and controversy reflected in the readers’ comments.  To […]

  22. […] Losey, R. Relevancy Ranking is the Key Feature of Predictive Coding Software found at http://e-discoveryteam.com/2013/08/25/relevancy-ranking-is-the-key-feature-of-predictive-coding-soft…. Relevancy ranking only works well now with the best software on the market and requires proper, […]

  23. […] [13] Losey, R., Relevancy Ranking is the Key Feature of Predictive Coding Software found at http://e-discoveryteam.com/2013/08/25/relevancy-ranking-is-the-key-feature-of-predictive-coding-soft…. […]

  24. […] Losey’s recent article “Relevancy Ranking is the Key Feature of Predictive Coding Software” generated some debate and controversy reflected in the readers’ comments. To appreciate the […]

  25. […] Relevancy Ranking is the Key Feature of Predictive Coding Software. […]

Leave a Reply

Discover more from e-Discovery Team

Subscribe now to keep reading and get access to the full archive.

Continue reading