17 Responses to Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two

  1. […] When it comes to predictive coding training, the “fewer reviewers the better” - Parts One, Two, and […]

  2. Jeremy Pickens says:

    Ok, I’ve got about a hundred comments I could make, but I’m going to try to be good and keep it short and small.

    First thing that struck me. You wrote:

    Recent data obtained by the Electronic Discovery Institute in their Oracle project, may, however, make it possible for scientists to make such evaluations in the future. Bay, M., EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013). As reported in the Monica Bay article the data on inconsistencies and number of reviewers used by each participating team may make it possible for scientists to now prove, or disprove, my theory that less is more, that consistent input by bona fide experts, SMEs, is critical to attaining comparatively high performance in real-world legal search projects.”

    When I read the LTN report, it said this:

    The study “considered multiple evaluation systems using litigation data from real high-stakes litigation where the producing party was confident that it conducted a meticulous attorney-based document review to respond to the document request,” he explained. Said Oracle’s Chakraborty: “The document review originally conducted by outside counsel in the study matter was a rigorous undertaking that included a thorough review with multiple quality checks. The review team was comprised of both law firm associates and contract attorneys.”

    Perhaps I am reading this wrong, but it sounds like the final ground truth in the Oracle study was established, in part, by contract attorneys.

    So look, I have no problem with using quality-checked contract reviewers to judge documents. But part of your argument is that non-SMEs are even more unreliable. So how is it that you think non-SME ground truth can be used to validate different training regimens? Isn’t that a paradox?

    • Ralph Losey says:

      I do not know any details of the Oracle production or review aside from the LTN quote, but just because contract reviewers were used in the review does not mean they were driving the CAR, nor even making any “final” relevance determinations at all. You make a big assumption there. What we do know is that this was a state-of-the-art, real world, litigation production. It was thus an excellent comparator. Moreover, to my knowledge we have never had a test collection like that before. TREC collections must by regulation be public and the test runs are all done with artificial hypotheticals and volunteer reviewers. The Verizon study was close, as it was real world, but it was not in a litigation context, just a document review for merger approval, and was, I think, done with high time pressures and hundreds of reviewers.

      If you study my review best practices you will see that I still use contract lawyers too, just not for AI training, and not for final relevancy calls. Under my methods contract lawyers are primarily used only in what I call the “second pass” reviews, usually performed after the AI training is complete. They can make relevancy judgments (more accurately irrelevancy judgments) but they are always double checked by an SME or SME delegate. They mainly do redaction and privilege logging, and other very time consuming tasks for which an expensive SME is not needed.

      By the way, anything you would share concerning your “inconsistency smoothing” algorithmic work would be of interest to readers I am sure.

      • Jeremy Pickens says:

        Under my methods contract lawyers are primarily used only in what I call the “second pass” reviews, usually performed after the AI training is complete. They can make relevancy judgments (more accurately irrelevancy judgments) but they are always double checked by an SME or SME delegate.

        No, I never assumed they were driving the CAR (as in, used for training data) in this scenario. I assumed Oracle had used them the way you are describing: To provide testing data. To see what the outcome of the process actually was. To provide the judgments that get used to determine final precision, recall, F1, etc. scores.

        Where I am a little doubtful is when you say that an SME always double checks these judgments. That doesn’t make any sense to me. If every document that the contract reviewer is judging during this “second pass” (aka evaluation aka testing) phase is again judged by an SME, why bother with the contract reviewer in the first place? That’s just wasted time and effort, is it not?

        So I suspect that what Oracle has done, after having used contract reviewers in exactly the way you describe, for post-AI results evaluation, is to do some high level QC but let most of the judgments stand, as is. And if that’s indeed the case, then again, we’re in paradox land.

        It really would be nice if we could get someone from Oracle to comment on this, because otherwise it is very difficult to reach any conclusions if we don’t know what actually happened.

      • Ralph Losey says:

        You are assuming that contract lawyers make many relevance changes. Not in my world. That should not happen if your training the system properly and your document ranking is working correctly. For example, in one project I was involved with the contract reviewers were finding about 2% relevant. It would have cost a small fortune for them to review it all, no matter what the discounted rate. Then I was brought in and did my predictive coding thing for a few days, and marked the whole set either relevant or not. The less than 50% predicted probability was marked relevant and the contract lawyers then did second pass, where on one their jobs was to confirm my prediction of relevance.

        Then the contract lawyers found almost all were relevant at the very high end, and found 80% to 90% relevance in the top 20% where most of the predicted relevant were sorted to. They liked that. Me and my surrogates then only had to double check their reversals from relevant to irrelevant. Then when they contract lawyers reached the relatively few docs in the 80% to 50% probable relevance range, the prevalence was lower. More reversals.

        We did not look at most of the docs less than probable relevant (49% and under). But we did sample the less than 50% majority of docs. That confirmed the predictions. Not perfectly mind you. I understand the limits of statistics and the remote, but real possibility of still finding relevant documents that you and other scientists are fond of pointing out (which you sometimes call the “long tail”), but it was confirmed within reason. That is, after all, what the law requires, reasonable, proportionate efforts, not perfection and mathematical certitude that many causal-type scientists are used to. I’m not saying you are one of “those” mind you. You appear to be more of a quantum relativity type to me! Law has always been there, dealing with probabilities and self-organization from chaos, not old-fashioned (and now disproven) Newtonian causality. We are used to probable relevance and not knowing for sure if the cat is dead.

        Anyway, that production I was talking about went very well. The documents needed for justice were all found, and then some, and the clients saved a lot of money in the process. Contract lawyers, myself, and a few of my surrogates worked hand in hand and the duplication of review was not too bad, and assured everyone of quality control.

      • Jeremy Pickens says:

        By the way, anything you would share concerning your “inconsistency smoothing” algorithmic work would be of interest to readers I am sure.

        I will, but my goal is not to take over your blog :-) So, in another forum.

      • Jeremy Pickens says:

        You are assuming that contract lawyers make many relevance changes.

        Not.. quite. I’m.. well.. I think we’re talking at cross-purposes here. I mostly agree with the gist of what you’re saying (though I think there are still one or two hidden gotchas that you’re not considering), but that gist is not really what I’m talking about here. I take full responsibility for not being very good at explaining myself via comment text. I think this discussion would be better in person, with a whiteboard or napkin or something else that I could sketch on.

  3. Ralph,

    Two Desi-V papers, and a white paper by Jeremy, investigate the impact of training errors on predictive coding for document review. In a nutshell, the impact is “not much.”

    Jianlin Cheng, Amanda Jones, Caroline Privault and Jean-Michel Renders, Soft Labeling for Multi-Pass Document Review. http://www.umiacs.umd.edu/~oard/desi5/research/Cheng-final.pdf

    Johannes C. Scholtes, Tim van Cann, Mary Mack, The Impact of Incorrect Training Sets and Rolling Collections on Technology-Assisted Review. http://www.umiacs.umd.edu/~oard/desi5/additional/Scholtes.pdf

    (Sorry Jeremy, I don’t have a link to your work.)

    Gordon

    • Ralph Losey says:

      Thanks for the comment Gordon, who, in case any of my readers do not know, is another star scientist in the field of legal search. Sorry it took me so long to approve your comment, but just noticed it now, even though you posted it several days ago. (Since it had links, it needed my approval as part of the spam filter.) I’ll check out the papers you mention. Already knew about Jeremy’s. Thanks again.

  4. Ralph,

    To date, I am aware of no study other than yours has measured intra-assessor overlap on the same ediscovery review task. Certainly not Grossman & Cormack.

    In Grossman & Cormack, Maura did not re-review the several hundred documents she reviewed as topic authority from TREC 2009. Cormack reviewed a sample of 100, and of those 100, he disagreed with 10 of Maura’s judgements. Maura re-reviewed only these 10 documents. Of the 10, Maura held her ground on 5, coded 2 as “arguable,” and reversed herself on 3. Furthermore, Cormack held (prior to Maura’s re-review) that the 3 on which Maura reversed herself were “arguable.” (Note that “arguable” was not an option in the original TREC 2009 review so switching to arguable should not be scored as a reversal.)

    These ten documents comprise a tiny judgmentally sampled fraction of all the documents that Maura reviewed at TREC 2009, which themselves were a judgmentally sampled fraction of the review set. The documents that Maura reviewed at TREC were only those that were appealed and were hence controversial, and the ten of those that were selected by me were especially controversial. You simply cannot conclude that she would have reversed herself this frequently had she conducted a second review of a representative set of documents.

    regards,
    Gordon

    • Ralph Losey says:

      Thanks for that comment and explanation. I have just corrected my blog explanation on that point accordingly. I consider the errors an anomaly and of no statistical importance since the sample was so small, and, as you point out, not at all representative of the topic collection.

  5. Ralph,

    Hi! Thanks for a great summary post of research findings on inter-assessor agreement.

    Note that Jeremy and I had a short paper at this year’s SIGIR in which we took the Voorhees dataset you describe here, and examined what effect the use of alternative (non-authoritative) assessors had upon the reliability of machine classification. The paper can be found here:

    http://www.williamwebber.com/research/papers/wp13sigir.pdf

    The takeaway finding was that using non-authoritative trainers meant that on average 25% more documents had to be reviewed in order to achieve the same level of recall (on this particular dataset, which admittedly is not very representative of what is found in e-discovery). This might, though, work out as cheaper overall if the non-authoritative trainers themselves were cheaper than the authoritative one.

    William

  6. […] how inconsistent human reviewers are, even when using search experts. See Less Is More, parts One, Two and Three. They still try to fix the old methods, and try to use human reviewers to measure what […]

  7. […] This is part-three of a three-part blog, so please read Part One and Part Two first. […]

  8. […] When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three; and, Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using […]

  9. […] Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Two, and search of Jaccard in my […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 3,101 other followers