This is part-three of a three-part blog, so please read Part One and Part Two first.
The Losey Study on Inconsistencies Suggests a Promising Future for Active Machine Learning
The data from my Enron review experiment shows that relatively high consistent relevance determinations are possible. The comparatively high overlap results achieved in this study suggest that the problem of inconsistent human relevance determinations can be overcome. All it takes is hybrid multimodal search methods, good software with features that facilitate consistent coding, good SME(s), and systematic quality control efforts, including compliance with the less is more rule.
I am not saying good results cannot be achieved with multiple reviewers too. I am just saying it is more difficult that way. It is hard to be of one mind on something as tricky as some document relevance decisions with just one reviewer. It is even more challenging to attain that level of attunement with many reviewers.
The results of my study are especially promising for reviews using active machine learning processes. Consistency of coding training documents is very important to avoid GIGO errors. That is because of the cascading effects of sensitivity to initial conditions that are inherent in machine learning. As mentioned, good software can smooth out inconsistency errors somewhat, but if the Jaccard index is too low, the artificial intelligence will be impacted, perhaps severely so. You will not find the right documents, not because there is anything wrong with the software, or anything wrong with your conception of relevance, but because you did not provide coherent instructions. You instead sent mixed messages that did not track your right conceptions. (But see the research reports of John Tredennick, CEO of Catalyst, whose chief scientist Jeremy Pickens, is investigating the ability of their software to attain good rankings in spite of inconsistent machine training.)
The same thing can happen, of course, if your conceptions of relevance are wrong to begin with. If you fail to use bona fide, objective SMEs to do the training. Even if their message is consistent, it may be the consistently wrong message. The trainers do not understand what the real target is, do not know what it looks like, so of course they cannot find it.
The inexperienced reviewers lack the broad knowledge of the subject matter and the evidence required to prove the case, and they lack the necessary deep understanding to have a correct conception of relevance. In situations like that, despite all of the quality control efforts for consistency, you will still be consistently wrong in your training. (Again, but see the research of Catalyst, where what they admit are very preliminary test results seem to suggest that their software can fulfill the alchemists dream, of turning lead into gold, of taking intentionally wrong input for training and still getting better results than manual review, and even some predictive coding. Tredennick, J., Subject Matter Experts: What Role Should They Play in TAR 2.0 Training? (November 17, 2013). I will continue to monitor their research with interest, as data must trump theories, but for now remain skeptical. I am at a loss to understand how the fundamental principle of GIGO could be overcome. Does anyone else who has read the Catalyst reports have any insights or comments on their analysis?)
One information scientist I spoke with on the principle of GIGO and machine training, William Webber, explained that it might not matter too much if your trainer makes some mistakes, or even quite a few mistakes, if the documents they mistakenly mark as relevant nevertheless happen to contain similar vocabulary as the relevant documents. In that case the errors might not hurt the model of “a relevant vocabulary” too much. The errors will dilute the relevance model somewhat, but there may still be sufficient weight on the “relevant terms” for the overall ranking to work.
William further explained that the training errors would seriously hurt the classification system in three situations (which he admits are a bit speculative). First, errors would be fatal in situations where there is a specialized vocabulary that identifies relevant documents, and the trainer is not aware of this language. In that case key language would never make it into the relevance model. The software classification system could not predict that these documents were relevant. Second, if the trainers have a systematically wrong idea of relevance (rather than just being inattentive or misreading borderline cases). In that case the model will be systematically biased (but this is presumably the easiest case to QC, assuming you have an SME available to do so). Third, if the trainers flip too many relevant documents into the irrelevant class, and so the software classifier thinks that the “relevant vocabulary” is not really that strong an indicator of relevance after all. That is a situation where there is too much wrong information, where the training is too diluted by errors to work.
Consistency Between Reviews Even Without Horizontal Quality Control Efforts
In my Enron experiment with two separate reviews I intentionally used only internal, or vertical, quality control procedures. That is one reason that the comparatively low 27% relevance inconsistency rate is so encouraging. There may have been some inconsistencies in coding in the same project, but not of the same document. That is because the methods and software I used (Kroll Ontrack’s Inview) made such errors easy to detect and correct. I made efforts to make my document coding consistent within the confines of both projects. But no efforts were made to try to make the coding consistent between the two review projects. In other words, I made no attempt in the second review to compare the decisions made in the first review nine-months earlier. In fact, just the opposite was true. I avoided horizontal quality control procedures on purpose in the second project to protect the integrity of my experiment to compare the two types of search methods used. That was, after all, the purpose of my experiment, not reviewer consistency.
I tried to eliminate carryover of any kind from one project to the next, even simple carryover like consulting notes or re-reading my first review report. I am confident that if I had employed quality controls between projects the Jaccard index would have been even higher, that I would have reduced the single reviewer error rate.
Another artificial reason the error rates between the two reviews might have been so high was the fact that I used a different, inferior methodology in the second review. Again, that was inherent in the experiment to compare methods. But the second method, a monomodal review method that I called a modified Borg approach, was a foreign method to me, and one that I found quite boring. Further, the Borg method was not conducive to consistent document reviews because it involved skimming a high number of irrelevant documents. I read 12,000 Enron documents in the Borg review and only 2,500 in the first, multimodal review. When using my normal methods in the first review I found 597 relevant documents in the 2,500 documents read. That is a prevalence rate of 24%. In the Borg review I found 376 relevant documents in the 12,000 documents read. That is a prevalence of only 03.1%. That kind of low prevalence review is, I suspect, more likely to lead to careless errors.
I am confident that if I had employed my same preferred hybrid multimodal methods in both reviews, that the consistency rate would have been even higher, even without additional quality control efforts. If I had done both, consistent methods and horizontal quality controls, the best results would have been attained.
In addition to improving consistency rates for a single reviewer, quality controls should also be able to improve consistency rates between multiple reviewer inconsistencies, at least in so far as the SME expertise can be transmitted between multiple reviewers. That in turn depends in no small part on whether the Grossman Cormack theory of review error causation is true, that inconsistencies are due to mere human error, carelessness and the like, as opposed to prior theories that relevance is always inherently subjective. If the subjective relevance theories are true, then everyone will have no choice but to just use one SME, who had better be well tuned to the judge. But, as mentioned, I do not believe in the theory that relevance is inherently subjective, so I do think multiple reviewers can be used, so long as there are multiple safeguards and quality controls in place. It will just be more difficult that way, and probably take longer.
How much more difficult, and how much longer, depends in part on the degree of subjectivity involved in the particular search project. I do not see the choice of competing theories as being all or nothing. Grossman and Cormack in their study concluded that only five percent of the relevance calls they made were subjective. It may well be higher than that on average, but, there is no way it is all subjective. I think it varies according to the case and the issues. The more subjectivity involved in a project, the more that strong, consistent, SME input is needed for machine training to work successfully.
Crowd Sourcing Does Not Apply to Most Predictive Coding Work
Some think that most relevance determinations are just subjective, so SMEs are not really needed. They think that contract review lawyers will work just as well. After all, they are usually intelligent generalists. They think that more is better, and do not like the results of the studies I have discussed in this article, especially my own success as a Less is More Army of One type predictive coder. They hang their theories on crowd sourcing, and the wisdom of the crowd.
Crowd sourcing does work with some things, but not document review, and certainly not predictive coding. We are not looking for lost dogs here, where crowd sourcing does work. We are looking for evidence in what are often very complex questions. These questions, especially in large cases where predictive coding is common, are usually subject to many arcane rules and principles of which the crowd has no knowledge, or worse, has wrong knowledge. Multiple wrongs do not make a right.
Here is a key point to remember on the crowd sourcing issue: the judge makes the final decisions on relevance, not the jury. Crowd sourcing might help you to predict the final outcome of a jury trial, juries are, after all, like small crowds with no particular expertise, just instructions from the judge. Crowd sourcing will not, however, help you to predict how a judge will rule on legal issues. Study of the judge’s prior rulings are a much better guide (perhaps along with, as some contend, what the judge had for breakfast). The non-skilled reviewers, the crowd, have little or nothing to offer in predicting an expert ruling. To put this mathematically, no matter how many zeros you add together, the total sum is always still zero.
Bottom line, you cannot crowd-source highly specialized skills.When it comes to specialized knowledge, the many are not always smarter than the few.
We all know this on a common sense level. Think about it. Would you want a crowd of nurses to perform surgery on you? Or would you insist on one skilled doctor? Of course you would want to have an SME surgeon operate on you, not a crowd. You would want a doctor who specializes in the kind of surgery you needed. One who had done it many times before. You cannot crowd source specialized skills.
The current facile fascination with crowd sourcing is trendy to be sure, but misplaced when it comes to most of the predictive coding work I see. Some documents, often critical ones, are too tricky, too subtle, for all but an experienced expert to recognize their probative value. Even documents that are potentially critical to the outcome of a case can be missed by non-experts. Most researchers critiquing the SME theory of predictive coding do not seem to understand this. I think that is because most are not legal experts, not experienced trial attorneys. They fail to appreciate the complexity and subtle nuances of the law in general, and evidence in particular.
They also fail to apprehend the enormous differences in skill levels and knowledge between attorneys. The law, like society, is so complex now that lawyers are becoming almost as specialized as doctors. We can only know a few fields of law. Thus, for example, just as you would not want a podiatrist to perform surgery on your eye, you would not want a criminal lawyer to handle your breach of contract suit.
To provide another example, if it were an area of law in which I have no knowledge, such as immigration law, I could read a hot document and not even know it. I might even think it was irrelevant. I would lack the knowledge and frame of reference to grasp its significance. The kind of quick training that passes muster in most contract lawyer reviews would not make much of a difference. That is because of complexity, and because the best documents are often the unexpected ones, the ones that only an expert would realize are important when they see one.
In the course of my 35 years of document review I have seen many inexperienced lawyers not recognize or misunderstand key documents on numerous occasions, including myself in the early days, and, to be honest, sometimes even now (especially when I am not the first-level SME, but just a surrogate). That is why partners supervise and train young lawyers, day in and day out for years. Although contract review lawyers may well have the search skills, and be power-users with great software skills, and otherwise be very smart and competent people, they lack the all important specialized subject matter expertise. As mentioned before, other experiments have shown that subject matter expertise is the most important of the three skill-sets needed for a good legal searcher. That is why you should not use contract lawyers to do machine training, at least in most projects. You should use SMEs. At the very least you should use an SME for quality control.
I will, however, concede that there may be some review projects where an SME is not needed at all, where multiple reviewers would work just fine. A divorce case for instance, where all of the reviewers might have an equally keen insight into sexy emails, or sexting, and no SMEs are needed. Alas, I never see cases like that, but I concede they are possible. It could also work in simplistic topics and non-real-world hypotheticals. That may explain some of the seemingly contra research results from Catalyst that rely on TREC data, not real world, complex, litigation data.
Conclusions Regarding Inconsistent Reviews
The data from the experiments on inconsistent reviews suggest that when only one human reviewer is involved, a reviewer who is also an experienced SME, that the overall consistency rates in review are much higher than when multiple non-SME reviewers are involved (contract reviewers in the Roitblat, Kershaw and Oot study) (77% v 16%), or even when multiple SMEs are involved (retired intelligence officers in Voorhees study) (77% v 45% with two SMEs and 30% with three SMEs). These comparisons are shown visually in this graph.
These results also suggest that with one SME reviewer the classification of irrelevant documents is nearly uniform (99%), and that the inconsistencies primarily lie in relevant categorizations (77% Jaccard) of borderline relevant documents. (A caveat should be made that this observation is based on unfiltered data, and not a keyword collection or data otherwise distorted with artificially high prevalence rates.)
The overall Agreement rate of 98%+ of all relevancy determinations, including irrelevant classifications where almost all classifications are easy and obvious, suggests that the very low Jaccard index rates measured in previous studies of 16% to 45% were more likely caused by human error, not document relevance ambiguity or genuine disagreement on the scope of relevance. A secondary explanation for the low scores is lack of significant subject matter expertise, such that the reviewers were not capable of recognizing a clearly relevant document when they saw one. Half of the TREC reviews were done by volunteer law students where such mistakes could easily happen. As I understand the analysis of Grossman and Cormack, they would consider this to be mere error, as opposed to a difference of opinion.
Even if you only consider the determinations of relevancy, and exclude determinations of irrelevancy, the 77% Jaccard index for one reviewer is still significantly greater than the prior 16% to 45% consistency rates. The data on inconsistencies from my experiment thus generally support the conclusions of Cormack and Grossman that most inconsistencies in document classifications are due to human error, not the presence of borderline documents or the inherent ambiguity of all relevancy determinations. Inconsistent Assessment of Responsiveness in E-Discovery: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012). Of the 3,274 different documents that I read in both projects during my experiment, only 63 were seen to be borderline, grey area types, which is less than 2%. The rest, 3,211 documents, were consistently coded. This is shown in the graph below.
There were almost certainly more grey area relevant documents than 63 in the 3,274 documents reviewed. But they did not come to my attention in the post hoc analysis because my determinations in both projects were consistent in review of the other borderline documents. Still, the findings support the conclusions of Grossman and Cormack that less than 5% of documents in a typical unfiltered predictive coding review project are of a borderline grey area type. In fact, the data from my study supports the conclusion that only 2% of the total documents subject to relevance were grey area types, that 98% of the judgment calls were not subjective. I think this is a fair assessment for the unfiltered Enron data that I was studying, and the relatively simple relevance issue (involuntary employment termination) involved.
The percentage of grey area documents where the relevance determinations are subjective and arguable may well be higher than 5%. More experiments are needed and nothing is proven by only a few tests. Still, my estimate, based on general experience and the Enron tests, is that when you are only considering relevant documents, it could be a high, on average, of as much as 20% subjective calls. (When considering all judgments, relevant and irrelevant, it is under 5% subjective.) Certainly subjectivity is a minority cause of inconsistent relevance determinations.
The data does not support the conclusion that relevance adjudications are inherently subjective, or mere idiosyncratic decisions. I am therefore confident that our legal traditions rest on solid relevance ground, not quicksand.
But I also understand that this solid ground in turn depends on competence, legal expertise, and a clear objective understanding of the rules of law and equity, not to mention the rules of reason and common sense. That is what legal training is all about. It always seems to come back to that, does it not?
Disclosure of Irrelevant Training Documents
These observations, especially the high consistency of review of irrelevance classifications (99%), support the strict limitation of disclosure of irrelevant documents as part of a cooperative litigation discovery process. Instead, only documents that a reviewer knows are of a grey area type or likely to be subject to debate should be disclosed. Even then the disclosure need not include the actual documents, but rather a summary and dialogue on the issues raised.
During my experimental review projects of the Enron documents, much like my reviews in real-world legal practice that I cannot speak of, I was personally aware of the ambiguous type grey area documents when originally classifying these documents. They were obvious because it was difficult to decide if they were within the border of relevance, or not. I was not sure how a judge would rule on the issue. The ambiguity would trigger an internal debate where a close question decision would ultimately be made. It could also trigger quality control efforts, such as consultations with other SMEs about those documents, although that did not happen in my Enron review experiment. In practice it does happen.
Even when limiting disclosure of irrelevant documents to those that are known to be borderline, disclosure of the actual documents themselves may often be unnecessary. Instead, a summary of the documents with explanation of the rationale as to the ultimate determination of irrelevance may suffice. The disclosure of a description of the borderline documents will at least begin a relevancy dialogue with the requesting party. Only if the abstract debate fails to reach agreement should disclosure of the actual documents be required. Even then it could be done in camera to a neutral third-party, such as a judge or special master. Alternatively, disclosure could be made with additional confidentiality restrictions, such as redactions, pending a ruling by the court.
Some relevance determinations certainly do include an element of subjectivity, of flexibility, and the law is used to that. But not all. Only a small minority. Some relevance determinations are more opinion than fact. But not all. Only a small minority. Some relevance determinations are more art than science. But not all. Only a small minority. Therefore, consistent and reliable relevance determinations by trained legal experts is possible, especially when good hybrid multimodal methods are used, along with good quality controls. (Good software is also important, and, as I have said many times before, some software on the market today is far better than others.)
The fact that it is possible to attain consistent coding is good news for legal search in general and especially good news for predictive coding, with its inherent sensitivity to initial conditions and cascading effects. It means that it is possible to attain the kind of consistent training needed for active machine learning to work accurately and efficiently, even in complex real-world litigation.
The findings of the studies reviewed in this article also support the use of SMEs with in-depth knowledge of the legal subject, and the use of as few SMEs to do the review as possible – Less Is More. These studies also strongly support that the greatest consistency in document review arises from the use of one SME only. By the way, despite the byline in Monica Bay’s article, EDI-Oracle Study: Humans Are Still Essential in E-Discovery (LTN Nov., 2013), that “Phase I of the study shows that older lawyers still have e-discovery chops and you don’t want to turn EDD over to robots,” the age of the lawyers is irrelevant. The best predictive coding trainers do not have to be old, they just have to be SMEs and have good search skills. In fact, not all SMEs are old, although many may be. It is the expertise and skills that matter, not age per se.
The findings and conclusions of the studies reviewed in this article also reinforce the need for strong quality control measures in large reviews where multiple reviewers must be used, such as second-pass reviews, or reviews led by traditionalists. This is especially true when the reviewers are relatively low-paid, non-SMEs. Quality controls detecting inconsistencies in coding and other possible human errors should be a part of all state-of-the-art software, and all legal search and review methodologies.
Finally, it is important to remember that good project management skills are important to the success of any project, including legal search. That is true even if you are talking about an Army of One, which is my thing. Skilled project management is even more important when hundreds of reviewers are involved. The effectiveness of any large-scale document review, including its quality controls, always depends on the project management.
[…] Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three. […]
I agree with Ralph, that generally, the fewer the number of subject matter experts, the better the outcome is likely to be. I don’t think that this means that one cannot achieve high levels of accuracy with a team of trainers, but it is likely to be more difficult.
In general, there are three factors that affect the success of any predictive coding system. We want training examples that are valid, consistent, and representative. Validity means that the training examples that are designated responsive are actually responsive, consistency means that the same evidence is treated in the same way whenever it is encountered, and representativeness means that we cover the range of the variables we are trying to predict.
Ultimately, any predictive coding system is an attempt to find the separator between documents that are responsive and documents that are not responsive. Where we put that separator depends on, among other things, the training examples. There are many algorithms that can be used to find a separator, but all of them depend on these properties of the training examples to achieve high levels of performance (e.g., Precision and Recall).
In this blog series Ralph concentrates on consistency and he is correct that the more consistency one brings to the training and measurement of predictive coding the better the results are likely to be, all other things being equal. That does not mean that a few inconsistently categorized training examples will devastate the predictive accuracy of the system. Most machine learning algorithms are robust to a few inconsistencies. This robustness does not imply, however, that consistency is irrelevant.
We don’t have to speculate on the effects of inconsistency. It can be measured for a particular data set and algorithm using simulation experiments, for example.
The specific training examples that are inconsistent can affect performance as well was the algorithm used to learn from these examples. In general, however, the more consistent the training set the better the quality of the system output.
The worst inconsistency is where half of the documents in a particular class are called responsive and half are called non-responsive. Let’s simplify this down to its essence. Imagine that we have 10 training documents, each containing the word, “fraud.” If half of these training documents are called responsive and half are called non-responsive, then the word “fraud” in the examples conveys no information about how to classify these documents. If all ten are classified as responsive, then the word conveys a lot of information about how to classify these documents. Ratios in between 5:5 and 10:0 (or 0:10) convey intermediate amounts of information. The more information there is in the training documents, the easier it is to correctly categorize the documents. This is standard information theory.
None of this means that we cannot achieve high levels of accuracy with groups of trainers as opposed to single trainers. Jeremy is not wrong. But it does mean that if you replaced your group of trainers with a single trainer, then you might expect even higher accuracy for a given level of effort. Generally, increasing the consistency of the training set will almost always improve the accuracy of the result or decrease the effort required to reach it, all other things being equal.
Ultimately, the more important factor is validity. In order to ensure validity, the training needs to be done by one or more subject matter experts, who can accurately distinguish responsive from non-responsive documents. If all of the documents that are classified as responsive are truly responsive (and vice versa), then we also achieve perfect consistency. To the degree that the training examples are valid, they must also be consistent. (They can be consistent without being valid, that is, consistently wrong.) If truly responsive training documents are always classified as responsive, and nonresponsive documents are always classified as non-responsive, then we achieve both validity and consistency. Typically, we come closer to this situation when a single subject matter expert classifies the training documents, but that does not mean that other approaches cannot also come close as well.
Thank you for your input. As one of the top SMEs in the field of search I value your opinions very much. I agree that all three of the factors you mention are important: validity, consistency, and representativeness. I address the validity issue in part three. It pertains to the importance of having SMEs that really know and understand the issue on which you are attempting to draw a dividing line between relevant and irrelevant. Representativeness is also important, which is where good methods of search come in. Thanks again for your comment.
When I was at both “DLD Tel Aviv” and “IBM Search” earlier this year I was surprised by (1) how many lawyers were in attendance (2) how many times I heard that tired old joke “but I went to law school to avoid math!” (3) how many lawyers were at these events to actually LEARN math and machine learning, and (4) how important SMEs are becoming across the entire legal ecosystem, especially e-discovery.
And although I have done the “deep dive” with respect to the application of statistics in discovery, I respect Herb and Ralph and always defer to their comments and analysis re: the key statistical concepts involved in predictive coding. But one note on SMEs which is a bit outside of Ralph’s immediate series although Ralph has discussed it in detail before:
My e-discovery world is the “human” document review room. Not the perfect world of tests and experiments using dead data, but the gritty world of “there is no toilet paper in the bathroom” and “there are too many of us in this room”, and “has anybody seen the associate supervising us? It’s been 2 weeks”.
I have worked as a coder, a reviewer, a project manager, and now through Project Counsel in Europe I staff numerous document reviews across the EMEA, albeit on an exclusive basis for only 3 clients: 3 U.S. corporations, 3 distinct industries, who always seem to be in a litigation/investigatory world of hurt. Of late, all 3 have embraced predictive coding/computer assisted review and brought it in-house. And I have helped them organize multilingual “data swat teams” comprised of contract attorneys who possess the tech skills + the language skills + the analysis ability, with an emphasis on becoming subject matter data search specialists who have the ability to conduct complex searches, analyze information.
The thought was best expressed by the Legal Director at one of my clients who said (more or less): “to me the key in all of this predictive coding stuff is always going to be the subject matter experts — your people — because time and time again I have seen the critical need of human reviewers, who need to be the true subject-matter experts in my industry, who can find stuff, recognize stuff, using a variety of search techniques. Yes, my in-house team is good, but you folks do this every day for a living. He/she is the person is going to save me the big bucks, and allow me to control the process. And at the end of the day he/she is going to be able to tell me a relevancy story using the data, be it good, bad and/or ugly. Statistics are great and good but CAR cannot interpret just yet”.
He was referring, indirectly, to a recent live case where several of our reviewers … using some linear review, some keyword searches, some concept searches, some CAR leads … to find a thread unrelated to the original search. It’s just Ralph’s in-your-face multimodal search writ large. It’s fun because despite the scare tactics employed by so many EDD vendors, lots of companies “get it” and have the vision and the cahonas to make the right things happen.
A wee bit out of your validity, consistency, and representativeness discussion but a point I wanted to address.
Thanks for the comment Greg. You are not one of the evil emperors of document review that I eluded to in Part One of Less is More, but rather one of the few, as I pointed out in Part Two, who get it. and are changing with the times.
Fortunately for you, very few of your competitors are as bold! Instead, they embrace the dark side of Luddite resistance and good old boy ignorance. It will be interesting to see how long the Emperor traditionalists can hang on to their outdated tools and methods. Of course, while most clients are still asleep, and these traditional linear types are making money hand over fist, they are encouraged to do nothing, but an occasional window dressing where predictive coding is claimed, but misused. Yes, on the short term the dark side will often prevail, but not in the long term. And even on the short term people like Greg will do very well indeed.
[…] e-Discovery Team: Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – … […]
I’ve been thinking long and hard before responding, and I realize that the scope of what has been said, both by Ralph as well as the excellent comments by Herb and Greg, is such that I cannot begin to address everything that I’d like to in a single comment. So I’m going to have to leave much of that discussion to further whitepapers, experiments, and LegalTech back-of-the-pub discussions.
I will try to summarize what I could say, however, with the following observation. And that’s the notion that when it comes to how much one does or doesn’t trust various reviewers, one’s assumption about whether a non-expert reviewer is going to make mistakes more on the overmarking (calling non-responsive documents responsive) or undermarking (calling responsive documents non-responsive) of responsiveness is going to determine, in large part, one’s position on all the issues that arise. Naturally, both occur. The question is whether you or I or whoever believes which of the two types of mistakes is more prevalent.
Consequences and solutions depend in large part on the nature of that mistake prevalence.
Jeremy: jumping in ahead of Ralph (oh, and we are still editing your video interview/John’s video interview from last month; more in a bit).
Bang, you hit it. Briefly, my contribution is this, having been in the trenches all these years and having done QC and trained QC teams: on first review we see more overmarking (calling non-responsive documents responsive) than undermarking. On my projects we train very, very well. But we tell reviewers if they have any doubt, make it responsive. This tapers off because we do intense QC the first 2 weeks.
From reports I receive on other projects from members on our listservs who write to me on a regular basis the result is similar but boils down to these 2 factors: (1) fear and intimidation in the doc review room (“when in doubt, responsive; we have people who know this stuff better than you can ever know and they will QC”), and (2) poor, poor orientation and training at the get-go – a quick overview of the case with maybe key names, key terms but no attempt to provide a holistic/”forest for the trees” explanation of what they are involved with. So reviewers flounder.
As a rule, we do 3 full days just on the case, the law. With quizzes. Then 1-2 days (depending) on the review tool.
Greg and Jeremy – I agree with what Greg says. (And by the way, that’s more training that I have ever heard of! Hats off to you.) I also agree it is important as you hint, Jeremy, but not all important. It is a factor to be considered to be sure, especially in QC. Everyone knows that you don’t get fired for marking an irrelevant document relevant, but God help the poor contract lawyer who marks a relevant document, irrelevant, not to mention a hot doc. To the dungeons. So while we have such a class system in the legal profession, there are economic pressures at work.
But beyond all that, there is the expert component here too. It is the difference between knowing and guessing. Why do you think the Zen Master is able to move so fast? Don’t think, do. Ah grasshopper….still trapped in the prison of your thoughts….you would not know an elephant until you touched all of its sides, and even then, you would not be sure until you were buried by one of its giant poops.
Yes, that’s my sense, too.. that overmarking is more common. And I think that’s relatively the better place to be in, when it comes to what the algorithmic side of the TAR process can do to mitigate and normalize assessor disagreement, and make predictions out across the remainder (unseen portion) of the collection.
And I agree with Ralph that there is a difference to the human between guessing and knowing. But the machine side of the HCIR equation can’t tell the difference. It just sees the judgment, the label. And so to me it becomes and empirical question of what you can get out of the other end by putting various combinations of assessor inputs and algorithms together in intelligent ways.
Again, most of this discussion is best saved for back-of-the-pub, fist pounding, arm waving, mutual eye rolling offline discussion. But I’m of firm belief that wrapping some intelligent process around the whole training regimen, whether or not that training does include non-SMEs, only SMEs, or non-SMEs and SMEs working together, is going to make the entire outcome better than it otherwise could have been.
Jeremy … well, once again, you hit it. The machine side of the HCIR equation can’t tell the difference. It just sees the judgment, the label. As far as back-of-the-pub, fist pounding, arm waving, mutual eye rolling offline discussion, let me say this. As Ralph knows, I have been pursuing a neuroscience/informatics course these last 2+ years, courtesy of IBM, MIT and ETHZ. My paper is “Contract attorneys, predictive coding and chocolate cake: the neuroscience of document review”. A look at the tech, the mind, the reality of doc review. I just submitted the 2nd draft to my advisors. I am trying to get a reduced/redacted version out in a post by next month. I will forward it to you. Just send me your email: email@example.com.
Re: SMEs … AGREED! You must, you must, you must wrap some intelligent process around the whole doc review training regimen. When done … magic happens. We are running 2 reviews right now … one in Paris and one in Zurich … and it was not easy, getting the end client to agree to pay for 4 days of training, no coding. But man were they impressed with the initial results. They saw a well-oiled machine, an involved team. Not a bunch of mindless coders. And they saw the review would go faster (read: saving €€).
Besides the pathetic pay scale, my biggest issue is the marketing and PR bullshit from many law firms/staffing agencies (not all) who say they have “installed” predictive coding software (or are “experimenting”) to sucker in business when what they are really doing is continuing their old systems. Over time this will all shake out and the cream WILL rise to the surface. Oops. Too many metaphors, I think 🙂
FYI – Most of the time I’m only SME of the search process, and not SME of the issues. When the SME is online its easy peasy, when not I act as surrogate, and its slightly more complex. When no SME, well, that’s a problem for me. I insist on having an answer man. Otherwise I’m a no-go. If no SME, then no AI enhanced searches, at least, not with my name attached.
Greg: Yes, I would love to read your paper. Thank you for offering! It sounds really quite interesting.
Ralph, you write: “Most of the time I’m only SME of the search process, and not SME of the issues.”
Actually, that touches on an area of research I’ve been doing since early 2006.. which is on the notion of factoring various aspects of the overall information retrieval / information seeking process, such that it’s not necessarily the same person that does all aspects of the process. One person might have the role of finding great documents to be judged. Another person (or people) might have the role of judging those documents. Other roles are possible as well.
If you’re willing to entertain that kind of role-based factorization of the process and people within the process, it opens up the design space in really interesting ways, and can be used to make the whole endeavor both much more effective as well as efficient.
Yes. That is the reality in most projects. Glad to hear of your research. Similar to my practical experiments. We should talk more off-line.
[…] written by Dr. Herb Roitblat to a series of articles on the e-Discovery Team blog entitled Less is More: When it comes to predictive coding training, the “fewer reviewers the better.” Herb agrees with Ralph Losey’s concepts in general, but while Ralph focuses primarily on […]
[…] human reviewers are, even when using search experts. See Less Is More, parts One, Two and Three. They still try to fix the old methods, and try to use human reviewers to measure what final recall […]
[…] Read the original article at: E-Discovery Team Blog […]
[…] Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three at the subheading Disclosure of Irrelevant Training […]
[…] I hope Pickens and Webber get there some day, but for now, we cannot yet turn lead into gold. It is even worse if you have a bunch of SMEs arguing with each other about where they should be going, about what is relevant and what is not. See: Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Parts One, Two, and Three. […]
[…] Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three at the subheadings Disclosure of Irrelevant Training Documents and Conclusions Regarding […]
[…] Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part Three at the subheadings Disclosure of Irrelevant Training Documents and Conclusions Regarding […]