Mitigating Factors to Human Inconsistency
When you consider all of the classifications of documents, both relevant and irrelevant, my consistency rate in the two ENRON reviews jumps to about 99% (01% inconsistent). Compare this with the Grossman Cormack study of the 2009 TREC experiments, where agreement on all non-relevant adjudications, assuming all non-appealed decisions were correct, was 97.4 percent (2.6% inconsistent). My guess is that most well run CAR review projects today are in fact attaining overall high consistency rates. The existing technologies for duplication, similarity, concept and predictive ranking are very good, especially when all used together. When you consider both relevant and irrelevant coding, it should be in the 90s for sure, probably the high nineties. Hopefully, by using todays’ improved software and the latest, fairly simple 8-step methods, we can reduce the relevance inconsistency problem even further. Further scientific research is, however, needed test these hopes and suppositions. My results in the Enron studies could be black swan, but I doubt it. I think my inconsistency is consistent.
Even though overall inconsistencies may be small, the much higher inconsistencies in relevance calls alone remains a continuing problem. It is a fact of life of all human document review as Voorhees showed years ago. The inconsistency problem must continue to be addressed by a variety of ongoing quality controls, including the use of predictive ranking, and including post hoc quality assurance tests such as ei-Recall. The research to date shows that duplicate, similarity and predictive coding ranking searches can help mitigate the inconsistency problem (the overlap has increased from the 30% range, to the 70% range), but not eliminate them entirely. By 2012 I was able to use these features to get the relevant-only disagreement rates down to 23%, and even then, the 63 inconsistently coded relevant documents were all unimportant. I suspect, but do not know, that my rates are now lower with improved quality controls, but do not know that. Again, further research is required before any blanket statements like that can be made authoritatively.
Our quest for quality legal search requires that we keep the natural human weakness of inconsistency front and center. Only computers are perfectly consistent. To help keep the human reviewers as consistent as possible, and so mitigate any damages that inconsistent coding may cause, a whole panoply of quality control and quality assurance methods should be used, not just improved search methods. See eg: ZeroErrorNumerics.com.
The Zero Error Numerics (ZEN) quality methods include:
- predictive coding analytics, a type of artificial intelligence, actively managed by skilled human analysts in a hybrid approach;
- data visualizations with metrics to monitor progress;
- flow-state of human reviewer concentration and interaction with AI processes;
- quiet, uninterrupted, single-minded focus (dual tasking during review is prohibited);
- disciplined adherence to a scientifically proven set of search and review methods including linear, keyword, similarity, concept, and predictive coding;
- repeated tests for errors, especially retrieval omissions;
- objective measurements of recall, precision and accuracy ranges;
- judgmental and random sampling and analysis such as ei-Recall;
- active project management and review-lawyer supervision;
- small team approach with AI leverage, instead of large numbers of reviewers;
- recognition that mere relevant is irrelevant;
- recognition of the importance of simplicity under the 7±2 rule;
- multiple fail-safe systems for error detection of all kinds, including reviewer inconsistencies;
- use of only the highest quality, tested e-discovery software and vendor teamsunder close supervision and teamwork;
- use of only experienced, knowledgeable Subject Matter Experts for relevancy guidance, either directly or by close consultation;
- extreme care taken to protect client confidentiality; and,
- high ethics – our goal is to find and disclose the truth in compliance with local laws, not win a particular case.
That is my quality play book. No doubt others have come up with their own methods.
High quality effective legal search depends in part on recognition of the common document review phenomena of concept shift and inconsistent classifications. Although you want to avoid inconsistencies, concept drift is a good thing. It should appear in all complex review projects. Think Bob Dylan – He not busy being born is busy dying. Moreover, you should have a standard protocol in place to both encourage and efficiently deal with such changes in relevance conception. If coding does not evolve, if relevance conceptions do not shift by conversations and analysis, there could be a quality issue. It is a warning flag and you should at least investigate.
Very few projects go in a straight line known from the beginning. Most reviews are not like a simple drag race. There are many curves. If you do not see a curve in the road, and you keep going straight, a spectacular wreck can result. You could fly off the track. This can happen all too easily if the SME in charge of defining relevance has lost track of what the reviewers are doing. You have to keep your eyes on the road and your hands on the wheel.
Good drivers of CARs – Computer Assisted Reviews – can see the curves. They expect them, even when driving a new course. When they come to a curve, they are not surprised, they know how to speed through the curves. They can do a power drift through any corner. Change in relevance should not be a speed-bump. It should be an opportunity to do a controlled skid, an exciting drift with tires burning. Speed drifts help keep a document review interesting, even fun, much like a race track. If you are not having a good time with large scale document review, then you are obviously doing something wrong. You may be driving an old car using the wrong methods. See: Why I Love Predictive Coding: Making document review fun with Mr. EDR and Predictive Coding 3.0.
Concept shift makes it harder than ever to maintain consistency. When the contours of relevance are changing, at least somewhat, as they should, then you have to be careful and be sure all of your prior codings are redone and made consistent with the latest understanding. Your third step of a baseline random sample should, for instance, be constantly revisited. All of the prior codings should be corrected to be consistent with the latest thinking. Otherwise your prevalence estimate could be way off, and with it all of your rough estimates of recall. The concern with consistency may slow you down a bit, and make the project cost a little more, but the benefits in quality are well worth it.
If you are foolish enough to still use secret control sets, you will not be able to make these changes at all. When the drift hits, as it almost always does, your recall and precision reports based on this control set will be completely worthless. Worse, if the driver does not know this, they will be mislead by the software reports of precision and recall based on the secret control set. That is one reason I am so adamantly opposed to the use of secret control set and have called for all software manufacturers to remove them. See Predictive Coding 3.0 article, part one.
If you do not go back and correct for changes in conception, then you risk withholding a relevant document that you initially coded irrelevant. It could be an important document. There is also the chance that the inconsistent classifications can impact the active machine learning by confusing the algorithmic classifier. Good predictive coding software can handle some errors, but you may slow things down, or if it is extreme, mess them up entirely. Quality controls of all kinds are needed to prevent that.
All types of quality controls are needed to address the inevitability of errors in reviewer classifications. Humans, even lawyers, will make some mistakes from time to time. We should expect that and allow for it in the process. Use of duplicate and near-duplicate guides, email strings, and other similarity searches, concept searches and probability rankings can mitigate against that fact that no human will ever attain perfect machine like consistency. So too can a variety of additional quality control measures, primary among them being the use of as few human reviewers as possible. This is in accord with the general review principle that I call less is more. See: Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part One and Part Two. That is not a problem if you are driving a good CAR, one with the latest predictive coding search engines. More than a couple of reviewers in a CAR like that will just slow you down. But it’s alright, Ma, it’s life, and life only.
Since I invoked the great Bob Dylan and It’s Alright, Ma earlier in this blog, I thought I owed it to you share the full lyrics, plus a video of young Bob’s performance. It could be his all time best song-poem. What do you think? Feeling very creative, leave a poem below that paraphrases Dylan to make one of the points in this blog
“It’s Alright, Ma (I’m Only Bleeding)”
Shadows even the silver spoon
The handmade blade, the child’s balloon
Eclipses both the sun and moon
To understand you know too soon
There is no sense in trying.
Suicide remarks are torn
From the fools gold mouthpiece
The hollow horn plays wasted words
Proved to warn
That he not busy being born
Is busy dying.
You follow, find yourself at war
Watch waterfalls of pity roar
You feel to moan but unlike before
That you’d just be
One more person crying.
A foreign sound to you ear
It’s alright, Ma, I’m only sighing.
Private reasons great or small
Can be seen in the eyes of those that call
To make all that should be killed to crawl
While others say don’t hate nothing at all
As human gods aim for their marks
Made everything from toy guns that sparks
To flesh-colored Christs that glow in the dark
It’s easy to see without looking too far
That not much
Is really sacred.
Teachers teach that knowledge waits
Can lead to hundred-dollar plates
Goodness hides behind its gates
But even the President of the United States
Sometimes must have
To stand naked.
It’s only people’s games that you got to dodge
And it’s alright, Ma, I can make it.
Into thinking you’re the one
That can do what’s never been done
That can win what’s never been won
Meantime life outside goes on
All around you.
You suddenly find you got nothing to fear
Alone you stand without nobody near
When a trembling distant voice, unclear
Startles your sleeping ears to hear
That somebody thinks
They really found you.
Yet you know there is no answer fit to satisfy
Insure you not to quit
To keep it in your mind and not forget
That it is not he or she or them or it
That you belong to.
For the wise men and the fools
I got nothing, Ma, to live up to.
That they do not respect in any degree
Who despite their jobs, their destinies
Speak jealously of them that are free
Cultivate their flowers to be
Nothing more than something
They invest in.
To strict party platforms ties
Social clubs in drag disguise
Outsiders they can freely criticize
Tell nothing except who to idolize
And then say God Bless him.
Gargles in the rat race choir
Bent out of shape from society’s pliers
Cares not to come up any higher
But rather get you down in the hole
That he’s in.
On anyone that lives in a vault
But it’s alright, Ma, if I can’t please him.
Limited in sex, they dare
To push fake morals, insult and stare
While money doesn’t talk, it swears
Obscenity, who really cares
Propaganda, all is phony.
With a killer’s pride, security
It blows the minds most bitterly
For them that think death’s honesty
Won’t fall upon them naturally
Must get lonely.
False gods, I scuff
At pettiness which plays so rough
Walk upside-down inside handcuffs
Kick my legs to crash it off
Say okay, I have had enough
What else can you show me?
They’d probably put my head in a guillotine
But it’s alright, Ma, it’s life, and life only.