Day One of the search project ended when I completed review of the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and then, with the help of William Webber, go deeper into math and statistical sampling than ever before. I will also give you the big picture of my review plan and search philosophy: its hybrid and multimodal. Some search experts disagree with my philosophy. They think I do not go far enough to fully embrace machine coding. They are wrong. I will explain why and rant on in defense of humanity. Only then will I conclude with the Day Two narrative.
Why the 1,507 Random Sample Size to Start Inview’s Predictive Coding
A pure random sample using 95% +/-3% and a 50% prevalence (the most conservative prevalence estimate) would require a sample of 1,065 documents. But Inview generates a larger sample of 1,507. This is because it uses what KO calls a conservative approach to sampling that has been reviewed and approved by several experts, including KO’s outside consulting expert on predictive coding, David Lewis (an authority on information science and a co-founder of TREC Legal Track). In fact, this particular feature is under constant review and revisions are expected in future software releases.
Inview’s uses a so-called simple random sample method in which each member of the population has an equal chance of being observed and sampled. But KO uses a larger than required minimum sample size because it uses a kind of continuous stream sampling where data is sampled at the time of input. That and other technical reasons explain the approximate 40% over-sampling in Inview, i.w., the use of 1,507 samples, instead of 1,065 samples, for a 95% +/-3% probability calculation.
This is typical of KO’s conservative approach to predictive coding in general. The over-sampling adds slightly to the cost of review of the random samples (you must review 1,507 documents, instead of 1,065 documents). But this does not add that much to the cost. That is because the review of these sample sets goes fast, since almost all of them in most cases will be irrelevant. Review of irrelevant documents takes far less time on average than review of relevant documents. So I am convinced that this extra cost is really negligible, as compared to the increased defensibility of the sampling.
Since this approximate 40% larger than normal sample size is standard in Inview, even though the confidence level is supposedly only 3%, you can argue that in most datasets it represents an even smaller margin of error. A random sample of 1,507 documents in a dataset of this size would normally represent a 95% confidence interval with a margin of error (confidence interval) of only 2.52%, not 3%. See my prior blog on random sample calculations: Random Sample Calculations And My Prediction That 300,000 Lawyers Will Be Using Random Sampling By 2022.
Baseline Quality Control Sample Calculation
At the beginning of every predictive coding project I like to have an idea as to how many relevant documents there may be. For that reason I use the random sample that Inview generates for predictive coding training purposes, for another purpose entirely, for quality control purposes. I use the random sample to calculate the probable number of relevant documents in the whole dataset. Only simple math is required for this standard baseline calculation. For this particular search, where I found 2 relevant documents in the sample of 1,507 documents, it is: 2/1507=.00132714. We’ll call that 0.13%. That is the percentage of relevant documents found in the whole, which is called the prevalence, a/k/a density rate or yield.
Based on this random sample percentage, my projection of the likely total number of relevant documents in the total database (aka yield) is 928 (.13%*699,082=928). So my general goal was to find 928 documents. That is called the spot projection or point projection. It represents a loose target or general goal for the search, a bullseye of sorts. It is not meant to be a recall calculation, of F1 measure, or anything like that. It is just a standard baseline for quality control purposes that many legal searchers use, not just me. It is, however, not part of the standard KO software or predictive coding design. I just use the random sample they generate for that secondary purpose.
The KO random sampling is for an entirely different purpose of creating a machine training set for the predictive coding type algorithms to work. This is an important distinction to understand that many people miss. David Lewis had to explain that basic distinction to me many times before I finally got it. This distinction in the use of random samples is basic to all information science search, and is not at all unique to KO’s Inview.
You need to be aware that there may well be more or less than the spot projection number of relevant documents in the collection (here 928). This is because of the limitations inherent in all random sampling statistics; the confidence intervals and levels. Here we used a confidence level (95%) and the confidence interval (+/- 3%). With a 3% confidence interval, there could (or so I thought, see important correction below by William Webber) be as many as 21,881 relevant documents (699,082*3.13%), or there could be no more relevant documents at all (just the 2 already found, since you can’t have a minus percentage, i.w., you can not have -2.87%). Those extremes numbers, are, however, highly unlikely, especially considering the prevalence factors.
I presented the above to William Webber, an information scientist whose excellent work in the field of legal search and statistics I have described before. I asked Webber to evaluate my math and analysis. He was kind enough to provide the following comments he allowed me to include here:
On the width of the actual confidence interval, you can’t directly apply the +/- 3%, as it refers to the worst case width, that is, when estimated prevalence is 50%. For an estimate prevalence of 0.13% on a sample of 1,507, the exact 95% confidence interval is [0.016%, 0.479%]. Note that this is not simply a +/- interval; it is wider on the high side than on the low side. (Essentially, in a sample of 1,507, the chance that a true prevalence of 0.479% would produce a sample yield of 2 or fewer is 2.5%, and the chance that a true prevalence of 0.016% would produce a sample yield of 2 or more is 2.5%; thus, we have a (100 – 2.5 – 2.5) = 95% interval.) So the interval is between 112 and 3,345 relevant documents in the collection. (bold added)
I clarified with William that he is saying with the .13% prevalence we have here (a/k/a density of relevant documents), and the 95% confidence level we are using, that the range of probable relevant documents is not from between 2 to 21,881, as I had thought, but rather is from between 112 and 3,345 relevant documents (.479%*699,082=3,345 (Webber is using exact numbers, not rounded off as shown here, which explains the small divergence, i.e. 3,345, and not 3,349)).
The spot or point projection bullseye I made of 928 relevant document remains unchanged (.13%*699,082=928). I had gotten that right. I just had not understood the variable target circles under the bell curve of probability, which Webber calculated for me as shown below assuming a sample size of 1,507.
The mistake I was making, and also made in my Random Sample Calculations essay, which I’m proud to say Webber complimented, was to simply add or subtract 3% to the scope of the spot target projection. I had assumed that the +/- 3% interval meant that you simply added or subtracted 3 to the prevalence rate. Thus, in this example, I added 3% to the .13% prevalence we have here to calculate the high end, and subtracted to determine the low end. That was a mistake.
Fortunately for us searchers, it does not work that way. You do not simply add or subtract 3% to the .13% prevalence rate to come up with a range. The target range is actually much tighter than that, providing us with more guidance on whether we are meeting our search goals. The actual range is from 0.016% to 0.479%, creating a full target of between 112 and 3,345 relevant documents. Again, this is a tempering down from .13% to .016% on the low-end, and a tempering up from .13%. to .479% on the high-end. This is required because of the 95% confidence interval and the sharply dropping bell curve that cuts off these extreme numbers in the 95% probability. As Webber puts it:
Note that there’s a difference here between the absolute width of the interval, and the width of the interval as a proportion of the point (spot) estimate. The former decreases as sample prevalence falls below 50%; the latter increases. … I also attach a graph of the interval width as a proportion of the point estimate (shown above). Which of these is the correct way of looking at things depends on whether you want to say “Your Honour, we found 928 relevant documents, but another fifth of a percent of the collection might be relevant”, or “Your Honour, we found 928 relevant documents, but there could be three times that in the collection”.
0.016% is the lower end of the confidence interval; 0.13% is the point (spot) estimate of precision. The interval in percentages is [ 0.016%, 0.479% ]; multiply this by the collection size of 699,082, and we get the interval in documents (rounded to the nearest document) of [ 112 – 3,349 ].
Now in fact I’ve been slightly sloppy here; as you point out, we’ve already seen 1,507 documents, and found that 2 of them are relevant, so thinking that way, we should say the interval is: 2 + (699,082 – 1,507) * [ 0.016%, 0.479% ] = [ 114 – 3,343 ]
(In fact, even this is not entirely exact, because the finite population means the sampling probability changes very very slightly every time we remove a document from the collection — but let’s ignore that and not give ourselves a headache.)
Hopefully you can kind of follow that, and understand that his final parenthetical adjustment represents very miniscule numerical adjustments of no significance to our world of whole documents, and not sub-parts thereof. For an online calculator that Webber told me about wherein you can calculate the range for yourself, please see the Binomial Confidence Intervals calculator.
I also asked William for further clarification on the low-end, why it was 112 documents, and not just the 2 documents already found? Again, it is the tempering effect of the 95% Wald interval. Here is Webber’s interesting response to that question:
As to why the lower bound is not exactly 2 (that is, the 2 we’ve already seen). Well, if there were only 2 relevant documents in the entire collection of 699,082, then the chance we’d happen to sample both of them in a sample of 1,507 is (again ignoring the finite population):
[ (2 / 699,082) ^ 2 ] * [ (1 – (2 / 699,082)) ^ 1,505 ] * [ (1505!) / (1503! * 2!) ]
A slightly scary looking expression: the first line calculates the probability of sampling 2 relevant and 1,505 irrelevant documents in any particular permutation, and the second calculates the number of different permutations this can be done.
[Put another way:] that whole expression simplifies to 1505 * 1504 / 2. The number of different ways you can choose 2 elements from 1505. In this case, the 2 locations in the sequence of samples at which the relevant documents are found.
[Either way] [The] expression equates to a chance of 1 in 108,420. That’s not impossible (few things are impossible), but it’s so unlikely that we rule it out. And very small numbers of relevant documents are also implausible (by a related, but slightly elaborated formula). In fact, it is not until we get to 112 (or 114, if you prefer) relevant documents in the collection that the chance of a sample with 2 _or more_ relevant finally reaches 2.5%. We also rule out 2.5% at the upper end, and get a 95% confidence interval as a result.
So, it was possible that when I found 2 relevant documents in the random sample of 1,507 documents that I had in fact found all of the relevant documents. But the odds against that were 108,420 to 1. That is essentially why it is very reasonable to round out, or as I have said here, temper, the improbable range I had assumed before of between 2 to 21,881, down to between 112 and 3,345.
Generating the Seed Set for Next Predictive Coding Session Using a Hybrid Multimodal Approach
I began day two with a plan to use any reviewer’s most powerful tool, their brain, to find and identify additional documents to train Inview. My standard Search and Review plan is multimodal. By this I mean my standard is to use all kinds of search methods, in addition to predictive coding. The other methods include expert human review, the wetware of our own brains, and our unique knowledge of the case as lawyers who understand the legal issues, understand relevancy, and the parties, witnesses, custodian language, timeline, opposing counsel, deciding judge, appeals court, and all the rest of the many complexities that go into legal search.
I also include Parametric Boolean Keyword search, which is a standard type of search built into Inview and most other modern review software. This allows keyword search with Boolean logic, plus searches delimited to certain document fields and metadata.
I also include Similarity type searches using near duplication technology. For instance, if you find a relevant document, you can then search for documents similar to it. In Inview this is called Find Similar. You can even dial in a percentage of similarity. You can also do Add Associated type search methods which finds and includes all associated documents, like email family members and email threads. Again, these Similarity type search features are found in most modern review software today, not just Inview, and can be very powerful tools.
Finally, I used the Concept search methods to locate good training documents. Concept searches used to be the most advanced feature for software review tools, and is present in many good review platforms by now. This is a great way to harness the ability of the computer to know about linguistic patterns in documents and related keywords that you would never think of on your own.
Under a multimodal approach all of the search methods are used between rounds to improve the seed set, and predictive coding is not used as a stand-alone feature.
My plan for this review project is to limit the input of each seed set, of course, but to be flexible on the numbers and search time required between rounds, depending upon what conditions I actually encounter. In the first few rounds I plan to use keyword searches, and concept searches, and searches on high probability rank and mid-probability rank (the software’s grey area) searches. I may use other methods depending again on how the search develops. My reviews will focus on the documents found by these searches. The data itself will dictate the exact methods and tools employed.
This multimodal, multi-search-methods approach to search is shown in the diagram below. Note IR stands for Intelligent Review, which is the KO language for predictive coding, a/k/a probabilistic coding. It stands at the top, but incorporates and includes all of the rest.
Some Vendors and Experts Disagree with Hybrid Multimodal
The multimodal approach is also encouraged by KO, which is one reason we selected KO as our preferred vendor. But not all software vendors and experts agree with the multimodal approach. Some advocate use of pure predictive coding methods alone, and do not espouse the need or desirability of using other search methods to generate seed sets. In fact, some experts and vendors even oppose the Hybrid approach, which means equal collaboration between Man and Machine. They do so because they favor of the Machine! (Unlike some lawyers who go to the other extreme and distrust the machine entirely and want to manually review everything.)
The anti-hybrid, anti-multimodal type experts would, in this search scenario and others like it, proceed directly to another machine selected set of documents. They would rely entirely on the computer judgment and computer selection of documents. The human reviewers would only be used to decide on the coding of the documents that the computer finds and instructs them to review.
That is a mere random stroll down memory lane. It is not a bona fide Hybrid approach, any more than is linear review where the humans do not rely on the computers to do anything but serve as a display vehicle. That is the style of old-fashioned e-discovery where lawyers and paralegals simply do a manual linear review on a computer, but without any real computer assistance.
Hybrid for me means use of both the natural intelligence of humans, namely skilled attorneys with knowledge of the law, and the artificial intelligence of computers, namely advanced software with ability to learn from and leverage the human instructions and review tirelessly and consistently.
Fighting for the Rights of Human Lawyers
I was frankly surprised to find in my due diligence investigation of predictive coding type software that there are several experts who have, in my view at least, a distinct anti-human, anti-lawyer bent. They have an anti-hybrid prejudice in favor of the computer. As a result, they have designed software that minimizes the input of lawyers. By doing so they have, in their opinion, created a pure system with better quality controls and less likelihood of human error and prejudice. Humans are weak-minded and tire easily. They are inconsistent and make mistakes. They go on and on about how their software prevents a lawyer from gaming the system, either intentionally and unintentionally. Usually they are careful in how they say that, but I have become sensitized after many such conversations and learned to read between the lines and call them on it.
These software designers want to take lawyers and other mere humans out of the picture as much as possible. They think in that way they will insulate their predictive model from bias. For instance, they want to prevent untrustworthy humans, especially tricky lawyer types, from causing the system to focus on one aspect of the relevancy topic to the detriment of another. They claim their software has no bias and will look for all aspects of relevancy in this manner. (They try to sweep under the carpet the fact, which they dislike, that it is the human lawyers who train the system to begin with in what is or is not relevant.) These software designers put a new spin on an old phrase, and say trust me, I’m a computer.
You usually run into this kind of attitude when talking to software designers and asking them questions about the software, and pressing for a real answer, instead of the bs they often throw out. They are pretty careful about what they put into writing, as they realize lawyers are their customers, and it is never a good idea to directly insult your customer, or their competence, and especially not their honesty. I happened upon an example of this in an otherwise good publication by the EDRM on search, a collaborative publication (so we do not know who wrote this particular paragraph among the thousands in the publication) EDRM Search Guide, May 7, 2009, DRAFT v. 1.17, at page 80 of 83:
In the realm of e-discovery, measurement bias could occur if the content of the sample is known before the sampling is done. As an example, if one were to sample for responsive documents and during the sampling stage, content is reviewed, there is potential for higher-level litigation strategy to impact the responsive documents. If a project manager has communicated the cost of reviewing responsive documents, and it is understood that responsive documents should somehow be as small as possible, that could impact your sample selection. To overcome this, the person implementing the sample selection should not be provided access to the content.
See what I am talking about? Yes, it is true lawyers could lie and cheat. But it is also true that the vast majority do not. They are honest. They are careful. They do not allow higher-level litigation strategy to impact the responsive documents. They do their best to find the evidence, not hide the evidence. Any software design built on the premise of the inherent dishonesty and frailty of mind of the users is inherently flawed. It takes human intelligence out of the picture based on an excessive disdain for human competence and honesty. It also ignores the undeniable fact that the few dishonest persons in any population, be it lawyers, scientists, techs, or software designers, will always find a way to lie, cheat, and steal. Barriers in software will not stop them.
In my experience with a few information scientists, and many technology experts, many of them distrust the abilities of all human reviewers, but especially lawyers, to contribute much to the search process. (David and William, are, however, not among them.) I speculate they are like this because: (a) so many of the lawyers and lit-support people they interact with tend to be relatively unsophisticated when it comes to legal search and technology; or, (b) they are just crazy in love with computers and their own software and don’t particularly like people, especially lawyer people. I suppose they think the Borg Queen is quite attractive too. Whatever the reason, several of the predictive coding software programs on the market today that they have designed rely too much on computer guidance and random sampling to the neglect of lawyer expertise. (Yes. That is what I really think. And no, I will not name names.)
After enduring many such experts and their pitches, I find their anti-lawyer, anti-human intelligence attitude offensive. I for one will not be assimilated into the Borg hive-mind. I will fight for the rights of human lawyers. I will oppose the borg-like software. Resistance is not futile!
The Borg-like experts design fully automated software for drones. Their software trivializes user expertise and judgment. The single-modal software search systems they promote underestimate the abilities (and honesty) of trained attorneys. They also underestimate the abilities of other kinds of search methods to find evidence, i.e., concept, similarity, and keyword searches.
I promote diversity of search methods and intelligence, but they do not. They rely too much on the computer, on random sampling, and on this one style of search. As a result, they do not properly leverage the skills of a trained attorney, nor take advantage of all types of programming.
In spite of their essentially hostile attitude to lawyers, I will try to keep an open mind. It is possible that a pure computer, pure probabilistic coding method may someday surpass my multimodal hybrid approach that still keeps humans in charge. Someday a random stroll down memory lane may be the way to go. But I doubt it.
In my opinion, legal search is different from other kinds of search. The goal of relevant evidence is inherently fuzzy. The 7±2 Rule reigns supreme in the court room, a place where most such computer geeks have never even been, much less understand. Legal search for possible evidence to use at trial will, in my opinion, always require trained attorneys to do correctly. It is a mistake to try to replace them entirely with machines. Hybrid is the only way to go.
So, after this long random introduction, and rant in favor of humanity, I finally come to the narrative itself about Day Two.
Second Day of Review (3.5 Hours)
I was disappointed at the end of the first day that I had not found more relevant documents in the first random sample. I knew this would make the search more difficult. But I wanted to stick with this hypothetical of involuntary terminations and run through multiple seed sets to see what happens. Still, when I do this again with this same data slice, and that is the current plan for the next set of trainees, I will use another hypothetical, one where I know I will find more hits (higher prevalence), namely a search for privileged documents.
I started my second day by reviewing all of the 711 documents containing the term “firing.” I had high hopes I would find emails about firing employees. I did find a couple of relevant emails, but not many. Turns out an energy company like Enron often used the term firing to refer to starting up coal furnaces and the like. Who knew? That was a good example of the flexibility of language and the limitations of keyword search.
I had better luck with “terminat*” within 10 words of “employment.” I sped through the search results by ignoring most of the irrelevant, and not taking time to mark them (although I did mark a few for training purposes). I found several relevant documents, and even found one I considered Highly Relevant. I marked them all and included them for training.
Next I used the “find similar” searches to expand upon the documents already located and marked as relevant documents. This proved to be a successful strategy, but I still had only found 26 relevant documents. It was late, so I called it a night. (It is never good to do this kind of work without rest, unless absolutely required.) I estimate my time on this second day of the project at three and a half hours.
To be continued . . . .