- Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron.
- Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane.
- Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.
- Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment.
In this fifth installment I will continue my description, this time covering days seven and eight of the project. As the title indicates, progress continues and I have another hybrid mind-meld moment. I also discover that the computer does not recognize the significance of references to God in an email. This makes sense logically, but is unexpected and kind of funny when encountered in a document review.
Seventh Day of Review (7 Hours)
this seventh day I followed Joe White’s advice as described at the end of the last narrative. It was essentially a three-step process:
One: I ran another learning session for the dozen or so I’d marked since the last one to be sure I was caught up, and then made sure all of the prior Training documents were checked back in. This only took a few minutes.
Two: I ran two more focus document trainings of 100 docs each, total 200. The focus documents are generated automatically by the computer. It only took about an hour to review these 200 documents because most were obviously irrelevant to me, even if the computer was somewhat confused.
I received more of an explanation from Joe White on the focus documents, as Inview calls them. He explains that, at the current time at least (KO is coming out with a new version of the Inview software soon, and they are in a state of constant analysis and improvement), 90% of each focus group consists of grey area type documents, and 10% are pure random under IRT ranking. For documents drawn via workflow (in the demo database they are drawn from the System Trainers group in the Document Training Stage) they are selected as 90% focus and 10% random; where the 90% focus selection is drawn evenly across each category set for iC training.
The focus documents come from the areas of least certainty for the algorithm. A similar effect can be achieved by searching for a given iC category for documents between 49 – 51%, etc., as I had done before for relevance. But the automated focus document system makes it a little easier because it knows when you do not have enough documents in the 49 – 51% probability range and then increases the draw to reach your specified number, here 100, to the next least-certain documents. This reduces the manual work in finding the grey area documents for review and training.
Three: I looked for more documents to evaluate/train the system. I had noticed that “severance” was a key word in relevant documents, and so went back and ran a search for this term for the first time. There were 3,222 hits, so, as per my standard procedure, I added this document count to name of the folder that automatically saved the search.
I found many more relevant documents that way. Some were of a new type I had not seen before (having to do with the mass lay-offs when Enron was going under), so I knew I was expanding the scope of relevancy training, as was my intent. I did the judgmental review by using various sort-type judgment searches in that folder, i.e. by ordering the documents by subject line, file type, search terms hits (the star symbols), etc., and did not review all 3,222 docs. I did not find that necessary. Instead, I honed in on the relevant docs, but also marked some irrelevant ones here that were close. Below is a screen shot of the first page of the documents sorted by putting those selected for training at the top.
I had also noticed that “lay off” “lay offs” and “laid off” were common terms found in relevant docs, and I had not searched for those particular terms before either. There were 973 documents with hits with one of these search terms. I did the same kind of judgmental search of the folder I created with these documents and found more relevant documents to train. Again, I was finding new documents and knew that I was expanding the scope of relevancy. Below is one new relevant document found in this selection; note how the search terms are highlighted for easy location.
I also took the time to mark some irrelevant documents in these new search folders, especially the documents in the last folder, and told them to train too, since they were otherwise close from a similar keywords perspective. So I thought I should go ahead and train them to try to teach the fine distinctions.
The above third step took another five hours (six hours total). I knew I had added hundreds of new docs for training in the past five hours, both relevant and irrelevant.
I decided it was time to run a training session again and force the software to analyze and rank all of the documents again. This was essentially the Fourth Round (not counting the little training I did at the beginning today to make sure I was served with the right (updated) Focus documents).
After the Training session completed, I asked for a report. It showed that 2,663 total documents (19,731 pages) have now been categorized and marked for Training in this last session. There were now 1,156 Trainer (me) identified documents, plus the original 1,507 System ID’ed docs. (Previously, in Round 3, there were the same 1,507 System ID’ed docs, and only 534 Trainer ID’ed docs.)
Then I ran a report to see how many docs had been categorized by me as Relevant (whether also marked for Training or not). Note I could have done this before the training session too, and it would not make any difference in results. All the training session does is change the predictions on coding, not the actual prior human coding. This relevancy search was saved in another search folder called “All Docs Marked Relevant after 4th Round – 355 Docs.” After the third round I had only ID’ed 137 relevant documents. So progress in recall was being made.
Prevalence Quality Control Check
As explained in detail in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane, my first random sample search allowed me to determine prevalence and get an idea of the total number of relevant document likely contained in the database. The number was 928 documents. That was the spot or point projection of the total yield in the corpus. (Yield is another information science and statistics term that is useful to know. It means in this context the expected number of relevant documents in the total database. See eg. Webber, W., Approximate Recall Conﬁdence Intervals, ACM Transactions on Information Systems, Vol. V, No. N, Article A (2012 draft) at A2.)
My yield calculation here of 928 is based on my earlier finding of 2 relevant documents in the initial 1,507 random sample. (2/1507=.00132714) (.13*699,082=928 relevant documents). So based on this I knew that I was correct to have gone ahead with the fourth round, and would next check to see how many documents the IRT now predicted would be relevant. My hope was the number would now be closer to the 928 goal of the projected yield of the 699,082 document corpus.
This last part had taken another hour, so I’ll end Day Seven with a total of 7 hours of search and review work.
Eighth Day of Review (9 Hours)
First I ran a probability search as before for all 51%+ probable relevant docs and saved them in a folder by that name. After the fourth round the IRC now predicted a total of 423 relevant documents. Remember I had already actually reviewed and categorized 355 docs as relevant, so this was only a potential max net gain of 68 docs. As it turned out, I disagreed with 8 of the predictions, so the actual net gain was only 60 docs, for a total of 415 confirmed relevant documents.
I had hoped for more after broadening the scope of documents marked relevant in the last seeding. So I was a little disappointed that my last seed set had not led to more predicted relevant. Since the “recall goal” for this project was 928 documents, I knew I still had some work to do to expand the scope. Either that or the confidence interval was at work, and there were actually fewer relevant documents in this collection than the random sample predicted as a point projection. The probability statistics showed that the actual range was between 112 documents 3,345 documents, due to the 95% confidence level and +/-3% confidence interval.
51%+ Probable Relevant Documents
Next I looked at the 51%+ probable relevant docs folder and sorted by whether the documents had been categorized on not. You do that by clicking on the symbol for categorization, a check, which is by default located in the upper left. That puts all of the categorized docs together, either on top or bottom. Then I reviewed the 68 new documents, the ones the computer predicted to be relevant that I had not previously marked relevant.
This is always the part of the review that is the most informative for me as to whether the computer is actually “getting-it” or not. You look to see what documents it gets wrong, in other words, makes a wrong prediction of probable relevance, and try to determine why. In this way you can be alert for additional documents to try to correct the error in future seeds. You learn from the computer’s mistakes where additional training is required.
I then had some moderately good news in my review. I only disagreed with eight of the 68 new predictions. One of these documents only had a 52.6% probability for relevance, another 53.6%, another 54.5%, another 54%, another 57.9%, and another other only 61%. Another two were 79.2% and 76.7% having to do with “voluntary” severance again, a mistake I had seen before. So even when the computer and I disagreed, it was not by much.
Computer Finds New Hard-to-Detect Relevant Documents
A couple of the documents that Inview predicted to be relevant were long, many pages, so my study and analysis of them took a while. Even though these long documents at first seemed irrelevant to me, as I kept reading and analyzing them, I ultimately agreed with the computer on all of them. A careful reading of the documents showed that they did in fact include discussion related to termination and terminated employees. I was surprised to see that, but pleased, as it showed the software mojo was kicking in. The predictive coding training was allowing the computer to find documents I would likely never have caught on my own. The mind-meld was working and hybrid power was again manifest.
These hard to detect issues (for me) mainly arose from the unusual situation of the mass terminations that came at the end of Enron, especially at the time of its bankruptcy. To be honest, I had forgotten about those events. My recollection of Enron history was pretty rusty when I started this project. I had not been searching for bankruptcy related terminations before. That was entirely the computer’s contribution and it was a good one.
From this study of the 68 new docs I realized that although there were still some issues with the software making an accurate distinction between voluntary and involuntary severance, overall, I felt pretty confident that Inview was now pretty well-trained. I based that on the 60 other predictions that were spot on.
Note that I marked most of the newly confirmed relevant documents for training, but not all. I did not want to excessively weight the training with some that were redundant, or odd for one reason or another, and thus not particularly instructive.
This work was fairly time-consuming. It took three long hours on a Sunday to complete.
Returning to work in the evening I started another training session, the Fifth. This would allow the new teaching (document training instructions) to take effect.
My plan was to then have the computer serve me up the 100 close calls (Focus Documents) by using the document training Checkout feature. Remember this feature selects and serves up for review the grey area docs designed to improve the IRT training, plus random samples.
But before I reviewed the next training set, I did a quick search to see how many new relevant documents (51%+) the last training (fifth round) has predicted. I found a total of 545 documents 51%+ predicted relevant. Remember I left the last session with 415 relevant docs (goal is 928). So progress was still being made. The computer had added 130 documents.
Review of Focus Documents
Before I looked at these new ones to see how many I agreed with, I stuck to my plan, and took a Checkout feed of 100 Focus documents. My guess is that most of the newly predicted 51%+ relevant docs would be in the grey area anyway, and so I’ll be reviewing some of them when I reviewed the Focus documents.
First, I noticed right away that it served up 35 irrelevant junk files that were obviously irrelevant and previously marked as such, such as PST placeholder files, and a few others like that, which clutter this ENRON dataset. Obviously, they were part of the random selection part of the Focus document selections. I told them all to train in one bulk command, hit the completed review button for them, and then focused on the remaining 65 documents. None had been reviewed before. Next I found some more obviously irrelevant docs, which were not close at all, i.e. 91% irrelevant and only 1% likely relevant. I suspect this is part of the general database random selection that makes up 10% of the Focus documents (the other 90% are close calls).
Next I did a file type sort to see if any more of the unreviewed documents in this batch of 100 were obviously irrelevant based on file type. I found 8 more such files, mass categorized them, mass trained them and quickly completed review for these 8.
Now there were 57 docs left, 9 of which were Word docs, and the rest emails. So I checked the 9 word docs next. Six of these were essentially the same document called “11 15 01 CALL.doc.” The computer gave each approximately a 32.3% probability of irrelevance and a 33.7% probability of relevance. Very close indeed. Some of the other docs had very slight prediction numbers (less than 1%). The documents proved to be very close calls. Most of them I found to be irrelevant. But in one document I found a comment about mass employee layoffs, so I decided to call it relevant to our issue of employee terminations. I trained those eight and checked them back in. I then reviewed the remaining word docs, found that they were also very close, but marked these as irrelevant and checked them in, leaving 48 docs left to review in the Training set of 100.
Next I noticed a junk kind of mass email from a sender called “Black.” I sorted by “From” found six by Black, and a quick look showed they were all irrelevant, as the computer had predicted for each. Not sure why they were picked as focus docs, but regardless, I trained them and checked them back in, now leaving 42 docs to review.
Next I sorted the remaining by “Subject” to look for some more that I might be able to quickly bulk code (mass categorize). It did not help much as there were only a couple of strings with the same subject. But I kept that subject order and sloughed through the remaining 42 docs.
I found most of the remaining docs were very close calls, all in the 30% range for both relevant and irrelevant. So they were all uncertain, i.w. a split choice, but none were actually predicted relevant, that is, none were in the over 50% likely relevant range. I found that most of them were indeed irrelevant, but not all. A few in this uncertain range were relevant. They were barely relevant, but of the new type recently marked having to do with the bankruptcy. Others that I found relevant were of a type I had seen before, yet the computer was still unsure with basically an even split of prediction in the 30% range. They were apparently different from the obviously relevant documents, but in a subtle way. I was not sure why. See Eg: control number 12509498.
It was 32.8% relevant and 30.9% irrelevant, even though I had marked an identical version of this email before as relevant in the last training. The computer was apparently suspicious of my prior call and was making sure. I know I’m anthropomorphizing a machine, but I don’t know how else to describe it.
Computer’s Focus Was Too Myopic To See God
One of the focus documents that the computer found a close call in the 30% range was email with control number 10910388. It was obviously just an inspirational message being forwarded around about God. You know the type I’m sure.
It was kind of funny to see that this email confused the computer, whereas any human could immediately recognize that this was a message about God, not employee terminations. It was obvious that the computer did not know God.
Suddenly My Prayers Are Answered
Right after the funny God mistake email, I reviewed another email with control number 6004505. It was about wanting to fire a particular employee. Although the computer was uncertain about the relevancy of this document, I knew right away that it rocked. It was just the kind of evidence I had been looking for. I marked it as Highly Relevant, the first hot document found in several sessions. Here is the email.
I took this discovery of a hot doc as a good sign. I was finding both the original documents I had been looking for and the new outliers. It looked to me like I had succeeded in training and in broadening the scope of relevancy to its proper breadth. I might not be travelling a divine road to redemption, but it was clearly leading to better recall.
Since most of these last 42 documents were all close questions (some were part of the 10% random and were obvious), the review took longer than usual. The above tasks all took over 1.5 hours (not including machine search time or time to write this memo).
Good Job Robot!
My next task was to review the 51% predicted relevant set of 545 docs. One document was particularly interesting, control number 12004849, which was predicted to be 54.7% likely relevant. I had previously marked it Irrelevant based on my close call decision that it only pertained to voluntary terminations, not involuntary terminations. It was an ERISA document, a Summary Plan Description of the Enron Metals Voluntary Separation Program.
Since the document on its face obviously pertained to voluntary separations, it was not relevant. That was my original thinking and why I at first called it Irrelevant. But my views on document characterizations on that fuzzy line between voluntary and involuntary employee terminations had changed somewhat over the course of the review project. I now had a better understanding of the underlying facts. The document necessarily defined both eligibility for this benefit, money when an employee left, and ineligibility. It specifically stated that employees of certain Enron entities were ineligible for this benefit. It stated that acceptance of an application was strictly within the company’s discretion. What happened if even an eligible employee decided not to voluntarily quit and take this money? Would they not then be terminated involuntarily? What happened if they applied for this severance, and the company said no? For all these reasons, and more, I decided that this document was in fact relevant to both voluntary and involuntary terminations. The relevance to involuntary terminations was indirect, and perhaps a bit of a stretch, but in my mind it was in the scope of a relevant document.
Bottom line, I had changed my mind and I now agreed with the computer and considered it Relevant. So I changed the coding to relevant and trained on it. Good call Inview. It had noticed an inconsistency with some of my other document codings and suggested a correction. I agreed. That was impressive. Good robot!
Looking at the New 51%+
Another one of the new documents that was in the 51%+ predicted relevant group was a document with 42 versions of itself. It was the Ken Lay email where he announced that he was not accepting his sixty-million dollar golden parachute. (Can you imagine how many law suits would have ensued if he took that money?) Here is one of the many copies of this email.
I had previously marked a version of this email as relevant in past rounds. Obviously the corpus (the 699,082 Enron emails) had more copies of that particular email that I had not found before. It was widely circulated. I confirmed the predictions of Relevance. (Remember that this database was deduplicated only on the individual custodian basis, vertical deduplication. It was not globally deduplicated against all custodians, horizontal deduplication. I recommend full horizontal deduplication as a default protocol.)
I disagreed with many of the other predicted relevant docs, but did not consider any of them important. The documents now presenting as possibly relevant were, in my view, cumulative and not really new, not really important. All were fetched by the outer limits of relevance triggered by my previously allowing in as barely relevant the final day comments on Ken Lay’s not taking a sixty-million dollar payment, and also allowing in as relevant general talk during bankruptcy that might mention layoffs.
Also, I was allowing in as relevant new documents and emails that concerned the ERISA plan revisions that were related to general severance. The SPD of the Enron Metals Voluntary Separation Program was an example of that. These were all fairly far afield of my original concept of relevance, which had grown as I saw all of the final days emails regarding layoffs, and better understood the bankruptcy and ERISA set up, etc.
Bottom line, I did not see much training value in these newly added docs, both predicted and confirmed. The new documents were not really new. They were very close to documents already found in the prior rounds. I was thinking it might be time to bring this search to an end.
Latest Relevancy Metrics
I ran one final search to determine my total relevant coded documents. The count was 659. That was a good increase over the last measured count of 545 relevant, but still short of my initial goal of 928, the point projection of yield. That is a 71% recall (659/928) of my target, which is pretty good, especially if the remaining relevant were just cumulative or otherwise not important. Considering the 3% confidence interval, and the range inherent in the 928 yield point projection because of that, from between 112 and 3,345 documents, it could in fact already be 100% recall, although I doubted that based on the process to date. See references to point projection, intervals, and William Webber’s work on confidence intervals in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane and in Webber, W., Approximate Recall Conﬁdence Intervals, ACM Transactions on Information Systems, Vol. V, No. N, Article A (2012 draft).
Enough Is Enough
I was pretty sure that further rounds of search would lead to the discovery of more relevant documents, but thought it very unlikely that any more significant relevant documents would be found. Although I had found one hot doc in this round, the quality of the rest of the documents found convinced me that was unlikely to occur again. I had the same reaction to the grey area documents. The quality had changed. Based on what I had been seeing in the last two rounds, the relevant documents left were, in my opinion, likely cumulative and of no real probative value to the case.
In other words, I did not see value in continuing the search and review process further, except for a final null-set quality control check. I decided to bring the search to end. Enough is enough already. Reasonable efforts are required, not perfection. Besides, I knew there was a final quality control test to be passed, and that it would likely reveal any serious mistakes on my part.
Moving On to the Perhaps-Final Quality Control Check
After declaring the search to be over, the next step in the project was to take a random sample of the documents not reviewed or categorized, to see if any significant false-negatives turned up. If none did, then I would consider the project a success, and conclude that more rounds of search were not required. If some did turn up, then I would have to keep the project going for at least another round, maybe more, depending on exactly what false-negatives were found. That would have to wait for the next day.
But before ending this long day I ran a quick search to see the size of this null set. There were 698,423 docs not categorized as relevant and I saved them in a Null Set Folder for easy reference. Now I could exit the program.
Total time for this night’s work was 4.5 hours, not including report preparation time and wait time on the computer for the training.
To be continued . . . .