This is my fourth in a series of narrative descriptions of an academic search project of 699,082 Enron emails and attachments. It started as a predictive coding training exercise that I created for Jackson Lewis attorneys. The goal was to find evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane. The third and fourth days are described in Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.
In this fourth installment I continue to describe what I did in days five and six of the project. In this narrative I go deep into the weeds and describe the details of multimodal search. Near the end of day six I have an affirming hybrid multimodal mind-meld moment, which I try to describe. I conclude by sharing some helpful advice I received from Joseph White, one of Kroll Ontrack’s (KO) experts on predictive coding and KO’s Inview software. Before I launch into the narrative, a brief word about vendor experts. Don’t worry, it is not going to be a commercial for my favorite vendors; more like a warning based on hard experience.
As part of your due diligence when selecting a vendor for any significant predictive coding project, I suggest that you interview the vendor experts that will be available to assist you, especially on the predictive coding aspects. They should have good knowledge of the software and the theory. They should also be able to explain everything to you clearly and patiently. They should not just be parrots of company white papers, or even worse, of sales materials and software manuals.
If a vendor expert truly understands, they can transcend the company jargon; they can rephrase so that you can understand. They can adapt to changing circumstances. The advice of a good vendor expert, one that not only understands the software, but also the law and the practical issues of lawyers, is invaluable. Periodic consults during a project can save you time and money, and improve the overall effectiveness of your search.
When talking to the experts, be sure that you understand what they say to you, and never just nod in agreement when you do not really get it. I have been learning and working with new computer software of all kinds for over thirty years, and am not at all afraid to say that I do not understand or follow something.
Often you cannot follow because the explanation is so poor. For instance, often the words I hear from vendor tech experts are too filled with company specific jargon. If what you are being told makes no sense to you, then say so. Keep asking questions until it does. Do not be afraid of looking foolish. You need to be able to explain this. Repeat back to them what you do understand in your own words until they agree that you have got it right. Do not just be a parrot. Take the time to understand. The vendor experts will respect you for the questions, and so will your clients. It is a great way to learn, especially when it is coupled with hands-on experience.
Fifth Day of Review (4 Hours)
I began the fifth day of work on the project by reviewing the 161 documents that I had found on my last day of working on this project. They were all predicted to be relevant (51% +). I had not finished reviewing them in the fourth day (which is reality was three-weeks ago). This first task took about one hour. Note that I elected not to train on all of them. This is an important degree of flexibility that Inview software provides, which others that I have seen do not.
Third Round of Predictive Coding
Next, I ran the third iteration of predictive coding analysis by the software. That is called “initiating a Session” in Inview, a new learning session. The menu screen for this is found in Workflow / Manage iC (the three colored dots logo). Click on that Manage iC button and you open the Manage iC Learning page.
On this screen, below the opening splash shown above, you can initiate a training session. A partial screen shot showing the Initiate Session button is shown below.
After you click on the Initiate Session button, a message appears in red font saying: “Learning session currently in progress.” This learning session can take over an hour or more, but that’s computer time. It only took five minutes of my actual, billable time. The computer during this time is analyzing every document based on the new matrix. Put another way, the learning session is based on the new information, the new coding of documents that I marked for Training, that I provided today and in Day Four. All other trained documents are also considered.
Basically the only new coding I had provided to Inview between rounds two and three were my coding of the 162 docs predicted to be relevant (51% +), 111 of which were new, and the 12 predicted to be Highly Relevant, only 1 of which Hot documents was new. Again, I did not train on any grey areas, as I thought it was too early to look at that. Depending on what results we get from this third round, I may include that in the next training.
Please note that KO’s Inview gives the Trainers (me and/or any other attorney with authority to manage the IRT process) the ability to pick and choose how we train. Other software is much more rigid and controlled, i.e. – they require review of grey area documents before each training, plus top ranked documents. I like the flexibility in KO’s software. It gives some credit to the ability of lawyers as expert searchers, at least when it comes to evidence and legal classifications. For a fuller explanation of my preferred hybrid approach, where computers and lawyers work together, and my opposition to a total computer-controlled approach, which I have called Borg-like, see Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane, (subsections Some Vendors and Experts Disagree with Hybrid Multimodal and Fighting for the Rights of Human Lawyers).
The third training session completed and the report stated that 534 Trainer identified Docs were used in the training, and that again there were 1,507 System Identified Documents. Recall that after the second training session completed there were 1,507 documents selected at random by the computer (a/k/a system identified) and 493 more documents selected by me (trainer identified) and marked for training, for a total of exactly 2,000. This meant I had only added 41 new documents for training since in the last session.
More Searching After the Training
I then ran a search to find the 51% probable relevant documents after the third training. Below is a screen shot of how that search is run. You could plug in whatever probability range you wanted. You could also include a variety of other search components at the same time. Inview, like most state-of-the-art review platforms, has all kinds of very powerful search functions.
This search found 132 predicted relevant documents, instead of 161 docs classified as probable relevant in the prior round. I saved those in my search folder that I named “51% probable relevant after 3rd round – 132 documents” with date. Twenty-six (26) of the documents were new, but most of them were dupes and near dupes, so there were actually less than 10 brand new docs. Also, interestingly, I only disagreed with one of the predictions, whereas in the prior round I had disagreed with several.
Next I did the same kind of iCategory probability search, 51% plus, but for Highly Relevant only. Recall that last time Inview returned 12, this time it again returned 12. There were all the same documents. (Note, this was not really necessary because Highly Relevant documents are included in the Relevant category, but I wanted to make sure on the count.)
My initial analysis, more like speculation, is that I am either: (1) stuck in too narrow an instruction, and the training needs to be expanded so that our recall is better; or, (2) almost done.
With that in mind I did the first search of the mid-range, the grey area, and searched the 40%-50% probability. (I could have done this another way, and let the machine select the mid-range, but this method allowed greater control.) This returned 29 documents.
I had reviewed 7 of these documents before and marked 5 of them as relevant, 1 as undetermined, and 1 as irrelevant. All 5 of the documents I had marked as relevant I had decided not to Train on. I had thought all 5 were irregular for some reason and it would not be good to use them for training. So now I marked all 5 of them for training because the computer was not certain about them. That is why they were in the 40-50% probability range.
It is interesting to note that I had marked one of these seven documents before as Undetermined. This is a complex legal document, an eight-page contract entitled Human Resources Agreement by and between Enron and several entities, including Georgia Pacific, dated October 2011. It is assigned control number 12009960. The first page of this document is shown below.
I was not sure when I first started the project whether it was relevant or not, so I sort of passed, and marked it as Undetermined. I like to do that at the beginning of projects. This document had many provisions on employment termination, but I was not sure if it really pertained to involuntary terminations or not, plus it looked like this was just a draft document, not a final contract. The computer was also unsure, like I used to be. But now with my greater experience of the relevancy border I was defining, and especially because of my now greater familiarity of the types of legal documents that Enron was using, I was able to make a decision. I considered the document to be irrelevant. The secondary references to involuntary termination were trumped by the primary intent to deal with voluntary departures as part of a merger. Plus, this was just a draft legal document. I was unsure of that before, but not now, not after I had seen dozens of documents by Enron lawyers. For all of these reasons, and more, I marked it as irrelevant and marked it for training.
I also marked for training the one document that I had marked before as irrelevant, but had not marked for training. I was hoping this would clear up some obvious confusion in my prior training.
I next reviewed the 22 docs in the grey area that I had not reviewed before. Most of them were dupes and near dupes. There were really only 10 new documents not reviewed before. I disagreed with about half of the predictions and only marked 8 out of the 22 as relevant (but they were somewhat close questions). This is all to be expected for grey area documents. This kind of close-call review is rather slow, and took almost three more hours for the post training tasks, for a total billable time today of four hours.
Sixth Day of Review (4 Hours)
I started with a search to confirm the total number of docs we have now marked as relevant and put them in a folder labeled “All docs Marked Relevant after third run – 137 docs” with date. That was just for housekeeping metrics. Careful labeling of the search folders that Inview generates automatically of each search is very important. It takes a little time to do, but can save you a lot of time later.
Add Associated Searches
Next I tried to expand on the 137 docs by using the Add Associated series of commands in the Home tab. This is a kind of similarity search function described before.
I started with “Duplicate.” This did not add any new files. My prior duplicate exercise had already caught them.
Then I used “Family” this added one new email, which transmitted a relevant Q&A document as an attachment. According to our protocol both the attachment and email would be produced, so I marked the email relevant (although nothing on the face of the email alone would be relevant). That was document control number 3600805. But I did not mark the email itself for Training, as I assumed that would not be helpful. We are now up to 138 documents categorized as relevant.
Next I pressed the “Near-Duplicate” button and this added no new documents. So then I activated the “Contextual” duplicates command. Again, nothing new. I also tried the Core near duplicates, again nothing added.
Then I activated the “Thread” command and this time it expanded the folder to 306 documents, an increase of 168 documents (it was 138). So this add associated Thread function more than doubled the size of the folder. But I had to review all of the 168 new documents from Thread to see whether in fact I considered them to be relevant or not. I thought that they probably all would be, or at least might be, because they were part of an email chain that was relevant, but maybe not, as least on their own. This proved to be a very time-consuming task, which I here describe in some detail. In fact, I found 162 out of the 168 to be relevant for various reasons described below and only disallowed 6 thread documents.
I found that many of the new 168 documents were emails that were transmittals of attachments that I had marked relevant, so again in accord with my protocol to always produce email parents of relevant attachments, I marked them all as Relevant, but did not tell to train. If the email has some content that was in itself relevant, than I also marked the document to Train, see eg – control number 12010704 shown below.
I also found siblings that were not relevant, and so marked them as Irrelevant in accord with my standard protocol for this project. My standard existing protocol was only to produce relevant attachments (but I am having second thoughts on this, as I will explain below). If one email has two attachments, one relevant, and one irrelevant, under this protocol I would only produce the relevant attachments and the email (parent). I would not produce the irrelevant attachment (sibling). That is what email families are all about. I would love to hear readers thoughts about that?
I found the most efficient way to search these new Thread documents was to sort using the Family ID column, which I dragged to the left side for easy viewing. To sort you just click on the column in the Document View. Below is a screen shot of the columns only where I have ordered by Family ID. The beginning and ending control number columns are also shown in the screen shot of part of the list.
For a good example of the kind of parent-child emails I am speaking about, look at the parent email named Enron Metals North America Voluntary Severance, control number 12006578.
This email is the parent of Family ID # 283789. There are 7 docs in this big family. Three out of the six children had already been marked as Relevant, but the parent email had not been reviewed, and neither had the three siblings, the other attachments.
Per protocol I marked the parent email as relevant. But in fact, when I read it carefully, I saw that it was relevant on its own. I noted some language in the body of the email itself talking about termination of employees (… will we be prevented from terminating the employee under the compulsory program?“), so I also marked the parent to Train.
One new sibling, a Word attachment, was reviewed and found to be relevant, so I marked it as Relevant and to Train. I had to give some thought to the two spreadsheets attached to this family, as they were lists of employees. But taking the email and other attachments into consideration, I decided these were likely employees identified for this “voluntary severance” program, which could in these circumstances amount to involuntary termination. The spreadsheet included ethnicity and age, and it is interesting to note that almost all of them were 50 years of age or older. Hmmm. I marked the two spreadsheets as relevant, but did not mark them for Training. This kind of analysis was fairly time-consuming.
For another close family question that I spent time analyzing, see Family ID # 274249.
I had previously reviewed and marked as relevant a word document called 101lmp.doc, control number 12004847. Although much of the document talks about “voluntary separation,” some of it talks about termination if an employee does not elect to voluntarily quit. Thus again it looked relevant to me. (Remember I decided that bona fide resignations were not relevant, but forced terminations were.) Do you see a hint in the screen shot of the parent email that suggest this email and attachments may also have been privileged?
This family has two other word docs and an excel spreadsheet. The other word docs were just limited to voluntary separations and so I marked them as Irrelevant, but not for training, as they were a close call. The spreadsheet calculated a “separation payment” if you elected to quit, so I considered that irrelevant too, but again did not Train on it.
Sometimes it does not make sense to separate the children because they are all so close and interconnected that you could not fully understand the relevant attachment without also considering the other attachments, which, on their own, might not be considered relevant. The Family # 214065 is an example of this.
It consists of an email and eight attachments. It concerns Ken Lay’s announcement of the purchase of what is left of Enron by Dynegy. Two of the attachments talked about employee terminations, but the others talked about other aspects of the deal. I thought you needed to see them all to understand the ones mentioning layoffs, so I marked them all as relevant. I did the same for Family # 564604 concerning the same event. I did the same for Family 648122 concerning the Dynegy merger.
I also did the same thing for Family # 458836. This last family caused me to change one document that I had previously called Irrelevant and Trained on, and made a sticky note about. I changed the coding to Relevant, but said no for Training. See doc control number 10713054.
As you can see, it is a FAQ document about leaving Enron in the context of voluntary departure, but it was part of a larger package concerning the massive 50% layoffs in October 2001. I left a new sticky note on the document explaining my flip-flop. I started off not knowing if it was relevant or not and so marked it Undetermined (essentially put off for later determination). In round two the computer rated this FAQ document as 94.7% likely relevant. I was convinced at that time that the computer was wrong, that the document was irrelevant because it only pertained to voluntary terminations.
Now, in this third round, I changed my mind again and agreed with the computer, and thought that this FAQ document was relevant. But I thought it was relevant for a new reason, one that I had not even considered before, namely its email family context.
Hybrid Multimodal Mind-Meld Search
The computer has decided that this FAQ document was 57.6% probable relevant. It did so, instead of a higher relevancy prediction, as you might expect, since I had marked it as relevant and told it to train. Although I did not like my own flip-flopping and agreeing with the computer, I was gratified by this low percentage. It was just 57.6% probable relevant. That indicated, as I thought it should, that Inview still considered the document something of a close call. So did I. To me this was yet another piece of evidence that the procedures were working, that the AI and human minds were melding. It was a hybrid computer-human process, yet I was still in control. That is what I mean by hybrid in my catch phrase hybrid multimodal search.
How Big Should Your Families Be?
This Family Analysis is a slow process and took me over three hours to complete. It might actually save substantial time to have a more expansive family protocol, one where all attachments are auto marked as relevant if the email is relevant or any of the attachments. But then you end up disclosing more, and possibly triggering more redaction work too. I wanted input on this, especially since exports from AdvanceView always include all families, like it or not. So I asked KO’s experts on this and they indicated that most people produce entire families without dropping any members, but there is some variation in this practice. I would welcome reader comments on this full family production issue.
Concern Regarding Scope of Relevance
At the end of this exercise to Add Associated documents based on the 137 previously categorized as Relevant, I had 289 Relevant documents, an increase of 162 documents (118%). See the screen shot below of the search folders where I stored these documents.
So this proved to be an effective way to increase my relevance count, my recall, and to do so with very good (96%) precision (162 out of the 168 added as Thread members were marked by me as relevant). But it was not that helpful in Training, I didn’t think, because very few of the 162 newly classified relevant documents were worthy of training status. Most were just technically relevant, for example, because they were an email parent transmitting a relevant child.
For that reason, I still wanted to make at least one more effort to reach for outliers, relevant documents on employee termination that had not yet discovered. I was concerned that there might be relevant documents in the collection of a completely different type that I had not found before. I was concerned that my training might have been too narrow. Either that, or perhaps I was near the end. The only way to know for sure was to make special efforts to broaden the search. I decided to broaden the scope of training documents before I ran another Training Session.
Input from KO’s Joe White
To double-check my analysis and plan, I consulted with the KO IRT search expert helping me, Joseph White. He basically agreed with my analysis of the results to date and provided several good suggestions. He agreed that we were at a tipping point between continuing to search for examples vs. considering the system trained. He also agreed with my decision to keep going and make more efforts to broaden the search for new training documents.
Joe recommended I run a new learning suggestion at this time to be sure that my suggestion status was current. Then if I ran another Focus document training session this would, in his words, help the Inview classifiers. He described the process in shorthand as “Run Learning Session, optionally enable new suggestions, pull new Focus documents and train them, repeat.” That is the essence of the predictive coding part of multimodal search. Joe explained that these training sessions can happen many times across all categories, or just the ones you are most concerned about and want further clarity for the system. Joe observed that in my search project to date there were relatively few documents in gray areas, as opposed to other projects he had seen. That meant my project might not need much more iterative training.
In Joe’s experience in situations like this one, where fewer than 2-3% of the corpus is presenting as relevant, it is generally more difficult to determine how well the system is doing. He suggested that when facing such low prevalence rates, which I know from experience is typical in employment cases, that I should continue to use other search techniques, as I had already been doing, to try to locate additional relevant documents to feed into training. In other words, he was recommending the mutimodal approach.
For instance Joe suggested use of other Inview search features, including: Associated Documents; Topic Grouping; Concept Searching to look for terms that may help you find other terms/documents that will yield relevant content; Find Similar; and Keyword Searching using special Inview capacities such as the Data Dictionary function to view term variants/counts. He also suggested that I continue to engage in general analysis of date ranges, custodians/people, and metadata patterns related to documents you have found (to help expand on the story). Joe suggested I use the Analytics view to help do this. This graphics display of data and data relationships allows for visual navigation and selection of communication patterns, date ranges and subject lines. Below is a screen shot of one example of the many Analytic views possible.
Most good software today has similar visual representations, including the one shown above of email communication patterns between custodians.
As to the pure predictive coding search methods, which Joe refers to as Active Learning, he suggested I continue to use the Focus document system he described before. He noted that if the gray area count diminishes, and few new relevant documents turn up, then I will know that, in his words, I’m in a good place.
Joe applauded my efforts in nuanced selection of documents for training to date. He suggested that I continue to look for new types of relevant documents to include in the IRT training. (The KO people rarely say predictive coding, which is a habit I’m trying to break them of. (The term predictive coding is descriptive, has been around a long time, and cannot be trademarked.))
Joe said I was correct to not focus inwardly on the already-trained documents. But he pointed out that such an inward focus might be appropriate in other projects where there is a concern with prior training quality, such as where there is a change mid-course in relevancy or where mistakes were made in initial coding. Since this had not happened here, he said I was on the right track to focus my search instead on outliers, new types of relevant documents, using the multimodal approach, which, by the way is my words, not Joe’s. Like most vendor experts, he tends to use proper corporate speak, and is slow to be indoctrinated into my vocabulary. Still, progress is being made on language, and Joe is never hesitant to respond to my questions. Joe’s near 24/7 access is also a treat.
My total time estimate for this sixth day of four hours did not include my time to study Joe White’s input or write up my work.
To be continued . . . .
Reblogged this on THE KINGDOMDWELLER.
Tremendous thanks for such an in-depth description of your process. Couple of questions:
Could you say generally how you chose which of the relevant documents KO should train on?
Were they simply the most relevant?
In order to select training documents did you need to make any assumptions about how KO was generalizing from them? If so, what assumptions did you make?
Glad you are enjoying the narratives. Hope it will inspire you to try it for yourself, if you have not already done so.
As to your question:
Generally I marked all relevant documents for training. I made a few exceptions when the relevancy would not have been helpful, such as with transmittal emails that were only relevant because of the attachments, or when it was very repetitive. A few examples are given throughout the narrative. I also marked many irrelevant documents for training too, and I also marked borderline grey area documents and training set documents. Again, this is all described in the narrative. Like all of Law, it is more of an art than a science and depends on the facts, the particular documents you encounter, and the issues under consideration.
Thanks for the insight into your processes for predictive coding. It’s much appreciated.
In regard to your question regarding families. My personal preference and what I have seen in the Australian market is for the entire family to be exchanged even if siblings are not relevant.
Where the sibling (or the host for that matter) is completely privileged or irrelevant and confidential, the data will be exchanged but not the image of the document itself. This saves from having to redact complete documents.
[…] Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld mom… by Ralph Losey. […]
[…] I conducted the search using Kroll Ontrack (“KO”) Inview software. KO was kind enough to provide the EDRM Enron Data, software, and hosting without charge. Any failures or mistakes are to be attributed to me, not them, and certainly not their software. KO left me alone to do whatever, and only provided input on one occasion upon my request, which is later documented in Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld mom…. […]
[…] Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld mom…. […]
[…] Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld mom…. […]
[…] Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld mom…. […]
[…] and Six of a Predictive Coding Narrative: Deep into the Weeds and a Computer Mind-meld Moment – http://bit.ly/NMrVLS (Ralph […]