Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron

This is the first in a series of narrative descriptions of a legal search project using predictive coding. Follow along while I search for evidence of involuntary employee terminations in a haystack of 699,082 Enron emails and attachments.

Joys and Risks of Being First

To the best of my knowledge, this writing project is another first. I do not think anyone has ever previously written a blow-by-blow, detailed description of a large legal search and review project of any kind, much less a predictive coding project. Experts on predictive coding speak only from a mile high perspective; never from the trenches (you can speculate why). That has been my practice here, until now, and also my practice when speaking about predictive coding on panels or in various types of conferences, workshops, and classes.

There are many good reasons for this, including the main one that lawyers cannot talk about their client’s business or information. That is why in order to do this I had to run an academic project and search and review the Enron data. Many people could do the same. In fact, each year the TREC Legal Track participants do similar search projects of Enron data. But still, no one has taken the time to describe the details of their search, not even the spacey TRECkies in TREC Legal Track.

A search project like this takes an enormous amount of time. In fact, only the 2011 Legal Track TRECkies recorded and reported the time that they put into the project, and even then it was just totals. In my narrative I will report the amount of time that I put into the project on a day-by-day basis, and also, sometimes, on a per task basis. I am a lawyer. I live by the clock and have done so for thirty-two years. Time is important to me, even non-money time like this. There is also a not-insignificant amount of time it takes to write it up a narrative like this. I did not attempt to record that.

There is one final reason this has never been attempted before, and it is not trivial: the risks involved. Any narrator who publicly describes their search efforts assumes the risk of criticism from monday morning quarterbacks about how the sausage was made. I get that. I think I can handle the inevitable criticism. A quote that Jason R. Baron turned me on to a couple of years ago helps, the famous line from Theodore Roosevelt in his Man in the Arena speech at the Sorbonne:

It is not the critic who counts: not the man who points out how the strong man stumbles or where the doer of deeds could have done better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood, who strives valiantly, who errs and comes up short again and again, because there is no effort without error or shortcoming, but who knows the great enthusiasms, the great devotions, who spends himself for a worthy cause; who, at the best, knows, in the end, the triumph of high achievement, and who, at the worst, if he fails, at least he fails while daring greatly, so that his place shall never be with those cold and timid souls who knew neither victory nor defeat.

I know this narrative is no high achievement, but we all do what we can, and this seems within my marginal capacities.

Desired Impact

I took the time to do this in the hope that such a narrative will encourage more attorneys and litigants  to use predictive coding technology. Everyone who tries this new technology agrees it is the best way yet to find evidence in an economical manner. It is the best way to counter all those who would use discovery as abuse, and not a tool for truth. See, eg.:

Predictive coding can finally put an end to this abuse. We can use these methods to search large volumes of ESI in a fast, efficient, and economical manner. We have to do this. It is imperative because the volumes of ESI continue to grow, and, along with this flood, the costs of discovery continue to spiral out of control. Despite all of our efforts at cooperation and professionalism, there are still far too many attorneys out there who take advantage of this situation and use discovery as a weapon to try to force defendants to settle meritless cases. See, e.g.:

  • Bondi v. Capital & Fin. Asset Mgmt. S.A., 535 F.3d 87, 97 (2d Cir. 2008)”This Court . . . has taken note of the pressures upon corporate defendants to settle securities fraud ‘strike suits’ when those settlements are driven, not by the merits of plaintiffs’ claims, but by defendants’ fears of potentially astronomical attorneys’ fees arising from lengthy discovery.”)
  • Spielman v. Merrill Lynch, Pierce, Fenner & Smith, Inc., 332 F.3d 116, 122-23 (2d Cir. 2003) (“The PSLRA afforded district courts the opportunity in the early stages of litigation to make an initial assessment of the legal sufficiency of any claims before defendants were forced to incur considerable legal fees or, worse, settle claims regardless of their merit in order to avoid the risk of expensive, protracted securities litigation.”)
  • Lander v. Hartford Life & Annuity Ins. Co., 251 F.3d 101, 107 (2d Cir. 2001) (“Because of the expense of defending such suits, issuers were often forced to settle, regardless of the merits of the action. PSLRA addressed these concerns by instituting . . . a mandatory stay of discovery so that district courts could first determine the legal sufficiency of the claims in all securities class actions.” (citations omitted))
  • Kassover v. UBS A.G., 08 Civ. 2753, 2008 WL 5395942 at *3 (S.D.N.Y. Dec. 19, 2008) (“PSLRA’s discovery stay provision was promulgated to prevent conduct such as: (a) filing frivolous securities fraud claims, with an expectation that the high cost of responding to discovery demands will coerce defendants to settle; and (b) embarking on a ‘fishing expedition’ or ‘abusive strike suit’ litigation.”)

Follow me now while I search for relevance in the ashes of Enron.

699,082 Enron Documents: the Ashes of a Once-Great Empire

My search was of the 699,082 slice of the Enron database put together by EDRM. It is the V2 version that was processed by ZL Labs for EDRM. It was deduplicated at the custodian level, in other words, vertical deduplication, not horizontal. Specifically this EDRM Enron dataset includes:

  • EDRM Enron PST Data Set: Enron e-mail messages and attachments organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files. The total size of the compressed files is approximately 19 GB. The total size of the uncompressed files is approximately 43 GB.
  • EDRM File Formats Data Set: 381 files covering 200 file formats.
  • EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email. (Note: in this review project I arbitrarily called any foreign language documents irrelevant, and did not consult with translators.)

I conducted the search using Kroll Ontrack (“KO”) Inview software. KO was kind enough to provide the EDRM Enron Data, software, and hosting without charge. Any failures or mistakes are to be attributed to me, not them, and certainly not their software. KO left me alone to do whatever, and only provided input on one occasion upon my request, which is later documented in Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment.

This 699,082 slice of the Enron database looks like a somewhat random selection of emails and attachments, not unlike what you would find by review of the PST files from a number of key custodians. It went in time from the late 90s, when the company was doing great, and involuntary terminations were rare, to its eventual dissolution. In the early emails life for the 20,000 Enron employees was good and their email reflected that. Enron was growing and hiring. It was one of the hottest companies in America with revenues of over $100 Billion. But all of that changed rapidly near the end of the company, when it fell into bankruptcy in late 2001 and the emails slowly came to an end.

The date range in this Enron data collection, excluding a few outliers, ranged from January 1, 1997 to November 30, 2002. Here is a screen shot of one of the oldest and newest emails in the collection.

 .

My search for involuntary employee termination related documents led me to focus on the final sad months when the empire fell apart, and ultimately the company itself was dissolved. The search involved fine relevancy distinctions between voluntary and involuntary terminations, with only involuntary being relevant. Such distinctions are common to all search projects, and this was no exception.

You would have to be a cold human indeed not to feel some of the pain of the thousands of people who ended up losing their jobs, their incomes, their retirement savings, because of the dastardly behavior of a few bad apples at the top. I have read their Enron email, which included many personal notes to family, friends, and even lovers. I have seen family photos, read their jokes and inspirational messages. I have even seen their porn and their cursing in anger. It is all there in the email, their life.

I have intruded into their privacy, uninvited, and unwelcome. For that I almost feel a need to apologize, but this is now public data, and my purpose was an academic study, not commercial or personal exploitation. Still, out of respect for the hundreds of people whose privacy I have necessarily invaded by the search of 699,082 emails and attachments, I will not include any specific information in this narrative about these people and their lives. I owe them that much, and anyway, it seems like the decent thing to do. I do not think the omission in any way detracts from the value of the narrative.

Learn By Watching, Then Doing

The original point of the exercise was to provide training to a group of my firm’s e-discovery liaisons who attended a KO training session in Minnesota. We trained by a demonstration of the use of the Inview software to respond to a hypothetical RFP. During the exercise emphasis was placed on use and description of the predictive coding features.

The feedback from my liaisons was that this was a good way to learn. This is not surprising because  lawyers typically learn best by doing, and before we do something for the first time, in an ideal situation at least, we usually observe someone else who already knows how to do it. Any good law firm will, for instance, have a new associate watch an experienced partner take a few depositions before they let them take a deposition on their own. It is part of the legal apprenticeship program and one reason we call it the practice of law.

Overview of Efforts

I conducted this search over the course of eight workdays in May and June 2012. At the end of each day I sent out a description of what I had done. All of the lawyers in the training were invited to log onto the database and follow along. I have since edited these daily reports into a single narrative, all for general instruction purposes.

This Search is Just One Example Among Many

This narrative shows one example of the use of predictive coding in a typical legal setting. It is just one illustration, and many alternative approaches could have been followed. Indeed, if I were to do this over again, I would do many things different now that I have the benefit of 20/20 hindsight. Also, my knowledge of this particular software, Inview, has improved since this relatively early experiment, especially the ins and outs of how its predictive coding features work. It was, however, not my first such experiment with KO’s Inview, not to mention my work with other vendors’ software, each of which works slightly differently. As they say in Texas (and central Florida), this was not my first rodeo. Still, if I rode this particular bull again, I would do it differently. And, on any one day, I am sure that there are any number of people who could do it better, including many of my readers.

Best Practice

I do not contend that the particular search efforts here described were the best possible way the search of this data for this purpose could have been performed. Moreover, I readily admit that it was not even close to a perfect process. Perfection in legal search is never possible by anyone with any software. Perfection is never required by the law, in search and review, or anything else. I do contend, however, that the efforts here described constituted a reasonable search effort. It should, therefore, withstand any legal challenge as to its adequacy, since the law only requires reasonable search efforts.

Having said all that, to be honest, I think that the search here described was fairly well done. Otherwise I would not waste the reader’s time with the description, nor use this narrative for instruction. Since I specialize in this stuff, and am considered an expert in legal search, particularly predictive coding, I would go a bit further  and claim that my efforts were more than just adequate. (A quick footnote on my qualifications: I have over 30 years of experience searching for ESI on computers, a pending patent on one legal search method, and I am a published author and frequent speaker on the subject.) Right now predictive coding, and related legal doctrine and methods such as proportionality and bottom line driven review, are my primary interest in e-discovery. You could say I am obsessed and literally talk about it all of the time. See eg.:

Based on my background and experience, I think it is fair to contend that the search conducted was more than a mere legally adequate effort, more than just a reasonable effort. I would argue that it constitutes an example of a best practice of search and review. It qualifies as a best practice (as opposed to best possible) for two reasons: (1) advanced, predictive coding based software was used; and, (2) the search was conducted by a qualified expert. Still, the particular details and methods used in the search described in the narrative are just one example of a best practice; one among many possible approaches. Also, it is certainly not a standard to be followed in every search. It is important not to confuse those two things. Standards are more general. They are never single-case specific. They are never reviewer specific. So, after all of this long introduction, we finally come to the search narrative itself.

Come, Watson, come! The game is afoot.

    .
First Day of Review (8.5 Hours)

The review began by judgmental sampling. I just looked around the 700-thousand documents to get a general idea of the types of documents in the dataset, the people involved, the date ranges, and the kinds of subjects their email addressed. This could have been done with reports generated by Inview, but I choose to do so by displaying all of them, and sorting the display in various ways, including one of my favorites, display by file types. That allowed very easy viewing of the underlying documents whenever I wanted.

Another good method I could have used was the Analytics view with graphics displays providing visual information about the data. This includes a pull down menu where you can review certain files types. Or you can review by custodian with a visual display of who is emailing who. Or you can view by graphical displays of date ranges. These kind of visual displays of ESI contours are now common in most advanced software.

I also ran a few obvious, and some not so obvious, keyword searches pertaining to employee termination and other things.

By just looking around as I did, and running a few easy keyword searches, I found some relevant documents, and many more irrelevant ones. Although I was just beginning to familiarize myself with the data, I went ahead and coded some documents where it was obvious they were either relevant or irrelevant; the lowest hanging fruit, if you will. I coded 412 documents in this manner.

I call this judgmental sampling because I was using my own judgment to select and review small samples of the overall data. Before I began the predictive coding search process, I would also include random sampling, as this is core to all predictive coding methods. But, as is usual for me, I started here with judgmental sampling.

By the way, although I will be very transparent here, I am not going to tell you exactly everything I did. I am going to save a few trade secrets, a little bit of secret sauce, such as exactly what keyword searches I ran at the very beginning. As Maura Grossman likes to call such disclosure, its translucent, not transparent. I hope you understand.

Category Coding

I designed only five coding choices in this exercise:

  • Irrelevant,
  • Undetermined (relevancy),
  • Relevant,
  • Highly relevant (a sub-category of Relevant),
  • Privileged

My categorization screen shown here included these categories, plus a box to check to tell the computer to train on the document. This training box is always optional. This will be explained further along in this narrative as we dive deeper into the predictive coding aspects of the search. The training button should only be checked on a category chosen for the document.

The screen shot here shows an cross-categorization error; the wrong Training box has been checked. The computer will not allow you to proceed with the error. Here you would either have to uncheck the Training button on Relevant, and check it instead on Irrelevant, or not at all. Alternatively, you could change the category check box to Relevant, and leave the Training box checked. This kind of consistency safeguard is present in all software systems that I looked at. Ask your vendor to confirm that they have similar consistency safeguards in the coding.

Regarding the Privileged category, I only ran into a few privileged documents that were relevant. When this happened, I of course marked them as such. But they were so rare as to not be valuable to describe here. The narrative will instead focus entirely on my search for relevancy.

This initial orientation period lasted about three hours. (Whenever I report time herein, that is billable type time. I’m not including breaks or significant interruptions.)

First Predictive Coding Run

After this orientation I began the search and coding project in earnest by starting the predictive coding procedures of Inview. I began by generating the first random sample of the data. I used a 95% confidence level and a +/- 3% confidence interval. Based on these specifications the software randomly selected 1,507 documents. I’ll explain that number soon, as observant readers will note it seems too high.

My first actual systematic coding was done by review of each of the 1,507 randomly selected documents. In the language of KO and Inview, these are called the machine-selected documents that will be used to train the system; a/k/a, the first seed set. They also served as the initial baseline for my Quality Control calculations, as will be explained later also.

I completed this review of the 1,507 documents in 5.5 hours. After I completed that review the Initiate Session button shown above became active. At that point I could start a machine learning session, but not before.

I made an effort during the review to monitor my review speeds. I started this review at a review speed of about 200 documents per hour. Gradually as I got better with using the controls on my MacPro (first time I had used it for Inview review (loved it)), and as I gained closer familiarity with the stupid Enron documents, my speed went up to about 300 files per hour. I made liberal use of the bulk coding capabilities to attain these speeds, but was still careful. On a dual screen monitor, knowing what I now know about the kind of random docs I’m likely to see, and how to use the software and keyboard shortcuts, I think I could attain a speed of 400 files per hour, maybe even 500. Remember, this is only possible (for me at least) in review of null-set type collections, i.w. – documents where almost all are irrelevant. It is much slower to review culled sets where there are 10% or more relevant documents. There you will be lucky to see 100 to 200 files per hour, even from top reviewers using clever sorting tricks and bulk coding.

Out of 1,507 items I reviewed, only 2 documents were identified as relevant. None was identified as highly relevant. Remember the goal is to find documents about employee termination (not contract termination, and not employee’s voluntary termination, or retiring, etc.). Moreover, the ultimate goal is to find the few highly relevant documents about involuntary employee termination that might be used at trial.

Based on this first review of the 1,507 random documents  (also called by Inview System Identified documents), plus my earlier casual review at the beginning of the day of 412 documents (called by Inview Trainer Identified documents), Inview went to work. I called it a night and let the computer take over.

The computers in the clouds (well actually they are in Eden Prairie, Minnesota, at KO’s secure facility) then churned away for a couple of hours. (No, I do not record this as billable time!) The computer studied my hopefully expert input on relevance or irrelevance. While I slept, Inview analyzed the input, analyzed all of the 699,082 documents, and applied the input to the documents. It then ranked the likely relevance and irrelevance of all 699,082 documents from almost 0% to 100%. The first predictive coding seed set training then completed. All documents were now ranked and ready for me to review when I next logged on.

To be continued. . . . . 

26 Responses to Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron

  1. Ralph,

    Very interesting. I’ll eagerly await the next installment.

    In response to your comment about time reporting, the human time involved in the Waterloo effort to review four topics for TREC 2009 is given in “Technology Assisted Review Can Be More Effective and More Efficient than Exhaustive Manual Review” by Maura Grossman and myself, pages 31-34:

    “(14,396 of the 836,165 documents were reviewed, on average, per topic). Total review time for all phases was about 118 hours; 30 hours per topic, on average.”

    Gordon

    • Ralph Losey says:

      Thanks for posting Gordon. I appreciate an information scientist of your caliber taking the time to comment, especially one who has been a participant in several TREC Legal Tracks.

      Two followup questions, if you don’t mind:

      (1) Did you do all of the “human” work yourself, or were you assisted by anyone, and if so, who, and what were the respective times?

      (2) Have any other TREC Legal Track participants revealed the human times spent on a topic, or are you the only one? If others have shared the time information, where might I locate that?

      Thanks,

      Ralph

      • Ralph,

        In TREC 2009, I did all the document review for the Waterloo effort, using the two-phase approach outlined in Grossman & Cormack, and further detailed in Cormack & Mojdeh (http://trec.nist.gov/pubs/trec18/papers/uwaterloo-cormack.WEB.RF.LEGAL.pdf).

        I believe that some but not all of the other participants in TREC 2009 and TREC 2010 have reported their times in the TREC proceedings. See trec.nist.gov — unfortunately the NIST site appears to be down at this instant.

        TREC 2011 participants were required to provide an estimate of the human effort involved in their submissions. These estimates will be included in the TREC Overview Paper which is imminent.

        regards,
        Gordon

  2. The TREC web site is up and running, now. (NIST is located in one of the areas worst-hit by Friday’s storm and subsequent power outages.)

    Each of the TREC 2011 participants’ estimates of the amount of time taken for configuring/loading, for searching, for reviewing, and for analysis is recorded in the run description. The run descriptions are available now in the Appendix to the TREC 2011 proceedings (which in turn is in the publications section of the TREC web site:
    http://trec.nist.gov/pubs/trec20/t20.proceedings.html

    Cheers,

    Ellen Voorhees
    NIST

    • Ralph Losey says:

      Thanks. Glad to see the NIST computers and TREC data made it through ok. Hope everyone, yourself included, has recovered from the storms.

      Also glad to know that some time recording was required in 2011.

  3. wewebber says:

    Ralph,

    Thanks for writing this up: this is invaluable not only for lawyers thinking of dipping their toes (or hurling their entire bodies headlong) into predictive coding, but also for researchers trying to understand the process and pragmatics of using an industrial predictive coding system.

    One quick question (to ask; perhaps not to answer): of the 412 documents you coded in your judgmental sample, how many were relevant or highly relevant?

    William

  4. Matt Toomey says:

    Ralph –

    I too am excited to be following this project.

    As an eDiscovery analyst with a background in Communication Science, I have worked — really, played — with some of the Enron data in my spare time.

    Some examples:
    http://bit.ly/MHm9yn
    http://bit.ly/LvAtGt
    http://bit.ly/N7crn6

    While I still call what I’m doing semantic and social network analysis, I find the intersections with predictive coding scintillating. Thanks for sharing this robust and enlightening work!

  5. Ralph Losey says:

    Thanks for the comment William. The answer is 2 relevant. No highly relevant. Only two relevant documents were found in that first judgmental sample of 412. I knew by that I was going to be in for a difficult search.

  6. […] initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and […]

  7. […] Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron by Ralph Losey. […]

  8. […] the initial 1,507 machine-selected documents and initiated the machine learning. I mentioned in the Day One narrative that I would explain why the sample size was that high. I will begin with that explanation and […]

  9. […] evidence pertaining to involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random […]

  10. […] evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random […]

  11. […] evidence pertaining to involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll […]

  12. […] find evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll […]

  13. Martha (Martie) Evans says:

    Mr. Losey,

    I always enjoy your writing and often have a smile on my face while reading. Your entertaining and educational style really contributes to my ability to learn and focus. Thank you for all of the time you devote to education! (I no longer practice law due to the wrangling of 3 wild children, but as an Account Manager in eDiscovery, I enthusiastically consume all I can about the topic. There is always so much to learn!)

    Best regards,

    Martha (Martie) Evans
    Member, NY Bar

  14. […] Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. […]

  15. […] Each year the Legal Track has a slightly different series of search tasks to try out and analyze. In 2011 there was only one task, which they called a learning task. It was designed to test the ability of predictive coding type search software and techniques to learn from instruction, from small seed sets, to properly rank a much larger data set. Each participant was provided with expert input for up to a 1,000 document seed set, and from there, the participants were to rank each document in a corpus of 685,592 documents. (TREC used the same Enron database put together by EDRM that I used in my seven-part search narrative, only my count was 699,082 documents, not 685,592, for reasons unknown. See: Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron.) […]

  16. John Martin says:

    Ralph, can you publish a list of the Enron V2 document Numbers (i.e. 3.818877.G3T4II30F0UK4YM4G2XQMIIKYS451SXUA.eml, ) of the documents that you found in each one of your five categories? I believe that this would be of great benefit to the industry as a whole.

  17. John Martin says:

    Alternatively, just a list of the relevant ones would work too.

  18. […] the seed set is formed by a random sample from the collection. Late last year, Ralph journalized an experiment in multimodal review on the Enron collection. Currently, he is journalizing a more fictionalized (but still […]

  19. […] One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron – http://bit.ly/LKKaDq (Ralph […]

  20. […] Losey, R., 2012a. Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron, E-Discovery Team Blog, 1st July 2012. Available online at: http://e-discoveryteam.com/2012/07/01/day-one-of-a-predictive-coding-narrative-searching-for-relevan…. […]

  21. […] Losey, R., 2012a. Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron, E-Discovery Team Blog, 1st July 2012. Available online at: http://e-discoveryteam.com/2012/07/01/day-one-of-a-predictive-coding-narrative-searching-for-relevan…. […]

Leave a Reply

Discover more from e-Discovery Team

Subscribe now to keep reading and get access to the full archive.

Continue reading