The first e-Discovery Team science fiction saga appeared on June 10, 2012. That blog was called A Day in the Life of a Discovery Lawyer in the Year 2062: a Science Fiction Tribute to Ray Bradbury. The story involved a legal search project in the year 2062. It was a first person narrative by the young attorney in charge. The future discovery lawyer described his admittedly very far out usage of the latest multi-sensory search software. It did not really teach or say that much about legal search today, although I’d like to think it had some pedantic value.
This second attempt at e-discovery science fiction, like the first, again combines future law, information science, and artificial intelligence based search software. But this time the search project is set just a few years into the future and so the descriptions are much more down to earth. In fact, some may argue that the software needed to do the project here described already exists. I look forward to your opinions on that in the comment section at the end of the story.
Before I start the story I must apologize to all StarTrek fans. I am unworthy to write a bona fide StarTrek novel (all I can do are simple cartoons), so rest assured that this story will have nothing to do with the beloved genre. Instead, only some general aspects of the Borg theme will be borrowed, including the so-called hive mind. Otherwise this not so subtle attempt to teach lawyers about current-day predictive coding will have nothing to do with our beloved StarTrek.
An e-Discovery Project in the Not-Too-Distant Future
My Saturday night was interrupted by a high priority message from our law firm president. We had won the Google RFP. We had been retained to handle the big China Space case. He mentioned in passing that all of our proposed terms had been accepted, except one. I later found out that Google did not counter on price, as we had feared. Instead, they asked for a revision to our discovery specifications. The partners in charge of the bid were not concerned. Why should they be? The variance just concerned a software and search protocol, and required use of a particular vendor. It only effected me, the discovery lawyer, not them. As they put it to me in the internal kick-off video conference on Monday, they knew I could handle it, especially since I was such an expert in those fields. I hate it when they do that.
Everyone in the firm was happy to get the case. I knew better than to rain on the parade during the video conference, so I smiled and said: Yeah sure, I’ll deal with it. Still, I wanted the whole team to understand how difficult and risky this new protocol could be. It could doom the whole case. I asked them to read a memo to file that I had already written about the possible impact of the change. I wanted to go on and explain why it was a problem, but knew that they were not that interested. After all, they hired me to do the e-discovery so that they would not have to deal with it. But, at least they said they would read the memorandum.
Google Picks a Borg Vendor
The change Google requested seemed minor to everyone but me. I understood right away that this was a dangerous Borg type protocol. It only used predictive coding, and in my opinion, minimized the skill and input of the lawyer reviewers.
Why didn’t Google tell us about this protocol demand in advance? Seemed like bait and switch to me and I was pissed. I later found out that a vendor had gotten to the general counsel. The CEO herself personally met with Google’s GC. She had talked him into the special protocol at the last minute by giving him some sort of special deal, or something.
Protests and CYA
I channeled my anger by writing a memorandum to file that summarized the risks involved with the Borg protocol. I also insisted that my concerns be shared with the client. Although the Google client lawyers all objected, senior management overruled. A watered down, very diplomatic version of my memo was sent to Google’s GC on Tuesday. The memo explained how the firm’s normal multimodal search protocol was commonly accepted in the industry and generally considered as far superior to the monomodal Borg approach specified in the new protocol. It also raised questions about the software of the new vendor Google retained. They were hired to do the non-legal e-discovery work in the China Space suit that Google’s own IT couldn’t handle.
The memo led to a short call on Wednesday with Google’s litigation counsel, Linda. As luck would have it, she was a fan of my blog. There were about a dozen other attorneys on the phone too, for reasons unknown, from both my firm and Google. We quickly learned that Linda was surprised by her GC’s decision to go with the vendor’s proposal. You could tell that, like me, she was a tad miffed about not being included in the decision.
I pointed out that the Borg protocol was risky and could lead to sanctions if it missed key documents, especially in this court. I said that relying on predictive coding alone was like jumping out of an airplane without an emergency parachute; that we needed the safety of the other search protocols. I also complained that it might take much longer than the multimodal approached outlined in our proposal. As far as I knew, no one had ever used a pure Borg approach in a major case. I knew I hadn’t. The approach seemed irrational. Why not use all of the search tools at your disposal? Why minimize the creative input of lawyers to find relevant documents, at least for the first seed set? It seemed crazy to me to just rely on random selection for the first training set, all on the pretext of eliminating human bias, myopia, inconsistencies, and mistakes. Those were all straw men, while the inefficiencies of not jump starting the machine training with known relevant documents was all too real.
I was very concerned about putting too much trust into the machine and machine learning. I was also suspicious of this vendor, and of any software that had only predictive coding features. Of course it was cheaper than a full service search engine, all it could do was machine learning. What a sales job they must have done on the GC. I kept that last comment to myself, but it was implied by my overall tone.
I was preaching to the choir with Linda. She agreed with everything, but said her hands were tied by marching orders from above. She asked me to do the best I could and try to work with this vendor. She promised to work with us on expenses, but pointed out, correctly, that we had accepted the revision without any request to change the discovery budget. It looks like I was struck with it, but at least I had a sympathetic ear with the client.
Still, I covered myself one more time with a follow-up memo to the case partners, copy to the head of litigation. I explained that the Borg search protocol could cost much more than the multimodal approach I had priced into the e-discovery budget. I was told not to worry, that the rest of the case budget was high enough to absorb some losses from e-discovery. That was comforting, but I did not want to see my department in red ink, while everyone else was in the black. I was used to e-Discovery carrying its own weight in the profit department. I hated to have that depend on whether a Borg approach with strange software would work or not.
China Space, Inc. v. Google, Inc.
This was a big case involving contract disputes between seven different departments of China Space, Inc., and Google, Inc. As usual, both sides also pled tort claims in the alternative. It had to do with some kind of technology they were both working on. Thank God neither side had any patent claims. I hate to deal with all of the special e-discovery rules those courts are always inventing. I still remember the first patent court rules that required you to use five keyword search terms. Seems really funny in retrospect.
One reason I still liked this case, even of it did have a Borg challenge, was our judge. We were in the hottest district in the country and all discovery issues in the case had already been assigned by the District Court Judge to one of the country’s top magistrate judges. The Magistrate had already noticed a 16(b) hearing devoted to e-discovery topics. I had only two weeks to prepare. Oh well, far better than the thirty minutes advance notice I sometimes get when on helicopter duty. I’m sure you know what that is.
We had known about the Magistrate Judge when we bid the case, but I did not think we’d be trying out some experimental protocol in his court. His expertise could, of course, cut both ways. We would have to see. So much depended on whether the Borg approach worked, or, more likely, just how bad it failed. At least if it all went bad, the firm’s reputation, and my own, would remain in tact. The judge would know full well that we were not using our regular vendor. He would know that the experiment was the client’s idea, or at least the client’s vendor’s idea. I would be sure that the judge understood the difference. In that way, if the worst did happen, we could try to deflect sanctions against our client (and us), and point the blame on the vendor’s software. I would make clear, if need be, that the only bad faith here was by the vendor’s salesmen.
Yes, even if their software warranty limits and disclaimers did hold up, the new vendor had a lot at stake here. They could lose big, or surprise everyone and win big. Maybe that’s why they lowballed their bid to get the contract. They wanted a chance to prove themselves, their method and software. They also knew I would have no choice but to go along and try my best to protect them in court. Like it or not, they were my experts now, and the Borg way is the search protocol that my client ordered me to follow. Resistance was futile, but still, I didn’t have to like it.
Google’s Bulk Self-Collection
The vendor was not involved with the collection. Google’s engineers had that well in hand. It was completed a week ago and had already been processed and loaded onto the vendor’s computers. Google IT did all of the collection, not the actual custodians. They copied everything that met the specified criteria, including time, file type and location.
Four hundred and eighty-five gigabytes (485 GB) of documents were collected by Google from a total of 45 custodians. These were the class-A custodians in the case, the ones we knew the opposing party and judge would want included in the first phase of discovery. All together, after deduplication, etc., there were over 4,000,000 documents from these custodians. There was an additional 1.5 Terabytes of ESI collected from the one hundred or so Class-B and C custodians, but their data was not loaded into the platform. We might look at their ESI in future discovery phases.
I would also usually load a few mock-docs onto the search and review platform at this point. By mock-docs I mean fictitious documents that might exist, or should exist, if either side’s story was absolutely correct. We would create these documents, carefully mark them so they were never produced, and then load them as machine training documents. But I was told that did not meet protocol here. Too bad. I had found that these moc-docs were a great way to jumpstart machine learning.
I had also found that their creation was an effective way to get the trial team to focus on what they considered the key facts in dispute to be. One of my favorite questions to ask a trial lawyer is, what would a smoking gun look like in this case? What would an email or other document look like that would ruin your case. I would also ask them to think positive, not easy for trial lawyers, and think what a silver bullet in this case would look like? What document would make your case for you? What words would it use? What concepts would it convey? Who would likely have said it? Anyone else? When?
I can quickly tell a really good lawyer by how quickly they catch on to the moc-doc game. The good ones know already how what they need to prove and disprove in the case. They already know what documents they would like to show to a jury to win the case. In some cases I had even agreed to allow the requesting party to submit moc-docs and use them to help guide our searches.
My First Meetings with the Borg Vendor
My first meeting with the Borg vendor started with the usual pleasantries. There were four or five of them on the phone, although after introductions only two of them ever said anything. I later learned they always worked in big groups like that. Part of the collective mentality I suppose. The CEO who had sold the project to the GC was not on the call. Naturally I had Googled all of them in advance of the meeting. They looked like they had not been outdoors in months; pale and pasty-faced, but they had good backgrounds in technology. Only one of them also had a law degree. The CEO, and so far she was a no show. Too bad, her Googles were very interesting.
The vendor knew where I stood on the Borg approach from my blog. The CEO was a regular reader and sometime commentator. After a few minutes of initial pleasantries, I talked about how I usually add moc-docs for training at the beginning of a project. We talked about whether there was any place for that in their software. They talked it over, and after a few possibilities were discussed, they decided it could not be done. Oh well. At least I tried. I would have to figure out another way to get my trial lawyers to focus on relevance and closing arguments. I still had only a vague idea of relevance and I have been talking to them for hours.
The vendor team explained their approach to predictive coding, which they called fully automated. They did not seem amused when I called it the Borg approach. When I brought out the old jump without a backup parachute argument, they asked me if I drive my car with a bicycle in the trunk. I had to smile with that comeback. If you have a car they said, you don’t need a bike. Of course, I agreed with them to a point. The other forms of search – keyword, concept, similarity, etc. – were not nearly as effective as machine learning. Still, a skilled lawyer’s use of the other search methods, sometimes even including intuition and good luck, could help the machine learn by providing good examples of relevant documents to pattern and train on. It could vastly improve the initial seed set, if nothing else.
They said their CAR didn’t need a bicycle’s help. They kept telling me how much I was going to love their software, how easy the whole process was. They called it the lazy man’s approach to predictive coding. I remained unimpressed, but did my best to try to keep an open mind. Who knows? Maybe it would even work.
We ended the first meeting by setting up a series of software training sessions on the actual data. Then I insisted on more meetings right after that with their top experts. They had one consultant I especially wanted to meet. She had a PhD in information science from the top school in legal search, so I wanted to hear her views. I did not expect to be able to talk her into multimodal. I knew that most information scientists did not appreciate lawyer search skills. They think that lawyers only want to shape the truth, and will use any trick to ignore or even bury facts not favorable to their client. They do not appreciate how seriously we take ethics and keeping our reputation in tact. They think we are only clever manipulators of facts, not bona fide discoverers of facts.
This anti-lawyer attitude, which at core is an anti-human Borg-like attitude, not just lawyers, is one reason some scientists and businessmen are anxious to find a fully automated process. They seem genuinely annoyed by the unavoidable fact that lawyers are needed as a final arbitrator of what is relevant or not.
They do not understand that the best discovery lawyers do not hide the facts, they explain them. Their clever intents are not on the facts. They are on the law. Lawyers are sworn to get the truth, the whole truth, and nothing but. And then to go on and win the case anyway. In theory the best scientists are the same way, only they are usually not as concerned with the expense and burden of search as the lawyers.
Random Sampling Setup For SME
In my later meetings with their experts I learned more about the search methods built into the protocol. As to probability statistics and random sampling, they were using a 2% confidence interval, and, of course, a 95% confidence level. That meant an initial sample size of 2,401 documents out of the 4,000,000. As the SME for the case I was going to have to singlehandedly review all of these myself, both in the initial sample and then again another 2,401 in the final quality assurance test sample.
I had no problem with that seemingly big burden. Although many contract review companies will tell you to assume that only 50 files per hour can be reviewed safely, I knew better. They said it might take me 48 hours to do that random review by myself (2401/50), and suggested I use two of their best reviewers instead. I did not go for that. I knew from experience that their time estimate was way off. Plus I knew that with multiple reviewers inconsistencies have to crop up. Also, I did not think their reviewers would have even close to my expertise and experience on relevancy. After so many years as a litigator I could spot the unexpected and see many ways that a document could be relevant or irrelevant. I wanted to be sure that the initial baseline random sample was done correctly.
Besides, random samples are quick and easy to review. That’s because 98%, 99%, and sometimes even more than 99% of the documents sampled are usually irrelevant in a case like this. After a while it becomes very easy to quickly recognize and label irrelevant documents. (The hard part is in spotting the rare relevant. That is where the legal skills and experience come in.) I could usually run through 200 documents in a random sample in a half hour. That is how I worked. In half hour bursts. I found that was the best way to maintain maximum concentration and efficiency. Since my hourly rate was over ten times that of an average contract lawyer, I wanted to be sure the client got its money’s worth. I thought I could probably do the review part at a speed of 400 files per hour, and thus review all 2,401 samples in six hours.
QC of the Borg Cube
I also wanted to be sure this new vendor knew how to properly set up and sample the database. I verified that they were going to exclude all documents that lacked text, or lacked enough text for predictive coding to work. The sampling works best when there are strict limits placed on the sample pool. The sample is only of meaningful documents for machine learning. I had a side team of trusted reviewers to search and code the non-text types of ESI. There were not really that many, besides, if an image or some other non-searchable computer file was part of an email, or other document found to be relevant, then under our standard protocols it would automatically be dragged back into the final production.
After the initial random sample, portions of which would also be used by the software for testing and training, the computer code would kick in. It would start to feed my review team batches of 400 documents at a time. A select percent, which I understood varied but was always less than 20%, would be selected randomly. The rest would be selected by the computer. Like most other predictive coding software I had used, it was designed to select documents that its analysis showed would most benefit from human classification. That usually meant documents in the 25% to 60% probable relevant range, but not always.
They had a review team of eight lawyer review specialists set up and ready to begin study of the briefs and relevancy notebook that my attorneys were working on. I reminded the vendor that I was required both by contract with the client, and by legal ethics, to personally supervise the work of their contract lawyers. I insisted on daily video conferences with the reviewers. That was not too difficult in this case since all of them were together in the same room at the same time. Their first assignment would be to review the same 2,401 random sample documents that I was for relevancy, but to look for confidentiality and privilege concerns, not relevancy.
There would be no need for this vendor work at all but for the fact that the Magistrate Judge was known to require disclosure of random tests and seed sets. His local rules also gave us the right to withhold and log any irrelevant and confidential or privileged files. We planned to do that very carefully, which we why we devoted the vendor’s entire review team to it. We knew we had an ethical duty to protect our client’s confidential data. This was one reason we never volunteer for such full disclosure. But with this case, in this court and this judge, that was how it was going to be. So we would at least make sure that all privileged and confidential documents were spotted, stamped, redacted or withheld and logged. We were going to protect the confidentiality rights of our clients, their employees and customers. The last thing we wanted to do was have to rely on our clawback agreement and order.
The vendor agreed with phony vigor to all of my supervision demands of their attorneys. They had no choice. It is illegal for them to practice law. They are a commercial corporation, not a law firm. Plus, they figured I would probably be like most of the lawyers they worked with. I would start off talking a good game, and then slack off on supervision as the project commenced and deadlines loomed. They were in for a surprise. The last thing I would do on this project is take it easy. I did not trust the Borg.
We spent the rest of the meeting assigning agenda items to begin preparation for the first 16(b) hearing on e-discovery the following week. I delegated everything to other associates and partners on my e-discovery team except for the initial review of the sample documents. I hoped I was right about the low yield assumptions in the 2,401 random documents. Otherwise my six hours of review could easily turn into sixty.
To be continued …