## Visualizing Data in a Predictive Coding Project – Part Two

November 16, 2014

This is part two of my presentation of an idea for visualization of data in a predictive coding project. Please read part one first.

As most of you already know, the ranking of all documents according to their probable relevance, or other criteria, is the purpose of predictive coding. The ranking allows accurate predictions to me made as to how the documents should be coded. In part one I shared the idea by providing a series of images of a typical document ranking process. I only included a few brief verbal descriptions. This week I will spell it out and further develop the idea. Next week I hope to end on a high note with random sampling and math.

Vertical and Horizontal Axis of the Images

The visualizations here presented all represent a collection of documents. It is supposed to be pointillist image, with one point for each document. At the beginning of a document review project, before any predictive coding training has been applied to the collection, the documents are all unranked. They are relatively unknown. This is shown by the fuzzy round cloud of unknown data.

Once the machine training begins all documents start to be ranked. In the most simplistic visualizations shown here the ranking is limited to predicted relevance or irrelevance. Of course, the predictions could be more complex, and include highly relevant and privilege, which is what I usually do. It could also include various other issue classifications, but I usually avoid this for a variety of reasons that would take us too far astray to explain.

Once the training and ranking begin the probability grid comes into play. This grid creates both a vertical and horizontal axis. (In the future, we could add third dimensions too, but let’s start simple.)  The one public comment received so far stated that the vertical axis on the images showing percentages adjacent to the words “Probable Relevant” might give people the impression that it is the probability of a document being relevant. Well, I hope so, because that is exactly what I was trying to do!

The vertical axis shows how the documents are ranked. The horizontal axis shows the number of documents, roughly, at each ranking level. Remember, each point is supposed to represent a specific, individual document. (In the future we could add family overlays, but again, let’s start simple.) A single dot in the middle would represent one document. An empty space would represent zero documents. A wide expanse of horizontal dots would represent hundreds or thousand of documents, depending on the scale.

The diagram below visualizes a situation common where ranking has just begun and the computer is uncertain as to how to classify the documents. It classifies most in the 37.5% to 67.5% range of probable relevance. It is all about fifty fifty at this point. This is the kind of spread you would expect to see if training began with only random sampling input. The diagram indicates that the computer does not really know much yet about the data. It does not yet have any real idea as to which documents are relevant, and which are not.

The vertical axis of the visualization is the key.  It is intended to show a running grid from 99% probable relevant to 0.01% probable relevant. Note that 0.01% probable relevant is another way of saying 99.9% probable irrelevant, but remember, I am trying to keep this simple. More complex overlays may be more to the liking of some software users. Also note that the particular numbers I show on the these diagrams is arbitrary: 0.01%, 12.5%, 25%, 37.5%, 50%, 67.5%, 75%, 87.5%, 99.9%, I would prefer to see more detail here, and perhaps add a grid showing a faint horizontal line at every 10% interval. Still, the fewer lines shown here does have a nice aesthetic appeal, plus it was easier for me to create on the fly for this blog.

Again, let me repeat to be very clear. The vertical grid on these diagrams represents the probable ranking from least likely to be relevant on the bottom, to most likely on the top. The horizontal grid shows the number of documents. It is really that simple.

Why Data Visualization Is Important

This kind of display of documents according to a vertical grid of probable relevance is very helpful because it allows you to see exactly how your documents are ranked at any one point in time. Just as important, it helps you to see how the alignment changes over time. This empowers you to see how your machine training impacts the distribution.

This kind of direct, immediate feedback greatly facilitates human computer interaction (what I call in my approximate 50 articles on predictive coding the hybrid approach). It makes it easier for the natural human intelligence to connect with the artificial intelligence. It makes it easier for the human SMEs involved to train the computer. The humans, typically attorneys or their surrogates, are the ones with the expertise on the legal issues in the case. This visualization allows them to see immediately what impact particular training documents have upon the ranking of the whole collection. This helps them to select effective training documents. It helps them to attain the goal of separation of relevant from irrelevant documents. Ideally they would be clustered on both the bottom and top of the vertical axis.

For this process to work it is important for the feedback to be grounded in actual document review, and not be a mere intellectual exercise. Samples of documents in the various ranking strata must be inspected to verify, or not, whether the ranking is accurate. That can vary from strata to strata. Moreover, as everyone quickly finds out, each project is different, although certain patterns do tend to emerge. The diagrams used as an example in this blog represent one such typical pattern, although greatly compressed in time. In reality the changes shows here from one diagram to another would be more gradual and have a few unexpected bumps and bulges.

Visualizations like this will speed up the ranking and the review process. Ultimately the graphics will all be fully interactive. By clicking on any point in the graphic you will be taken to the particular document or documents that it represents. You click and drag and you are taken to a whole set of documents selected. For instance, you may want to see all documents between 45% and 55%, so you would select that range in the graphic. Or you may want to see all documents in the top 5% probable relevance ranking, so you select that top edge of the graphic. These documents will instantly be shown in the review database. Most good software already has document visualizations with similar linking capacities. So we are not reinventing the Wheel here, just applying these existing software capacities to new patterns, namely to document rankings.

These graphic features will allow you to easily search the ranking locations. This will in turn allow you to verify, or correct, the machine’s learning. Where you find that the documents clicked have a correct prediction of relevance, you verify by coding as relevant, or highly relevant. Where the documents clicked have an incorrect prediction, you correct by coding the document properly. That is how the computer learns. You tell it yes when it gets it right, and no when it gets it wrong.

At the beginning of a project many predictions of relevance and irrelevance will be incorrect. These errors will diminish as the training progress, as the correct predictions are verified, and erroneous predictions are corrected. Fewer mistakes will be made as the machine starts to pick up the human intelligence. To me it seems like a mind to computer transference. More of the predictions will be verified, and the document distributions will start to gather on both end of the vertical relevance axis. Since the volume of documents is represented by the horizontal axis, more documents will start to bunch together at both the top and bottom of the vertical axis. Since document collections in legal search usually contain many more irrelevant documents than relevant, you will typically see most documents on the bottom.

Visualizations of an Exemplar Predictive Coding Project

In the sample considered here we see unnaturally rapid training. It would normally take many more rounds of machine training than are shown in these four diagrams. In fact, with a continuous active training process, there could be hundreds of rounds per day. In that case the visualization would look more like an animation than a series of static images. But again, I have limited the process here for simplicity sake.

As explained previously, the first thing that happens to the fuzzy round cloud of unknown data before any training begins is that the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. In addition other necessarily irrelevant documents to this particular project are bulk-culled out. For example, ESI such as music files, some types of photos, and many email domains, like, for instance, emails from publications such as the NY Times. By good fortune in this example exactly One Million documents remain for predictive coding.

We begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. (They are the yellow dots.) Assuming a 95% confidence level, do you know what confidence interval this creates? I asked this question before and repeat it again, as the answer will not come until the final math installment next week.

Next we assume that an SME, and or his or her surrogates, reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant. Do you know what prevalence rate this creates? Do you know the projected range of relevant documents within the confidence interval limits of this sample? That is the most important question of all.

Next we do the first round of machine training proper. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. Again for simplicity sake, we assume that the analytics is directed towards relevance alone. In fact, most projects would also include high-relevance and privilege.

In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities. Here there are already more documents trained on irrelevance, than relevance. This is in spite of the fact that the active search was to find relevant documents, not irrelevant documents. This is typical in most review projects where you have many more irrelevant than relevant documents overall, and where it is easier to spot and find irrelevant than relevant.

Next we see the data after the second round of training. The division of the collection of documents into relevant and irrelevant is beginning to form. The largest of collection of documents are the blue points at the bottom. They are the documents that the computer predicts are irrelevant based on the training to date. There are also a large collection of points shown in red at the top. They are the ones where the computer now thinks there is a high probability of relevance. Still, the computer is uncertain about the vast majority of the documents: the red in the third strata from the top, the blue in the third strata from the bottom, and the many in the grey, the 37.5% to 67.5% probable relevance range. Again we see an overall bottom heavy distribution. This is a typical pattern because it is usually easier to train on irrelevance than relevance.

As noted before, the training could be continuous. Many software programs offer that feature. But I want to keep the visualizations here simple, and not make an animation, and so I do not assume here a literally continuous active learning. Personally, although I do like to keep the training continuous throughout the review, I like the actual computer training to come in discrete stages that I control. That gives me a better understanding of the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. That is the kind of feedback that these visualizations enhance.

Next we see the data after the third round of training. Again, in reality it would typically take more rounds of training than three to reach this relatively mature state, but I am trying to keep this example simple. If a project did progress this fast, it would probably be because a large number of documents were used in the prior rounds.  The documents about which the computer is now uncertain — the grey area, and the middle two brackets — is now much smaller.

The computer now has a high probability ranking for most of the probable relevant and probable irrelevant documents. The largest number of documents are the blue bottom, where the computer predicts they have a near zero chance of being classified relevant. Again, most of the  probable predictions, those in the top and bottom three brackets, are located in the bottom three brackets. Those are the documents predicted to have less that a 37.5% chance of being relevant. Again, this kind of distribution is typical, but there can be many variances from project to project. We here see a top loading where most of the probable relevant documents are in the top 12.5% percent ranking. In other words, they have an 87.5% probable relevant ranking, or higher.

Next we see the data after the fourth round of training. It is an excellent distribution at this point. There are relatively few documents in the middle. This means there are relatively few documents about which the computer is uncertain as to its probable classification. This pattern is one factor among several to consider in deciding whether further training and document review are required to complete your production.

Another important metric to consider is the total number of documents found to be probable relevant, and comparison with the random sample prediction. Here is where math comes in, and understanding of what random sampling can and cannot tell you about the success of a project. You consider the spot projection of total relevance based on your initial prevalence calculation, but much more important, you consider the actual range of documents under the confidence interval. That is what really counts when dealing with prevalence projections and random sampling. That is where the plus or minus  confidence interval comes into play, as I will explain in detail the third and final installment to this blog.

In the meantime, here is  the document count of the distribution roughly pictured in the final diagram above, which to me looks like an upside down, fragile champagne glass. We see that exactly 250,000 documents have a 50% or higher probable relevance ranking, and 750,000 documents have a 49.9% or less probable relevance ranking. Of the probable relevant documents, there are 15,000 documents that fall in the 50% to 67.5% range. There are another 10,000 documents that fall in the 37.5% to 49.9% probable relevance range. Again, this is also fairly common as we often see less on the barely irrelevant side that we do on the barely relevant side. As a general rule I review with humans all documents that are 50% or higher probable relevance, and do not review the rest. I do however sample and test the rest, the documents with less than a 50% probable relevance ranking. Also, in some projects I review far less than the top 50%. That all depends on proportionality constraints, and document ranking distribution, the kind of distributions that these visualizations will show.

In addition to this metrics analysis, another important factor to consider in whether our search and review efforts are now complete, is how much change in ranking there has been from one training round to the next. Sometimes there may be no change at all. Sometimes there may only be very slight changes. If the changes from the last round are large, that is an indication that more training should still be tried, even if the distribution already looks optimal, as we see here.

Another even more important quality control factor is how correct the computer has been in the last few rounds of its predictions. Ideally, you want to see the rate of error decreasing to a point where you see no errors in your judgmental samples. You want your testing of the computer’s prediction to show that it has attained a high degree of precision. That means there are few documents predicted relevant, that actual review by human SMEs show are in fact irrelevant. This kind of error is known as a False Positive. Much more important to quality evaluation is to the discovery of documents predicted irrelevant, that are actually relevant. This kind of error is known as a False Negative. The False Negatives are your real concern in most projects because legal search is usually focused on recall, not precision, at least within reason.

The final distinction to note in quality control is one that might seem subtle, but really is not. You must also factor in relevance weight. You never want a False Negative to be a highly relevant document. If that happens to me, I always commence at least one more round of training. Even missing a document that is not highly relevant, not hot, but is a strong relevant document, and one of a type not seen before, is typically a cause for further training. This is, however, not an automatic rule as with the discovery of a hot document. It depends on a variety of factors having to do with relevance analysis of the particular case and document collection.

In our example we are going to assume that all of the quality control indicators are positive, and a decision has been made to stop training and move on to a final random sample test.

A second random sample comes next. That test and visualization will be provided next week, along with the promised math and sampling analysis.

Math Quiz

I part one, and again here, I asked some basic math questions on random sampling, prevalence, and recall. So far no one has attempted to answer the questions posed. Apparently, most readers here do not want to be tested. I do not blame them. This is also what I find in my online training program, e-DiscoveryTeamTraining.com, where only a small percentage of the students who take the program elect to be tested. That is fine with me as it means one less paper to grade, and most everyone passes anyway. I do not encourage testing. You know if you get it or not. Testing is not really necessary.

The same applies to answering math questions in a public blog. I understand the hesitancy. Still, I hope many privately tried, or will try, to solve the questions and came up with the correct answers. In part three of this blog I will provide the answers, and you will know for sure if you got it right. There is still plenty of time to try to figure it out on your own. The truly bold can post it online in the comments below. Of course, this is all pretty basic stuff to try experts of this craft. So, to my fellow experts out there, you have yet another week to take some time and strut your stuff by sharing the obvious answers. Surely I am not the only one in the e-discovery world bold enough to put their reputation on the line by sharing their opinions and analysis in public for all to see (and criticize). Come on. I do it every week.

Math and sampling are important tools for quality control, but as Professor Gordon Cormack, a true wizard in the area of search, math, and sampling likes to point out, sampling alone has many inherent limitations. Gordon insists, and I agree, that sampling should only be part of a total quality control program. You should never just rely on random sampling alone, especially in low prevalence collections. Still, when sampling, prevalence, and recall are included as part of an overall QC effort, the net effect is very reassuring. Unless I know that I have an expert like Gordon on the other side, and so far that has never happened, I want to see the math. I want to know about all of the quality control and quality assurance steps taken to try to find the information requested. If you are going to protect your client, you need to learn this too, or have someone at hand who already knows it.

This kind of math, sampling, and other process disclosures should convince even the most skeptical adversary or judge. That is why it is important for all attorneys involved with legal research to have a clear mathematical understanding of the basics. Visualizations alone are inadequate, but, for me at least, visualizations do help a lot. All kinds of data visualizations, not just the ones here presented, provide important tools to help lawyers to understand how a search project is progressing.

Challenge to Software Vendors

The simplicity of the design of the idea presented here is a key part of the power and strength of the visualization. It should not be too difficult to write code to implement this visualization. We need this. It will help users to better understand the process. It will not cost too much to implement, and what it does cost should be recouped soon in higher sales. Come on vendors, show me you are listening. Show me you get it. If you have a software demo that includes this feature, then I want to see it. Otherwise not.

All good predictive coding software already ranks the probable relevance of documents, so we are not talking about an enormous coding project. This feature would simply add a visual display to calculations already being made. I could hand make these calculations myself using an Excel spreadsheet, but that is time consuming and laborious. This kind of visualization lends itself to computer generation.

I have many other ideas for predictive coding features, including other visualizations, that are much more complex and challenging to implement. This simple grid explained here is an easy one to implement, and will show me, and the rest of our e-discovery community, who the real leaders are in software development.

Conclusion

The primary goal of the e-Discovery Team blog is educational, to help lawyers and other e-discovery professionals. In addition, I am trying to influence what services and products are provided in e-discovery, both legal and technical. In this blog I am offering an idea to improve the visualizations that most predictive software already provide. I hope that all vendors will include this feature in future releases of their software. I have a host of additional ideas to improve legal search and review software, especially the kind that employs active machine learning. They include other, much more elaborate visualization schemes, some of which have been alluded to here.

Someday I may have time to consult on all of the other, more complex ideas, but, in the meantime, I offer this basic idea for any vendor to try out. Until vendors start to implement even this basic idea, anyone can at least use their imagination, as I now do, to follow along. These kind of visualizations can help you to understand the impact of document ranking on your predictive coding review projects. All it takes is some idea as to the number of documents in various probable relevance ranking strata. Try it on your next predictive coding project, even if it is just rough images from your own imagination (or Excel spreadsheet). I am sure you will see for yourself how helpful this can be to monitor and understand the progress of your work.

## Is the IRS’s Inability to Find Emails the Result of Unethical Behavior? New Opinion by U.S. Tax Court Provides Some Clues – Part 2

October 5, 2014

This is Part Two of the essay where I go into the specifics of the holding in Dynamo. Please read Part One first: Is the IRS’s Inability to Find Emails the Result of Unethical Behavior? New Opinion by U.S. Tax Court Provides Some Clues – Part One. There I pointed out that the IRS attitude towards email discovery, particularly predictive coding, shows that they belong to the unethical category I call The Clueless. Yes, the IRS is clueless, but not in an affable Pink Panther Inspector Clouseau way, but in an arrogant, super-know-it-all way of egomaniac types. It is wonderfully personified in Ms. Lerner’s face during her Congressional non-testimony. Like Congress did to Lerner, the Tax Court in Dynamo properly cut down the IRS attorneys and rejected all of their IRS’ anti-predictive coding non-sense arguments.

Dynamo Holdings Opinion

Dynamo Holdings, Ltd. vs. Commissioner, 143 T.C. No. 9 (Sept. 17, 2014) is a very well written opinion by United Stated Tax Court by Judge Ronald L. Buch. I highly recommend that you study and cite this opinion. It is so good that I have decided to devote the rest of this blog to quotation of the portions of it that pertain to predictive coding.

I cannot refrain from provided some comments too, of course, otherwise what would be the point of doing more than provide a link? But for the sake of clarity, and purity, although I will intermix my [side bar comments] along with the quotes, I will do so with blue font, and italics, so you will not mistake the court’s words with my own. Yes, I know, that is not how you do things in law review articles, that this is way too creative. So what? It will be a lot more interesting for you to read it that way, and quicker too. So damn with the old rules of legal writing, here goes.

[P]etitioners request that the Court let them use predictive coding, a technique prevalent in the technological industry but not yet formally sanctioned by this Court, to efficiently and economically identify the nonprivileged information responsive to respondent’s discovery request. [The Petitioners are the defendants, and Respondents are the plaintiff, IRS. The IRS sued to collect tax on certain transfers between business entities alleging they were disguised gifts to the owners of Dynamo. Seems like a pretty clear cut issue to me, and I cannot see why it was necessary to look at millions of emails to find out what happened. The opinion does not explain that. The merits of the case are not addressed and a detailed proportionality analysis is not provided.]

Respondent [IRS] opposes petitioners’ request to use predictive coding because, he states, predictive coding is an “unproven technology”. Respondent adds that petitioners need not devote their claimed time or expense to this matter because they can simply give him access to all data on the two tapes and preserve the right (through a “clawback agreement”) to later claim that some or all of the data is privileged information not subject to discovery.2 [This is the disingenuous part I referred to previously.]

FN 2 – We understand respondent’s use of the term “clawback agreement” to mean that the disclosure of any privileged information on the tapes would not be a waiver of any privilege that would otherwise apply to that information.

The Court held an evidentiary hearing on respondent’s motion. [It looks like the Tax Court followed Judge David Waxse on this often debated issue as to whether an evidentiary hearing should be provided, but he only went part way. As you will see, a full scale Daubert type hearing was not provided. Instead, Judge Buch treated their testimony as informal input. Most judges agree that this is appropriate, even if they do not agree with Judge Waxe’s position that Daubert type rulings are appropriate in a mere discovery dispute. Most judges I have talked to think that Evidence Rule 702 does not apply, since there is no evidence or trial, and no presentation to the jury to protect; there is just a dispute as to discovery search methods.]

[W]e hold that petitioners must respond to respondent’s discovery request but that they may use predictive coding in doing so. [The defendants had argued they should not have to search two backup tapes for email at all, and the use of predictive coding was a fall back argument. The decision did not provide any detailed explanation as to necessity, and I get the impression that it was not really pushed, that the main focus of the briefs was on predictive coding.]

Petitioners ask the Court to let them use predictive coding to efficiently and economically help identify the nonprivileged information that is responsive to respondent’s discovery request. More specifically, petitioners want to implement the following procedure to respond to the request: [I have omitted the first four reasons as not terribly interesting.] … 5. Through the implementation of predictive coding, review the remaining data using search criteria that the parties agree upon to ascertain, on the one hand, information that is relevant to the matter, and on the other hand, potentially relevant information that should be withheld as privileged or confidential information.

[T]he Court is not normally in the business of dictating to parties the process that they should use when responding to discovery. [This is a very important point. See Sedona Principle Six. The defendants did not really need the plaintiff’s approval to use predictive coding. Judge Buch is suggesting that this whole permission motion is an unnecessary waste of time, but he will indulge them anyway and address it. I for one am glad that he did.] If our focus were on paper discovery, we would not (for example) be dictating to a party the manner in which it should review documents for responsiveness or privilege, such as whether that review should be done by a paralegal, a junior attorney, or a senior attorney. Yet that is, in essence, what the parties are asking the Court to consider–whether document review should be done by humans or with the assistance of computers. [These are all very good points.] Respondent fears an incomplete response to his discovery. [Parties in litigation always fear that. The U.S. employs a “trust based” system of discovery that relies on the honesty of the parties, and especially relies on the honesty and cooperativeness of the attorneys who conduct the discovery. There are alternatives, like having judges control discovery. Most of the world has such judge controlled discovery, but lawyers in the U.S. do not want that, and it is doubtful that taxpayers would want to fund an alternative court based approach.] If respondent believes that the ultimate discovery response is incomplete and can support that belief, he can file another motion to compel at that time. Nonetheless, because we have not previously addressed the issue of computer-assisted review tools, we will address it here.

Each party called a witness to testify at the evidentiary hearing as an expert. Petitioners’ witness was James R. Scarazzo. Respondent’s witness was Michael L. Wudke. [I added these links. Scarazzo is with the well known vendor, FTI, in Washington D.C., and Wudke is with another vendor in N.Y., Transperfect Legal Solutions. He used to be with Deloitte.] The Court recognized the witnesses as experts on the subject matter at hand. We may accept or reject the findings and conclusions of the experts, according to our own judgment.

Predictive coding is an expedited and efficient form of computer-assisted review that allows parties in litigation to avoid the time and costs associated with the traditional, manual review of large volumes of documents. Through the coding of a relatively small sample of documents, computers can predict the relevance of documents to a discovery request and then identify which documents are and are not responsive. The parties (typically through their counsel or experts) select a sample of documents from the universe of those documents to be searched by using search criteria that may, for example, consist of keywords, dates, custodians, and document types, and the selected documents become the primary data used to cause the predictive coding software to recognize patterns of relevance in the universe of documents under review. The software distinguishes what is relevant, and each iteration produces a smaller relevant subset and a larger set of irrelevant documents that can be used to verify the integrity of the results. [That is not technically correct, at least not in most cases. The relevance subset does not get smaller and smaller. The probability predictions do, however, get more accurate. True predictive coding as used by most vendors today is active machine learning. It ranks the relevance of the probability of all documents. See Eg AI-EnhancedReview.com] Through the use of predictive coding, a party responding to discovery is left with a smaller set of documents to review for privileged information, resulting in a savings both in time and in expense. [Now the judge is back on track and this is an essential truth.] The party responding to the discovery request also is able to give the other party a log detailing the records that were withheld and the reasons they were withheld. [Judge Buch is referring to the privilege log, or in some cases, also a confidentiality log.]

Magistrate Judge Andrew Peck published a leading, oft-cited article on predictive coding which is helpful to our understanding of that method. [Of course Judge Peck’s photograph is not in the opinion.See Andrew Peck, “Search, Forward: Will Manual Document Review and Keyboard Searches be Replaced by Computer-Assisted Coding?”, L. Tech. News (Oct. 2011). The article generally discusses the mechanics of predictive coding and the shortcomings of manual review and of keyword searches. The article explains that predictive coding is a form of “computed-assisted coding”, which in turn means “tools * * * that use sophisticated algorithms to enable the computer to determine relevance, based on interaction with (i.e., training by) a human reviewer.” Id. at 29. The article explains that:

Unlike manual review, where the review is done by the most junior staff, computer-assisted coding involves a senior partner (or team) who review and code a “seed set” of documents. [Judge Peck wrote this back in 2011. I believe his understanding of “senior parter” level skill needed for training has since evolved. I can elaborate, but it would take us too far astray. Let’s just say what is needed is a single, or at least, very small team of real experts on the relevance facts at issue in the case. See Eg. Less Is More: When it comes to predictive coding training, the “fewer reviewers the better” – Part One, Part Two, Part ThreeThe computer identifies properties of those documents that it uses to code other documents. As the senior reviewer continues to code more sample documents, the computer predicts the reviewer’s coding. (Or, the computer codes some documents and asks the senior reviewer for feedback.)

When the system’s predictions and the reviewer’s coding sufficiently coincide, the system has learned enough to make confident predictions for the remaining documents. Typically, the senior lawyer (or team) needs to review only a few thousand documents to train the computer. [The number depends, of course. For some projects, tens of thousands of documents may be needed over multiple iterations to adequately train the computer. Some projects are much harder than others, despite the skills of the search designers involved. Yes, it takes a great deal of skill and experience to properly design a large predictive coding search and review project. It also takes good predictive coding software that ranks all document probabilities.]

Some systems produce a simple yes/no as to relevance, while others give a relevance score (say, on a 0 to 100 basis) that counsel can use to prioritize review. For example, a score above 50 may produce 97% of the relevant documents, but constitutes only 20% of the entire document set. [All good software today ranks all documents, typically 0 to 100% probability, rather than give a simplistic yes/no ranking.]

Counsel may decide, after sampling and quality control tests, that documents with a score of below 15 are so highly likely to be irrelevant that no further human review is necessary. Counsel can also decide the cost-benefit of manual review of the documents with scores of 15-50. [Typically the cut off point is way above 15% probability. I have no idea where that number came from. A more logical and frequent number is below 50%, meaning they are probably not relevant.]

Id.

The substance of the article was eventually adopted in an opinion that states: “This judicial opinion now recognizes that computer-assisted review is an acceptable way to search for relevant ESI in appropriate cases.” Moore v. Publicis Groupe, 287 F.R.D. 182, 183 (S.D.N.Y. 2012), adopted sub nom. Moore v. Publicis Groupe SA, No. 11 Civ. 1279 (ALC)(AJP), 2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012).

Respondent asserts that predictive coding should not be used in these cases because it is an “unproven technology”. We disagree. [The alternative methods, keyword search and linear human review are the “unproven technologies,” not predictive coding. Indeed, the science proves that keyword and linear review are unreliable. See Eg. LEGAL SEARCH SCIENCE.  The new gold standard is active machine learning, aka predictive coding, not hundreds of low paid contract lawyers sitting in cubicles all day.] Although predictive coding is a relatively new technique, and a technique that has yet to be sanctioned (let alone mentioned) by this Court in a published Opinion, the understanding of e-discovery and electronic media has advanced significantly in the last few years, thus making predictive coding more acceptable in the technology industry than it may have previously been. In fact, we understand that the technology industry now considers predictive coding to be widely accepted for limiting e-discovery to relevant documents and effecting discovery of ESI without an undue burden.10 [Excellent point. Plus it is not really all that “new” by today’s standards. It has been around in academic circles since the 1990s.]

FN 10 – Predictive coding is so commonplace in the home and at work in that most (if not all) individuals with an email program use predictive coding to filter out spam email. See Moore v. Publicis Groupe, 287 F.R.D. 182, n.2 (S.D.N.Y. 2012), adopted sub nom. Moore v. Publicis Groupe SA, No. 11 Civ. 1279 (ALC)(AJP), 2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012).

See Progressive Cas. Ins. Co. v. Delaney, No. 2:11-cv-00678-LRH-PAL, 2014 WL 3563467, at *8 (D. Nev. July 18, 2014) (stating with citations of articles that predictive coding has proved to be an accurate way to comply with a discovery request for ESI and that studies show it is more accurate than human review or keyword searches); F.D.I.C. v. Bowden, No. CV413-245, 2014 WL 2548137, at *13 (S.D. Ga. June 6, 2014) (directing that the parties consider the use of predictive coding). See generally Nicholas Barry, “Man Versus Machine Review:  The Showdown between Hordes of Discovery Lawyers and a Computer-Utilizing Predictive-Coding Technology”, 15 Vand. J. Ent. & Tech. L. 343 (2013); Lisa C. Wood, “Predictive Coding Has Arrived”, 28 ABA Antitrust J. 93 (2013). The use of predictive coding also is not unprecedented in Federal litigation. See, e.g., Hinterberger v. Catholic Health Sys., Inc., No. 08-CV-3805(F), 2013 WL 2250603 (W.D.N.Y. May 21, 2013); In Re Actos, No. 6:11-md-2299, 2012 WL 7861249 (W.D. La. July 27, 2012); Moore, 287 F.R.D. 182. Where, as here, petitioners reasonably request to use predictive coding to conserve time and expense, and represent to the Court that they will retain electronic discovery experts to meet with respondent’s counsel or his experts to conduct a search acceptable to respondent, we see no reason petitioners should not be allowed to use predictive coding to respond to respondent’s discovery request. Cf. Progressive Cas. Ins. Co., 2014 WL 3563467, at *10-*12 (declining to allow the use of predictive coding where the record lacked the necessary transparency and cooperation among counsel in the review and production of ESI responsive to the discovery request).

Mr. Scarazzo’s expert testimony supports our opinion. He testified that11 discovery of ESI essentially involves a two-step process.

FN 11 – Mr. Wudke did not persuasively say anything to erode or otherwise undercut Mr. Scarazzo’s testimony. [This is to the credit of Mr. Wudke, an honest expert.]

First, the universe of data is narrowed to data that is potentially responsive to a discovery request. Second, the potentially responsive data is narrowed down to what is in fact responsive. He also testified that he was familiar with both predictive coding and keyword searching, two of the techniques commonly employed in the first step of the two-step discovery process, and he compared those techniques by stating:

[K]ey word searching is, as the name implies, is a list of terms or terminologies that are used that are run against documents in a method of determining or identifying those documents to be reviewed. What predictive coding does is it takes the type of documents, the layout, maybe the whispets of the documents, the format of the documents, and it uses a computer model to predict which documents out of the whole set might contain relevant information to be reviewed.

So one of the things that it does is, by using technology, it eliminates or minimizes some of the human error that might be associated with it. [Note proper use of the word “some,” it eliminates some of the human error. It cannot be eliminated entirely.] Sometimes there’s inefficiencies with key word searching in that it may include or exclude documents, whereas training the model to go back and predict this, we can look at it and use statistics and other sampling information to pull back the information and feel more confident that the information that’s being reviewed is the universe of potentially responsive data.

He concluded that the trend was in favor of predictive coding because it eliminates human error and expedites review. [The modifier “some” to “eliminates human error” is not used here, and thus is a slight overstatement.]

In addition, Mr. Scarazzo opined credibly and without contradiction that petitioners’ approach to responding to respondent’s discovery request is the most reasonable way for petitioners to comply with that request. Petitioners asked Mr. Scarazzo to analyze and to compare the parties’ dueling approaches in the setting of the data to be restored from Dynamo’s backup tapes and to opine on which of the approaches is the most reasonable way for petitioners to comply with respondent’s request. Mr. Scarazzo assumed as to petitioners’ approach that the restored data would be searched using specific criteria, that the resulting information would be reviewed for privilege, and that petitioners would produce the nonprivileged information to respondent. He assumed as to respondent’s approach that the restored data would be searched for privileged information without using specific search criteria, that the resulting privileged information would be removed, and that petitioners would then produce the remaining data to respondent. As to both approaches, he examined certain details of Dynamo’s backup tapes, interviewed the person most knowledgeable on Dynamo’s backup process and the contents of its backup tapes (Dynamo’s director of information technology), and performed certain cost calculations.

Mr. Scarazzo concluded that petitioners’ approach would reduce the universe of information on the tapes using criteria set by the parties to minimize review time and expense and ultimately result in a focused set of information germane to the matter. He estimated that 200,000 to 400,000 documents would be subject to review under petitioners’ approach at a cost of \$80,000 to \$85,000, while 3.5 million to 7 million documents would be subject to review under respondent’s approach at a cost of \$500,000 to \$550,000. [This is a huge reduction, and shows the importance of predictive coding. It is a reduction of from between 2.2 million to 6.6 million documents. That seems credible to me, but the actual cost saving quoted here seems off, or at least, seems incomplete. For instance, if you assume 300,000 documents, the mid-point of the estimated document count using predictive coding, and a projected cost of \$85,000, that is only \$00.28 per document. That is a valid number for the predictive coding culling process, but not for the actual review of the documents for confidentiality and privilege, and to confirm the privilege predictions.]

Our Rules, including our discovery Rules, are to “be construed to secure the just, speedy, and inexpensive determination of every case.” Rule 1(d). Petitioners may use predictive coding in responding to respondent’s discovery request. If, after reviewing the results, respondent believes that the response to the discovery request is incomplete, he may file a motion to compel at that time. See Rule 104(b), (d).

## Should Lawyers Be Big Data Cops?

September 1, 2014

Many police departments are using big data analytics to predict where crime is likely to take place and prevent it. Should lawyers do the same to predict and stop illegal, non-criminal activities? This is not the job of police, but should it be the job of lawyers? We already have the technology to do this, but should we? Should lawyers be big data cops? Does anyone even want that?

Crime Prevention by Data Analytics is Already in Use by Many Police Departments

The NY Times reported on this back in 2011 when it was relatively new: Sending the Police Before There’s a Crime. The Times reported how the Santa Cruz California police were using data analysis to predict where burglaries and other crimes might take place and to deploy police officers accordingly:

The arrests were routine. Two women were taken into custody after they were discovered peering into cars in a downtown parking garage in Santa Cruz, Calif. One woman was found to have outstanding warrants; the other was carrying illegal drugs.

But the presence of the police officers in the garage that Friday afternoon in July was anything but ordinary: They were directed to the parking structure by a computer program that had predicted that car burglaries were especially likely there that day.

The Times reported that several cities were already using data analysis to try to systematically anticipate when and where crimes will occur, including the Chicago Police Department. Chicago created a predictive analytics unit back in 2010.

This trend is growing and precrime detection technologies are now used by many police departments around the world, including the Department of Homeland Security, not to mention the NSA analytics of metadata. See eg The Minority Report: Using Predictive Analytics to prevent the crime from happening in the first place! (IBM); In Hot Pursuit of Numbers to Ward Off Crime (NY Times); Police embracing tech that predicts crimes (CNN); U.S. Cities Relying on Precog Software to Predict Murder (Wired). The analytics are already pretty good at predicting places and times where cars will be stolen, houses robbed and people mugged.

Although these programs help improve efficient crime fighting, they are not without serious privacy and due process critics. Imagine the potential abuses if an evil Big Brother government was not only watching you, but could arrest you based on computer predictions of what you might do. Although no one is arresting people yet for what they might do as in the Minority Report, they are subjecting people to significantly increased scrutiny, even home visits. See eg. Professor Elizabeth Joh, Policing by Numbers: Big Data and the Fourth Amendment; Professor Brandon Garrett, Big Data and Due ProcessThe minority report: Chicago’s new police computer predicts crimes, but is it racist? (The Verge, 2014); Eric Holder Warns About America’s Disturbing Attempts at Precrime. Do we really want to give computers, and the people who operate them, that much power? Does the Constitution as now written even allow that?

Should Lawyers Detect and Stop Law Suits Before They Happen?

Should lawyers follow our police departments and use data analytics to predict and stop illegal, but non-criminal activities? The police will not do it. It is beyond their jurisdiction. Their job is to fight crime, not torts, not breach of contract, nor the tens of thousand of other civil wrongs that people and corporations sue each other about every day. Should lawyers do it? Is that the next step for the plaintiff’s bar? Is that the next step for corporate defense lawyers? For corporate compliance lawyers?  For the Civil Division of the Department of Justice? How serious is the potential loss in privacy and other rights to go that route? What other risks do we take in using our new found predictive coding skills in this way?

There are millions of civil wrongs committed each year that are beyond the purview of the criminal justice system. Many of them cause disputes, and many of these disputes in turn lead to state and federal litigation. Evidence of these illegal activities is present in the both public and private data. Should lawyers mine this data to look for civil wrongs? Should the civil justice system include prevention? Should lawyers not only bring and defend law suits, but also prevent them?

This is not the future we are talking about here. The necessary software and search skills already exist to do this. Lawyers with big data skills can already detect and prevent breach of contract, torts, and statutory violations, if they have access to the data. It is already possible for skilled lawyers to detect and stop these illegal activities before damages are caused, before disputes arise, before law suits are filed. Lawyers with artificial intelligence enhanced evidence search skills can already do this.

I have written about this several times before and even coined a word for this legal service. I call it “PreSuit.” It is a play off the term PreCrime from the Minority Report movie. I have built a website that provides an overview on how these services can be performed. Some lawyers have even begun rendering such services. But should they? Some lawyers, myself included, know how to use existing predictive coding software to mine data and make predictions as to where illegal activities are likely to take place. We know how to use this predictive technology to intervene to prevent such illegal activity. But should we?

Just because new technology empowers us to do new things, does not mean we should. Perhaps we should refrain from becoming big data cops? We do not need the extra work. No one is clamoring for this new service. Should we build a new bomb just because we can?

Do we really want to empower an elite group of technology enhanced lawyers in this way? After all, society has gotten along just fine for centuries using traditional civil dispute resolution procedures. We have gotten along just fine by using a court system that imposes after-the-fact damages and injunctions to provide redress for civil wrongs. Should we really turn the civil justice system on its head by detecting the wrongs in advance and avoiding them?

Is it really in the best interest of society for lawyers to be big data cops? Or anyone else for that matter? Is it in the best interests of corporate world to have this kind of private police action? Is it in the best interest of lawyers? The public? What are the privacy and due process ramifications?

Some Preliminary Thoughts

I do not have any answers on this yet. It is too early in my own analysis to say for sure. These kind of complex constitutional issues require a lot of thought and discussions. All sides should be heard. I would like to hear what others have to say about this before I start reaching any conclusions. I look forward to hearing your public and private comments. I do, however, have a few preliminary thoughts and predictions to start the discussion. Some are serious, some are just designed to be thought-provoking. You figure out which are which. If you quote me, please remember to include this disclaimer. None of these thoughts are yet firm convictions, nor certain predictions. I may change my mind on all of this as my understanding improves. As a better Ralph than I once said: “A foolish consistency is the hobgoblin of little minds.”

First of all, there is no current demand for this service by the people who need it the most, large corporations. They may never want this, even though such opposition is irrational. It would, after all, reduce litigation costs and make their company more profitable. I am not sure why, and do not think it is as simple as some would say, that they just want to hide their illegal activities. Let me tell you an experience from my 34 years as a litigator that may shed some light on this. This is an experience that I know is common with many litigators. It has to do with the relationship between lawyers and management in most large companies.

Occasionally during a case I would become aware of a business practice in my client corporation that should obviously be changed. Typically it was a business practice that created or at least contributed to the law suit I just defended. The practice was not blatantly illegal, but was a grey-area. The case had shown that it was stupid and should be changed, if for no other reason than to prevent another case like that from happening. Since I had just seen the train wreck in slow motion, and knew full well how much it had cost the company, mostly in my fees, I thought I would help the company to prevent it from happening again. I would make a recommendation as to what should be changed and why. Sometimes I would explain in detail how the change would have prevented the litigation I just finished. I would explain how a change in the business practice would save the company money.

I have done this several times as a litigator at other firms before going to my current firm where I only do e-discovery. Do you know what kind of reaction I got? Nothing. No response at all, except perhaps a bored, polite thanks. I doubt my lessons learned memos were even read. I was, after all, just an unknown, young partner in a Floriduh law firm. I was not pointing out an illegal practice, nor one that had to be changed to avoid illegal activities. I was just pointing out a very ill-advised practice. I have had occasions to point out illegal activities too, in fact this is a more frequent occurrence, and there the response is much different. I was not ignored. I was told this would be changed. Sometimes I was asked to assist in that change. But when it came to recommendations to change something not outright illegal, suggestions to improve business practices, the response was totally different. Crickets. Just crickets. And big yawns. When will lawyers learn their place?

A couple of times I talked to in-house counsel about this, and tried to enlist their support to get the legal, but stupid, business practice changed. They would usually agree with me, full-heartedly, on the stupid part, after all they had seen the train wreck too. But they were cynical. They would explain that no one in upper management would listen to them. I am speaking about large corporations, ones with big bureaucracies. It may be better in small companies. In large companies in-house would express frustration. They knew the law department had far less juice than most others in the company. (Only the poor records department, or compliance department, if there is one, typically gets less respect than legal.) Many other parts of a company actually generate revenue, or at least provide cool toys that management wants, such as IT. All Legal does is spend money and aggravate everyone. The department that usually has the most juice in a company is sales, and they are the ones with most of the questionable practices. They are focused on money-making, not abstractions like legal compliance and dispute avoidance. Bottom line, in my experience upper management is not interested in hearing the opinions of lawyers, especially outside counsel, on what they should do differently.

Based on this experience I do not think the idea of lawyers as analytic cops to prevent illegal activities will get much traction with upper management. They do not want a lawyer in the room. It would stifle their creativity, their independent management acumen. They see all lawyers as nay sayers, deal breakers. Listen to lawyers and you’ll get paralysis by analysis. No, I do not see any welcome sign appearing for lawyers as big data cops, even if you present chart after chart as to how much data, time and frustration you will save the company in litigation avoidance. Of course, I never was much of a salesman. I’m just a lawyer who follows the hacker way of management (an iterative, pragmatic, action-based approach, which is the polar opposite of paralysis by analysis). So maybe some vendor salesmen out there will be able to sell the PreSuit concept, but not lawyers, at least not me.

I have tried all year. I have talked about this idea at several events. I have written about it, and created the PreSuit website with details. Do you know how many companies have responded? How many have expressed at least some interest in the possibility of reducing litigation costs by data analytics? Build it and they will come, they say. Not in my experience. I’ve built it and no one has come. There has been no response at all. Weeds are starting to grow on this field of dreams. Oh well. I’m a golfer. I’m used to disappointment.

This is probably just as well because reduction of litigation is not really in the best interests of the legal profession. After all, most law firms make most of their money in litigation. Lawyers should refuse to be big data cops and should let the CEOs carry on in ignorant bliss. Let them continue to function with eyes closed and spawn expensive litigation for corporate counsel to defend and for plaintiff’s counsel to get rich on. The litigation system works fine for the lawyers, and for the courts and judges too. Why muck up a big money generating machine by avoiding the disputes that the keep whole thing running? Especially when no one wants that.

All of the established powers want to leave things just the way they are. Can you imagine the devastating economic impact a fifty percent reduction in litigation would cause on the legal system? On lawyers everywhere? Both plaintiff’s and defendant’s bars? Hundreds of thousands of lawyers and support staff  would be out of work. No. This will be ignored, and if not ignored, attacked as radical, new, unproven, and perhaps most effective of all, as dangerous to privacy rights and due process. The privacy anti-big-brother groups will, for once, join forces with corporate America. Protect the workers they will say. Unions everywhere will oppose PreSuit. Labor and management will finally have an issue they can agree upon. Only a few high-tech lawyers will oppose them, and they are way outnumbered, especially in the legal profession.

No, I predict this will never be adopted voluntarily, nor will it ever be required by legislation. The politicians of today do not lead, they follow. The only thing I see now that will cause people to want to avoid litigation, to use data analytics to detect and prevent disputes, is the collapse, or near-collapse, of our current system of civil litigation. Lawyers as big data cops will only come out of desperation. This might happen sooner than you think.

There is another way of course. True leadership could come from the new ranks of corporate America. They could see the enlightened self-interest of PreSuit litigation avoidance. They could understand the value of data analytics and value of compliance. This may not come from our current generation old-school leaders, they barely know what data analytics is anyway. But maybe it will come from the next wave of leaders. There is always hope that the necessary changes will be made out of intelligence, not crises. If history is any guide, this is unlikely, but not impossible.

On the other hand, maybe this is benevolent neglect. Maybe the refusal to adopt these new technologies is for the best. Maybe the power to predict civil wrongs would be abused by a small technical elite of e-discovery lawyer cops. Maybe it would go to their head, and before you know it, their heavy hands would descend to rob all employees of their last fragments of privacy. Maybe innovation would be stifled by the fear that new creative actions might be seen as a precursor to illegal activities. This chilling effect could cause everyone to just play it safe.

The next generation of Steve Jobs would never arise in conditions such as this. They would instead come from the last remaining countries that still maintained a heavy litigation load. They would arise in cultures that still allow the workforce to do as it damn well pleases, and just let the courts sort it all out later. Legal smegal, just get the job done. Maybe expensive chaos is the best incubator we have for creative genius? Maybe it is best to keep lawyers out of the boardroom? Much less give them a badge and let them police anything. It is better to keep data analytics in Sales where it belongs. Let us know what our customers are doing and thinking, but keep a blind eye to ourself. That way we can do what we want.

Conclusion

I always end my blogs with a conclusion. But not this time. I have no conclusions yet. This could go either way. This game is too close to call. We are still in the early innings yet. Who knows? A few star CEOs may come out of the cornfields yet. Then we could find out fast whether PreSuit is a good thing. A few test cases should flush out the facts, good and bad.