Happy Holidays!

December 7, 2014

Last week’s blog was so difficult, so long, that I think everyone is ready for a humorous break. So I offer you a bit of Festivus holiday spirit from the comedic mind of Seinfeld.


Apparently this type of humor is threatening the morals of my home state, Florida, as the following video explains. It does not take much to threaten our morals, believe me. On the other hand, how funny is a state that actually has a Festivus pole in the state capital? Well, not everyone is amused.


I hate to leave on a political note, so I offer you one more Festivus video. The last half of this one reminds us of the many grievances aired on Seinfeld.


No doubt someone will be offended by these videos, so, as I like to say from time to time: “Let the airing of grievances begin!” Just leave them in the comment box below. The funnier the better. Points for clever comments, even if not really funny. Even more points for grievances, clever or not, with some kind of tie into e-discovery.


Visualizing Data in a Predictive Coding Project – Part Three

November 30, 2014

Ralph at Niagara FallsThis is part three of my presentation of an idea for visualization of data in a predictive coding project. Please read part one and part two first. This concluding blog in the visualization series also serves as a stand alone lesson on the basics of math, sampling, probability, prevalence, recall and precision. It will summarize some of my current thoughts on quality control and quality assurances in large scale document reviews. Bottom line, there is far more to quality control than doing the math, but still, sampling and metric analysis are helpful. So too is creative visualization of the whole process.

Law, Science and Technology

Team_TriangleThis is the area in which scientists on e-discovery teams excel. I recommend that every law firm, corporate and vendor e-discovery team have at least one scientist to help them. Technologists alone are not sufficient. Discovery teams know this, and all have engineers working with lawyers, but very few yet have scientists working with engineers and lawyers. They are like two-legged stools.

Also, and this seems obvious, you need search sophisticated lawyers on e-discovery teams too. I am starting to see this error more and more lately, especially in vendors. Engineers may think they know the law, that is very common, but they are wrong. The same delusional thinking sometimes even affects scientists. Both engineers and scientists tend to over-simplify the law and do not really understand legal discovery. They do not understand the larger context and overall processes and policies.

John Tredennick

John Tredennick

For legal search to be done properly, it must not only include lawyers, the lawyers must lead. Ideally, a lawyer will be in charge, not in a domineering way (my way or the highway), but in a cooperative multi-disciplinary team sort of way. That is one of the strong points I see at Catalyst. Their team includes tons of engineers/technologists, like any vendor, but also scientists, and lawyers. Plus, and here is the key part, the CEO is an experienced search lawyer. That means not only a law degree, but years of legal experience as a practicing attorney doing discovery and trials. A fully multidisciplinary team with an experienced search lawyer as leader is, in my opinion, the ideal e-discovery team. Not only for vendors, but for corporate e-discovery teams, and, of course, law firms.

ralph_wrongMany disagree with me on this, as many laymen and non-practicing attorneys resent my law-first orientation. Technologists are now often in charge, especially on vendor teams. In my experience these technologists do not properly respect the complexity of legal knowledge and process. They often bad mouth lawyers and law firms behind their back. Their products and services suffer as a result. It is a recipe for disaster.

On many vendor teams, the lawyers are not part of the leadership, if they are on a team, it is low level and they are not respected. This is all wrong because the purpose of e-discovery teams is the search for evidence in a legal context, typically a law suit. There is only one leg of the stool that has ever studied evidence.

It takes all three disciplines for top quality legal search: scientists, technologists and lawyers. If you cannot afford a full-time scientists, then you should at least hire one as a consultant on the harder cases.

The scientists on a team may not like the kind of simplification I will present here on sampling, prevalence and recall. They typically want to go into far greater depth and provide multiple caveats on math and probability, which is fine, but it is important to start with a foundation of basics. This is what you will find here. The basics of math and probabilities, and applications of these principles from a lawyer’s point of view, not a scientist’s or engineer’s.

Professor Gordon CormackStill, the explanations here are informed by the input of several outstanding scientists. A special shout out and thanks goes to Gordon Cormack. He has been very generous with his time and patient with my incessant questions. Professor Cormack has been a preeminent voice in Information Science and search for decades now, well before he started teaming with Maura Grossman to study predictive coding. I appreciate his assistance, and, of course, any errors and oversimplifications are solely my own.

Now let’s move onto the math part you have been waiting for, and begin by revisiting the hypothetical we set out in parts one and two of this visualization series.

 Calculating and Visualizing Prevalence

Recall we have exactly 1,000,000 documents remaining for predictive coding after culling. I previously explained that this particular project began with culling and multimodal judgmental sampling, and with a random sample of 1,534 documents. Please note this is not intended to refer to all projects. This is just an example to have data flows set up for visualization purposes. If you want to see my standard work flows see LegalSearchScience.com and Electronic Discovery Best Practices, EDBP.com, on the Predictive Coding page. You will see, for instance, that another activity is always recommended, especially near the beginning of a project, namely Relevancy Dialogues (step 1).

data-visual_RANDOM_ONEAssuming a 95% confidence level, a sample of 1,534 documents creates a confidence interval of 2.5%. This means your sample is subject to a 2.5% error rate in both directions, high and low, for a total error range of 5%. This is 5% of the total One Million documents corpus (50,000 documents), not just 5% of the 1,534 sample (77 documents).

In our hypothetical the SME, who had substantial help from a top contract reviewer, studied the 1,534 sampled documents. The SME found that 384 were relevant and 1,150 were irrelevant. By the way, when done properly this review of 1,534 documents should only cost between $1,000 to $2,000, with most of that going to the SME expense, not the contract reviewer expense.

The spot projection of prevalence here is 25%. This is simple division. Divide the 384 relevant by the total population: 384/1,534. You get 25%. That means that one out of four of the documents sampled was found to be relevant. Random sampling tells us that this same ratio should apply, at least roughly, on the larger population.  You could at this point simply project the sample percentage from the sample onto the entire document population. You would thus conclude that approximately 250,000 documents will likely be relevant. But this kind of projection alone is nearly meaningless in lower prevalence situations, which are common in legal search. It is also of questionable value in this hypothetical where there is a relatively high prevalence of 25%.data-visual_RANDOM_TWO

When doing probability analysis based on sampling you must always include both the confidence level, here 95%, and the confidence interval, here 2.5%. The Confidence Level means that 5 times out of 100 the projection will be in error. More specifically, the Confidence Level means that if you were to repeat the sampling 100 times, the resulting Confidence Interval (here 2.5%) would contain the true value (here 250,000 relevant documents) at least 95% of the time. Conversely, this means that it would miss the true value at most 5% of the time.

In our hypothetical the true value is 250,000 relevant documents. On one sample you might get a Confidence Interval of 225,000 – 275,000, as we did here. But with another sample you might get 215,000 – 265,000. On another you might get 240,000 – 290,000.  These all include the true value. Occasionally (but no more than 5 times in a hundred), you might get a Confidence Interval like 190,000 – 240,000, or 260,000 – 310,000, that excludes the true value. That is what a 95% Confidence Level means.

The confidence interval range is simply calculated here by adding 2.5% to the 25%, and subtracting 2.5% to the 25%. This creates a percentage range of from between 22.5% to 27.5%. When you project this confidence interval unto the entire document collection you get a range of relevant documents of from between 225,000 (22.5%*1,000,000) and 275,000 (27.5%*1,000,000).

This simple calculation, called a Classical or Gaussian Estimation, works well in high prevalence situations, but in situations where the prevalence is low, say 3% or less, and even in this hypothetical where the prevalence is a relatively high 25%, the accuracy of the projected range can be improved by adjusting the 22.5% to 27.5% confidence interval range. The adjustment is performed by using what is called a Binomial calculation, instead of the Normal or Gaussian calculation. Ask a scientist for particulars on this, not me. I just know to use a standard Binomial Confidence Interval Calculator to determine the range in most legal search projects. For some immediate guidance, see the definitions of Binomial Estimation and Classical or Gaussian Estimation in The Grossman-Cormack Glossary of Technology Assisted Review.

data-visual_RANDOM_2With the Binomial Calculator you again enter the samples as a fraction with the numerator being the relevant documents, and the denominator the total number of documents sampled. Again, this is just like before, you divide 384 by 1,534. The basic answer is also the same 25%, the point or spot projection ratio, but the range with a Binomial Calculator is now slightly different. Instead of a simple plus or minus 2.5%, that produces 22.5%-27.5%, the binomial calculation creates a tighter range of from between 22.9% to 27.3%. The range in this hypothetical is thus a little tighter than 5%. It is a range here is 4.4% (from between 22.9 to 27.3%). Therefore the projected range of relevant documents using the Binomial interval calculation is from between 229,000 (22.9%*1,000,000) and 273,000 (27.3%*1,000,000) documents.

bell-curve-Standard_deviation_diagramThe 1,534 simple random sample of 1,000,000 document collection shows that 95 times out of 100 the correct number of relevant documents will be between 229,000 and 273,000.

This also means that no more than five times out of 100 will the calculated interval, here between 22.9% and 27.3%, fail to capture the true value, the true number of relevant documents in the collection. Sometimes the true value, the true number of relevant documents, may be less than 229,000 or greater than 273,000. This is shown in part by the graphic below, which is another visualization that I like to use to help me to visualize what is happening in a predictive coding project. Here the true value lies somewhere between the 229,ooo and 273,000, or at least 95 times out of 100 it does. When 5 times out of 100 the true value lies outside the range, the divergence is usually small. Most of the time, when the confidence interval misses the true value, it is a near miss. Cases where the confidence interval is far below, or far above, the true value is exceedingly rare.


The Binomial adjustment to the interval calculation is required for low prevalence populations. For instance, if the prevalence was only 2%, and the interval was again 2.5%, the error range would create a negative number, -.5% (2%-2.5%). It would be from between -.5% and 4.5%. That projection means from between -0- relevant documents to 45,000. (Obviously you can not have negative relevant documents.) The zero relevant documents is also known to be wrong because you could not have performed the calculation unless there were some relevant documents in the sample. So in this situation of low prevalence the Binomial calculation method is required to produce anything close to accurate projections.

For example, assuming again a 1,000,000 corpus, and a 95%+/-2.5% sample consisting of 1,534 documents, a 2% prevalence results from finding 31 relevant documents. Using the binomial calculator you get a range of from between 1.4% to 2.9%, instead of  between -.5% to 4.5%. The binomial based interval range results in a projection of between 14,000 relevant documents (instead of the absurd zero relevant documents) to 29,000 relevant documents.

Even with the binomial calculation adjustment, the reliability of using probability projections to calculate prevalence is the subject of much controversy among information scientists and probability statisticians (most good information scientists doing search are also probability statisticians, but not visa versa). The reliability of such range projections is controversial in situations like this, where the sample size is low, here only 1,534 documents, and the likely percentage of relevant documents is also low, here only 2%. In this second scenario where only 31 relevant documents were found in the sample, there are too few relevant documents for sampling to be as reliable as it is in higher prevalence collections. I still think you should do it. It does provide good information. But you should not rely completely on these calculations, especially when it comes to the step of trying to calculate recall. You should use all of the quality control procedures you know, including the others listed previously.

Calculating Recall Using Prevalence

Search Quadrant - standard in information scienceRecall is another percentage that represents the proportion between the total number of relevant documents in a collection, and the number of these relevant documents that have been found. So, if you happen to know that there are 10 relevant documents in a collection of 100 documents, and you correctly identify 9 relevant documents, then you have attained a 90% recall level. Referring to the hopefully familiar Search Quadrant shown right, this means that you would have one False Negative and nine True Positives. If you only found one out of the ten, you would have 10% recall (and would likely be fired for negligence). This would be nine False Positives and one True Positive.

The calculation of Precision requires information on the total number of False Positives. In the first example where you found nine of the ten relevant, if you also found nine more that you thought were relevant, but were not, they were False Positives, then what would your precision be? You have found a total of 18 documents that you thought were relevant, and it turns out that only half of them, 9 documents, were actually relevant. That means you had a precision rate of 50%. Simple. Precision could also easily be visualized by various kinds of standard graphs. I suggest that this be added to all search and review software. It is important to see, but, IMO, when it comes to legal search, the focus should be on Recall, not Precision.

gold_standard_MYTHThe problem with calculating Recall in legal search is that you never know the total number of relevant documents, that is the whole point of the search. If you knew, you would not have to search. But in fact no one ever knows. Moreover, in large document collections, there is no way to ever exactly know the total number of relevant documents. All you can ever do is calculate probable ranges. You might think that absolute knowledge could come from human review of all One Million documents in our hypothetical. But that would be wrong because humans make too many mistakes, especially with legal judgments as fluid as relevancy determinations. So too do computers, dependent as they are to the training by all too fallible humans.

Bottom line, we can never know for sure how many relevant documents are in the 1,000,000 collection, and so we can never know with certainty what our Recall rate is. But we can make an very educated guess, one that is almost certainly correct when a range of Recall percentages are used, instead of just one particular number. We can narrow down the grey area. All experienced lawyers are familiar conceptually with this problem. The law is made in a process similar to this. It arise case by case out of large grey areas of uncertainty.

The reliability of our sample based Recall guess decreases as prevalence lowers. It is a problem inherent to all random sampling. It is not unique to legal evidence search. What is unique to legal search is the importance of Recall to begin with. In many other types of search Recall is not that important. Google is the prime example of this. You do not need to find all websites with relevant information, just the more useful, generally the most popular web pages. Law is moving away from Recall focus, but slowly. And it is more of a move right now from Recall of simple relevance to Recall of the highly relevant. In that sense legal search will in the long run become more like mainstream Googlesque search. But for now the law is still obsessed with finding all of the evidence in the perhaps mistaken belief that justice requires the whole truth. But I digress.

In our initial hypothetical of a 25% prevalence, the accuracy of the recall guess is actually very high, subject primarily to the 95% confidence level limitation. Even in the lower 2% hypothetical, the recall calculation has value. Indeed, it is the basis of much scientific research concerning things like rare diseases and rare species. Again, we enter a hotly debated area of science that is beyond my expertise (although not my interest).

data-visual_Round_5Getting back to our example where we have a 95% confidence level that there are between 229,000 and 273,000 relevant documents in the 1,000,000 document collection – as described before in part one of this series, we assume that after only four rounds of machine training we have reached a point in the project where we are not seeing a significant increase in relevant documents from one round of machine training to the next. The change in document probability ranking has slowed and the visualization of the ranking distribution looks something like this upside down champagne glass shown right.

At this point a count shows that we have now found 250,000 relevant documents. This is critical information that I have not shared in the first two blogs, information that for the first time allows for a Recall calculation. I held back this information until now for simplicity purposes, plus it allowed me to add a fun math test. (Well, the winner of the test, John Tredennick, CEO of Catalyst, thought it was fun.) In reality you would keep a running count of relevant documents found, and you would have a series of Recall visualizations. Still, the critical Recall calculation takes place when you have decided to stop the review and test.

Recall-rangeAssuming we have found 250,000 relevant documents this means that we have attained anywhere from between 91.6% to 100% recall. At least it means we can have a 95% confidence level that we have attained a result somewhere in that range. Put another way, we can have a 95% confidence level that we have attained a 91.6% or higher recall rate. We cannot have 100% confidence in that result. Only 95%. That means that one time out of twenty (5% of the 95% confidence level) there may be more than 273,00 relevant documents. That in turn means that one time in twenty we may have attained less than a 91.6% recall in this circumstance.


The low side Recall calculation of 91.6% is derived by dividing the 250,000 found, by the high-end of the confidence interval, 273,000 documents. If the spot projection happens to be exactly right, which is rare, and in this hypo is now looking less and less likely (we have, after all, now found 250,000 relevant documents, or at least think we have), then the math would be 100% recall (250,000/250,000). That is extremely unlikely. Indeed, information scientists love to say that the only way to attain 100% recall is with 0% precision, that is, to select all documents. This statement is, among other things, a hyperbole intended to make the uncertainty point inherent in sampling and confidence levels. The 95% Confidence Level uncertainty is shown by the long tail on either side of the standard bell curve pictured above.

You can never have more than 100% recall, of course, so we do not say we have attained anywhere between 109% and 91.6% recall. The low-end estimate of 229,000 relevant documents has, at this point in the project, been shown to be wrong by the discovery and verification of 250,000 relevant documents. I say shown, not proven because of the previously mentioned liquidity of relevance and inaccuracy of humans of make consistent final judgments when, as here, vast numbers of documents are involved.

Thermometer_RecallFor a visualization of recall I like the image of a thermometer, like a fund-raising goal chart, but with a twist of two different measures. On the left side put the low-end measure, here the 2.29% confidence interval with 229,000 documents, and on the right side, the high measure, 2.73% confidence interval with 273,000 documents. You can thus chart your progress from the two perspectives at once, the low probability error rate, and the high probability error rate. This is shown on the diagram to the right. It shows the metrics of our hypothetical where we have found and confirmed 250,000 relevant documents. That just happens to represent 100% recall on the low-end of probability error range using the 2.29% confidence interval. But as explained before, the 250,000 relevant documents found also represents only 91.6% recall on the high-end using the 2.73% confidence interval. You will never really know which is accurate, except that it is safe to bet you have not in fact attained 100% recall.

Random Sample Quality Assurance Test

In any significant project, in addition to following the range of recall progress, I impose a quality assurance test at the end to look for False Negatives. Remember, this means relevant documents that have been miscoded as irrelevant. One way to do that is by running similarity searches and verification of syncing. That can catch situations involving documents that are known to be relevant. It is a way to be sure that all variations of those documents, including similar but different documents, are coded consistently. There may be reasons to call one variant relevant, and another irrelevant, but usually not. I like to put a special emphasis on this at the end, but it is only one of many quality tests and searches that a skilled searcher can and should run throughout any large review project. Visualizations could also be used to assist in this search.

But what about the False negatives that are not near duplicates or close cousins? The similarity and consistency searches will not find them. Of course you have been looking for these documents throughout the project, and at this point you think that you have found as many relevant documents as you can. You may not think you have found all relevant documents, total recall, no experienced searcher ever really believes that, but you should feel like you have found all highly relevant documents. You should have a well reasoned opinion that you have found all of the relevant documents needed to do justice. That opinion will be informed by legal principles of reasonability and proportionality.

data-visual_Round_5That opinion will also be informed by your experience in search though this document set. You will have seen for yourself that the probability rankings have divided the documents into to well defined segments, relevant and irrelevant. You will have seen that no documents, or very few, remain in the uncertainty area, the 40-60% range. You will have personally verified the machine’s predictions many times, such that you will have high confidence that the machine is properly implementing the SME’s relevance concept. You will have seen for yourself that few new relevant documents are found from one round of training to the next. You will also usually have seen that the new documents found are really just more of the same. That they are essentially cumulative in nature. All of these observations, plus the governing legal principles, go into the decision to stop the training and review, and move onto final confidentiality protection review, and then production and privilege logging.

Still, in spite of all such quality control measures, I like to add one more, one based again on random sampling. Again, I am looking for False Negatives, specifically any that are of a new and different kind of relevant document not seen before, or a document that would be considered highly relevant, even if of a type seen before. Remember, I will not have stopped the review in most projects (proportionality constraints aside), unless I was confident that I had already found all of those types of documents; already found all types of strong relevant documents, and already found all highly relevant document, even if they are cumulative. I want to find each and every instance of all hot (highly relevant) documents that exists in the entire collection. I will only stop (proportionality constraints aside) when I think the only relevant documents I have not recalled are of an unimportant, cumulative type; the merely relevant. The truth is, most documents found in e-discovery are of this type; they are merely relevant, and of little to no use to anybody except to find the strong relevant, new types of relevant evidence, or highly relevant evidence.

There are two types of random samples that I usually run for this final quality assurance test. I can sample the entire document set again, or I can limit my sample to the documents that will not be produced. In the hypothetical we have been working with, that would mean a sample of the 750,000 documents not identified as relevant. I do not do both samples, but rather one or another. But you could do both in a very large, relatively unconstrained budget project. That would provide more information. Typically in a low prevalence situation, where for instance there is only a 2% relevance shown from both the sample, and the ensuing search project, I would do my final quality assurance test with a sample of the entire document collection. Since I am looking for False Negatives, my goal is not frustrated by including the 2% of the collection already identified as relevant.

There are benefits from running a full sample again, as it allows direct comparisons with the first sample, and can even be combined with the first sample for some analysis. You can, for instance, run a full confusion matrix analysis as explained, for instance, in The Grossman-Cormack Glossary of Technology Assisted Review; also see Escape From Babel: The Grossman-Cormack Glossary.


Truly Non-Relevant Truly Relevant
Coded Non-Relevant True Negatives (“TN”) False Negatives (“FN”)
Coded Relevant False Positives (“FP”) True Positives (“TP”)

Accuracy = 100% – Error = (TP + TN) / (TP + TN + FP + FN)
Error = 100% – Accuracy = (FP + FN) / (TP + TN + FP + FN)
Elusion = 100% – Negative Predictive Value = FN / (FN + TN)
Fallout = False Positive Rate = 100% – True Negative Rate = FP / (FP + TN)
Negative Predictive Value = 100% – Elusion = TN / (TN + FN)
Precision = Positive Predictive Value = TP / (TP + FP)
Prevalence = Yield = Richness = (TP + FN) / (TP + TN + FP + FN)
Recall = True Positive Rate = 100% – False Negative Rate = TP / (TP + FN)

Special code and visualizations built into review software could make it far easier to run this kind of Confusion Matrix analysis. It is really far easier than it looks and should be user friendly automated. Software vendors should also offer basic instruction on this tool. Scientist members of an e-discovery team can help with this. Since the benefits of this kind of analysis outweigh the small loss of including the 2% already known to be relevant in the alternative low prevalence example, I typically go with a full random sample in low prevalence projects.

In our primary hypothetical we are not dealing with a low prevalence collection. It has a 25% rate. Here if I sampled the entire 1,000,000, I would in large part be wasting 25% of my sample. To me that detriment outweighs the benefits of bookend samples, but I know that some experts disagree. They love the classic confusion matrix analysis.

To complete this 25% prevalence visualization hypothetical, next assume that we take a simple random sample of the 750,000 documents only, which is sometimes called the null-set. This kind of sample is also sometimes called an Elusion test, as we are sampling the excluded documents to looks for relevant documents that have so far eluded us. We again sample 1,534 documents, again allowing us a 95% confidence level and confidence interval of plus or minus 2.5%.

Next assume in this hypothetical that we find that 1,519 documents have been correctly coded as irrelevant. (Note, most of the correct coding would come have come from machine prediction, not actual human review, but some would have been by actual prior human review.) These 1,519 documents are True Negatives. That is 99% accurate. But the SME review of the random sample did uncover 15 mistakes, 15 False Negatives. The SME decided that 15 documents out of the 1,534 sampled  had been incorrectly coded as irrelevant. That is a 01% error rate. That is pretty good, but not dispositive. What really matters is the nature of the relevancy of the 15 False Negatives. Were these important documents, or just more of the same?

I always use what is called an accept on zero error protocol for the elusion test when it comes to highly relevant documents. If any are highly relevant, then the quality assurance test automatically fails. In that case you must go back and search for more documents like the one that eluded you and must train the system some more. I have only had that happen once, and it was easy to see from the document found why it happened. It was a black swan type document. It used odd language. It qualified as a highly relevant under the rules we had developed, but just barely, and it was cumulative. Still, we tried to find more like it and ran another round of training. No more were found, but still we did a third sample of the null set just to be sure. The second time it passed.

In our hypothetical none of the 15 False Negative documents were highly relevant, not even close. None were of a new type of relevance. All were of a type seen before. Thus the test was passed.

The project then continued with the final confidential review, production and logging phases. Visualizations should be included in the software for these final phases as well, and I have several ideas, but this article is already far too long.

As I indicated in part one of this blog series, I am just giving away a few of my ideas here. For more information you will need to contact me for billable consultations, routed through my law firm, of course, and subject to my time availability with priority given to existing clients. Right now I am fully booked, but I may have time for these kind of interesting projects in a few months.


Ralph_FallsThe growth in general electronic discovery legal work (see EDBP for full description) has been exploding this year, so too has multidisciplinary e-discovery team work. It will, I predict, continue to grow very fast from this point forward. But the adoption of predictive coding software and predictive coding review has, to date, been an exception to this high growth trend. In fact, the adoption of predictive coding has been relatively slow. It is still only infrequently used, if at all, by most law firms, even in big cases. I spoke with many attorneys at the recent Georgetown Institute event who specialize in this field. They are all seeing the same thing and, like me, are shaking their heads in frustration and dismay.

I predict this will change too over the next two to three years. The big hindrances to the adoption of predictive coding are law firms and their general lack of knowledge and skills in predictive coding. Most law firms, both big and small, know very little about the basic methods of predictive coding. They know even less about the best practices. The ignorance is widespread among attorneys my age, and they are the ones in law firm leadership positions. The hinderance to widespread adoption of predictive coding is not lack of judicial approval. There is now plenty of case law. The hinderance is lack of knowledge and skills.


Greedy Lawyers

There is also a greed component involved for some, shall we say, less than client-centric law firms. We have to talk about this elephant in the room. Client’s already are. Some attorneys are quite satisfied with the status quo. They make a great deal of money from linear reviews, and so called advanced keyword search driven reviews. The days of paid inefficiency are numbered. Technology will eventually win out, even over fat cat lawyers. It always does.

The answers I see to the resistance issues to predictive coding are threefold:

Continued Education. We have to continue the efforts to demystify AI and active machine learning. We ned to move our instruction from theory to practice.

Improved Software. Some review software already has excellent machine training features. Some is just so-so, and some do not have this kind of document search and ranking capacity at all. My goal is to push the whole legal software industry to include active machine learning in most all of their options. Another goal is for software vendors to improve their software, and make it easier to work with by adding much more in the way of creative visualizations. That has been the main point of this series and I hope to see a response soon from the industry. Help me to push the industry. Demand these features in your review software. Look beyond the smokescreens and choose the true leaders in the field.

Client Demand. Pressure on reluctant law firms from the companies that pay the bills will have a far stronger impact than anything else.  I am talking about both corporate clients and insurers. They will, I predict, start pushing law firms into greater utilization of AI-enhanced document review. The corporate clients and insurers have the economic motivation for this change that most law firms lack. Corporate clients are also much more comfortable with the use of AI for Big Data search. That kind of pressure by clients on law firms will motivate e-discovery teams to learn the necessary skills. That will in turn motivate the software vendors to spend the money necessary to improve their software with better AI search and better visualizations.

All of the legal software on the market today, especially review software, could be improved by adding more visualizations and graphic display tools. Pictures really can be worth a thousand words. They can especially help to make advanced AI techniques more accessible and easier to understand. The data visualization ideas set forth in this series are just the tip of the iceberg of what can be done to improve existing software.

Genius Bar at Georgetown

November 23, 2014

genius_bar_logoI interrupt my current series of blogs on predictive coding visualization to report on a recent experience with a Genius Bar event. I am not talking about the computer hipster type geniuses that work at the Apple Genius Bar, although there were a few of them at the CLE too. The Apple Genius Bar types can be smart, but, as we all know, they are not really geniuses, even if that is their title. True genius is rare, especially in the Legal Bar. Wikipedia says that a genius is a person who displays exceptional intellectual ability, creativity, or originality, typically to a degree that is associated with the achievement of new advances in a domain of knowledge.


All of us who attended the Georgetown Advanced e-Discovery Institute this week saw a true genius in action. He did not wear the tee-shirt uniform of the Apple genius employees. He wore a bow tie. His name is John M. Facciola. His speech at Georgetown was his last public event before he retires next week as a U.S. Magistrate Judge.


Judge Facciola’s one hour talk displayed exceptional intellectual ability, creativity and originality, just as the definition of genius requires. What else can you call a talk that features a judge channeling Socrates? An oration that uses Plato’s Apology to criticize and enlighten Twenty-First Century lawyers? …. sophists all. The intensity of John’s talk, to me at least, and I’m sure to most of the six hundred or so other lawyers in the room, also indicated a new advance in the making in the domain of knowledge of Law. Still, true genius requires that an advance in knowledge actually be achieved, not just talked about. It requires that the world itself be moved. It requires, as another genius of our day, Steve Jobs, liked to say, that a dent be made in the Universe.

Facciola_standing_thinGeniuses not only have intellectual ability, creativity and originality, they have it to such a degree that they are able to change the world. In the legal world, indeed any world, that is rare. Richard Braman was one such man. His Sedona Conference did make a dent in the legal universe. So did the Principles, and so did his crowning achievement, the Cooperation Proclamation. John Facciola is another such man, or may be, who is trying to take Cooperation to the next level, to expand it to platonic heights. To be honest, the jury is still out on whether his ingenious ideas and proposals will in fact be adopted by the Bar, will in fact lead to the achievement of new advances in a domain of knowledge. That is the true test of a real genius.

Thus, whether future generations will see John Facciola as a genius depends in no small part on all of us, as well as on what John Facciola does next. For unlike the genius of Jobs and Braman, Facciola may be retired as a judge, but he is still very much alive. His legacy is still in the making. For that we should be very grateful. I for one cannot wait to see what he does next and will continue to support his genius in the making.

All of the other judges at Georgetown made it clear where they stand on the ideas of virtue and justice that Facciola promotes. In the final judges panel each wore a funny bow tie in his honor, and were all introduced by panel leader Maura Grossman with Facciola as their last name. It was a very touching and funny moment, all at the same time. I am really glad I was there.

Facciola’s last speech as a judge reflected his own life, his own genius. It was a very personal talk, a deep talk, where, to use his words, he shared his own strong religious and spiritual convictions. In this context he shared his critique of the law as we currently know it, and of legal ethics. It was damning and based on long experience. It was real. Some might say harsh. But he balanced this with his inspirational vision of what the law could and should be in the future. A law where morality, not profit, is the rule. Where the Golden Rule trumps all others. A profession where lawyers are not sophists, that will say or do anything for their clients. He laments that in federal court today most of the litigants are big corporations, as only they can afford federal court.

Judge Facciola calls for a profession where lawyers are citizens who care, who try to do the right thing, the moral thing, not just the expedient or profitable thing for their clients. He calls for lawyers to cooperate. He calls for a complete rewrite of our codes of ethics to make them more humanistic, and at the same time, more spiritual, more Platonic, in the ancient philosophic sense of Truth and Goodness. This is the genius we saw shine at Georgetown.

It reminds me of some quotes from Plato’s Apology, a few excerpts of which Facciola also read during his last talk as a judge. Take a moment and remember with me the most famous closing argument of all times:

Men of Athens, I honor and love you; but I shall obey God rather than you, and while I have life and strength I shall never cease from the practice and teaching of philosophy, exhorting anyone whom I meet after my manner, and convincing him, saying: O my friend, why do you who are a citizen of the great and mighty and wise city of Athens, care so much about laying up the greatest amount of money and honor and reputation, and so little about wisdom and truth and the greatest improvement of the soul, which you never regard or heed at all? Are you not ashamed of this? And if the person with whom I am arguing says: Yes, but I do care; I do not depart or let him go at once; I interrogate and examine and cross-examine him, and if I think that he has no virtue, but only says that he has, I reproach him with undervaluing the greater, and overvaluing the less. And this I should say to everyone whom I meet, young and old, citizen and alien, but especially to the citizens, inasmuch as they are my brethren. For this is the command of God, as I would have you know; and I believe that to this day no greater good has ever happened in the state than my service to the God. For I do nothing but go about persuading you all, old and young alike, not to take thought for your persons and your properties, but first and chiefly to care about the greatest improvement of the soul. I tell you that virtue is not given by money, but that from virtue come money and every other good of man, public as well as private. This is my teaching, and if this is the doctrine which corrupts the youth, my influence is ruinous indeed. But if anyone says that this is not my teaching, he is speaking an untruth. Wherefore, O men of Athens, I say to you, do as Anytus bids or not as Anytus bids, and either acquit me or not; but whatever you do, know that I shall never alter my ways, not even if I have to die many times.


For the truth is that I have no regular disciples: but if anyone likes to come and hear me while I am pursuing my mission, whether he be young or old, he may freely come. Nor do I converse with those who pay only, and not with those who do not pay; but anyone, whether he be rich or poor, may ask and answer me and listen to my words; and whether he turns out to be a bad man or a good one, that cannot be justly laid to my charge, as I never taught him anything. And if anyone says that he has ever learned or heard anything from me in private which all the world has not heard, I should like you to know that he is speaking an untruth.

Facciola_standing_thin_shrugIf Facciola’s positive, Socratic inspired, moral vision for the Law is realized, and I for one think it is possible, then it would be a great new advance in the field of Law. The legal universe would be dented again. It would cement Facciola’s own place as a great Twenty-First Century genius, right up there with Jobs and Braman.

I am sure that Judge Facciola will continue his educational efforts in the field of law after the judge title becomes honorific. I hope he will give more specific form to his reform proposals. I cannot hope that his educational efforts will increase, because they are already incredibly prodigious, but I can hope they will now focus on his legacy, on his particular genius for legal ethics.

Many of our judges and attorneys work hard on e-discovery education. Many have great intellectual ability. But not many are capable of displaying the kind of genius we saw from Facciola’s swan-song as a judge at Georgetown. It is his alma mater, and the students at the Institute, which we have taken to calling the audience these days, were filled with John’s friends and admirers. It brought out the best in Fatch.

There were over 600 students, or fans, or audience, whatever you want to call them, who attended the Georgetown event held at the Ritz Carlton in Tysons Corner. That is a lot of people, mostly all lawyers. To be honest, that was several hundred lawyers too many for any CLE event. Big may be better in data, but not in education.

I liked the Institute better in its early days when there were just a few dozen attendees. I was there near the beginning as a teacher, and considered my sessions to be classes. The people who paid to attend were considered students. That is the language we used then. Now that has all changed. Now I attend as a presenter, and the people who pay to attend are called an audience. It seems like a transition that Socrates would condemn.

The big crowd and entertainment aspects of this years Georgetown Institute reminded me of a big event in Canada last month where I was honored to make the keynote on the first day. I talked about Technology and the Future of the Law, and, as usual, had my razzle dazzle Keynote slides. (I don’t use PowerPoint.) On the second day they had a second keynote. I was surprised to learn he was a professional motivational speaker. Not even a lawyer. My honor faded quickly. The keynote was all salesman rah rah, with no mention of the law at all. That’s not right in my book. It also made me wonder why I was really asked to give the first day’s keynote. Oh well, it was otherwise a great event. But I am now starting to tone down my slides. If I could tone down my enthusiasm, I would too, but I’ve tried, and that’s not possible.

John FacciolaThe task of putting on a show for a large, 600 plus audience was too great a challenge for almost all of the presenters at Georgetown. Do not get me wrong, all of the attorneys tagged to present knew their stuff, but being an expert, and an educator, are very different things. Being an expert and an entertainer are almost night and day. Very, very few experts have the skills of Facciola to do that, who, by the way used no slides at all. (I cannot, however, help but think how it might have been improved by the projection of a large holographic image of Socrates.)

Most of the sessions I attended at Georgetown were like any other CLE, fairly boring. We presenters (at least we were not called performers) were all told to engage our audience, to get them talking, but that almost never happened. The shows were no doubt educational, at least to those who had not seen them before. But entertaining? Even slightly amusing? No, not really. Oh, a few of the panels had their moments, and some were very interesting at times, even to me. A couple even made me laugh a few times. But only one was pure genius. The solo performance of Judge John Facciola.

Fatch_keyboardI found especially compelling his role-playing as Socrates, along with his quotes of Plato, where he read from the Greek original of his high school book from long ago. Judge Facciola presented with a light and witty hand both his dark condemnations of our profession’s failings, and his hope for a different, more virtuous future. His sense of humor of the human predicament made it all work. Humor is a quality possessed of most geniuses, and near geniuses. John radiates with it, and makes you smile, even if you cannot hear or understand all of his words. And even if many of his words anger you. I have no doubt some who heard this talk did not like his bluntness, nor his call for spirituality and a complete rewrite with non-lawyer participation of our professional code of ethics. Well, they did not like Socrates either. It comes with the turf of know-nothing truth-tellers. That is what happens when you speak truth to power.

I thought of trying to share the contents of John’s Apology by consulting my notes and memory. But that could never do it justice. I am no Plato. And really, truth be told, I know Nothing. You have to see the full video of John’s talk for yourself. And you can. Yes! Unlike Socrates’ last talk, Georgetown filmed John’s talk. Not only that, they filmed the whole CLE event. I suspect Georgetown will profit handsomely from all of this. John, of course, was paid nothing, and he would have it no other way.

Dear Georgetown advisors, and Dean Center, good citizens and friends all, please make a special exception regarding payment for the video of John Facciola’s talk. In the spirit of Socrates and your mission as educators, I respectfully request that you publish it online, in full, free of charge. Not the whole event, mind you, but John’s talk, all of his talk. Everyone should see this, not just the bubble people, not just Georgetown graduates and insiders. Let anyone, whether they be rich or poor, listen to these words. Put it on YouTube. Circulate it as widely as you can. Let me know and I will help you to get the word out. Give it away. No charge. You know that is what Socrates would demand.

In the meantime for all of my dear readers not lucky enough to have been there, here is a short fair use video that I made of Judge Facciola’s concluding words. Here he makes a humorous reference to the final passage he had previously quoted in full from Plato’s Apology. This is at the very end where Socrates asks his friends to punish his sons, the way he has tormented them, should they fall from the way of virtue. Having a son myself, I will finish this blog with the full quote from Plato and make the same request of you all. And I do not mean the humorous reference to long hair in Facciola’s concluding joke, I mean the real Socratic reference to  virtue over money and a puffed up sense of self-importance. A reference that we should all take to heart, not just Adam.

socrates3Do to my sons as I have done to you.

Still I have a favour to ask of them. When my sons are grown up, I would ask you, O my friends, to punish them; and I would have you trouble them, as I have troubled you, if they seem to care about riches, or anything, more than about virtue; or if they pretend to be something when they are really nothing,—then reprove them, as I have reproved you, for not caring about that for which they ought to care, and thinking that they are something when they are really nothing. And if you do this, both I and my sons will have received justice at your hands.

The hour of departure has arrived, and we go our ways—I to die, and you to live. Which is better God only knows.

Visualizing Data in a Predictive Coding Project – Part Two

November 16, 2014

visual-numbersThis is part two of my presentation of an idea for visualization of data in a predictive coding project. Please read part one first.

As most of you already know, the ranking of all documents according to their probable relevance, or other criteria, is the purpose of predictive coding. The ranking allows accurate predictions to me made as to how the documents should be coded. In part one I shared the idea by providing a series of images of a typical document ranking process. I only included a few brief verbal descriptions. This week I will spell it out and further develop the idea. Next week I hope to end on a high note with random sampling and math.

Vertical and Horizontal Axis of the Images

Raw_DataThe visualizations here presented all represent a collection of documents. It is supposed to be pointillist image, with one point for each document. At the beginning of a document review project, before any predictive coding training has been applied to the collection, the documents are all unranked. They are relatively unknown. This is shown by the fuzzy round cloud of unknown data.

Once the machine training begins all documents start to be ranked. In the most simplistic visualizations shown here the ranking is limited to predicted relevance or irrelevance. Of course, the predictions could be more complex, and include highly relevant and privilege, which is what I usually do. It could also include various other issue classifications, but I usually avoid this for a variety of reasons that would take us too far astray to explain.

Once the training and ranking begin the probability grid comes into play. This grid creates both a vertical and horizontal axis. (In the future, we could add third dimensions too, but let’s start simple.)  The one public comment received so far stated that the vertical axis on the images showing percentages adjacent to the words “Probable Relevant” might give people the impression that it is the probability of a document being relevant. Well, I hope so, because that is exactly what I was trying to do!

The vertical axis shows how the documents are ranked. The horizontal axis shows the number of documents, roughly, at each ranking level. Remember, each point is supposed to represent a specific, individual document. (In the future we could add family overlays, but again, let’s start simple.) A single dot in the middle would represent one document. An empty space would represent zero documents. A wide expanse of horizontal dots would represent hundreds or thousand of documents, depending on the scale.

The diagram below visualizes a situation common where ranking has just begun and the computer is uncertain as to how to classify the documents. It classifies most in the 37.5% to 67.5% range of probable relevance. It is all about fifty fifty at this point. This is the kind of spread you would expect to see if training began with only random sampling input. The diagram indicates that the computer does not really know much yet about the data. It does not yet have any real idea as to which documents are relevant, and which are not.


The vertical axis of the visualization is the key.  It is intended to show a running grid from 99% probable relevant to 0.01% probable relevant. Note that 0.01% probable relevant is another way of saying 99.9% probable irrelevant, but remember, I am trying to keep this simple. More complex overlays may be more to the liking of some software users. Also note that the particular numbers I show on the these diagrams is arbitrary: 0.01%, 12.5%, 25%, 37.5%, 50%, 67.5%, 75%, 87.5%, 99.9%, I would prefer to see more detail here, and perhaps add a grid showing a faint horizontal line at every 10% interval. Still, the fewer lines shown here does have a nice aesthetic appeal, plus it was easier for me to create on the fly for this blog.

Again, let me repeat to be very clear. The vertical grid on these diagrams represents the probable ranking from least likely to be relevant on the bottom, to most likely on the top. The horizontal grid shows the number of documents. It is really that simple.

Why Data Visualization Is Important

visualize 2This kind of display of documents according to a vertical grid of probable relevance is very helpful because it allows you to see exactly how your documents are ranked at any one point in time. Just as important, it helps you to see how the alignment changes over time. This empowers you to see how your machine training impacts the distribution.

This kind of direct, immediate feedback greatly facilitates human computer interaction (what I call in my approximate 50 articles on predictive coding the hybrid approach). It makes it easier for the natural human intelligence to connect with the artificial intelligence. It makes it easier for the human SMEs involved to train the computer. The humans, typically attorneys or their surrogates, are the ones with the expertise on the legal issues in the case. This visualization allows them to see immediately what impact particular training documents have upon the ranking of the whole collection. This helps them to select effective training documents. It helps them to attain the goal of separation of relevant from irrelevant documents. Ideally they would be clustered on both the bottom and top of the vertical axis.

For this process to work it is important for the feedback to be grounded in actual document review, and not be a mere intellectual exercise. Samples of documents in the various ranking strata must be inspected to verify, or not, whether the ranking is accurate. That can vary from strata to strata. Moreover, as everyone quickly finds out, each project is different, although certain patterns do tend to emerge. The diagrams used as an example in this blog represent one such typical pattern, although greatly compressed in time. In reality the changes shows here from one diagram to another would be more gradual and have a few unexpected bumps and bulges.

Visualizations like this will speed up the ranking and the review process. Ultimately the graphics will all be fully interactive. By clicking on any point in the graphic you will be taken to the particular document or documents that it represents. You click and drag and you are taken to a whole set of documents selected. For instance, you may want to see all documents between 45% and 55%, so you would select that range in the graphic. Or you may want to see all documents in the top 5% probable relevance ranking, so you select that top edge of the graphic. These documents will instantly be shown in the review database. Most good software already has document visualizations with similar linking capacities. So we are not reinventing the Wheel here, just applying these existing software capacities to new patterns, namely to document rankings.

These graphic features will allow you to easily search the ranking locations. This will in turn allow you to verify, or correct, the machine’s learning. Where you find that the documents clicked have a correct prediction of relevance, you verify by coding as relevant, or highly relevant. Where the documents clicked have an incorrect prediction, you correct by coding the document properly. That is how the computer learns. You tell it yes when it gets it right, and no when it gets it wrong.

At the beginning of a project many predictions of relevance and irrelevance will be incorrect. These errors will diminish as the training progress, as the correct predictions are verified, and erroneous predictions are corrected. Fewer mistakes will be made as the machine starts to pick up the human intelligence. To me it seems like a mind to computer transference. More of the predictions will be verified, and the document distributions will start to gather on both end of the vertical relevance axis. Since the volume of documents is represented by the horizontal axis, more documents will start to bunch together at both the top and bottom of the vertical axis. Since document collections in legal search usually contain many more irrelevant documents than relevant, you will typically see most documents on the bottom.

Visualizations of an Exemplar Predictive Coding Project

In the sample considered here we see unnaturally rapid training. It would normally take many more rounds of machine training than are shown in these four diagrams. In fact, with a continuous active training process, there could be hundreds of rounds per day. In that case the visualization would look more like an animation than a series of static images. But again, I have limited the process here for simplicity sake.

1000000_docsAs explained previously, the first thing that happens to the fuzzy round cloud of unknown data before any training begins is that the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. In addition other necessarily irrelevant documents to this particular project are bulk-culled out. For example, ESI such as music files, some types of photos, and many email domains, like, for instance, emails from publications such as the NY Times. By good fortune in this example exactly One Million documents remain for predictive coding.

RandomWe begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. (They are the yellow dots.) Assuming a 95% confidence level, do you know what confidence interval this creates? I asked this question before and repeat it again, as the answer will not come until the final math installment next week.

Next we assume that an SME, and or his or her surrogates, reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant. Do you know what prevalence rate this creates? Do you know the projected range of relevant documents within the confidence interval limits of this sample? That is the most important question of all.

Next we do the first round of machine training proper. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. Again for simplicity sake, we assume that the analytics is directed towards relevance alone. In fact, most projects would also include high-relevance and privilege.

data-visual_Round_2In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities. Here there are already more documents trained on irrelevance, than relevance. This is in spite of the fact that the active search was to find relevant documents, not irrelevant documents. This is typical in most review projects where you have many more irrelevant than relevant documents overall, and where it is easier to spot and find irrelevant than relevant.

data-visual_Round_3Next we see the data after the second round of training. The division of the collection of documents into relevant and irrelevant is beginning to form. The largest of collection of documents are the blue points at the bottom. They are the documents that the computer predicts are irrelevant based on the training to date. There are also a large collection of points shown in red at the top. They are the ones where the computer now thinks there is a high probability of relevance. Still, the computer is uncertain about the vast majority of the documents: the red in the third strata from the top, the blue in the third strata from the bottom, and the many in the grey, the 37.5% to 67.5% probable relevance range. Again we see an overall bottom heavy distribution. This is a typical pattern because it is usually easier to train on irrelevance than relevance.

As noted before, the training could be continuous. Many software programs offer that feature. But I want to keep the visualizations here simple, and not make an animation, and so I do not assume here a literally continuous active learning. Personally, although I do like to keep the training continuous throughout the review, I like the actual computer training to come in discrete stages that I control. That gives me a better understanding of the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. That is the kind of feedback that these visualizations enhance.

data-visual_Round_4Next we see the data after the third round of training. Again, in reality it would typically take more rounds of training than three to reach this relatively mature state, but I am trying to keep this example simple. If a project did progress this fast, it would probably be because a large number of documents were used in the prior rounds.  The documents about which the computer is now uncertain — the grey area, and the middle two brackets — is now much smaller.

The computer now has a high probability ranking for most of the probable relevant and probable irrelevant documents. The largest number of documents are the blue bottom, where the computer predicts they have a near zero chance of being classified relevant. Again, most of the  probable predictions, those in the top and bottom three brackets, are located in the bottom three brackets. Those are the documents predicted to have less that a 37.5% chance of being relevant. Again, this kind of distribution is typical, but there can be many variances from project to project. We here see a top loading where most of the probable relevant documents are in the top 12.5% percent ranking. In other words, they have an 87.5% probable relevant ranking, or higher.

data-visual_Round_5Next we see the data after the fourth round of training. It is an excellent distribution at this point. There are relatively few documents in the middle. This means there are relatively few documents about which the computer is uncertain as to its probable classification. This pattern is one factor among several to consider in deciding whether further training and document review are required to complete your production.

Another important metric to consider is the total number of documents found to be probable relevant, and comparison with the random sample prediction. Here is where math comes in, and understanding of what random sampling can and cannot tell you about the success of a project. You consider the spot projection of total relevance based on your initial prevalence calculation, but much more important, you consider the actual range of documents under the confidence interval. That is what really counts when dealing with prevalence projections and random sampling. That is where the plus or minus  confidence interval comes into play, as I will explain in detail the third and final installment to this blog.

PrevalenceIn the meantime, here is  the document count of the distribution roughly pictured in the final diagram above, which to me looks like an upside down, fragile champagne glass. We see that exactly 250,000 documents have a 50% or higher probable relevance ranking, and 750,000 documents have a 49.9% or less probable relevance ranking. Of the probable relevant documents, there are 15,000 documents that fall in the 50% to 67.5% range. There are another 10,000 documents that fall in the 37.5% to 49.9% probable relevance range. Again, this is also fairly common as we often see less on the barely irrelevant side that we do on the barely relevant side. As a general rule I review with humans all documents that are 50% or higher probable relevance, and do not review the rest. I do however sample and test the rest, the documents with less than a 50% probable relevance ranking. Also, in some projects I review far less than the top 50%. That all depends on proportionality constraints, and document ranking distribution, the kind of distributions that these visualizations will show.

In addition to this metrics analysis, another important factor to consider in whether our search and review efforts are now complete, is how much change in ranking there has been from one training round to the next. Sometimes there may be no change at all. Sometimes there may only be very slight changes. If the changes from the last round are large, that is an indication that more training should still be tried, even if the distribution already looks optimal, as we see here.

Another even more important quality control factor is how correct the computer has been in the last few rounds of its predictions. Ideally, you want to see the rate of error decreasing to a point where you see no errors in your judgmental samples. You want your testing of the computer’s prediction to show that it has attained a high degree of precision. That means there are few documents predicted relevant, that actual review by human SMEs show are in fact irrelevant. This kind of error is known as a False Positive. Much more important to quality evaluation is to the discovery of documents predicted irrelevant, that are actually relevant. This kind of error is known as a False Negative. The False Negatives are your real concern in most projects because legal search is usually focused on recall, not precision, at least within reason.

The final distinction to note in quality control is one that might seem subtle, but really is not. You must also factor in relevance weight. You never want a False Negative to be a highly relevant document. If that happens to me, I always commence at least one more round of training. Even missing a document that is not highly relevant, not hot, but is a strong relevant document, and one of a type not seen before, is typically a cause for further training. This is, however, not an automatic rule as with the discovery of a hot document. It depends on a variety of factors having to do with relevance analysis of the particular case and document collection.

In our example we are going to assume that all of the quality control indicators are positive, and a decision has been made to stop training and move on to a final random sample test.

A second random sample comes next. That test and visualization will be provided next week, along with the promised math and sampling analysis.

Math Quiz

I part one, and again here, I asked some basic math questions on random sampling, prevalence, and recall. So far no one has attempted to answer the questions posed. Apparently, most readers here do not want to be tested. I do not blame them. This is also what I find in my online training program, e-DiscoveryTeamTraining.com, where only a small percentage of the students who take the program elect to be tested. That is fine with me as it means one less paper to grade, and most everyone passes anyway. I do not encourage testing. You know if you get it or not. Testing is not really necessary.

The same applies to answering math questions in a public blog. I understand the hesitancy. Still, I hope many privately tried, or will try, to solve the questions and came up with the correct answers. In part three of this blog I will provide the answers, and you will know for sure if you got it right. There is still plenty of time to try to figure it out on your own. The truly bold can post it online in the comments below. Of course, this is all pretty basic stuff to try experts of this craft. So, to my fellow experts out there, you have yet another week to take some time and strut your stuff by sharing the obvious answers. Surely I am not the only one in the e-discovery world bold enough to put their reputation on the line by sharing their opinions and analysis in public for all to see (and criticize). Come on. I do it every week.

Math and sampling are important tools for quality control, but as Professor Gordon Cormack, a true wizard in the area of search, math, and sampling likes to point out, sampling alone has many inherent limitations. Gordon insists, and I agree, that sampling should only be part of a total quality control program. You should never just rely on random sampling alone, especially in low prevalence collections. Still, when sampling, prevalence, and recall are included as part of an overall QC effort, the net effect is very reassuring. Unless I know that I have an expert like Gordon on the other side, and so far that has never happened, I want to see the math. I want to know about all of the quality control and quality assurance steps taken to try to find the information requested. If you are going to protect your client, you need to learn this too, or have someone at hand who already knows it.

This kind of math, sampling, and other process disclosures should convince even the most skeptical adversary or judge. That is why it is important for all attorneys involved with legal research to have a clear mathematical understanding of the basics. Visualizations alone are inadequate, but, for me at least, visualizations do help a lot. All kinds of data visualizations, not just the ones here presented, provide important tools to help lawyers to understand how a search project is progressing.

Challenge to Software Vendors

challengeThe simplicity of the design of the idea presented here is a key part of the power and strength of the visualization. It should not be too difficult to write code to implement this visualization. We need this. It will help users to better understand the process. It will not cost too much to implement, and what it does cost should be recouped soon in higher sales. Come on vendors, show me you are listening. Show me you get it. If you have a software demo that includes this feature, then I want to see it. Otherwise not.

All good predictive coding software already ranks the probable relevance of documents, so we are not talking about an enormous coding project. This feature would simply add a visual display to calculations already being made. I could hand make these calculations myself using an Excel spreadsheet, but that is time consuming and laborious. This kind of visualization lends itself to computer generation.

I have many other ideas for predictive coding features, including other visualizations, that are much more complex and challenging to implement. This simple grid explained here is an easy one to implement, and will show me, and the rest of our e-discovery community, who the real leaders are in software development.


Ralph_2013_beard_frownThe primary goal of the e-Discovery Team blog is educational, to help lawyers and other e-discovery professionals. In addition, I am trying to influence what services and products are provided in e-discovery, both legal and technical. In this blog I am offering an idea to improve the visualizations that most predictive software already provide. I hope that all vendors will include this feature in future releases of their software. I have a host of additional ideas to improve legal search and review software, especially the kind that employs active machine learning. They include other, much more elaborate visualization schemes, some of which have been alluded to here.

Someday I may have time to consult on all of the other, more complex ideas, but, in the meantime, I offer this basic idea for any vendor to try out. Until vendors start to implement even this basic idea, anyone can at least use their imagination, as I now do, to follow along. These kind of visualizations can help you to understand the impact of document ranking on your predictive coding review projects. All it takes is some idea as to the number of documents in various probable relevance ranking strata. Try it on your next predictive coding project, even if it is just rough images from your own imagination (or Excel spreadsheet). I am sure you will see for yourself how helpful this can be to monitor and understand the progress of your work.



Visualizing Data in a Predictive Coding Project

November 9, 2014

data-visual_Round_5This blog will share a new way to visualize data in a predictive coding project. I only include a brief description this week. Next week I will add a full description of this project. Advanced students should be able to predict the full text from the images alone. Study the text and try to figure out the details of what is going on.

Soon all good predictive coding software will include visualizations like this to help searchers to understand the data. The images can be automatically created by computer to accurately visualize exactly how the data is being analyzed and ranked. Experienced searchers can use this kind of visual information to better understand what they should do next to efficiently meet their search and review goals.

For a game try to figure out how the high and low number of relevant documents that you must find in this review project to claim that you have a 95% confidence level of having found all relevant documents, the mythical total recall. This high-low range will be wrong one time out of twenty, that is what the 95% confidence level means, but still, this knowledge is helpful. The correct answer to questions of recall and prevalence is always a high-low range of documents, never just one number, and never a percentage. Also, there are always confidence level caveats. Still, with these limitations in mind, for extra points, state what the spot projection is for prevalence. These illustrations and short descriptions provide all of the information you need to calculate these answers.

The project begins with a collection of documents here visualized by the fuzzy ball of unknown data.


Next the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. By good fortune exactly One Million documents remain.


We begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. Assuming a 95% confidence level, what confidence interval does this create?


Assume that an SME reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant.


Training Begins

Next we do the first round of machine training. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. To keep it simple we only show the relevance ranking, and not also the irrelevance metrics display. The top represents 99.9% probable relevance. The bottom the inverse, 00.1% probable relevance. Put another way, the bottom would represent 99.9% probable irrelevance. For simplicity sake we also assume that the analytics is directed towards relevance alone, whereas most projects would also include high-relevance and privilege. In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities.


Next we see the data after the second round of training. Note that the training could with most software be continuous. But I like to control when the training happens in order to better understand the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. The human SME understands how the machine is learning. The SME learns where the machine needs the most help to tune into their conception of relevance. This kind of cross-communication makes it easier for the artificial intelligence to properly boost the human intelligence.


Next we see the data after the third round of training. The machine is learning very quickly. In most projects it takes longer than this to attain this kind of ranking distribution. What does this tell us about the number of documents between rounds of training?


Now we see the data after the fourth round of training. It is an excellent distribution and so we decide to stop and test.data-visual_Round_5The second random sample comes next. That visualization, and a full description of the project, will be provided next week. In the meantime, leave your answers to the questions in the comments below. This is a chance to strut your stuff. If you prefer, send me your answers, and questions, by private email.



Get every new post delivered to your Inbox.

Join 3,724 other followers