Visualizing Data in a Predictive Coding Project – Part Three

November 30, 2014

Ralph at Niagara FallsThis is part three of my presentation of an idea for visualization of data in a predictive coding project. Please read part one and part two first. This concluding blog in the visualization series also serves as a stand alone lesson on the basics of math, sampling, probability, prevalence, recall and precision. It will summarize some of my current thoughts on quality control and quality assurances in large scale document reviews. Bottom line, there is far more to quality control than doing the math, but still, sampling and metric analysis are helpful. So too is creative visualization of the whole process.

Law, Science and Technology

Team_TriangleThis is the area in which scientists on e-discovery teams excel. I recommend that every law firm, corporate and vendor e-discovery team have at least one scientist to help them. Technologists alone are not sufficient. Discovery teams know this, and all have engineers working with lawyers, but very few yet have scientists working with engineers and lawyers. They are like two-legged stools.

Also, and this seems obvious, you need search sophisticated lawyers on e-discovery teams too. I am starting to see this error more and more lately, especially in vendors. Engineers may think they know the law, that is very common, but they are wrong. The same delusional thinking sometimes even affects scientists. Both engineers and scientists tend to over-simplify the law and do not really understand legal discovery. They do not understand the larger context and overall processes and policies.

John Tredennick

John Tredennick

For legal search to be done properly, it must not only include lawyers, the lawyers must lead. Ideally, a lawyer will be in charge, not in a domineering way (my way or the highway), but in a cooperative multi-disciplinary team sort of way. That is one of the strong points I see at Catalyst. Their team includes tons of engineers/technologists, like any vendor, but also scientists, and lawyers. Plus, and here is the key part, the CEO is an experienced search lawyer. That means not only a law degree, but years of legal experience as a practicing attorney doing discovery and trials. A fully multidisciplinary team with an experienced search lawyer as leader is, in my opinion, the ideal e-discovery team. Not only for vendors, but for corporate e-discovery teams, and, of course, law firms.

ralph_wrongMany disagree with me on this, as many laymen and non-practicing attorneys resent my law-first orientation. Technologists are now often in charge, especially on vendor teams. In my experience these technologists do not properly respect the complexity of legal knowledge and process. They often bad mouth lawyers and law firms behind their back. Their products and services suffer as a result. It is a recipe for disaster.

On many vendor teams, the lawyers are not part of the leadership, if they are on a team, it is low level and they are not respected. This is all wrong because the purpose of e-discovery teams is the search for evidence in a legal context, typically a law suit. There is only one leg of the stool that has ever studied evidence.

It takes all three disciplines for top quality legal search: scientists, technologists and lawyers. If you cannot afford a full-time scientists, then you should at least hire one as a consultant on the harder cases.

The scientists on a team may not like the kind of simplification I will present here on sampling, prevalence and recall. They typically want to go into far greater depth and provide multiple caveats on math and probability, which is fine, but it is important to start with a foundation of basics. This is what you will find here. The basics of math and probabilities, and applications of these principles from a lawyer’s point of view, not a scientist’s or engineer’s.

Professor Gordon CormackStill, the explanations here are informed by the input of several outstanding scientists. A special shout out and thanks goes to Gordon Cormack. He has been very generous with his time and patient with my incessant questions. Professor Cormack has been a preeminent voice in Information Science and search for decades now, well before he started teaming with Maura Grossman to study predictive coding. I appreciate his assistance, and, of course, any errors and oversimplifications are solely my own.

Now let’s move onto the math part you have been waiting for, and begin by revisiting the hypothetical we set out in parts one and two of this visualization series.

 Calculating and Visualizing Prevalence

Recall we have exactly 1,000,000 documents remaining for predictive coding after culling. I previously explained that this particular project began with culling and multimodal judgmental sampling, and with a random sample of 1,534 documents. Please note this is not intended to refer to all projects. This is just an example to have data flows set up for visualization purposes. If you want to see my standard work flows see LegalSearchScience.com and Electronic Discovery Best Practices, EDBP.com, on the Predictive Coding page. You will see, for instance, that another activity is always recommended, especially near the beginning of a project, namely Relevancy Dialogues (step 1).

data-visual_RANDOM_ONEAssuming a 95% confidence level, a sample of 1,534 documents creates a confidence interval of 2.5%. This means your sample is subject to a 2.5% error rate in both directions, high and low, for a total error range of 5%. This is 5% of the total One Million documents corpus (50,000 documents), not just 5% of the 1,534 sample (77 documents).

In our hypothetical the SME, who had substantial help from a top contract reviewer, studied the 1,534 sampled documents. The SME found that 384 were relevant and 1,150 were irrelevant. By the way, when done properly this review of 1,534 documents should only cost between $1,000 to $2,000, with most of that going to the SME expense, not the contract reviewer expense.

The spot projection of prevalence here is 25%. This is simple division. Divide the 384 relevant by the total population: 384/1,534. You get 25%. That means that one out of four of the documents sampled was found to be relevant. Random sampling tells us that this same ratio should apply, at least roughly, on the larger population.  You could at this point simply project the sample percentage from the sample onto the entire document population. You would thus conclude that approximately 250,000 documents will likely be relevant. But this kind of projection alone is nearly meaningless in lower prevalence situations, which are common in legal search. It is also of questionable value in this hypothetical where there is a relatively high prevalence of 25%.data-visual_RANDOM_TWO

When doing probability analysis based on sampling you must always include both the confidence level, here 95%, and the confidence interval, here 2.5%. The Confidence Level means that 5 times out of 100 the projection will be in error. More specifically, the Confidence Level means that if you were to repeat the sampling 100 times, the resulting Confidence Interval (here 2.5%) would contain the true value (here 250,000 relevant documents) at least 95% of the time. Conversely, this means that it would miss the true value at most 5% of the time.

In our hypothetical the true value is 250,000 relevant documents. On one sample you might get a Confidence Interval of 225,000 – 275,000, as we did here. But with another sample you might get 215,000 – 265,000. On another you might get 240,000 – 290,000.  These all include the true value. Occasionally (but no more than 5 times in a hundred), you might get a Confidence Interval like 190,000 – 240,000, or 260,000 – 310,000, that excludes the true value. That is what a 95% Confidence Level means.

The confidence interval range is simply calculated here by adding 2.5% to the 25%, and subtracting 2.5% to the 25%. This creates a percentage range of from between 22.5% to 27.5%. When you project this confidence interval unto the entire document collection you get a range of relevant documents of from between 225,000 (22.5%*1,000,000) and 275,000 (27.5%*1,000,000).

This simple calculation, called a Classical or Gaussian Estimation, works well in high prevalence situations, but in situations where the prevalence is low, say 3% or less, and even in this hypothetical where the prevalence is a relatively high 25%, the accuracy of the projected range can be improved by adjusting the 22.5% to 27.5% confidence interval range. The adjustment is performed by using what is called a Binomial calculation, instead of the Classical or Gaussian calculation. Ask a scientist for particulars on this, not me. I just know to use a standard Binomial Confidence Interval Calculator to determine the range in most legal search projects. For some immediate guidance, see the definitions of Binomial Estimation and Classical or Gaussian Estimation in The Grossman-Cormack Glossary of Technology Assisted Review.

data-visual_RANDOM_2With the Binomial Calculator you again enter the samples as a fraction with the numerator being the relevant documents, and the denominator the total number of documents sampled. Again, this is just like before, you divide 384 by 1,534. The basic answer is also the same 25%, the point or spot projection ratio, but the range with a Binomial Calculator is now slightly different. Instead of a simple plus or minus 2.5%, that produces 22.5%-27.5%, the binomial calculation creates a tighter range of from between 22.9% to 27.3%. The range in this hypothetical is thus a little tighter than 5%. It is a range here is 4.4% (from between 22.9 to 27.3%). Therefore the projected range of relevant documents using the Binomial interval calculation is from between 229,000 (22.9%*1,000,000) and 273,000 (27.3%*1,000,000) documents.

 

bell-curve-Standard_deviation_diagramThe 1,534 simple random sample of 1,000,000 document collection shows that 95 times out of 100 the correct number of relevant documents will be between 229,000 and 273,000.

This also means that no more than five times out of 100 will the calculated interval, here between 22.9% and 27.3%, fail to capture the true value, the true number of relevant documents in the collection. Sometimes the true value, the true number of relevant documents, may be less than 229,000 or greater than 273,000. This is shown in part by the graphic below, which is another visualization that I like to use to help me to visualize what is happening in a predictive coding project. Here the true value lies somewhere between the 229,ooo and 273,000, or at least 95 times out of 100 it does. When 5 times out of 100 the true value lies outside the range, the divergence is usually small. Most of the time, when the confidence interval misses the true value, it is a near miss. Cases where the confidence interval is far below, or far above, the true value is exceedingly rare.

Corpus_data_recall

The Binomial adjustment to the interval calculation is required for low prevalence populations. For instance, if the prevalence was only 2%, and the interval was again 2.5%, the error range would create a negative number, -.5% (2%-2.5%). It would be from between -.5% and 4.5%. That projection means from between -0- relevant documents to 45,000. (Obviously you can not have negative relevant documents.) The zero relevant documents is also known to be wrong because you could not have performed the calculation unless there were some relevant documents in the sample. So in this situation of low prevalence the Binomial calculation method is required to produce anything close to accurate projections.

For example, assuming again a 1,000,000 corpus, and a 95%+/-2.5% sample consisting of 1,534 documents, a 2% prevalence results from finding 31 relevant documents. Using the binomial calculator you get a range of from between 1.4% to 2.9%, instead of  between -.5% to 4.5%. The binomial based interval range results in a projection of between 14,000 relevant documents (instead of the absurd zero relevant documents) to 29,000 relevant documents.

Even with the binomial calculation adjustment, the reliability of using probability projections to calculate prevalence is the subject of much controversy among information scientists and probability statisticians (most good information scientists doing search are also probability statisticians, but not visa versa). The reliability of such range projections is controversial in situations like this, where the sample size is low, here only 1,534 documents, and the likely percentage of relevant documents is also low, here only 2%. In this second scenario where only 31 relevant documents were found in the sample, there are too few relevant documents for sampling to be as reliable as it is in higher prevalence collections. I still think you should do it. It does provide good information. But you should not rely completely on these calculations, especially when it comes to the step of trying to calculate recall. You should use all of the quality control procedures you know, including the others listed previously.

Calculating Recall Using Prevalence

Search Quadrant - standard in information scienceRecall is another percentage that represents the proportion between the total number of relevant documents in a collection, and the number of these relevant documents that have been found. So, if you happen to know that there are 10 relevant documents in a collection of 100 documents, and you correctly identify 9 relevant documents, then you have attained a 90% recall level. Referring to the hopefully familiar Search Quadrant shown right, this means that you would have one False Negative and nine True Positives. If you only found one out of the ten, you would have 10% recall (and would likely be fired for negligence). This would be nine False Positives and one True Positive.

The calculation of Precision requires information on the total number of False Positives. In the first example where you found nine of the ten relevant, if you also found nine more that you thought were relevant, but were not, they were False Positives, then what would your precision be? You have found a total of 18 documents that you thought were relevant, and it turns out that only half of them, 9 documents, were actually relevant. That means you had a precision rate of 50%. Simple. Precision could also easily be visualized by various kinds of standard graphs. I suggest that this be added to all search and review software. It is important to see, but, IMO, when it comes to legal search, the focus should be on Recall, not Precision.

gold_standard_MYTHThe problem with calculating Recall in legal search is that you never know the total number of relevant documents, that is the whole point of the search. If you knew, you would not have to search. But in fact no one ever knows. Moreover, in large document collections, there is no way to ever exactly know the total number of relevant documents. All you can ever do is calculate probable ranges. You might think that absolute knowledge could come from human review of all One Million documents in our hypothetical. But that would be wrong because humans make too many mistakes, especially with legal judgments as fluid as relevancy determinations. So too do computers, dependent as they are to the training by all too fallible humans.

Bottom line, we can never know for sure how many relevant documents are in the 1,000,000 collection, and so we can never know with certainty what our Recall rate is. But we can make an very educated guess, one that is almost certainly correct, especially when a range of Recall percentages are used, instead of just one particular numner. We can narrow down the grey area. All experienced lawyers are familiar conceptually with this problem. The law is made in a process similar to this. It arise case by case out of large grey areas of uncertainty.

The reliability of our sample based Recall guess decreases as prevalence lowers. It is a problem inherent to all random sampling. It is not unique to legal evidence search. What is unique to legal search is the importance of Recall to begin with. In many other types of search Recall is not that important. Google is the prime example of this. You do not need to find all websites with relevant information, just the more useful, generally the most popular web pages. Law is moving away from Recall focus, but slowly. And it is more of a move right now from Recall of simple relevance to Recall of the highly relevant. In that sense legal search will in the long run become more like mainstream Googlesque search. But for now the law is still obsessed with finding all of the evidence in the perhaps mistaken belief that justice requires the whole truth. But I digress.

In our initial hypothetical of a 25% prevalence, the accuracy of the recall guess is actually very high, subject primarily to the 95% confidence level limitation. Even in the lower 2% hypothetical, the recall calculation has value. Indeed, it is the basis of much scientific research concerning things like rare diseases and rare species. Again, we enter a hotly debated area of science that is beyond my expertise (although not my interest).

data-visual_Round_5Getting back to our example where we have a 95% confidence level that there are between 229,000 and 273,000 relevant documents in the 1,000,000 document collection – as described before in part one of this series, we assume that after only four rounds of machine training we have reached a point in the project where we are not seeing a significant increase in relevant documents from one round of machine training to the next. The change in document probability ranking has slowed and the visualization of the ranking distribution looks something like this upside down champagne glass shown right.

At this point a count shows that we have now found 250,000 relevant documents. This is critical information that I have not shared in the first two blogs, information that for the first time allows for a Recall calculation. I held back this information until now for simplicity purposes, plus it allowed me to add a fun math test. (Well, the winner of the test, John Tredennick, CEO of Catalyst, thought it was fun.) In reality you would keep a running count of relevant documents found, and you would have a series of Recall visualizations. Still, the critical Recall calculation takes place when you have decided to stop the review and test.

Recall-rangeAssuming we have found 250,000 relevant documents this means that we have attained anywhere from between 91.6% to 100% recall. At least it means we can have a 95% confidence level that we have attained a result somewhere in that range. Put another way, we can have a 95% confidence level that we have attained a 91.6% or higher recall rate. We cannot have 100% confidence in that result. Only 95%. That means that one time out of twenty (5% of the 95% confidence level) there may be more than 273,00 relevant documents. That in turn means that one time in twenty we may have attained less than a 91.6% recall in this circumstance.

bell-curve-Standard_deviation_diagram

The low side Recall calculation of 91.6% is derived by dividing the 250,000 found, by the high-end of the confidence interval, 273,000 documents. If the spot projection happens to be exactly right, which is rare, and in this hypo is now looking less and less likely (we have, after all, now found 250,000 relevant documents, or at least think we have), then the math would be 100% recall (250,000/250,000). That is extremely unlikely. Indeed, information scientists love to say that the only way to attain 100% recall is with 0% precision, that is, to select all documents. This statement is, among other things, a hyperbole intended to make the uncertainty point inherent in sampling and confidence levels. The 95% Confidence Level uncertainty is shown by the long tail on either side of the standard bell curve pictured above.

You can never have more than 100% recall, of course, so we do not say we have attained anywhere between 109% and 91.6% recall. The low-end estimate of 229,000 relevant documents has, at this point in the project, been shown to be wrong by the discovery and verification of 250,000 relevant documents. I say shown, not proven because of the previously mentioned liquidity of relevance and inaccuracy of humans of make consistent final judgments when, as here, vast numbers of documents are involved.

Thermometer_RecallFor a visualization of recall I like the image of a thermometer, like a fund-raising goal chart, but with a twist of two different measures. On the left side put the low-end measure, here the 2.29% confidence interval with 229,000 documents, and on the right side, the high measure, 2.73% confidence interval with 273,000 documents. You can thus chart your progress from the two perspectives at once, the low probability error rate, and the high probability error rate. This is shown on the diagram to the right. It shows the metrics of our hypothetical where we have found and confirmed 250,000 relevant documents. That just happens to represent 100% recall on the low-end of probability error range using the 2.29% confidence interval. But as explained before, the 250,000 relevant documents found also represents only 91.6% recall on the high-end using the 2.73% confidence interval. You will never really know which is accurate, except that it is safe to bet you have not in fact attained 100% recall.

Random Sample Quality Assurance Test

In any significant project, in addition to following the range of recall progress, I impose a quality assurance test at the end to look for False Negatives. Remember, this means relevant documents that have been miscoded as irrelevant. One way to do that is by running similarity searches and verification of syncing. That can catch situations involving documents that are known to be relevant. It is a way to be sure that all variations of those documents, including similar but different documents, are coded consistently. There may be reasons to call one variant relevant, and another irrelevant, but usually not. I like to put a special emphasis on this at the end, but it is only one of many quality tests and searches that a skilled searcher can and should run throughout any large review project. Visualizations could also be used to assist in this search.

But what about the False negatives that are not near duplicates or close cousins? The similarity and consistency searches will not find them. Of course you have been looking for these documents throughout the project, and at this point you think that you have found as many relevant documents as you can. You may not think you have found all relevant documents, total recall, no experienced searcher ever really believes that, but you should feel like you have found all highly relevant documents. You should have a well reasoned opinion that you have found all of the relevant documents needed to do justice. That opinion will be informed by legal principles of reasonability and proportionality.

data-visual_Round_5That opinion will also be informed by your experience in search though this document set. You will have seen for yourself that the probability rankings have divided the documents into to well defined segments, relevant and irrelevant. You will have seen that no documents, or very few, remain in the uncertainty area, the 40-60% range. You will have personally verified the machine’s predictions many times, such that you will have high confidence that the machine is properly implementing the SME’s relevance concept. You will have seen for yourself that few new relevant documents are found from one round of training to the next. You will also usually have seen that the new documents found are really just more of the same. That they are essentially cumulative in nature. All of these observations, plus the governing legal principles, go into the decision to stop the training and review, and move onto final confidentiality protection review, and then production and privilege logging.

Still, in spite of all such quality control measures, I like to add one more, one based again on random sampling. Again, I am looking for False Negatives, specifically any that are of a new and different kind of relevant document not seen before, or a document that would be considered highly relevant, even if of a type seen before. Remember, I will not have stopped the review in most projects (proportionality constraints aside), unless I was confident that I had already found all of those types of documents; already found all types of strong relevant documents, and already found all highly relevant document, even if they are cumulative. I want to find each and every instance of all hot (highly relevant) documents that exists in the entire collection. I will only stop (proportionality constraints aside) when I think the only relevant documents I have not recalled are of an unimportant, cumulative type; the merely relevant. The truth is, most documents found in e-discovery are of this type; they are merely relevant, and of little to no use to anybody except to find the strong relevant, new types of relevant evidence, or highly relevant evidence.

There are two types of random samples that I usually run for this final quality assurance test. I can sample the entire document set again, or I can limit my sample to the documents that will not be produced. In the hypothetical we have been working with, that would mean a sample of the 750,000 documents not identified as relevant. I do not do both samples, but rather one or another. But you could do both in a very large, relatively unconstrained budget project. That would provide more information. Typically in a low prevalence situation, where for instance there is only a 2% relevance shown from both the sample, and the ensuing search project, I would do my final quality assurance test with a sample of the entire document collection. Since I am looking for False Negatives, my goal is not frustrated by including the 2% of the collection already identified as relevant.

There are benefits from running a full sample again, as it allows direct comparisons with the first sample, and can even be combined with the first sample for some analysis. You can, for instance, run a full confusion matrix analysis as explained, for instance, in The Grossman-Cormack Glossary of Technology Assisted Review; also see Escape From Babel: The Grossman-Cormack Glossary.

CONFUSION MATRIX

Truly Non-Relevant Truly Relevant
Coded Non-Relevant True Negatives (“TN”) False Negatives (“FN”)
Coded Relevant False Positives (“FP”) True Positives (“TP”)

Accuracy = 100% – Error = (TP + TN) / (TP + TN + FP + FN)
Error = 100% – Accuracy = (FP + FN) / (TP + TN + FP + FN)
Elusion = 100% – Negative Predictive Value = FN / (FN + TN)
Fallout = False Positive Rate = 100% – True Negative Rate = FP / (FP + TN)
Negative Predictive Value = 100% – Elusion = TN / (TN + FN)
Precision = Positive Predictive Value = TP / (TP + FP)
Prevalence = Yield = Richness = (TP + FN) / (TP + TN + FP + FN)
Recall = True Positive Rate = 100% – False Negative Rate = TP / (TP + FN)

Special code and visualizations built into review software could make it far easier to run this kind of Confusion Matrix analysis. It is really far easier than it looks and should be user friendly automated. Software vendors should also offer basic instruction on this tool. Scientist members of an e-discovery team can help with this. Since the benefits of this kind of analysis outweigh the small loss of including the 2% already known to be relevant in the alternative low prevalence example, I typically go with a full random sample in low prevalence projects.

In our primary hypothetical we are not dealing with a low prevalence collection. It has a 25% rate. Here if I sampled the entire 1,000,000, I would in large part be wasting 25% of my sample. To me that detriment outweighs the benefits of bookend samples, but I know that some experts disagree. They love the classic confusion matrix analysis.

To complete this 25% prevalence visualization hypothetical, next assume that we take a simple random sample of the 750,000 documents only, which is sometimes called the null-set. This kind of sample is also sometimes called an Elusion test, as we are sampling the excluded documents to looks for relevant documents that have so far eluded us. We again sample 1,534 documents, again allowing us a 95% confidence level and confidence interval of plus or minus 2.5%.

Next assume in this hypothetical that we find that 1,519 documents have been correctly coded as irrelevant. (Note, most of the correct coding would come have come from machine prediction, not actual human review, but some would have been by actual prior human review.) These 1,519 documents are True Negatives. That is 99% accurate. But the SME review of the random sample did uncover 15 mistakes, 15 False Negatives. The SME decided that 15 documents out of the 1,534 sampled  had been incorrectly coded as irrelevant. That is a 01% error rate. That is pretty good, but not dispositive. What really matters is the nature of the relevancy of the 15 False Negatives. Were these important documents, or just more of the same?

I always use what is called an accept on zero error protocol for the elusion test when it comes to highly relevant documents. If any are highly relevant, then the quality assurance test automatically fails. In that case you must go back and search for more documents like the one that eluded you and must train the system some more. I have only had that happen once, and it was easy to see from the document found why it happened. It was a black swan type document. It used odd language. It qualified as a highly relevant under the rules we had developed, but just barely, and it was cumulative. Still, we tried to find more like it and ran another round of training. No more were found, but still we did a third sample of the null set just to be sure. The second time it passed.

In our hypothetical none of the 15 False Negative documents were highly relevant, not even close. None were of a new type of relevance. All were of a type seen before. Thus the test was passed.

The project then continued with the final confidential review, production and logging phases. Visualizations should be included in the software for these final phases as well, and I have several ideas, but this article is already far too long.

As I indicated in part one of this blog series, I am just giving away a few of my ideas here. For more information you will need to contact me for billable consultations, routed through my law firm, of course, and subject to my time availability with priority given to existing clients. Right now I am fully booked, but I may have time for these kind of interesting projects in a few months.

Conclusion

Ralph_FallsThe growth in general electronic discovery legal work (see EDBP for full description) has been exploding this year, so too has multidisciplinary e-discovery team work. It will, I predict, continue to grow very fast from this point forward. But the adoption of predictive coding software and predictive coding review has, to date, been an exception to this high growth trend. In fact, the adoption of predictive coding has been relatively slow. It is still only infrequently used, if at all, by most law firms, even in big cases. I spoke with many attorneys at the recent Georgetown Institute event who specialize in this field. They are all seeing the same thing and, like me, are shaking their heads in frustration and dismay.

I predict this will change too over the next two to three years. The big hindrances to the adoption of predictive coding are law firms and their general lack of knowledge and skills in predictive coding. Most law firms, both big and small, know very little about the basic methods of predictive coding. They know even less about the best practices. The ignorance is widespread among attorneys my age, and they are the ones in law firm leadership positions. The hinderance to widespread adoption of predictive coding is not lack of judicial approval. There is now plenty of case law. The hinderance is lack of knowledge and skills.

greedy

Greedy Lawyers

There is also a greed component involved for some, shall we say, less than client-centric law firms. We have to talk about this elephant in the room. Client’s already are. Some attorneys are quite satisfied with the status quo. They make a great deal of money from linear reviews, and so called advanced keyword search driven reviews. The days of paid inefficiency are numbered. Technology will eventually win out, even over fat cat lawyers. It always does.

The answers I see to the resistance issues to predictive coding are threefold:

Continued Education. We have to continue the efforts to demystify AI and active machine learning. We ned to move our instruction from theory to practice.

Improved Software. Some review software already has excellent machine training features. Some is just so-so, and some do not have this kind of document search and ranking capacity at all. My goal is to push the whole legal software industry to include active machine learning in most all of their options. Another goal is for software vendors to improve their software, and make it easier to work with by adding much more in the way of creative visualizations. That has been the main point of this series and I hope to see a response soon from the industry. Help me to push the industry. Demand these features in your review software. Look beyond the smokescreens and choose the true leaders in the field.

Client Demand. Pressure on reluctant law firms from the companies that pay the bills will have a far stronger impact than anything else.  I am talking about both corporate clients and insurers. They will, I predict, start pushing law firms into greater utilization of AI-enhanced document review. The corporate clients and insurers have the economic motivation for this change that most law firms lack. Corporate clients are also much more comfortable with the use of AI for Big Data search. That kind of pressure by clients on law firms will motivate e-discovery teams to learn the necessary skills. That will in turn motivate the software vendors to spend the money necessary to improve their software with better AI search and better visualizations.

All of the legal software on the market today, especially review software, could be improved by adding more visualizations and graphic display tools. Pictures really can be worth a thousand words. They can especially help to make advanced AI techniques more accessible and easier to understand. The data visualization ideas set forth in this series are just the tip of the iceberg of what can be done to improve existing software.


Visualizing Data in a Predictive Coding Project – Part Two

November 16, 2014

visual-numbersThis is part two of my presentation of an idea for visualization of data in a predictive coding project. Please read part one first.

As most of you already know, the ranking of all documents according to their probable relevance, or other criteria, is the purpose of predictive coding. The ranking allows accurate predictions to me made as to how the documents should be coded. In part one I shared the idea by providing a series of images of a typical document ranking process. I only included a few brief verbal descriptions. This week I will spell it out and further develop the idea. Next week I hope to end on a high note with random sampling and math.

Vertical and Horizontal Axis of the Images

Raw_DataThe visualizations here presented all represent a collection of documents. It is supposed to be pointillist image, with one point for each document. At the beginning of a document review project, before any predictive coding training has been applied to the collection, the documents are all unranked. They are relatively unknown. This is shown by the fuzzy round cloud of unknown data.

Once the machine training begins all documents start to be ranked. In the most simplistic visualizations shown here the ranking is limited to predicted relevance or irrelevance. Of course, the predictions could be more complex, and include highly relevant and privilege, which is what I usually do. It could also include various other issue classifications, but I usually avoid this for a variety of reasons that would take us too far astray to explain.

Once the training and ranking begin the probability grid comes into play. This grid creates both a vertical and horizontal axis. (In the future, we could add third dimensions too, but let’s start simple.)  The one public comment received so far stated that the vertical axis on the images showing percentages adjacent to the words “Probable Relevant” might give people the impression that it is the probability of a document being relevant. Well, I hope so, because that is exactly what I was trying to do!

The vertical axis shows how the documents are ranked. The horizontal axis shows the number of documents, roughly, at each ranking level. Remember, each point is supposed to represent a specific, individual document. (In the future we could add family overlays, but again, let’s start simple.) A single dot in the middle would represent one document. An empty space would represent zero documents. A wide expanse of horizontal dots would represent hundreds or thousand of documents, depending on the scale.

The diagram below visualizes a situation common where ranking has just begun and the computer is uncertain as to how to classify the documents. It classifies most in the 37.5% to 67.5% range of probable relevance. It is all about fifty fifty at this point. This is the kind of spread you would expect to see if training began with only random sampling input. The diagram indicates that the computer does not really know much yet about the data. It does not yet have any real idea as to which documents are relevant, and which are not.

Vertical_ranking_overlay

The vertical axis of the visualization is the key.  It is intended to show a running grid from 99% probable relevant to 0.01% probable relevant. Note that 0.01% probable relevant is another way of saying 99.9% probable irrelevant, but remember, I am trying to keep this simple. More complex overlays may be more to the liking of some software users. Also note that the particular numbers I show on the these diagrams is arbitrary: 0.01%, 12.5%, 25%, 37.5%, 50%, 67.5%, 75%, 87.5%, 99.9%, I would prefer to see more detail here, and perhaps add a grid showing a faint horizontal line at every 10% interval. Still, the fewer lines shown here does have a nice aesthetic appeal, plus it was easier for me to create on the fly for this blog.

Again, let me repeat to be very clear. The vertical grid on these diagrams represents the probable ranking from least likely to be relevant on the bottom, to most likely on the top. The horizontal grid shows the number of documents. It is really that simple.

Why Data Visualization Is Important

visualize 2This kind of display of documents according to a vertical grid of probable relevance is very helpful because it allows you to see exactly how your documents are ranked at any one point in time. Just as important, it helps you to see how the alignment changes over time. This empowers you to see how your machine training impacts the distribution.

This kind of direct, immediate feedback greatly facilitates human computer interaction (what I call in my approximate 50 articles on predictive coding the hybrid approach). It makes it easier for the natural human intelligence to connect with the artificial intelligence. It makes it easier for the human SMEs involved to train the computer. The humans, typically attorneys or their surrogates, are the ones with the expertise on the legal issues in the case. This visualization allows them to see immediately what impact particular training documents have upon the ranking of the whole collection. This helps them to select effective training documents. It helps them to attain the goal of separation of relevant from irrelevant documents. Ideally they would be clustered on both the bottom and top of the vertical axis.

For this process to work it is important for the feedback to be grounded in actual document review, and not be a mere intellectual exercise. Samples of documents in the various ranking strata must be inspected to verify, or not, whether the ranking is accurate. That can vary from strata to strata. Moreover, as everyone quickly finds out, each project is different, although certain patterns do tend to emerge. The diagrams used as an example in this blog represent one such typical pattern, although greatly compressed in time. In reality the changes shows here from one diagram to another would be more gradual and have a few unexpected bumps and bulges.

Visualizations like this will speed up the ranking and the review process. Ultimately the graphics will all be fully interactive. By clicking on any point in the graphic you will be taken to the particular document or documents that it represents. You click and drag and you are taken to a whole set of documents selected. For instance, you may want to see all documents between 45% and 55%, so you would select that range in the graphic. Or you may want to see all documents in the top 5% probable relevance ranking, so you select that top edge of the graphic. These documents will instantly be shown in the review database. Most good software already has document visualizations with similar linking capacities. So we are not reinventing the Wheel here, just applying these existing software capacities to new patterns, namely to document rankings.

These graphic features will allow you to easily search the ranking locations. This will in turn allow you to verify, or correct, the machine’s learning. Where you find that the documents clicked have a correct prediction of relevance, you verify by coding as relevant, or highly relevant. Where the documents clicked have an incorrect prediction, you correct by coding the document properly. That is how the computer learns. You tell it yes when it gets it right, and no when it gets it wrong.

At the beginning of a project many predictions of relevance and irrelevance will be incorrect. These errors will diminish as the training progress, as the correct predictions are verified, and erroneous predictions are corrected. Fewer mistakes will be made as the machine starts to pick up the human intelligence. To me it seems like a mind to computer transference. More of the predictions will be verified, and the document distributions will start to gather on both end of the vertical relevance axis. Since the volume of documents is represented by the horizontal axis, more documents will start to bunch together at both the top and bottom of the vertical axis. Since document collections in legal search usually contain many more irrelevant documents than relevant, you will typically see most documents on the bottom.

Visualizations of an Exemplar Predictive Coding Project

In the sample considered here we see unnaturally rapid training. It would normally take many more rounds of machine training than are shown in these four diagrams. In fact, with a continuous active training process, there could be hundreds of rounds per day. In that case the visualization would look more like an animation than a series of static images. But again, I have limited the process here for simplicity sake.

1000000_docsAs explained previously, the first thing that happens to the fuzzy round cloud of unknown data before any training begins is that the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. In addition other necessarily irrelevant documents to this particular project are bulk-culled out. For example, ESI such as music files, some types of photos, and many email domains, like, for instance, emails from publications such as the NY Times. By good fortune in this example exactly One Million documents remain for predictive coding.

RandomWe begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. (They are the yellow dots.) Assuming a 95% confidence level, do you know what confidence interval this creates? I asked this question before and repeat it again, as the answer will not come until the final math installment next week.

Next we assume that an SME, and or his or her surrogates, reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant. Do you know what prevalence rate this creates? Do you know the projected range of relevant documents within the confidence interval limits of this sample? That is the most important question of all.

Next we do the first round of machine training proper. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. Again for simplicity sake, we assume that the analytics is directed towards relevance alone. In fact, most projects would also include high-relevance and privilege.

data-visual_Round_2In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities. Here there are already more documents trained on irrelevance, than relevance. This is in spite of the fact that the active search was to find relevant documents, not irrelevant documents. This is typical in most review projects where you have many more irrelevant than relevant documents overall, and where it is easier to spot and find irrelevant than relevant.

data-visual_Round_3Next we see the data after the second round of training. The division of the collection of documents into relevant and irrelevant is beginning to form. The largest of collection of documents are the blue points at the bottom. They are the documents that the computer predicts are irrelevant based on the training to date. There are also a large collection of points shown in red at the top. They are the ones where the computer now thinks there is a high probability of relevance. Still, the computer is uncertain about the vast majority of the documents: the red in the third strata from the top, the blue in the third strata from the bottom, and the many in the grey, the 37.5% to 67.5% probable relevance range. Again we see an overall bottom heavy distribution. This is a typical pattern because it is usually easier to train on irrelevance than relevance.

As noted before, the training could be continuous. Many software programs offer that feature. But I want to keep the visualizations here simple, and not make an animation, and so I do not assume here a literally continuous active learning. Personally, although I do like to keep the training continuous throughout the review, I like the actual computer training to come in discrete stages that I control. That gives me a better understanding of the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. That is the kind of feedback that these visualizations enhance.

data-visual_Round_4Next we see the data after the third round of training. Again, in reality it would typically take more rounds of training than three to reach this relatively mature state, but I am trying to keep this example simple. If a project did progress this fast, it would probably be because a large number of documents were used in the prior rounds.  The documents about which the computer is now uncertain — the grey area, and the middle two brackets — is now much smaller.

The computer now has a high probability ranking for most of the probable relevant and probable irrelevant documents. The largest number of documents are the blue bottom, where the computer predicts they have a near zero chance of being classified relevant. Again, most of the  probable predictions, those in the top and bottom three brackets, are located in the bottom three brackets. Those are the documents predicted to have less that a 37.5% chance of being relevant. Again, this kind of distribution is typical, but there can be many variances from project to project. We here see a top loading where most of the probable relevant documents are in the top 12.5% percent ranking. In other words, they have an 87.5% probable relevant ranking, or higher.

data-visual_Round_5Next we see the data after the fourth round of training. It is an excellent distribution at this point. There are relatively few documents in the middle. This means there are relatively few documents about which the computer is uncertain as to its probable classification. This pattern is one factor among several to consider in deciding whether further training and document review are required to complete your production.

Another important metric to consider is the total number of documents found to be probable relevant, and comparison with the random sample prediction. Here is where math comes in, and understanding of what random sampling can and cannot tell you about the success of a project. You consider the spot projection of total relevance based on your initial prevalence calculation, but much more important, you consider the actual range of documents under the confidence interval. That is what really counts when dealing with prevalence projections and random sampling. That is where the plus or minus  confidence interval comes into play, as I will explain in detail the third and final installment to this blog.

PrevalenceIn the meantime, here is  the document count of the distribution roughly pictured in the final diagram above, which to me looks like an upside down, fragile champagne glass. We see that exactly 250,000 documents have a 50% or higher probable relevance ranking, and 750,000 documents have a 49.9% or less probable relevance ranking. Of the probable relevant documents, there are 15,000 documents that fall in the 50% to 67.5% range. There are another 10,000 documents that fall in the 37.5% to 49.9% probable relevance range. Again, this is also fairly common as we often see less on the barely irrelevant side that we do on the barely relevant side. As a general rule I review with humans all documents that are 50% or higher probable relevance, and do not review the rest. I do however sample and test the rest, the documents with less than a 50% probable relevance ranking. Also, in some projects I review far less than the top 50%. That all depends on proportionality constraints, and document ranking distribution, the kind of distributions that these visualizations will show.

In addition to this metrics analysis, another important factor to consider in whether our search and review efforts are now complete, is how much change in ranking there has been from one training round to the next. Sometimes there may be no change at all. Sometimes there may only be very slight changes. If the changes from the last round are large, that is an indication that more training should still be tried, even if the distribution already looks optimal, as we see here.

Another even more important quality control factor is how correct the computer has been in the last few rounds of its predictions. Ideally, you want to see the rate of error decreasing to a point where you see no errors in your judgmental samples. You want your testing of the computer’s prediction to show that it has attained a high degree of precision. That means there are few documents predicted relevant, that actual review by human SMEs show are in fact irrelevant. This kind of error is known as a False Positive. Much more important to quality evaluation is to the discovery of documents predicted irrelevant, that are actually relevant. This kind of error is known as a False Negative. The False Negatives are your real concern in most projects because legal search is usually focused on recall, not precision, at least within reason.

The final distinction to note in quality control is one that might seem subtle, but really is not. You must also factor in relevance weight. You never want a False Negative to be a highly relevant document. If that happens to me, I always commence at least one more round of training. Even missing a document that is not highly relevant, not hot, but is a strong relevant document, and one of a type not seen before, is typically a cause for further training. This is, however, not an automatic rule as with the discovery of a hot document. It depends on a variety of factors having to do with relevance analysis of the particular case and document collection.

In our example we are going to assume that all of the quality control indicators are positive, and a decision has been made to stop training and move on to a final random sample test.

A second random sample comes next. That test and visualization will be provided next week, along with the promised math and sampling analysis.

Math Quiz

I part one, and again here, I asked some basic math questions on random sampling, prevalence, and recall. So far no one has attempted to answer the questions posed. Apparently, most readers here do not want to be tested. I do not blame them. This is also what I find in my online training program, e-DiscoveryTeamTraining.com, where only a small percentage of the students who take the program elect to be tested. That is fine with me as it means one less paper to grade, and most everyone passes anyway. I do not encourage testing. You know if you get it or not. Testing is not really necessary.

The same applies to answering math questions in a public blog. I understand the hesitancy. Still, I hope many privately tried, or will try, to solve the questions and came up with the correct answers. In part three of this blog I will provide the answers, and you will know for sure if you got it right. There is still plenty of time to try to figure it out on your own. The truly bold can post it online in the comments below. Of course, this is all pretty basic stuff to try experts of this craft. So, to my fellow experts out there, you have yet another week to take some time and strut your stuff by sharing the obvious answers. Surely I am not the only one in the e-discovery world bold enough to put their reputation on the line by sharing their opinions and analysis in public for all to see (and criticize). Come on. I do it every week.

Math and sampling are important tools for quality control, but as Professor Gordon Cormack, a true wizard in the area of search, math, and sampling likes to point out, sampling alone has many inherent limitations. Gordon insists, and I agree, that sampling should only be part of a total quality control program. You should never just rely on random sampling alone, especially in low prevalence collections. Still, when sampling, prevalence, and recall are included as part of an overall QC effort, the net effect is very reassuring. Unless I know that I have an expert like Gordon on the other side, and so far that has never happened, I want to see the math. I want to know about all of the quality control and quality assurance steps taken to try to find the information requested. If you are going to protect your client, you need to learn this too, or have someone at hand who already knows it.

This kind of math, sampling, and other process disclosures should convince even the most skeptical adversary or judge. That is why it is important for all attorneys involved with legal research to have a clear mathematical understanding of the basics. Visualizations alone are inadequate, but, for me at least, visualizations do help a lot. All kinds of data visualizations, not just the ones here presented, provide important tools to help lawyers to understand how a search project is progressing.

Challenge to Software Vendors

challengeThe simplicity of the design of the idea presented here is a key part of the power and strength of the visualization. It should not be too difficult to write code to implement this visualization. We need this. It will help users to better understand the process. It will not cost too much to implement, and what it does cost should be recouped soon in higher sales. Come on vendors, show me you are listening. Show me you get it. If you have a software demo that includes this feature, then I want to see it. Otherwise not.

All good predictive coding software already ranks the probable relevance of documents, so we are not talking about an enormous coding project. This feature would simply add a visual display to calculations already being made. I could hand make these calculations myself using an Excel spreadsheet, but that is time consuming and laborious. This kind of visualization lends itself to computer generation.

I have many other ideas for predictive coding features, including other visualizations, that are much more complex and challenging to implement. This simple grid explained here is an easy one to implement, and will show me, and the rest of our e-discovery community, who the real leaders are in software development.

Conclusion

Ralph_2013_beard_frownThe primary goal of the e-Discovery Team blog is educational, to help lawyers and other e-discovery professionals. In addition, I am trying to influence what services and products are provided in e-discovery, both legal and technical. In this blog I am offering an idea to improve the visualizations that most predictive software already provide. I hope that all vendors will include this feature in future releases of their software. I have a host of additional ideas to improve legal search and review software, especially the kind that employs active machine learning. They include other, much more elaborate visualization schemes, some of which have been alluded to here.

Someday I may have time to consult on all of the other, more complex ideas, but, in the meantime, I offer this basic idea for any vendor to try out. Until vendors start to implement even this basic idea, anyone can at least use their imagination, as I now do, to follow along. These kind of visualizations can help you to understand the impact of document ranking on your predictive coding review projects. All it takes is some idea as to the number of documents in various probable relevance ranking strata. Try it on your next predictive coding project, even if it is just rough images from your own imagination (or Excel spreadsheet). I am sure you will see for yourself how helpful this can be to monitor and understand the progress of your work.

 

 


Visualizing Data in a Predictive Coding Project

November 9, 2014

data-visual_Round_5This blog will share a new way to visualize data in a predictive coding project. I only include a brief description this week. Next week I will add a full description of this project. Advanced students should be able to predict the full text from the images alone. Study the text and try to figure out the details of what is going on.

Soon all good predictive coding software will include visualizations like this to help searchers to understand the data. The images can be automatically created by computer to accurately visualize exactly how the data is being analyzed and ranked. Experienced searchers can use this kind of visual information to better understand what they should do next to efficiently meet their search and review goals.

For a game try to figure out how the high and low number of relevant documents that you must find in this review project to claim that you have a 95% confidence level of having found all relevant documents, the mythical total recall. This high-low range will be wrong one time out of twenty, that is what the 95% confidence level means, but still, this knowledge is helpful. The correct answer to questions of recall and prevalence is always a high-low range of documents, never just one number, and never a percentage. Also, there are always confidence level caveats. Still, with these limitations in mind, for extra points, state what the spot projection is for prevalence. These illustrations and short descriptions provide all of the information you need to calculate these answers.

The project begins with a collection of documents here visualized by the fuzzy ball of unknown data.

Raw_Data

Next the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. By good fortune exactly One Million documents remain.

1000000_docs

We begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. Assuming a 95% confidence level, what confidence interval does this create?

Random

Assume that an SME reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant.

 

Training Begins

Next we do the first round of machine training. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. To keep it simple we only show the relevance ranking, and not also the irrelevance metrics display. The top represents 99.9% probable relevance. The bottom the inverse, 00.1% probable relevance. Put another way, the bottom would represent 99.9% probable irrelevance. For simplicity sake we also assume that the analytics is directed towards relevance alone, whereas most projects would also include high-relevance and privilege. In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities.

data-visual_Round_2

Next we see the data after the second round of training. Note that the training could with most software be continuous. But I like to control when the training happens in order to better understand the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. The human SME understands how the machine is learning. The SME learns where the machine needs the most help to tune into their conception of relevance. This kind of cross-communication makes it easier for the artificial intelligence to properly boost the human intelligence.

data-visual_Round_3

Next we see the data after the third round of training. The machine is learning very quickly. In most projects it takes longer than this to attain this kind of ranking distribution. What does this tell us about the number of documents between rounds of training?

data-visual_Round_4

Now we see the data after the fourth round of training. It is an excellent distribution and so we decide to stop and test.data-visual_Round_5The second random sample comes next. That visualization, and a full description of the project, will be provided next week. In the meantime, leave your answers to the questions in the comments below. This is a chance to strut your stuff. If you prefer, send me your answers, and questions, by private email.

 


Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part Two

November 2, 2014

recordsThis is the second part of a two-part blog, please read part one first.

AI-Enhanced Big Data Search Will Greatly Simplify Information Governance

Information Governance is, or should be, all about finding the information you need, when you need it, and doing so in a cheap and efficient manner. Information needs are determined by both law and personal preferences, including business operation needs. In order to find information, you must first have it. Not only that, you must keep it until you need it. To do that, you need to preserve the information. If you have already destroyed information, really destroyed it I mean, not just deleted it, then obviously you will not be able to find it. You cannot find what does not exist, as all Unicorn chasers eventually find out.

Too_Many_RecordsThis creates a basic problem for Information Governance because the whole system is based on a notion that the best way to find valuable information is to destroy worthless information. Much of Information Governance is devoted to trying to determine what information is a valuable needle, and what is worthless chaff. This is because everyone knows that the more information you have, the harder it is for you to find the information you need. The idea is that too much information will cut you off. These maxims were true in the pre-AI-Enhanced Search days, but are, IMO, no longer true today, or, at least, will not be true in the next five to ten years, maybe sooner.

In order to meet the basic goal of finding information, Information Governance focuses its efforts on the proper classification of information. Again, the idea was to make it simpler to find information by preserving some of it, the information you might need to access, and destroying the rest. That is where records classification comes in.

The question of what information you need has a time element to it. The time requirements are again based on personal and business operations needs, and on thousand of federal, state and local laws. Information governance thus became a very complicated legal analysis problem. There are literally thousands of laws requiring certain types of information to be preserved for various lengths of time. Of course, you could comply with most of these laws by simply saving everything forever, but, in the past, that was not a realistic solution. There were severe limits on the ability to save information, and the ability to find it. Also, it was presumed that the older information was, the less value it had. Almost all information was thus treated like news.

These ideas were all firmly entrenched before the advent of Big Data and AI-enhanced data mining. In fact, in today’s world there is good reason for Google to save every search, ever done, forever. Some patterns and knowledge only emerge in time and history. New information is sometimes better information, but not necessarily so. In the world of Big Data all information has value, not just the latest.

paper records management warehouseThis records life-cycle ideas all made perfect sense in the world of paper information. It cost a lot of money to save and store paper records. Everyone with a monthly Iron Mountain paper records storage bill knows that. Even after the computer age began, it still cost a fair amount of money to save and store ESI. The computers needed to buy and maintain digital storage used to be very expensive. Finding the ESI you needed quickly on a computer was still very difficult and unreliable. All we had at first was keyword search, and that was very ineffective.

Due to the costs of storage, and the limitations of search, tremendous efforts were made by record managers to try to figure out what information was important, or needed, either from a legal perspective, or a business necessity perspective, and to save that information, and only that information. The idea behind Information Management was to destroy the ESI you did not need or were not required by law to preserve. This destruction saved you money, and, it also made possible the whole point of Information Governance, to find the information you wanted, when you wanted it.

Back in the pre-AI search days, the more information you had, the harder it was to find the information you needed. That still seems like common sense. Useless information was destroyed so that you could find valuable information. In reality, with the new and better algorithms we now have for AI-enhanced search, it is just the reverse. The more information you have, the easier it becomes to find what you want. You now have more information to draw upon.

That is the new reality of Big Data. It is a hard intellectual paradigm to jump, and seems counter-intuitive. It took me a long time to get it. The new ability to save and search everything cheaply and efficiently is what is driving the explosion of Big Data services and products. As the save everything, find anything way of thinking takes over, the classification and deletion aspects of Information Governance will naturally dissipate. The records lifecycle will transform into virtual immortality. There is no reason to classify and delete, if you can save everything and find anything at low cost. The issues simplify; they change to how to save and  search, although new collateral issues of security and privacy grow in importance.

Save and Search v. Classify and Delete

The current clash in basic ideas concerning Big Data and Information Governance is confusing to many business executives. According to Gregory Bufithis who attended a recent event in Washington D.C. on Big Data sponsored by EMC, one senior presenter explained:

The C Suite is bedeviled by IG and regulatory complexity. … 

The solution is not to eliminate Information Governance entirely. The reports of its complete demise, here or elsewhere, are exaggerated. The solution is to simplify IG. To pare it down to save and search. Even this will take some time, like I said, from five to ten years, although there is some chance this transformation of IG will go even faster than that. This move away from complex regulatory classification schemes, to simpler save and search everything, is already being adopted by many in the high-tech world. To quote Greg again from the private EMC event in D.C. in October, 2014:

Why data lakes? Because regulatory complexity and the changes can kill you. And are unpredictable in relationship to information governance. …

So what’s better? Data lakes coupled with archiving. Yes, archiving seems emblematic of “old” IT. But archiving and data lifecycle management (DLM) have evolved from a storage focus, to a focus on business value and data loss prevention. DLM recognizes that as data gets older, its value diminishes, but it never becomes worthless. And nobody is throwing out anything and yes, there are negative impacts (unnecessary storage costs, litigation, regulatory sanctions) if not retained or deleted when it should be.

But … companies want to mine their data for operational and competitive advantage. So data lakes and archiving their data allows for ingesting and retain all information types, structured or unstructured. And that’s better.

Because then all you need is a good search platform or search system … like Hadoop which allows you to sift through the data and extract the chunks that answer the questions at hand. In essence, this is a step up from OLAP (online analytical processing). And you can use “tag sift sort” programs like Data Rush. Or ThingWorx which is an approach that monitors the stream of data arriving in the lake for specific events. Complex event processing (CEP) engines can also sift through data as it enters storage, or later when it’s needed for analysis.

Because it is all about search.

Recent Breakthroughs in Artificial Intelligence
Make Possible Save Everything, Find Anything

AIThe New York Times in an opinion editorial this week discussed recent breakthroughs in Artificial Intelligence and speculated on alternative futures this could create. Our Machine Masters, NT Times Op-Ed, by David Brooks (October 31, 2014). The Times article quoted extensively another article in the current issue of Wired by technology blogger Kevin Kelly: The Three Breakthroughs That Have Finally Unleashed AI on the World. Kelly argues, as do I, that artificial intelligence has now reached a breakthrough level. This artificial intelligence breakthrough, Kevin Kelly argues, and David Brook’s agrees, is driven by three things: cheap parallel computation technologies, big data collection, and better algorithms. The upshot is clear in the opinion of both Wired and the New York Times: “The business plans of the next 10,000 start-ups are easy to forecast: Take X and add A.I. This is a big deal, and now it’s here.

These three new technology advances change everything. The Wired article goes into the technology and financial aspects of the new AI; it is where the big money is going and will be made in the next few decades. If Wired is right, then this means in our world of e-discovery, companies and law firms will succeed if, and only if, they add AI to their products and services. The firms and vendors who add AI to document review, and project management, will grow fast. The non-AI enhanced vendors, non-AI enhanced software, will go out of business. The law firms that do not use AI tools will shrink and die.

David_BrooksThe Times article by David Brooks goes into the sociological and philosophical aspects of the recent breakthroughs in Artificial Intelligence:

Two big implications flow from this. The first is sociological. If knowledge is power, we’re about to see an even greater concentration of power.  … [E]ngineers at a few gigantic companies will have vast-though-hidden power to shape how data are collected and framed, to harvest huge amounts of information, to build the frameworks through which the rest of us make decisions and to steer our choices. If you think this power will be used for entirely benign ends, then you have not read enough history.

The second implication is philosophical. A.I. will redefine what it means to be human. Our identity as humans is shaped by what machines and other animals can’t do. For the last few centuries, reason was seen as the ultimate human faculty. But now machines are better at many of the tasks we associate with thinking — like playing chess, winning at Jeopardy, and doing math. [RCL – and, you might add, better at finding relevant evidence.]

On the other hand, machines cannot beat us at the things we do without conscious thinking: developing tastes and affections, mimicking each other and building emotional attachments, experiencing imaginative breakthroughs, forming moral sentiments. [RCL – and, you might add, better at equitable notions of justice and at legal imagination.]

In this future, there is increasing emphasis on personal and moral faculties: being likable, industrious, trustworthy and affectionate. People are evaluated more on these traits, which supplement machine thinking, and not the rote ones that duplicate it.

In the cold, utilitarian future, on the other hand, people become less idiosyncratic. If the choice architecture behind many decisions is based on big data from vast crowds, everybody follows the prompts and chooses to be like each other. The machine prompts us to consume what is popular, the things that are easy and mentally undemanding.

I’m happy Pandora can help me find what I like. I’m a little nervous if it so pervasively shapes my listening that it ends up determining what I like. [RCL – and, you might add, determining what is relevant, what is fair.]

I think we all want to master these machines, not have them master us.

ralph_wrongAlthough I share the concerns of the NY Times about mastering machines and alternative future scenarios, my analysis of the impact of the new AI is focused and limited to the Law. Lawyers must master the AI-search for evidence processes. We must master and use the better algorithms, the better AI-enhanced software, not visa versa. The software does not, nor should it, run itself. Easy buttons in legal search are a trap for the unwary, a first step down a slippery slope to legal dystopia. Human lawyers must never over-delegate our uniquely human insights and abilities. We must train the machines. We must stay in charge and assert our human insights on law, relevance, equity, fairness and justice, and our human abilities to imagine and create new realities of justice for all. I want lawyers and judges to use AI-enhanced machines, but I never want to be judged by a machine alone, nor have a computer alone as a lawyer.

The three big new advances that are allowing better and better AI are nowhere near to threatening the jobs of human judges or lawyers, although they will likely reduce their numbers, and certainly will change their jobs. We are already seeing these changes in Legal Search and Information Governance. Thanks to cheap parallel computation, we now have Big Data Lakes stored in thousands of inexpensive, cloud computers that are operating together. This is where open-sourced software like Hadoop comes in. They make the big clusters of computers possible. The better algorithms is where better AI-enhanced Software comes in. This makes it possible to use predictive coding effectively and inexpensively to find the information needed to resolve law suits. The days of vast numbers of document reviewer attorneys doing linear review are numbered. Instead, we will see a few SMEs, working with small teams of reviewers, search experts, and software experts.

The role of Information Managers will also change drastically. Because of Big Data, cheap parallel computing, and better algorithms, it is now possible to save everything, forever, at a small cost, and to quickly search and find what you need. The new reality of Save Everything, Find Anything undercuts most of the rationale of Information Governance. It is all about search now.

Conclusion

Ralph_Losey_2013_abaNow that storage costs are negligible, and search far more efficient, the twin motivators of Information Science to classify and destroy are gone, or soon will be. The key remaining tasks of Information Governance are now preservation and search, plus relatively new ones of security and privacy. I recognize that the demise of the importance of destruction of ESI could change if more governments enact laws that require the destruction of ESI, like the EU has done with Facebook posts and the so-called “right to be forgotten law.” But for now, most laws are about saving data for various times, and do not require data be destroyed. Note that the new Delaware law on data destruction still keeps it discretionary on whether to destroy personal data or not. House Bill No. 295 – The Safe Destruction of Documents Containing Personal Identifying Information. It only places legal burdens and liability for failures to properly destroy data. This liability for mistakes in destruction serves to discourage data destruction, not encourage it.

Preservation is not too difficult when you can economically save everything forever, so the challenging task remaining is really just one of search. That is why I say that Information Governance will become a sub-set of search. The save everything forever model will, however, create new legal work for lawyers. The cybersecurity protection and privacy aspects of Big Data Lakes are already creating many new legal challenges and issues. More legal issues are sure to arise with the expansion of AI.

Automation, including this latest Second Machine Age of mental process automation, does not eliminate the need for human labor. It just makes our work more interesting and opens up more time for leisure. Automation has always created new jobs as fast as it has eliminated old ones. The challenge for existing workers like ourselves is to learn the new skills necessary to do the new jobs. For us e-discovery lawyers and techs, this means, among other things, acquiring new skills to use AI-enhanced tools. One such skill, the ability for HCIR, human computer information retrieval, is mentioned in most of my articles on predictive coding. It involves new skill sets in active machine learning to train a computer to find the evidence you want from large collections of data sets, typically emails. When I was a law student in the late 1970s, I could never have dreamed that this would be part of my job as a lawyer in 2014.

The new jobs do not rely on physical or mental drudgery and repetition. Instead, they put a premium on what makes up distinctly human, our deep knowledge, understanding, wisdom, and intuition; our empathy, caring, love and compassion; our morality, honesty, and trustworthiness; our sense of justice and fairness; our ability to change and adapt quickly to new conditions; our likability, good will, and friendliness; our imagination, art, wisdom, and creativity. Yes, even our individual eccentricities, and our all important sense of humor. No matter how far we progress, let us never lose that! Please be governed accordingly.



Follow

Get every new post delivered to your Inbox.

Join 3,648 other followers