Visualizing Data in a Predictive Coding Project – Part Two

November 16, 2014

visual-numbersThis is part two of my presentation of an idea for visualization of data in a predictive coding project. Please read part one first.

As most of you already know, the ranking of all documents according to their probable relevance, or other criteria, is the purpose of predictive coding. The ranking allows accurate predictions to me made as to how the documents should be coded. In part one I shared the idea by providing a series of images of a typical document ranking process. I only included a few brief verbal descriptions. This week I will spell it out and further develop the idea. Next week I hope to end on a high note with random sampling and math.

Vertical and Horizontal Axis of the Images

Raw_DataThe visualizations here presented all represent a collection of documents. It is supposed to be pointillist image, with one point for each document. At the beginning of a document review project, before any predictive coding training has been applied to the collection, the documents are all unranked. They are relatively unknown. This is shown by the fuzzy round cloud of unknown data.

Once the machine training begins all documents start to be ranked. In the most simplistic visualizations shown here the ranking is limited to predicted relevance or irrelevance. Of course, the predictions could be more complex, and include highly relevant and privilege, which is what I usually do. It could also include various other issue classifications, but I usually avoid this for a variety of reasons that would take us too far astray to explain.

Once the training and ranking begin the probability grid comes into play. This grid creates both a vertical and horizontal axis. (In the future, we could add third dimensions too, but let’s start simple.)  The one public comment received so far stated that the vertical axis on the images showing percentages adjacent to the words “Probable Relevant” might give people the impression that it is the probability of a document being relevant. Well, I hope so, because that is exactly what I was trying to do!

The vertical axis shows how the documents are ranked. The horizontal axis shows the number of documents, roughly, at each ranking level. Remember, each point is supposed to represent a specific, individual document. (In the future we could add family overlays, but again, let’s start simple.) A single dot in the middle would represent one document. An empty space would represent zero documents. A wide expanse of horizontal dots would represent hundreds or thousand of documents, depending on the scale.

The diagram below visualizes a situation common where ranking has just begun and the computer is uncertain as to how to classify the documents. It classifies most in the 37.5% to 67.5% range of probable relevance. It is all about fifty fifty at this point. This is the kind of spread you would expect to see if training began with only random sampling input. The diagram indicates that the computer does not really know much yet about the data. It does not yet have any real idea as to which documents are relevant, and which are not.


The vertical axis of the visualization is the key.  It is intended to show a running grid from 99% probable relevant to 0.01% probable relevant. Note that 0.01% probable relevant is another way of saying 99.9% probable irrelevant, but remember, I am trying to keep this simple. More complex overlays may be more to the liking of some software users. Also note that the particular numbers I show on the these diagrams is arbitrary: 0.01%, 12.5%, 25%, 37.5%, 50%, 67.5%, 75%, 87.5%, 99.9%, I would prefer to see more detail here, and perhaps add a grid showing a faint horizontal line at every 10% interval. Still, the fewer lines shown here does have a nice aesthetic appeal, plus it was easier for me to create on the fly for this blog.

Again, let me repeat to be very clear. The vertical grid on these diagrams represents the probable ranking from least likely to be relevant on the bottom, to most likely on the top. The horizontal grid shows the number of documents. It is really that simple.

Why Data Visualization Is Important

visualize 2This kind of display of documents according to a vertical grid of probable relevance is very helpful because it allows you to see exactly how your documents are ranked at any one point in time. Just as important, it helps you to see how the alignment changes over time. This empowers you to see how your machine training impacts the distribution.

This kind of direct, immediate feedback greatly facilitates human computer interaction (what I call in my approximate 50 articles on predictive coding the hybrid approach). It makes it easier for the natural human intelligence to connect with the artificial intelligence. It makes it easier for the human SMEs involved to train the computer. The humans, typically attorneys or their surrogates, are the ones with the expertise on the legal issues in the case. This visualization allows them to see immediately what impact particular training documents have upon the ranking of the whole collection. This helps them to select effective training documents. It helps them to attain the goal of separation of relevant from irrelevant documents. Ideally they would be clustered on both the bottom and top of the vertical axis.

For this process to work it is important for the feedback to be grounded in actual document review, and not be a mere intellectual exercise. Samples of documents in the various ranking strata must be inspected to verify, or not, whether the ranking is accurate. That can vary from strata to strata. Moreover, as everyone quickly finds out, each project is different, although certain patterns do tend to emerge. The diagrams used as an example in this blog represent one such typical pattern, although greatly compressed in time. In reality the changes shows here from one diagram to another would be more gradual and have a few unexpected bumps and bulges.

Visualizations like this will speed up the ranking and the review process. Ultimately the graphics will all be fully interactive. By clicking on any point in the graphic you will be taken to the particular document or documents that it represents. You click and drag and you are taken to a whole set of documents selected. For instance, you may want to see all documents between 45% and 55%, so you would select that range in the graphic. Or you may want to see all documents in the top 5% probable relevance ranking, so you select that top edge of the graphic. These documents will instantly be shown in the review database. Most good software already has document visualizations with similar linking capacities. So we are not reinventing the Wheel here, just applying these existing software capacities to new patterns, namely to document rankings.

These graphic features will allow you to easily search the ranking locations. This will in turn allow you to verify, or correct, the machine’s learning. Where you find that the documents clicked have a correct prediction of relevance, you verify by coding as relevant, or highly relevant. Where the documents clicked have an incorrect prediction, you correct by coding the document properly. That is how the computer learns. You tell it yes when it gets it right, and no when it gets it wrong.

At the beginning of a project many predictions of relevance and irrelevance will be incorrect. These errors will diminish as the training progress, as the correct predictions are verified, and erroneous predictions are corrected. Fewer mistakes will be made as the machine starts to pick up the human intelligence. To me it seems like a mind to computer transference. More of the predictions will be verified, and the document distributions will start to gather on both end of the vertical relevance axis. Since the volume of documents is represented by the horizontal axis, more documents will start to bunch together at both the top and bottom of the vertical axis. Since document collections in legal search usually contain many more irrelevant documents than relevant, you will typically see most documents on the bottom.

Visualizations of an Exemplar Predictive Coding Project

In the sample considered here we see unnaturally rapid training. It would normally take many more rounds of machine training than are shown in these four diagrams. In fact, with a continuous active training process, there could be hundreds of rounds per day. In that case the visualization would look more like an animation than a series of static images. But again, I have limited the process here for simplicity sake.

1000000_docsAs explained previously, the first thing that happens to the fuzzy round cloud of unknown data before any training begins is that the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. In addition other necessarily irrelevant documents to this particular project are bulk-culled out. For example, ESI such as music files, some types of photos, and many email domains, like, for instance, emails from publications such as the NY Times. By good fortune in this example exactly One Million documents remain for predictive coding.

RandomWe begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. (They are the yellow dots.) Assuming a 95% confidence level, do you know what confidence interval this creates? I asked this question before and repeat it again, as the answer will not come until the final math installment next week.

Next we assume that an SME, and or his or her surrogates, reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant. Do you know what prevalence rate this creates? Do you know the projected range of relevant documents within the confidence interval limits of this sample? That is the most important question of all.

Next we do the first round of machine training proper. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. Again for simplicity sake, we assume that the analytics is directed towards relevance alone. In fact, most projects would also include high-relevance and privilege.

data-visual_Round_2In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities. Here there are already more documents trained on irrelevance, than relevance. This is in spite of the fact that the active search was to find relevant documents, not irrelevant documents. This is typical in most review projects where you have many more irrelevant than relevant documents overall, and where it is easier to spot and find irrelevant than relevant.

data-visual_Round_3Next we see the data after the second round of training. The division of the collection of documents into relevant and irrelevant is beginning to form. The largest of collection of documents are the blue points at the bottom. They are the documents that the computer predicts are irrelevant based on the training to date. There are also a large collection of points shown in red at the top. They are the ones where the computer now thinks there is a high probability of relevance. Still, the computer is uncertain about the vast majority of the documents: the red in the third strata from the top, the blue in the third strata from the bottom, and the many in the grey, the 37.5% to 67.5% probable relevance range. Again we see an overall bottom heavy distribution. This is a typical pattern because it is usually easier to train on irrelevance than relevance.

As noted before, the training could be continuous. Many software programs offer that feature. But I want to keep the visualizations here simple, and not make an animation, and so I do not assume here a literally continuous active learning. Personally, although I do like to keep the training continuous throughout the review, I like the actual computer training to come in discrete stages that I control. That gives me a better understanding of the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. That is the kind of feedback that these visualizations enhance.

data-visual_Round_4Next we see the data after the third round of training. Again, in reality it would typically take more rounds of training than three to reach this relatively mature state, but I am trying to keep this example simple. If a project did progress this fast, it would probably be because a large number of documents were used in the prior rounds.  The documents about which the computer is now uncertain — the grey area, and the middle two brackets — is now much smaller.

The computer now has a high probability ranking for most of the probable relevant and probable irrelevant documents. The largest number of documents are the blue bottom, where the computer predicts they have a near zero chance of being classified relevant. Again, most of the  probable predictions, those in the top and bottom three brackets, are located in the bottom three brackets. Those are the documents predicted to have less that a 37.5% chance of being relevant. Again, this kind of distribution is typical, but there can be many variances from project to project. We here see a top loading where most of the probable relevant documents are in the top 12.5% percent ranking. In other words, they have an 87.5% probable relevant ranking, or higher.

data-visual_Round_5Next we see the data after the fourth round of training. It is an excellent distribution at this point. There are relatively few documents in the middle. This means there are relatively few documents about which the computer is uncertain as to its probable classification. This pattern is one factor among several to consider in deciding whether further training and document review are required to complete your production.

Another important metric to consider is the total number of documents found to be probable relevant, and comparison with the random sample prediction. Here is where math comes in, and understanding of what random sampling can and cannot tell you about the success of a project. You consider the spot projection of total relevance based on your initial prevalence calculation, but much more important, you consider the actual range of documents under the confidence interval. That is what really counts when dealing with prevalence projections and random sampling. That is where the plus or minus  confidence interval comes into play, as I will explain in detail the third and final installment to this blog.

PrevalenceIn the meantime, here is  the document count of the distribution roughly pictured in the final diagram above, which to me looks like an upside down, fragile champagne glass. We see that exactly 250,000 documents have a 50% or higher probable relevance ranking, and 750,000 documents have a 49.9% or less probable relevance ranking. Of the probable relevant documents, there are 15,000 documents that fall in the 50% to 67.5% range. There are another 10,000 documents that fall in the 37.5% to 49.9% probable relevance range. Again, this is also fairly common as we often see less on the barely irrelevant side that we do on the barely relevant side. As a general rule I review with humans all documents that are 50% or higher probable relevance, and do not review the rest. I do however sample and test the rest, the documents with less than a 50% probable relevance ranking. Also, in some projects I review far less than the top 50%. That all depends on proportionality constraints, and document ranking distribution, the kind of distributions that these visualizations will show.

In addition to this metrics analysis, another important factor to consider in whether our search and review efforts are now complete, is how much change in ranking there has been from one training round to the next. Sometimes there may be no change at all. Sometimes there may only be very slight changes. If the changes from the last round are large, that is an indication that more training should still be tried, even if the distribution already looks optimal, as we see here.

Another even more important quality control factor is how correct the computer has been in the last few rounds of its predictions. Ideally, you want to see the rate of error decreasing to a point where you see no errors in your judgmental samples. You want your testing of the computer’s prediction to show that it has attained a high degree of precision. That means there are few documents predicted relevant, that actual review by human SMEs show are in fact irrelevant. This kind of error is known as a False Positive. Much more important to quality evaluation is to the discovery of documents predicted irrelevant, that are actually relevant. This kind of error is known as a False Negative. The False Negatives are your real concern in most projects because legal search is usually focused on recall, not precision, at least within reason.

The final distinction to note in quality control is one that might seem subtle, but really is not. You must also factor in relevance weight. You never want a False Negative to be a highly relevant document. If that happens to me, I always commence at least one more round of training. Even missing a document that is not highly relevant, not hot, but is a strong relevant document, and one of a type not seen before, is typically a cause for further training. This is, however, not an automatic rule as with the discovery of a hot document. It depends on a variety of factors having to do with relevance analysis of the particular case and document collection.

In our example we are going to assume that all of the quality control indicators are positive, and a decision has been made to stop training and move on to a final random sample test.

A second random sample comes next. That test and visualization will be provided next week, along with the promised math and sampling analysis.

Math Quiz

I part one, and again here, I asked some basic math questions on random sampling, prevalence, and recall. So far no one has attempted to answer the questions posed. Apparently, most readers here do not want to be tested. I do not blame them. This is also what I find in my online training program,, where only a small percentage of the students who take the program elect to be tested. That is fine with me as it means one less paper to grade, and most everyone passes anyway. I do not encourage testing. You know if you get it or not. Testing is not really necessary.

The same applies to answering math questions in a public blog. I understand the hesitancy. Still, I hope many privately tried, or will try, to solve the questions and came up with the correct answers. In part three of this blog I will provide the answers, and you will know for sure if you got it right. There is still plenty of time to try to figure it out on your own. The truly bold can post it online in the comments below. Of course, this is all pretty basic stuff to try experts of this craft. So, to my fellow experts out there, you have yet another week to take some time and strut your stuff by sharing the obvious answers. Surely I am not the only one in the e-discovery world bold enough to put their reputation on the line by sharing their opinions and analysis in public for all to see (and criticize). Come on. I do it every week.

Math and sampling are important tools for quality control, but as Professor Gordon Cormack, a true wizard in the area of search, math, and sampling likes to point out, sampling alone has many inherent limitations. Gordon insists, and I agree, that sampling should only be part of a total quality control program. You should never just rely on random sampling alone, especially in low prevalence collections. Still, when sampling, prevalence, and recall are included as part of an overall QC effort, the net effect is very reassuring. Unless I know that I have an expert like Gordon on the other side, and so far that has never happened, I want to see the math. I want to know about all of the quality control and quality assurance steps taken to try to find the information requested. If you are going to protect your client, you need to learn this too, or have someone at hand who already knows it.

This kind of math, sampling, and other process disclosures should convince even the most skeptical adversary or judge. That is why it is important for all attorneys involved with legal research to have a clear mathematical understanding of the basics. Visualizations alone are inadequate, but, for me at least, visualizations do help a lot. All kinds of data visualizations, not just the ones here presented, provide important tools to help lawyers to understand how a search project is progressing.

Challenge to Software Vendors

challengeThe simplicity of the design of the idea presented here is a key part of the power and strength of the visualization. It should not be too difficult to write code to implement this visualization. We need this. It will help users to better understand the process. It will not cost too much to implement, and what it does cost should be recouped soon in higher sales. Come on vendors, show me you are listening. Show me you get it. If you have a software demo that includes this feature, then I want to see it. Otherwise not.

All good predictive coding software already ranks the probable relevance of documents, so we are not talking about an enormous coding project. This feature would simply add a visual display to calculations already being made. I could hand make these calculations myself using an Excel spreadsheet, but that is time consuming and laborious. This kind of visualization lends itself to computer generation.

I have many other ideas for predictive coding features, including other visualizations, that are much more complex and challenging to implement. This simple grid explained here is an easy one to implement, and will show me, and the rest of our e-discovery community, who the real leaders are in software development.


Ralph_2013_beard_frownThe primary goal of the e-Discovery Team blog is educational, to help lawyers and other e-discovery professionals. In addition, I am trying to influence what services and products are provided in e-discovery, both legal and technical. In this blog I am offering an idea to improve the visualizations that most predictive software already provide. I hope that all vendors will include this feature in future releases of their software. I have a host of additional ideas to improve legal search and review software, especially the kind that employs active machine learning. They include other, much more elaborate visualization schemes, some of which have been alluded to here.

Someday I may have time to consult on all of the other, more complex ideas, but, in the meantime, I offer this basic idea for any vendor to try out. Until vendors start to implement even this basic idea, anyone can at least use their imagination, as I now do, to follow along. These kind of visualizations can help you to understand the impact of document ranking on your predictive coding review projects. All it takes is some idea as to the number of documents in various probable relevance ranking strata. Try it on your next predictive coding project, even if it is just rough images from your own imagination (or Excel spreadsheet). I am sure you will see for yourself how helpful this can be to monitor and understand the progress of your work.



Visualizing Data in a Predictive Coding Project

November 9, 2014

data-visual_Round_5This blog will share a new way to visualize data in a predictive coding project. I only include a brief description this week. Next week I will add a full description of this project. Advanced students should be able to predict the full text from the images alone. Study the text and try to figure out the details of what is going on.

Soon all good predictive coding software will include visualizations like this to help searchers to understand the data. The images can be automatically created by computer to accurately visualize exactly how the data is being analyzed and ranked. Experienced searchers can use this kind of visual information to better understand what they should do next to efficiently meet their search and review goals.

For a game try to figure out how the high and low number of relevant documents that you must find in this review project to claim that you have a 95% confidence level of having found all relevant documents, the mythical total recall. This high-low range will be wrong one time out of twenty, that is what the 95% confidence level means, but still, this knowledge is helpful. The correct answer to questions of recall and prevalence is always a high-low range of documents, never just one number, and never a percentage. Also, there are always confidence level caveats. Still, with these limitations in mind, for extra points, state what the spot projection is for prevalence. These illustrations and short descriptions provide all of the information you need to calculate these answers.

The project begins with a collection of documents here visualized by the fuzzy ball of unknown data.


Next the data is processed, deduplicated, deNisted, and non-text and other documents unsuitable for analytics are removed. By good fortune exactly One Million documents remain.


We begin with some multimodal judgmental sampling, and with a random sample of 1,534 documents. Assuming a 95% confidence level, what confidence interval does this create?


Assume that an SME reviewed the 1,534 sample and found that 384 were relevant and 1,150 were irrelevant.


Training Begins

Next we do the first round of machine training. The first round of training is sometimes called the seed set. Now the document ranking according to probable relevance and irrelevance begins. To keep it simple we only show the relevance ranking, and not also the irrelevance metrics display. The top represents 99.9% probable relevance. The bottom the inverse, 00.1% probable relevance. Put another way, the bottom would represent 99.9% probable irrelevance. For simplicity sake we also assume that the analytics is directed towards relevance alone, whereas most projects would also include high-relevance and privilege. In this project the data ball changed to the following distribution. Note the lighter colors represent less density of documents. Red documents represent documents coded or predicted as relevant, and blue as irrelevant. All predictive coding projects are different and the distributions shown here are just one among near countless possibilities.


Next we see the data after the second round of training. Note that the training could with most software be continuous. But I like to control when the training happens in order to better understand the impact of my machine training. The SME human trains the machine, and, in an ideal situation, the machine also trains the SME. The human SME understands how the machine is learning. The SME learns where the machine needs the most help to tune into their conception of relevance. This kind of cross-communication makes it easier for the artificial intelligence to properly boost the human intelligence.


Next we see the data after the third round of training. The machine is learning very quickly. In most projects it takes longer than this to attain this kind of ranking distribution. What does this tell us about the number of documents between rounds of training?


Now we see the data after the fourth round of training. It is an excellent distribution and so we decide to stop and second random sample comes next. That visualization, and a full description of the project, will be provided next week. In the meantime, leave your answers to the questions in the comments below. This is a chance to strut your stuff. If you prefer, send me your answers, and questions, by private email.


Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part Two

November 2, 2014

recordsThis is the second part of a two-part blog, please read part one first.

AI-Enhanced Big Data Search Will Greatly Simplify Information Governance

Information Governance is, or should be, all about finding the information you need, when you need it, and doing so in a cheap and efficient manner. Information needs are determined by both law and personal preferences, including business operation needs. In order to find information, you must first have it. Not only that, you must keep it until you need it. To do that, you need to preserve the information. If you have already destroyed information, really destroyed it I mean, not just deleted it, then obviously you will not be able to find it. You cannot find what does not exist, as all Unicorn chasers eventually find out.

Too_Many_RecordsThis creates a basic problem for Information Governance because the whole system is based on a notion that the best way to find valuable information is to destroy worthless information. Much of Information Governance is devoted to trying to determine what information is a valuable needle, and what is worthless chaff. This is because everyone knows that the more information you have, the harder it is for you to find the information you need. The idea is that too much information will cut you off. These maxims were true in the pre-AI-Enhanced Search days, but are, IMO, no longer true today, or, at least, will not be true in the next five to ten years, maybe sooner.

In order to meet the basic goal of finding information, Information Governance focuses its efforts on the proper classification of information. Again, the idea was to make it simpler to find information by preserving some of it, the information you might need to access, and destroying the rest. That is where records classification comes in.

The question of what information you need has a time element to it. The time requirements are again based on personal and business operations needs, and on thousand of federal, state and local laws. Information governance thus became a very complicated legal analysis problem. There are literally thousands of laws requiring certain types of information to be preserved for various lengths of time. Of course, you could comply with most of these laws by simply saving everything forever, but, in the past, that was not a realistic solution. There were severe limits on the ability to save information, and the ability to find it. Also, it was presumed that the older information was, the less value it had. Almost all information was thus treated like news.

These ideas were all firmly entrenched before the advent of Big Data and AI-enhanced data mining. In fact, in today’s world there is good reason for Google to save every search, ever done, forever. Some patterns and knowledge only emerge in time and history. New information is sometimes better information, but not necessarily so. In the world of Big Data all information has value, not just the latest.

paper records management warehouseThis records life-cycle ideas all made perfect sense in the world of paper information. It cost a lot of money to save and store paper records. Everyone with a monthly Iron Mountain paper records storage bill knows that. Even after the computer age began, it still cost a fair amount of money to save and store ESI. The computers needed to buy and maintain digital storage used to be very expensive. Finding the ESI you needed quickly on a computer was still very difficult and unreliable. All we had at first was keyword search, and that was very ineffective.

Due to the costs of storage, and the limitations of search, tremendous efforts were made by record managers to try to figure out what information was important, or needed, either from a legal perspective, or a business necessity perspective, and to save that information, and only that information. The idea behind Information Management was to destroy the ESI you did not need or were not required by law to preserve. This destruction saved you money, and, it also made possible the whole point of Information Governance, to find the information you wanted, when you wanted it.

Back in the pre-AI search days, the more information you had, the harder it was to find the information you needed. That still seems like common sense. Useless information was destroyed so that you could find valuable information. In reality, with the new and better algorithms we now have for AI-enhanced search, it is just the reverse. The more information you have, the easier it becomes to find what you want. You now have more information to draw upon.

That is the new reality of Big Data. It is a hard intellectual paradigm to jump, and seems counter-intuitive. It took me a long time to get it. The new ability to save and search everything cheaply and efficiently is what is driving the explosion of Big Data services and products. As the save everything, find anything way of thinking takes over, the classification and deletion aspects of Information Governance will naturally dissipate. The records lifecycle will transform into virtual immortality. There is no reason to classify and delete, if you can save everything and find anything at low cost. The issues simplify; they change to how to save and  search, although new collateral issues of security and privacy grow in importance.

Save and Search v. Classify and Delete

The current clash in basic ideas concerning Big Data and Information Governance is confusing to many business executives. According to Gregory Bufithis who attended a recent event in Washington D.C. on Big Data sponsored by EMC, one senior presenter explained:

The C Suite is bedeviled by IG and regulatory complexity. … 

The solution is not to eliminate Information Governance entirely. The reports of its complete demise, here or elsewhere, are exaggerated. The solution is to simplify IG. To pare it down to save and search. Even this will take some time, like I said, from five to ten years, although there is some chance this transformation of IG will go even faster than that. This move away from complex regulatory classification schemes, to simpler save and search everything, is already being adopted by many in the high-tech world. To quote Greg again from the private EMC event in D.C. in October, 2014:

Why data lakes? Because regulatory complexity and the changes can kill you. And are unpredictable in relationship to information governance. …

So what’s better? Data lakes coupled with archiving. Yes, archiving seems emblematic of “old” IT. But archiving and data lifecycle management (DLM) have evolved from a storage focus, to a focus on business value and data loss prevention. DLM recognizes that as data gets older, its value diminishes, but it never becomes worthless. And nobody is throwing out anything and yes, there are negative impacts (unnecessary storage costs, litigation, regulatory sanctions) if not retained or deleted when it should be.

But … companies want to mine their data for operational and competitive advantage. So data lakes and archiving their data allows for ingesting and retain all information types, structured or unstructured. And that’s better.

Because then all you need is a good search platform or search system … like Hadoop which allows you to sift through the data and extract the chunks that answer the questions at hand. In essence, this is a step up from OLAP (online analytical processing). And you can use “tag sift sort” programs like Data Rush. Or ThingWorx which is an approach that monitors the stream of data arriving in the lake for specific events. Complex event processing (CEP) engines can also sift through data as it enters storage, or later when it’s needed for analysis.

Because it is all about search.

Recent Breakthroughs in Artificial Intelligence
Make Possible Save Everything, Find Anything

AIThe New York Times in an opinion editorial this week discussed recent breakthroughs in Artificial Intelligence and speculated on alternative futures this could create. Our Machine Masters, NT Times Op-Ed, by David Brooks (October 31, 2014). The Times article quoted extensively another article in the current issue of Wired by technology blogger Kevin Kelly: The Three Breakthroughs That Have Finally Unleashed AI on the World. Kelly argues, as do I, that artificial intelligence has now reached a breakthrough level. This artificial intelligence breakthrough, Kevin Kelly argues, and David Brook’s agrees, is driven by three things: cheap parallel computation technologies, big data collection, and better algorithms. The upshot is clear in the opinion of both Wired and the New York Times: “The business plans of the next 10,000 start-ups are easy to forecast: Take X and add A.I. This is a big deal, and now it’s here.

These three new technology advances change everything. The Wired article goes into the technology and financial aspects of the new AI; it is where the big money is going and will be made in the next few decades. If Wired is right, then this means in our world of e-discovery, companies and law firms will succeed if, and only if, they add AI to their products and services. The firms and vendors who add AI to document review, and project management, will grow fast. The non-AI enhanced vendors, non-AI enhanced software, will go out of business. The law firms that do not use AI tools will shrink and die.

David_BrooksThe Times article by David Brooks goes into the sociological and philosophical aspects of the recent breakthroughs in Artificial Intelligence:

Two big implications flow from this. The first is sociological. If knowledge is power, we’re about to see an even greater concentration of power.  … [E]ngineers at a few gigantic companies will have vast-though-hidden power to shape how data are collected and framed, to harvest huge amounts of information, to build the frameworks through which the rest of us make decisions and to steer our choices. If you think this power will be used for entirely benign ends, then you have not read enough history.

The second implication is philosophical. A.I. will redefine what it means to be human. Our identity as humans is shaped by what machines and other animals can’t do. For the last few centuries, reason was seen as the ultimate human faculty. But now machines are better at many of the tasks we associate with thinking — like playing chess, winning at Jeopardy, and doing math. [RCL – and, you might add, better at finding relevant evidence.]

On the other hand, machines cannot beat us at the things we do without conscious thinking: developing tastes and affections, mimicking each other and building emotional attachments, experiencing imaginative breakthroughs, forming moral sentiments. [RCL – and, you might add, better at equitable notions of justice and at legal imagination.]

In this future, there is increasing emphasis on personal and moral faculties: being likable, industrious, trustworthy and affectionate. People are evaluated more on these traits, which supplement machine thinking, and not the rote ones that duplicate it.

In the cold, utilitarian future, on the other hand, people become less idiosyncratic. If the choice architecture behind many decisions is based on big data from vast crowds, everybody follows the prompts and chooses to be like each other. The machine prompts us to consume what is popular, the things that are easy and mentally undemanding.

I’m happy Pandora can help me find what I like. I’m a little nervous if it so pervasively shapes my listening that it ends up determining what I like. [RCL – and, you might add, determining what is relevant, what is fair.]

I think we all want to master these machines, not have them master us.

ralph_wrongAlthough I share the concerns of the NY Times about mastering machines and alternative future scenarios, my analysis of the impact of the new AI is focused and limited to the Law. Lawyers must master the AI-search for evidence processes. We must master and use the better algorithms, the better AI-enhanced software, not visa versa. The software does not, nor should it, run itself. Easy buttons in legal search are a trap for the unwary, a first step down a slippery slope to legal dystopia. Human lawyers must never over-delegate our uniquely human insights and abilities. We must train the machines. We must stay in charge and assert our human insights on law, relevance, equity, fairness and justice, and our human abilities to imagine and create new realities of justice for all. I want lawyers and judges to use AI-enhanced machines, but I never want to be judged by a machine alone, nor have a computer alone as a lawyer.

The three big new advances that are allowing better and better AI are nowhere near to threatening the jobs of human judges or lawyers, although they will likely reduce their numbers, and certainly will change their jobs. We are already seeing these changes in Legal Search and Information Governance. Thanks to cheap parallel computation, we now have Big Data Lakes stored in thousands of inexpensive, cloud computers that are operating together. This is where open-sourced software like Hadoop comes in. They make the big clusters of computers possible. The better algorithms is where better AI-enhanced Software comes in. This makes it possible to use predictive coding effectively and inexpensively to find the information needed to resolve law suits. The days of vast numbers of document reviewer attorneys doing linear review are numbered. Instead, we will see a few SMEs, working with small teams of reviewers, search experts, and software experts.

The role of Information Managers will also change drastically. Because of Big Data, cheap parallel computing, and better algorithms, it is now possible to save everything, forever, at a small cost, and to quickly search and find what you need. The new reality of Save Everything, Find Anything undercuts most of the rationale of Information Governance. It is all about search now.


Ralph_Losey_2013_abaNow that storage costs are negligible, and search far more efficient, the twin motivators of Information Science to classify and destroy are gone, or soon will be. The key remaining tasks of Information Governance are now preservation and search, plus relatively new ones of security and privacy. I recognize that the demise of the importance of destruction of ESI could change if more governments enact laws that require the destruction of ESI, like the EU has done with Facebook posts and the so-called “right to be forgotten law.” But for now, most laws are about saving data for various times, and do not require data be destroyed. Note that the new Delaware law on data destruction still keeps it discretionary on whether to destroy personal data or not. House Bill No. 295 – The Safe Destruction of Documents Containing Personal Identifying Information. It only places legal burdens and liability for failures to properly destroy data. This liability for mistakes in destruction serves to discourage data destruction, not encourage it.

Preservation is not too difficult when you can economically save everything forever, so the challenging task remaining is really just one of search. That is why I say that Information Governance will become a sub-set of search. The save everything forever model will, however, create new legal work for lawyers. The cybersecurity protection and privacy aspects of Big Data Lakes are already creating many new legal challenges and issues. More legal issues are sure to arise with the expansion of AI.

Automation, including this latest Second Machine Age of mental process automation, does not eliminate the need for human labor. It just makes our work more interesting and opens up more time for leisure. Automation has always created new jobs as fast as it has eliminated old ones. The challenge for existing workers like ourselves is to learn the new skills necessary to do the new jobs. For us e-discovery lawyers and techs, this means, among other things, acquiring new skills to use AI-enhanced tools. One such skill, the ability for HCIR, human computer information retrieval, is mentioned in most of my articles on predictive coding. It involves new skill sets in active machine learning to train a computer to find the evidence you want from large collections of data sets, typically emails. When I was a law student in the late 1970s, I could never have dreamed that this would be part of my job as a lawyer in 2014.

The new jobs do not rely on physical or mental drudgery and repetition. Instead, they put a premium on what makes up distinctly human, our deep knowledge, understanding, wisdom, and intuition; our empathy, caring, love and compassion; our morality, honesty, and trustworthiness; our sense of justice and fairness; our ability to change and adapt quickly to new conditions; our likability, good will, and friendliness; our imagination, art, wisdom, and creativity. Yes, even our individual eccentricities, and our all important sense of humor. No matter how far we progress, let us never lose that! Please be governed accordingly.

e-Discovery Industry Reaction to Microsoft’s Offer to Purchase Equivio for $200 Million – Part Two

October 19, 2014

microsoft_acquiresThis is part two of an article providing an e-discovery industry insiders view of the possible purchase of Equivio by Microsoft. Please read Part One first. So far the acquisition by Microsoft is still just a rumor, but do not be surprised if it is officially announced soon.

Another e-discovery insider has agreed to go public with his comments, and three more anonymous submissions were received. Let’s begin with these quotes, and then I will move onto some analysis and opinions on this deal and the likely impact on our industry.

More Industry Insider Comments


Jon Kerry-Tyerman (VP, Business Development, Everlaw): “If you think about this potential acquisition in the context of the EDRM, it makes a lot of sense. The technological issues on the left-hand side—from Information Governance through Preservation and Collection—are primarily search-related, rather than discovery-related.  And the technology behind search is largely a problem that’s been solved. That’s why we see these tasks being commoditized by the providers of the systems on which these data reside, entrenched players like Microsoft and Google. Microsoft has already shown a willingness to wade deeper here (see, e.g., Matter Center for Office 365), so the acquisition of Equivio’s expertise to improve document search and organization within the enterprise is a logical extension of that strategy.

I don’t think, however, that this heralds an expansion by Microsoft into the wider “ediscovery” space. The tasks on the right-hand side of the EDRM—particularly Review through Presentation—depend on expert legal judgment. While technology cannot supplant that judgment, it can be used to augment it. Doing so effectively, however, requires a nuanced understanding of the unique legal and technological problems underlying these tasks, and the resulting solutions are not easily applicable to other domains. For a big fish like Microsoft, that’s simply too small a pond in which to swim. It happens to be the perfect environment for a technology startup, however, which is why we’re focusing exclusively on applying cutting-edge computer science to the right-hand side of the EDRM—including our proprietary (read: non-Equivio!) predictive coding system.”

anonymousAnonymous One (a tech commentator not in e-discovery world provides an interesting outsider view): “I read the commentary and found it to be fairly eDiscovery introspective.  What I think is:

  1. I don’t know the Equivio markets as well as I should. I thought Equivio was/is a classification engine that did a wonderful job of deduplication of email threads. They played in the eDiscovery markets and we don’t focus on these markets except for their relevance to information governance.
  2. Equivio lacked a coherent strategy to integrate to the Microsoft stack, at the level of managed metadata, content types, and site provisioning, which doomed them to bit player status unless someone acquired them or they committed to tight integration with the hybrid SharePoint/Office 365/Exchange/OneDrive/Yammer/Delve/File Share stack for unified content governance. Now someone has. Hats off to Warwick & Co. for $200MM for this.
  3. My expectation is that Equivio will be added into Office 365 and Delve to crawl through everything you own and classify it, launching whatever processes you want. This is not good news for Content Analyst, dataglobal, Nuix, HP Autonomy, or Google, except that Google and HP are able to stand on their own. It is also not good news, but less bad news for Concept Searching, Smart Logic and BA Insight, in that they leverage SharePoint and Office 365 Search and extend it with integration points and connectors to other systems.
  4. Microsoft is launching Matter Center at LegalTech in NYC in February after announcing it at ILTA. This is the first of the vertical solutions that begin the long journey of companies to adopt either the Microsoft or Google cloud solution stacks and abandon the isolated silos of information like Box, Dropbox, etc., for the corporate side of information management.”


Anonymous Two: “It’s an interesting move for Microsoft. $200M is a little high for tools in our industry, but is peanuts for them. They make dozens of these types of moves and spend billions each year acquiring various companies and technologies. I agree with Craig Ball regarding how many times have we seen formidable competitors go the way of the Dodo after they were purchased by a bigger company. I highly doubt they are planning to jump into our industry to lock horns with all of us. It is more likely that they may be developing some sort of Information Governance & Analysis offering for businesses, which could have some downstream effects on eDiscovery.”


Anonymous Three: “The acquisition of Equivio by Microsoft and the price paid are not a complete surprise. I agree with others who do not see this as a sign of Microsoft entering the ediscovery business. If Microsoft wanted  to do that it could acquire any of the big ediscovery players out there. Rather, the Equivo acquisition allows Microsoft to offer a service that other big data companies cannot. Putting aside HP’s acquisition of Autonomy, I think Microsoft’s acquisition of Equivo is only the first of what will be a series of technology acquisitions by big data companies. These companies, that handle terabytes upon terabytes of data for major corporations around the world, can one day provide ediscovery as an additional service offering. That day isn’t today, but it is coming.”

What Microsoft Will Do With Equivio

DiogenesThe consensus view is that after the purchase Microsoft will essentially disband Equivio and absorb its technology, its software designs, and some of its experts. Then, as Craig Ball predicts, they will wander the halls of Redmond like the great cynic Diogenes. No one seems to think that Microsoft will continue Equivio’s business. For that reason it would make no sense for Microsoft to continue to license the Equivio search technologies to e-discovery companies. That in turn means a large part of the e-discovery industry that now depends on Equivio search components, and licenses with Equivio, will soon be out of luck. Zoom will go boom! More on that later.

If  Microsoft did not buy Equivio to continue its business, why did it want its technology? As the scientists I talked to all told me, Microsoft already has plenty of artificial intelligence based text search capabilities, software, and patents. But maybe they are not designed for searching through large disorganized corporate datasets, such as email? Maybe their software in this area is not nearly as good as Equivio’s. As smart and talented as my scientist friends seem to think Microsoft is, the company seems to have a black hole of incompetence when its comes to email search and other aspects of information management.

The consensus view is that Microsoft wants Equivio to grab its technology and patents (at least one commentator also thought they were also after Equivio’s customers). The Microsoft plan is probably to incorporate its software code into various existing Microsoft products and new products under development. Almost no one expects those new products to be e-discovery specific. They might, however, help provide a discovery search overlay to existing software. Outlook, for instance, has pathetic search capacities that frustrate millions daily. Maybe they will add better e-discovery aspects to that. I personally expect (hope) they will do that.

Information Governance Is Now King

emperor's new clothes woodcutI also agree with the consensus view in our industry, a view that is now preoccupied with Information Governance, that Microsoft’s new products using Equivio technology will be information governance products. I expect Microsoft to once again follow IBM and focus on the left side of EDRM. I expect Microsoft to come out with new Governance type products and software module add-ons. I do not think that Microsoft will go into litigation support specific products, such as document review software, nor litigation search oriented products. Like IBM, they think it is still too small a market and too specialized a market.

Bottom line, Microsoft is not interested in entering the e-discovery specific market at this time, any more than IBM is. Instead, like most (but not all, especially Google) of the smart creatives of the technology world, Microsoft has bought into the belief that information is something that can be governed, can be managed. They think that Information Governance is like paper records management, just with more zeros after the number of records involved. The file-everything librarian mentality lives on, or tries to.

The Inherent Impossibility, in the Long Run, of Information Governance

Most of the e-discovery world now believes that Information Governance is not only possible, but it is the savior to the information deluge that floods us all. I disagree, especially in the long run. I appear to be a lone dissenting voice on this in e-discovery. I think the establishment majority in our industry is deluding themselves into thinking that information is like paper, only there is more of it. They delude themselves into thinking that Information is capable of being governed, just like so many little paper soldiers in an army. I say the Emperor has no clothes. That information cannot be governed.

paper doll cutouts

Electronic Information is a totally new kind of force, something Mankind has never seen before. Digital Information is a Genie out of the bottle. It cannot be captured. It cannot be managed. It certainly cannot be governed. It cannot even be killed. Forget about trying to put it back in the bottle. It is breeding faster than even Star Trek’s Tribbles could imagine. Like Baron and Paul discussed in their important 2007 law review, ESI is like a new Universe, and we are living just moments after the Big Bang. George L. Paul and Jason R. Baron, Information Inflation: Can the Legal System Adapt? 13 RICH. J.L. & TECH. 10 (2007).



Ludwig-WittgensteinWhat few outside of Google, Baron, and Paul seem to grasp is that Information has a life of its own. Id. at FN 30 (quoting Ludwig Wittgenstein (a 20th Century Austrian philosopher whom I was forced to study while in college in Vienna): “[T]o imagine a language is to imagine a form of life.”) Electronic information is a new and unique life form that defies all attempts of limitation, much less governance. As James Gleick observed in his book on information science, everything is a form of information. The Universe itself is a giant computer and we are all self-evolving algorithms. Gleick, The Information: a history, a theory, a flood.

Essentially information is free, and wants to be free. It does not want to be governed, or charged for. Information is more useful when free and when it is not subject to transitory restraints.


Regardless of the economic aspects, and whether information really wants to be free or not, as a practical matter Information cannot be governed, even if some of it can be commoditized. Information is moving and growing far too fast for governance.

Digitized information is like a nuclear reaction that has passed the point of no return. The chain reaction has been triggered. This is what exponential growth really means. In time such fission vision will be obvious. Even people without Google glasses will be able to see it.

Nuclear chain reaction

In the meantime we have a new breed of information governance experts running around who serve like heroic bomb squads. Some know that it is just a noble quest, doomed to failure. Most do not. They helicopter into corporate worlds attempting to defuse ticking information bombs. They build walls around it. They confidently set policies and promulgate rules. They talk sternly about enforcement of rules. They automate filing. They automate deletion. Some are even starting to make robot file clerks.

Information governance experts, just like the records managers before them, are all working diligently to try to solve today’s problems of information management. But, all the while, ever new problems encroach upon their walls. They cannot keep up with this growth, the new forms of information. The next generation of exponential growth builds faster than anyone can possibly govern. Do they not know that the bomb has already exploded? The tipping point has already past?

Information governance policies that are being created today are like sand castles built at low tide. Can you hear the next wave of data generated by the Internet of Things? It will surely wash away all of today’s efforts. There will always be more data, more unexpected new forms of information. Governance of information is a dream, a Don Quixote quest.

Information can not be governed. It can only be searched.

search_globalIn my view we should focus on search technologies, and give up on governance. Or at least realize it is a mere stop-gap measure. In the world I see, search is king, not governance. Do not waste your valuable time and effort trying to file information. Just search for it, when and if you need it. You will not need most of it anyway.

I do not really think Microsoft has the fission vision, but I could be wrong. They may well see the world like I do, and like Google does, and realize that it is all search now. Microsoft may already understand that information governance is just a subset of search, not visa versa. Maybe Microsoft is already focused on creating great new search software that will help us transition from governance to search. Maybe they hope to remain relevant in the future and to compete with Google. No one knows for sure the full thinking behind Microsoft’s decision to buy Equivio.

The majority of experts are probably right, Microsoft probably does have information governance software in mind when buying Equivio. Microsoft probable still hangs onto the governance world view, and does not see it my way, or Google’s way, that it is all about search. Still, by buying good search code from Equivio, Microsoft cannot go wrong. Eventually, after the governance approach fails, which I predict will happen in ten years, or less, and Microsoft and the governance experts finally see the world like Google and me, it will help to have Equivio’s code as a foundation.

What Happens If Zoom Goes Boom?

In the short-term what companies may be adversely affected by the exit of Equivio from the e-discovery market?  I had first thought that K-Cura would be adversely impacted, but apparently that’s wrong. You can see how I would be confused because when you look at Equivio’s installed base web page, Equivio features K-Cura and its Relativity review platform. Equivio even includes a page on its website that promotes the Equivio Zoom tab on Relativity’s software. Nevertheless, K-Cura insists that it does not have anything built on Equivio’s technology. K-Cura states that its analytics engine is OEM from another company, Content Analyst. For that reason, it says its products will not be affected that much if Zoom is no longer a plug-in.

K-Cura says that its relationship with Equivio is simply that of a Relativity developer partner. It allows Equivio to develop an integration that allows users of Relativity to access Zoom from within the Relativity platform. Those users still need to license Zoom separately and get the plug-in from Equivio. Relativity itself has a Content Analyst’s engine fully baked-in for the same kind of text analytics, predictive coding, etc.  K-Cura states that functionality will still all be there, no matter what happens with Equivio. So I stand corrected on my original comments to the contrary.

So what does happen if Zoom goes boom? Companies that depend on Equivio may be in trouble, or may simply move to Content Analysts or someone else. Are they as good as Equivio? I do not know. But I do know there are huge differences in analytics quality and how well one company’s predictive coding features work as compared to another. That is exactly why Equivio existed, to license technologies to fill the gap. Apparently Content Analyst and others do the same. They do the research and code development on search so that most other vendors in the industry do not have to. The trade-off is dependency and the chance they may close shop or be bought out.

Only a few vendors have taken the time, and very considerable expense, to develop their own active machine learning software features, instead of licensing it from Equivio or Content Analysts. These vendors will now reap the rewards of having the rugs pulled out from under some of their competitors. Eventually even lawyers will realize that search quality does matter, that all predictive coding software programs are not alike.

There is a long list of other key users of Equivio products, some of whom may be concerned about losing Equivio’s prodcuts. They include, according to the list on Equivio’s own website:

  • Concordance by Lexis Nexis
  • DT Search
  • EDT
  • Xera iConnect
  • iPro
  • Law PreDiscovery by Lexis Nexis
  • Thomson Reuters

In addition, Equivio’s installed base web page lists the following companies and law firms as users of their technology. It is a very long list, including many prominent vendors in the space, and many small frys that I have never heard of. They may all be somewhat concerned about the Microsoft move, to one degree or another, according to how dependent they are on Equivio software or software components.

  • Altep
  • BDO Consulting
  • Bowman & Brooke
  • CACI
  • Catalyst
  • CDCI research
  • Commonwealth Legal
  • Crown Castle
  • D4
  • Deloitte
  • Dinsmore
  • Discover Ready
  • Discovia
  • doeLegal
  • DTI
  • e-Stet
  • e-Law
  • Envision Discovery
  • Epiq Systems
  • Ernst & Young
  • eTera Consulting
  • Evidence Exchange
  • Foley & Lardner
  • FTI Consulting
  • Gibson Dunn
  • Guidance Software
  • H&A
  • Huron
  • ILS Innovative Litigation Services
  • Inventus
  • IRIS
  • KPMG
  • USIS Labat
  • Law In Order
  • LDiscovery
  • Lightspeed
  • Lighthouse eDiscovery
  • Logic Force Consulting
  • Millnet
  • Morgan Lewis
  • Navigant
  • Night Owl Discovery
  • Nulegal
  • Nvidia
  • ProSearch Strategies
  • PWC
  • Qualcomm
  • Reed Smith
  • Renew Data
  • Responsive Data Solutions
  • Ricoh
  • RVM
  • Shepherd Data Services
  • Squire Sanders
  • Stroock
  • TechLaw Solutions
  • Winston & Strawn

This is Equivio’s list, and it may not be current, nor even accurate (some of the links were broken), but it is what is shown on Equivo’s website as of October 14, 2014. Do not blame me if Equivio has you on the list, and you should not be, but feel free to leave a comment below to set the record straight. Hopefully, many of you have already moved on, and no longer use Equivio anymore anyway, like K-Cura. I happen to know that is true for a few of the other companies on that list. If not, if you still rely on Equivio, well, maybe Microsoft will still do business with you when it is time to renew, but most think that is very unlikely.


Ralph_bemuzedIt is hubris to think that a force as mysterious and exponential as Information can be governed. Yet it appears that is why Microsoft wants to buy Equivio. Like most of establishment IT, including the vast majority of pundits in our own e-discovery world, Microsoft thinks that Information Governance is the next big thing. They think that predictive coding was just a passing fad that is now over. If these assumptions are correct, then we can expect to see fragments of Equivio’s code appear in Microsoft’s future software as part of general information governance functions. We will not see Microsoft come out with predictive coding software for e-discovery.

Once again, Microsoft is missing the big picture here. Like most IT experts today outside of Google, they do not understand that Search is king, and governance is just a jester. The last big thing, Search, especially AI enhanced active machine learning, iw – predictive coding, is still the next big thing. Information governance is just a reactive, Don Quixote thing. Not big at all, and certainly not long-lasting. If anything, it is the dying gasp of last century’s records managers and librarians. Nice people all, I’m sure, but then so was John Henry.

Microsoft’s absorption of Equivo is a setback for search, for legal e-discovery. But at the same time it is a boon for the few e-discovery vendors who chose not to rely on Equivio, and chose instead to build their own search. It is also a boon for Google, as, once again, Microsoft shows that it still does not get search. You will not see Google fall for a governance dream.

Search is and will remain the dominant problem of our age for generations. Information cannot be governed. It cannot be catalogued. It can only be searched. Everyone needs to get over the whole archaic notion of governance. King George died long ago.

Google has it right. We should focus our AI development on search, not governance. Spend your time learning to search, forget about filing. It is a hopeless waste of time. It is just like the little Dutch boy putting his finger in the dyke. Learn to swim instead. Better yet, build a search boat like Noah and leave the governor behind.


Get every new post delivered to your Inbox.

Join 3,620 other followers