Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part Two

recordsThis is the second part of a two-part blog, please read part one first.

AI-Enhanced Big Data Search Will Greatly Simplify Information Governance

Information Governance is, or should be, all about finding the information you need, when you need it, and doing so in a cheap and efficient manner. Information needs are determined by both law and personal preferences, including business operation needs. In order to find information, you must first have it. Not only that, you must keep it until you need it. To do that, you need to preserve the information. If you have already destroyed information, really destroyed it I mean, not just deleted it, then obviously you will not be able to find it. You cannot find what does not exist, as all Unicorn chasers eventually find out.

Too_Many_RecordsThis creates a basic problem for Information Governance because the whole system is based on a notion that the best way to find valuable information is to destroy worthless information. Much of Information Governance is devoted to trying to determine what information is a valuable needle, and what is worthless chaff. This is because everyone knows that the more information you have, the harder it is for you to find the information you need. The idea is that too much information will cut you off. These maxims were true in the pre-AI-Enhanced Search days, but are, IMO, no longer true today, or, at least, will not be true in the next five to ten years, maybe sooner.

In order to meet the basic goal of finding information, Information Governance focuses its efforts on the proper classification of information. Again, the idea was to make it simpler to find information by preserving some of it, the information you might need to access, and destroying the rest. That is where records classification comes in.

The question of what information you need has a time element to it. The time requirements are again based on personal and business operations needs, and on thousand of federal, state and local laws. Information governance thus became a very complicated legal analysis problem. There are literally thousands of laws requiring certain types of information to be preserved for various lengths of time. Of course, you could comply with most of these laws by simply saving everything forever, but, in the past, that was not a realistic solution. There were severe limits on the ability to save information, and the ability to find it. Also, it was presumed that the older information was, the less value it had. Almost all information was thus treated like news.

These ideas were all firmly entrenched before the advent of Big Data and AI-enhanced data mining. In fact, in today’s world there is good reason for Google to save every search, ever done, forever. Some patterns and knowledge only emerge in time and history. New information is sometimes better information, but not necessarily so. In the world of Big Data all information has value, not just the latest.

paper records management warehouseThis records life-cycle ideas all made perfect sense in the world of paper information. It cost a lot of money to save and store paper records. Everyone with a monthly Iron Mountain paper records storage bill knows that. Even after the computer age began, it still cost a fair amount of money to save and store ESI. The computers needed to buy and maintain digital storage used to be very expensive. Finding the ESI you needed quickly on a computer was still very difficult and unreliable. All we had at first was keyword search, and that was very ineffective.

Due to the costs of storage, and the limitations of search, tremendous efforts were made by record managers to try to figure out what information was important, or needed, either from a legal perspective, or a business necessity perspective, and to save that information, and only that information. The idea behind Information Management was to destroy the ESI you did not need or were not required by law to preserve. This destruction saved you money, and, it also made possible the whole point of Information Governance, to find the information you wanted, when you wanted it.

Back in the pre-AI search days, the more information you had, the harder it was to find the information you needed. That still seems like common sense. Useless information was destroyed so that you could find valuable information. In reality, with the new and better algorithms we now have for AI-enhanced search, it is just the reverse. The more information you have, the easier it becomes to find what you want. You now have more information to draw upon.

That is the new reality of Big Data. It is a hard intellectual paradigm to jump, and seems counter-intuitive. It took me a long time to get it. The new ability to save and search everything cheaply and efficiently is what is driving the explosion of Big Data services and products. As the save everything, find anything way of thinking takes over, the classification and deletion aspects of Information Governance will naturally dissipate. The records lifecycle will transform into virtual immortality. There is no reason to classify and delete, if you can save everything and find anything at low cost. The issues simplify; they change to how to save and  search, although new collateral issues of security and privacy grow in importance.

Save and Search v. Classify and Delete

The current clash in basic ideas concerning Big Data and Information Governance is confusing to many business executives. According to Gregory Bufithis who attended a recent event in Washington D.C. on Big Data sponsored by EMC, one senior presenter explained:

The C Suite is bedeviled by IG and regulatory complexity. … 

The solution is not to eliminate Information Governance entirely. The reports of its complete demise, here or elsewhere, are exaggerated. The solution is to simplify IG. To pare it down to save and search. Even this will take some time, like I said, from five to ten years, although there is some chance this transformation of IG will go even faster than that. This move away from complex regulatory classification schemes, to simpler save and search everything, is already being adopted by many in the high-tech world. To quote Greg again from the private EMC event in D.C. in October, 2014:

Why data lakes? Because regulatory complexity and the changes can kill you. And are unpredictable in relationship to information governance. …

So what’s better? Data lakes coupled with archiving. Yes, archiving seems emblematic of “old” IT. But archiving and data lifecycle management (DLM) have evolved from a storage focus, to a focus on business value and data loss prevention. DLM recognizes that as data gets older, its value diminishes, but it never becomes worthless. And nobody is throwing out anything and yes, there are negative impacts (unnecessary storage costs, litigation, regulatory sanctions) if not retained or deleted when it should be.

But … companies want to mine their data for operational and competitive advantage. So data lakes and archiving their data allows for ingesting and retain all information types, structured or unstructured. And that’s better.

Because then all you need is a good search platform or search system … like Hadoop which allows you to sift through the data and extract the chunks that answer the questions at hand. In essence, this is a step up from OLAP (online analytical processing). And you can use “tag sift sort” programs like Data Rush. Or ThingWorx which is an approach that monitors the stream of data arriving in the lake for specific events. Complex event processing (CEP) engines can also sift through data as it enters storage, or later when it’s needed for analysis.

Because it is all about search.

Recent Breakthroughs in Artificial Intelligence
Make Possible Save Everything, Find Anything

AIThe New York Times in an opinion editorial this week discussed recent breakthroughs in Artificial Intelligence and speculated on alternative futures this could create. Our Machine Masters, NT Times Op-Ed, by David Brooks (October 31, 2014). The Times article quoted extensively another article in the current issue of Wired by technology blogger Kevin Kelly: The Three Breakthroughs That Have Finally Unleashed AI on the World. Kelly argues, as do I, that artificial intelligence has now reached a breakthrough level. This artificial intelligence breakthrough, Kevin Kelly argues, and David Brook’s agrees, is driven by three things: cheap parallel computation technologies, big data collection, and better algorithms. The upshot is clear in the opinion of both Wired and the New York Times: “The business plans of the next 10,000 start-ups are easy to forecast: Take X and add A.I. This is a big deal, and now it’s here.

These three new technology advances change everything. The Wired article goes into the technology and financial aspects of the new AI; it is where the big money is going and will be made in the next few decades. If Wired is right, then this means in our world of e-discovery, companies and law firms will succeed if, and only if, they add AI to their products and services. The firms and vendors who add AI to document review, and project management, will grow fast. The non-AI enhanced vendors, non-AI enhanced software, will go out of business. The law firms that do not use AI tools will shrink and die.

David_BrooksThe Times article by David Brooks goes into the sociological and philosophical aspects of the recent breakthroughs in Artificial Intelligence:

Two big implications flow from this. The first is sociological. If knowledge is power, we’re about to see an even greater concentration of power.  … [E]ngineers at a few gigantic companies will have vast-though-hidden power to shape how data are collected and framed, to harvest huge amounts of information, to build the frameworks through which the rest of us make decisions and to steer our choices. If you think this power will be used for entirely benign ends, then you have not read enough history.

The second implication is philosophical. A.I. will redefine what it means to be human. Our identity as humans is shaped by what machines and other animals can’t do. For the last few centuries, reason was seen as the ultimate human faculty. But now machines are better at many of the tasks we associate with thinking — like playing chess, winning at Jeopardy, and doing math. [RCL – and, you might add, better at finding relevant evidence.]

On the other hand, machines cannot beat us at the things we do without conscious thinking: developing tastes and affections, mimicking each other and building emotional attachments, experiencing imaginative breakthroughs, forming moral sentiments. [RCL – and, you might add, better at equitable notions of justice and at legal imagination.]

In this future, there is increasing emphasis on personal and moral faculties: being likable, industrious, trustworthy and affectionate. People are evaluated more on these traits, which supplement machine thinking, and not the rote ones that duplicate it.

In the cold, utilitarian future, on the other hand, people become less idiosyncratic. If the choice architecture behind many decisions is based on big data from vast crowds, everybody follows the prompts and chooses to be like each other. The machine prompts us to consume what is popular, the things that are easy and mentally undemanding.

I’m happy Pandora can help me find what I like. I’m a little nervous if it so pervasively shapes my listening that it ends up determining what I like. [RCL – and, you might add, determining what is relevant, what is fair.]

I think we all want to master these machines, not have them master us.

ralph_wrongAlthough I share the concerns of the NY Times about mastering machines and alternative future scenarios, my analysis of the impact of the new AI is focused and limited to the Law. Lawyers must master the AI-search for evidence processes. We must master and use the better algorithms, the better AI-enhanced software, not visa versa. The software does not, nor should it, run itself. Easy buttons in legal search are a trap for the unwary, a first step down a slippery slope to legal dystopia. Human lawyers must never over-delegate our uniquely human insights and abilities. We must train the machines. We must stay in charge and assert our human insights on law, relevance, equity, fairness and justice, and our human abilities to imagine and create new realities of justice for all. I want lawyers and judges to use AI-enhanced machines, but I never want to be judged by a machine alone, nor have a computer alone as a lawyer.

The three big new advances that are allowing better and better AI are nowhere near to threatening the jobs of human judges or lawyers, although they will likely reduce their numbers, and certainly will change their jobs. We are already seeing these changes in Legal Search and Information Governance. Thanks to cheap parallel computation, we now have Big Data Lakes stored in thousands of inexpensive, cloud computers that are operating together. This is where open-sourced software like Hadoop comes in. They make the big clusters of computers possible. The better algorithms is where better AI-enhanced Software comes in. This makes it possible to use predictive coding effectively and inexpensively to find the information needed to resolve law suits. The days of vast numbers of document reviewer attorneys doing linear review are numbered. Instead, we will see a few SMEs, working with small teams of reviewers, search experts, and software experts.

The role of Information Managers will also change drastically. Because of Big Data, cheap parallel computing, and better algorithms, it is now possible to save everything, forever, at a small cost, and to quickly search and find what you need. The new reality of Save Everything, Find Anything undercuts most of the rationale of Information Governance. It is all about search now.


Ralph_Losey_2013_abaNow that storage costs are negligible, and search far more efficient, the twin motivators of Information Science to classify and destroy are gone, or soon will be. The key remaining tasks of Information Governance are now preservation and search, plus relatively new ones of security and privacy. I recognize that the demise of the importance of destruction of ESI could change if more governments enact laws that require the destruction of ESI, like the EU has done with Facebook posts and the so-called “right to be forgotten law.” But for now, most laws are about saving data for various times, and do not require data be destroyed. Note that the new Delaware law on data destruction still keeps it discretionary on whether to destroy personal data or not. House Bill No. 295 – The Safe Destruction of Documents Containing Personal Identifying Information. It only places legal burdens and liability for failures to properly destroy data. This liability for mistakes in destruction serves to discourage data destruction, not encourage it.

Preservation is not too difficult when you can economically save everything forever, so the challenging task remaining is really just one of search. That is why I say that Information Governance will become a sub-set of search. The save everything forever model will, however, create new legal work for lawyers. The cybersecurity protection and privacy aspects of Big Data Lakes are already creating many new legal challenges and issues. More legal issues are sure to arise with the expansion of AI.

Automation, including this latest Second Machine Age of mental process automation, does not eliminate the need for human labor. It just makes our work more interesting and opens up more time for leisure. Automation has always created new jobs as fast as it has eliminated old ones. The challenge for existing workers like ourselves is to learn the new skills necessary to do the new jobs. For us e-discovery lawyers and techs, this means, among other things, acquiring new skills to use AI-enhanced tools. One such skill, the ability for HCIR, human computer information retrieval, is mentioned in most of my articles on predictive coding. It involves new skill sets in active machine learning to train a computer to find the evidence you want from large collections of data sets, typically emails. When I was a law student in the late 1970s, I could never have dreamed that this would be part of my job as a lawyer in 2014.

The new jobs do not rely on physical or mental drudgery and repetition. Instead, they put a premium on what makes up distinctly human, our deep knowledge, understanding, wisdom, and intuition; our empathy, caring, love and compassion; our morality, honesty, and trustworthiness; our sense of justice and fairness; our ability to change and adapt quickly to new conditions; our likability, good will, and friendliness; our imagination, art, wisdom, and creativity. Yes, even our individual eccentricities, and our all important sense of humor. No matter how far we progress, let us never lose that! Please be governed accordingly.

3 Responses to Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part Two

  1. Aaron T Schneider says:

    Hadoop is primarily being used to store customer data or to assist in solving problems related to a specific domain. I’m not sure many companies can even build a “Data Lake” to analyze their own business files even if they wanted to. Most companies are moving to third party data silos. For example, a company might store Exchange email using Microsoft 360 and CRM data with Salesforce.com. How does a company build a data lake when all of the data is hosted by different third party vendors? Remember ALL the data still needs to get onto Hadoop for analysis or you don’t have a complete picture.

    Let’s say that you want to analyze your Exchange Data and Salesforce data using analytic engine that takes advantages of MapReduce and you have about a petabyte of data. First, you need to get all the data from the the third parties in a format that you can store on Hadoop. Then you need to find the server space for Hadoop! Most likely you will go with cloud infrastructure like Amazon because your company doesn’t have any IT infrastructure of it’s own. You are already paying to store your data at different vendors and now you have to pay to host all of that information again for analysis.

    Analyzing each silo of data by itself is pointless if you want to understand “the big picture” . So where do we go from here?

  2. Edmund says:

    Thanks for another great article Mr. Losey. I tweeted these blog posts to my class, so I’m hoping we can have a good discussion of the implications of artificial intelligence on the need to preserve and archive data. Prof. Jenkins has been telling us all year how important the human element is in e-discovery. Advocacy is all about building narratives, and I can’t imagine a machine building a narrative or story that is as compelling as one designed by a human. But they can still help compile the information that goes into that narrative to give that narrative a richer factual background. It reminds me a bit of my undergraduate study of Jean-Francois Lyotard’s “The Postmodern Condition: A Report on Knowledge.” http://www.abdn.ac.uk/idav/documents/Lyotard_-_Postmodern_Condition.pdf.


%d bloggers like this: