Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part One

There is a micro-battle brewing between Information Governance and Search in the legal world that reflects a larger conflict in the greater information technology world. I touched upon this in my blog last week, e-Discovery Industry Reaction to Microsoft’s Offer to Purchase Equivio for $200 Million – Part Two. I speculated that Microsoft was probably buying Equivio because it wants to improve its information governance products. That is what most in the e-discovery industry seems to think, and, after all, law is where Equivio now operates. I also speculated that it might instead be a pure search play on Microsoft’s part, that Microsoft might already understand, like Google clearly does, that search is now king, and Information Governance is a fading fad.

In my blog last week I broke ranks with almost all other specialists in e-discovery and turned against Information Governance. Instead, I took sides with Search, and opined that the preoccupation with classification, retention, and destruction of data would soon stop being a viable, efficient activity. Instead, I suggested that we all focus on the googlesque approach, one that I had previously disdained, of save everything and search instead of classify.

I still hold Information Governance in high respect, and think it has another five years or so of useful contributions. But ultimately I think the classify and control lock-down approach of IG is futile. It cannot withstand the continuing exponential growth of data, nor the basic entropy forces aligned against all attempts to govern, especially attempts based on all-too-human rules and compliance. My prediction is that within five to ten years, IG will no longer be worth the effort. This projection assumes continuing exponential growth and future improvements in search. Breakthroughs in search would be nice, but my projection does not depend on that. It assumes instead a slow, steady improvement of search technologies. For now IG helps search, and so is worth it, but not for much longer.

For the immediate future a dual approach, govern and search, is still viable. It may ultimately be a Don Quixote quest, but not just yet. Still, soon enough, and perhaps in even less than five years, it may no longer be worth the time and expense to try. At least this is how I see it, and, as I must in all honesty admit, this is still very much a minority report in the world of legal technology.

Since my blog last week I have learned much more about this conflict in the larger world of technology. Although I am a lonely voice in the legal technology world, which after all is not unexpected, since law itself is an attempt to govern, I have plenty of good company in the general technology world. There is not only Google, whom I expected, but also EMC, GE, and a host of others. The debate is part of larger issues surrounding Big Data. The outcome will impact everyone’s life for years to come. Govern or Search is not just a legal issue. It is a cultural issue.

Search or Governance:
What is the best approach to survive the information deluge?

In last week’s blog I played the role of the child pointing out that the Emperor has no clothes. I suggested that Information Governance is really little more than e-dressed-up records management. This is not a popular view, but the Information Governance world so far has not reacted too vehemently. No death threats like Gamergate. Even a few positive comments.

Perhaps most of the IG world hope that I will just go away and stop this conversation. Sorry. Not going to happen. In fact, I feel compelled to summarize and expand upon what I said last week. These are critical issues. They not only challenge the legal world, but everyone who uses a computer. We are all suffering from information overload. We are all looking for a solution. Will we cope by search, or by vertical forces of governance, man-made laws?

As a person who has devoted his life to law and rules, and been a lawyer focusing on technology for over 34 years, I know firsthand the limitations of law and governance. I am sure that rules are not the way to go. It is better to rely on search and technology. You may say there is no reason to choose, but really there is. Their methods ultimately diverge, and resources, like attention, are limited. The focus must be on search.

I hope this will become a full-scale dialogue, as the late great Richard Braman would have us do. But if not dialogue, then at least a public debate. Perhaps I will square off with an IG leader at Legal Tech? Time will tell. We are waiting for some vendors to step up to the plate and sponsor such discussions. In the meantime, if you did not read last week’s blog, I urge you to do so, Part Two at least. Here are a few excerpts that I would like to emphasize:

I think the establishment majority in our industry is deluding themselves into thinking that information is like paper, only there is more of it. They delude themselves into thinking that Information is capable of being governed, just like so many little paper soldiers in an army. I say the Emperor has no clothes. That information cannot be governed.

Electronic Information is a totally new kind of force, something Mankind has never seen before. Digital Information is a Genie out of the bottle. It cannot be captured. It cannot be managed. It certainly cannot be governed. It cannot even be killed. Forget about trying to put it back in the bottle. …

Essentially information is free, and wants to be free. It does not want to be governed, or charged for. Information is more useful when free and when it is not subject to transitory restraints. …

Regardless of the economic aspects, and whether information really wants to be free or not, as a practical matter Information cannot be governed, even if some of it can be commoditized. Information is moving and growing far too fast for governance.

I stated last week that information inflation is like a nuclear bomb, and we have reached the tipping point of no return in the atomic fission chain reaction. I invoked the fission vision to try to help convey what exponential information growth really means. For those reasons I called Information Governance a noble, but futile, Don Quixote quest. I asserted that it is impossible to file and classify an e-data world that doubles in size and complexity every couple of years. Yet, the Information Governance experts try to do just that. I know, I used to be one of them. Here is how I put it last week.

[W]e have a new breed of information governance experts running around who serve like heroic bomb squads. Some know that it is just a noble quest, doomed to failure. Most do not. They helicopter into corporate worlds attempting to defuse ticking information bombs. They build walls around it. They confidently set policies and promulgate rules. They talk sternly about enforcement of rules. They automate filing. They automate deletion. Some are even starting to make robot file clerks

Information governance experts, just like the records managers before them, are all working diligently to try to solve today’s problems of information management. But, all the while, ever new problems encroach upon their walls. They cannot keep up with this growth, the new forms of information. The next generation of exponential growth builds faster than anyone can possibly govern. Do they not know that the bomb has already exploded? The tipping point has already past?

Information governance policies that are being created today are like sand castles built at low tide. Can you hear the next wave of data generated by the Internet of Things? It will surely wash away all of today’s efforts. There will always be more data, more unexpected new forms of information. Governance of information is a dream, a Don Quixote quest.

I used yet another analogy last week of the little Dutch boy with his finger in the Dyke trying to stop the flood of data from overwhelming us all. I explained that this was impossible in my view because of the exponentially growing volume and complexity of data. I concluded that search was the only answer in tomorrow’s world, that governance was no longer possible. Instead of making and trying to enforce information rules based on archaic notions of governance, we should instead harness all of our efforts to improve search.

Do not waste your valuable time and effort trying to file information. Just search for it, when and if you need it. You will not need most of it anyway. …

It is hubris to think that a force as mysterious and exponential as Information can be governed. …

Search is and will remain the dominant problem of our age for generations. Information cannot be governed. It cannot be catalogued. It can only be searched. …

Google has it right. We should focus our AI development on search, not governance. Spend your time learning to search, forget about filing. It is a hopeless waste of time. It is just like the little Dutch boy putting his finger in the dyke. Learn to swim instead. Better yet, build a search boat like Noah and leave the governor behind.

In this two-part blog I will explore further aspects of this debate, with the emphasis in Part Two on the views in the larger technology world, not just the legal e-discovery world. But, as that is my core competence, I will once again start in our litigation sandbox and work my way out into the big beach of all technology. It turns out that this beach is on a very large Data Lake surrounded by open-source software with the funny name of Hadoop. More on that later.

Information Governance from an e-Discovery Lawyer’s Perspective

Everyone in the e-discovery world is already pretty familiar with the ideas and goals of Information Management. It is the big hit in e-disco conference circles, having replaced Predictive Coding sometime last year as the latest thing. Every CLE event now has one or more panels devoted to the topic. There are even several new organizations set up to promote Information Governance,

A few large cutting edge law firms even have Information Governance practice groups. They do such things as assisting clients to establish an information governance framework to link information creation and use to business objectives. They also help clients to categorize information during its useful life, and then delete it when no longer needed. These services require an review of the company information in five steps:

Information Inventory.
Formation of an Information Governance Steering Committee.
Information Policy and Protocol Review.
Policy and Protocol Creation and Modification.
Policy Implementation and Enforcement.

At the end of a typical IG Review a corporate client should have streamlined the creation, use and disposition of its information. This should, in turn, increase its business efficiency, reduce the risks involved in the over-or under-retention of information, and increase its ability to react to litigation or regulatory events.

Most lawyers specializing in e-discovery know the drill and have engaged in a few such IG projects. I have worked on several IG reviews, following similar steps. They typically involve many meetings, team assembly and team activities, and information work flow studies and discussions. Most of what a good IG lawyer does is educational. For instance, they teach everyone the basics of a records life cycle. They help the client to make their own decisions as to what is right for their company. They provide suggestions to consider, and set out the pros and cons to help the client’s team make good decisions. It can all become very complicated, especially with large IT infrastructures. The fees involved can be considerable. Six and seven figure projects are not unheard of.

There is no one-size-fits-all Information Governance plan. It is not a form driven practice. It all depends on so many things. For this reason a good IG lawyer acts as both a trainer and coach for the corporate team. The team makes the final decisions, not the outside legal counsel. Nevertheless, IG lawyers can have a big impact on those decisions. Most IG lawyers are inclined towards reducing data retention, as they know very well, or think they do, that the costs of search and review of data in the context of litigation is very high. They tend to steer the team in the direction of data destruction. It is heretic to say just save everything forever, and focus on search instead. Alas, that is why my ex-communication from the IG world is certain.

I used to be the way they are. ESI feared me. I was all about killing data as soon as you no longer had a business need for it. I was all in favor of short retention schedules. But, that was then. That was before I really mastered predictive coding, which in my version means active machine learning. That was before I understood much better than I used to, that we are living in a whole new world of Big Data Analytics. I now realize that is possible to dramatically reduce the costs of document review. I now realize the incredible power of AI enhanced search. I am starting to realize the potential value of large pools of seeming worthless data. These realizations change everything.

My understanding and experience with Big Data Analytics has led me to a different view of data retention and governance. I now understand that more data can mean more intelligence, that it does not necessarily mean more trouble and expense. I understand that more and bigger data has its own unique values, so long as it can be analyzed and searched effectively.

This change of position was reinforced by my observing many litigated cases where companies no longer had the documents they needed to prove their case. The documents had short retention spans. They had all been destroyed in the normal course of business before litigation was ever anticipated. I have seen first hand that yesterdays trash can be tomorrow’s treasure. I will not even go into the other kind of problems that very short retention policies can place upon a company to immediately implement a lit-hold. The time pressures to get a hold in place can be enormous and thus errors become more likely.

There is a definite dark side to data destruction that IG types do not like to face. No one knows for sure when data has lost its value. The meaningless email of yesterday about lunch at a certain restaurant could well have a surprise value in the future. For instance, a time-line of what happened when, and to whom, is sometimes an important issue in litigation. These stupid lunch emails could help prove where a witness was and when. They could show that a witness was at lunch, out of the office, and not at a meeting as someone else alleges.

Who knows what value such seemingly worthless data may someday have? Perhaps millions of emails of ten thousand employees about lunch could be used someday to prove or disprove certain class-action allegations. Outside of the little world of litigation, perhaps the information could help management make smarter business decisions. For instance, they could help a company to decide whether to open a company cafeteria, and if so, what kind of food its employees would really like to have served there. Information can prove what really happened in the past and can help you to make the right decisions. With smart search, there can be great hidden value in too much information. Businesses are starting to see this now where Big Data mining is all the buzz. We lawyers need to start doing the same.

The point is, with the never-ending uncertainties of tomorrow, you can never know for sure that information is valueless and should be destroyed, and what information has value and should be saved. There may be an unimaginably large haystack of information, and you may think it only has a few valuable needles. But, you never really know. Today’s irrelevant straw could be tomorrow’s relevant needle. With the AI based search capacities we already have, capacities that are surely to improve, when you need to find a needle in these near infinite stacks, you will be able to. The cost of storage itself has become so low as to become a negligible factor for most large corporations. Why destroy data when you can effectively search it and mine it for value?

Information Technology View on IG v. Search

The general IT world is also struggling between whether to go all-in with Search, or keep trying to solve the problem of too much information with Governance. Unlike the legal world, where my vote for search is still a new and small majority, in the IT world search is already a strong voice. Many in IT see attempts at Information Governance as misguided throwbacks to the pre-digital world. In the last year it seems to me that Search is gaining ground in the technology world. From what I see the retain and Search solution is surging ahead of the old-fashioned govern and destroy approach.

Consider, for instance, the policy of search stated by hot new companies like Pivotal, which is a joint venture between EMC, VMware, and GE. Pivotal’s public mantra is: Store Everything. Analyze Anything. Build the Right Thing.

Pivotal urges its customers to Store Everything, not just its organized databases, such as financial records. It provides the ability to store all types of data, including especially disorganized data, such as employee emails and texts, and do so in the same place. That is the new gold standard. Pivotal explains the value of store everything this way:

Store everything to create a rich data repository for business needs. With unlimited, supported Pivotal HD enterprises never have to worry about data growth constraints or runaway license costs.

Its suite of Big Data software is designed to allow a company to store all data types in the same place, which it, along with EMC, and others, have started calling a Data Lake. All types and formats of ESI become readable, searchable, in the Data Lake. They do not have to be stored separately, nor searched and analyzed separately. The Data Lakes are also infinitely expandable. Unlike real lakes, they cannot flood. They can instead grow unhindered in cyberspace. All they need are more servers.

These are major breakthroughs and mean the inevitable end of separate data silos by format type and size. This allows you to, in Pivotal’s words, leverage all your data, forever, and place it all in a centralized Business Data Lake. You can analyze multiple data sets and types that live in the Business Data Lake. This allows you to determine the integration value of multiple data sets and types. It also makes storage of Big Data much less expensive.

Bottom line, when all of your data is saved forever, and subject to advanced search analytics, you are empowered to build the right thing. In Pivotal’s words, building the right thing means to deliver a transformative solution to meet today’s demanding business needs. For business that means creation of new products, new advertising, new sales and business methods. For law it means building your case, finding evidence, and creating new legal methods. The promise of Big Data is changing everything.

Data Lakes Are Made Possible by Hadoop

Hadoop is an open sourced software program that underlies most Big Data repositories, including what Pivotal, EMC, and others call Data Lakes. Facebook, for example, uses Hadoop, and claims to have assembled the largest Hadoop cluster of computers in the world, housing 21 Petabytes of storage as of 2010. By March 2011, its Hadoop clusters had grown to 30 PB, which Facebook says is 3,000 times the size of the Library of Congress. Facebook’s Hadoop Data Lake supposedly grows at the rate of one half a petabyte a day. Although Yahoo! uses open sourced Hadoop, Google does not. Google instead uses its own proprietary version of MapReduce and Google File System, and keeps the details as closely guarded trade secrets.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son’s yellow stuffed elephant.

Hadoop is one of the new technologies driving the search versus governance debate. Before Hadoop, store-everything, scalable, central data repositories were not possible. The relatively small separate silos of data before then made it very difficult to store large amounts of data, and to search everything. First, you had to find it, then you had to employ a number of different search methods. This made it hard to mine the value of data. The separate locations also made it much more expensive to store and maintain Big Data.

Data silos were a big impediment to search as a solution to the data deluge. Conversely, they were somewhat of a help to governance. But now the days of silos are numbered. Thanks to open-source Hadoop, and the hundreds of established companies and start-ups that provide for-profit applications for Hadoop Data Lakes, the days of data silos are numbered. Now with Hadoop it is possible to put in all data in one place, forever, and, with some limitations, search it all, whenever you want.

To be continued and concluded next week in Part Two.

This entry was posted on Sunday, October 26th, 2014 at 7:59 pm and is filed under Lawyers Duties, Metadata, Review, Search, Technology. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

12 Responses to Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part One

realrecords says:

October 27, 2014 at 9:34 am

Mr. Losey: You have continued your excellent position in an extremely coherent and understandable manner. I cannot agree more with your prophecy…no, your accurate analysis and prediction.

Loading...

Reply
Gregory Bufithis (@GregBufithis) says:

October 27, 2014 at 11:43 am

Your post is well-timed. I attended an EMC workshop last week (a follow-up to a series of workshops that EMC runs across the globe on Big Data, storage, analytics and search) and one of the speakers noted “in the old days the VP of management information services was in charge of the problems of data growth, complexity and security. That has morphed into information governance”.

Some of the more salient points in the session:

– The C Suite is bedeviled by IG and regulatory complexity, leading to their unpredictable relationship.

– A solution: “data lakes” (a phrase Bryant Bell of EMC coined a few years ago) which can store practically unlimited amounts of data in any format, schema and type. Relatively inexpensive and massively scalable, a data lake enables data to be analyzed without being moved. It may also include connectors for content from legacy and production applications to maintain those applications until end of life. This ensures an efficient transition to the data lake.

– Data lakes need to be coupled with archiving because as data gets older, its value diminishes, but it never becomes worthless. As you note “yesterdays trash can be tomorrow’s treasure”. And nobody is throwing out anything despite the negative impacts (unnecessary storage costs, litigation, regulatory sanctions). Companies want to mine their data for operational and competitive advantage. So data lakes and archiving their data allows for ingesting and retain all information types, structured or unstructured. And that’s better

– But you need a good search platform or search system like Hadoop MapReduce which allows you to sift through the data and extract the chunks that answer the questions at hand. In essence, this is a step up from the old online analytical processing. And you can use “tag sift sort” programs like Data Rush. Or ThingWorx which is an approach that monitors the stream of data arriving in the lake for specific events. Complex event processing (CEP) engines can also sift through data as it enters storage, or later when it’s needed for analysis.

– When IG was mentioned it was “search powered IG”

– And as far as data silos “there better be a damn good, specific reason” because that method of storing data does more harm than good. Quoting: “common sense dictates the flow of data that is unimpeded and central operates much faster, more efficiently”

– , and are more agile than systems where needed information may be locked away.

When EMC did the spin-out of Pivotal Software out of its assets and the assets of its 80% owned subsidiary VMware, and then formed the EMC-VMware-Pivotal-GE joint venture, it emphasized that the combination of storage, computation and analytics capabilities would allow customers to scale-out their “data lakes”, coupled with the new enterprise search technologies noted above. Search technologies getting better all the time.

Note: this is similar (not exactly the same) to the cognitive application development that IBM is doing with Watson via IBM Bluemix. Using the power of Watson for search. But IBM being IBM, it is more detailed. If you have the stamina to work through “IBM Watson Content Analytics” (600+ pages) it is a treasure trove of search knowledge.

Oh, and I looked up the whole Stewart Brand quote, which he coined in 1984 as a reposte to Steve Wozniak at the first Hackers Conference which Brand helped organize.

According to a transcript of the event he said:

“On the one hand, information wants to be expensive, because it’s so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other.”

Wozniak’s reply: “Fine, information should be free, but your time should not.”

Loading...

Reply
- Gregory Bufithis (@GregBufithis) says:
  
  October 27, 2014 at 12:09 pm
  
  Oops. Sorry. Correction. James Dixon, the CTO of Pentaho, is more properly credited with coining the term “data lake”. From a Forbes article in 2011 quoting him on how big data and the massive explosion of sources of information would require a new IT architecture:
  
  “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
  
  He also wrote a detailed blog post.
  
  His point was CIOs would need to think of “data lakes”, the difference between a data lake and a data warehouse being that in a data warehouse, the data is pre-categorized at the point of entry, which can dictate how it’s going to be analyzed. This is especially true in online analytical processing, which stores the data in an optimal form to support specific types of analysis.
  
  The problem he said is that in the world of big data, we don’t necessarily know each time what value the data has when it’s initially accepted from the array of sources available to us. We might know some questions we want to answer, but not to the extent that it makes sense to close off the ability to answer questions that materialize later. And if that data is improperly “organized” then it is lost. Therefore, storing data in some “optimal” form for later analysis doesn’t make any sense. Instead, store the data in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers.
  
  Loading...
  
  Reply
  - Ralph Losey says:
    
    October 28, 2014 at 7:09 am
    
    Thanks for your addition, and for helping serve as a muse for my recent insights on Big Data.
    
    Loading...
Andy Wilson (Logik) says:

October 27, 2014 at 3:21 pm

+ 1 for search. Storage is/has approached $0, so naturally people will store everything, which then makes smart search a high priority. NoSQL systems like Hadoop, CouchDB, Mongo, etc. are a natural fit for this ever-changing index, because they don’t require a schema.

Loading...

Reply
- Ralph Losey says:
  
  October 28, 2014 at 7:20 am
  
  NoSQL = Not Only SQL databases.
  Thanks for the comment.
  
  Loading...
  
  Reply
Dean Gonsowski says:

October 28, 2014 at 5:36 pm

Great post Ralph and while I’m loath to venture into the Lion’s den, I must. To start I won’t take issue with many of your core operating principles. My main issue with your search vs governance mantra is that is presumes that all data has value and that the value (potential or real) outweighs the risks and liability associated with the ESI. This concept is core to the definition promulgated by the IGI and others.

I wrote blog on this topic a while back (http://www.recommind.com/blog/2013/06/19/information-governance-gangnam-style-the-perils-of-dark-data) which attempted to combat the notion that big data will make all data valuable one day. In your parlance it’s the “yesterdays trash can be tomorrow’s treasure” which I think is the big fallacy. Information garbage weeks, months or years ago is highly unlikely to be mined in the future by big data initiatives to derive valuable business insights.

Instead, that same questionably valuable data is much more certain to harm the organization, either in terms of data breaches, ESI that must be preserved/collected/reviewed/produced for eDiscovery, etc.

Finally, all that data kept in the hopes that it might have big data-esque value creates more noise making it harder to find useful “signals” buried therein.

I do agree that governance needs to evolve as a practical and achievable concept, but the “keep it all forever” is similarly imperfect.

-Dean

Loading...

Reply
- Ralph Losey says:
  
  October 29, 2014 at 6:03 am
  
  Thanks for the comment Dean. We have been rubbing shoulders in the search field together for many a year. So I know you understand the value and importance of search. In the spirit of “dialogue” that Richard Braman wanted us all to practice in grappling with important new legal issues, I will respond by saying I hear you, and will process your message, and review your article, and try to fully grasp your perspective. I will respond later. Please venture back into the Lion’s Den any time. I will not bite, and if I roar, well, that’s just me. 🙂
  
  Loading...
  
  Reply
Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part Two | e-Discovery Team ® says:

November 2, 2014 at 9:55 pm

[…] This is the second part of a two-part blog, please read part one first. […]

Loading...

Reply
Hadoop Happenings: Hadoop Lingo; Focus on Growth | Qubole says:

November 4, 2014 at 4:20 pm

[…] E-discoveryteam.com- Ralph Losey, a practicing attorney, tackles the governance vs. search debate. Read More […]

Loading...

Reply
Does Search = the Demise of InfoGov? | robertcruz03 says:

November 24, 2014 at 3:17 pm

[…] expert Ralph Losey on the subject of How to best conquer out of control information growth http://e-discoveryteam.com/2014/10/26/hadoop-data-lakes-predictive-analytics-and-ultimate-demise-of-…, in which he argues that we have arrived at 2 alternative approaches, namely, to either Classify […]

Loading...

Reply
10 Predictions for 2015 E-Discovery says:

December 22, 2014 at 7:14 pm

[…] destruction purposes under the umbrella of an information governance strategy? Or should they keep Big Data and comb through it with advanced search technologies? Don’t expect these questions to be resolved in 2015. There is […]

Loading...

Reply