12 Responses to Hadoop, Data Lakes, Predictive Analytics and the Ultimate Demise of Information Governance – Part One

  1. realrecords says:

    Mr. Losey: You have continued your excellent position in an extremely coherent and understandable manner. I cannot agree more with your prophecy…no, your accurate analysis and prediction.

  2. Your post is well-timed. I attended an EMC workshop last week (a follow-up to a series of workshops that EMC runs across the globe on Big Data, storage, analytics and search) and one of the speakers noted “in the old days the VP of management information services was in charge of the problems of data growth, complexity and security. That has morphed into information governance”.

    Some of the more salient points in the session:

    – The C Suite is bedeviled by IG and regulatory complexity, leading to their unpredictable relationship.

    – A solution: “data lakes” (a phrase Bryant Bell of EMC coined a few years ago) which can store practically unlimited amounts of data in any format, schema and type. Relatively inexpensive and massively scalable, a data lake enables data to be analyzed without being moved. It may also include connectors for content from legacy and production applications to maintain those applications until end of life. This ensures an efficient transition to the data lake.

    – Data lakes need to be coupled with archiving because as data gets older, its value diminishes, but it never becomes worthless. As you note “yesterdays trash can be tomorrow’s treasure”. And nobody is throwing out anything despite the negative impacts (unnecessary storage costs, litigation, regulatory sanctions). Companies want to mine their data for operational and competitive advantage. So data lakes and archiving their data allows for ingesting and retain all information types, structured or unstructured. And that’s better

    – But you need a good search platform or search system like Hadoop MapReduce which allows you to sift through the data and extract the chunks that answer the questions at hand. In essence, this is a step up from the old online analytical processing. And you can use “tag sift sort” programs like Data Rush. Or ThingWorx which is an approach that monitors the stream of data arriving in the lake for specific events. Complex event processing (CEP) engines can also sift through data as it enters storage, or later when it’s needed for analysis.

    – When IG was mentioned it was “search powered IG”

    – And as far as data silos “there better be a damn good, specific reason” because that method of storing data does more harm than good. Quoting: “common sense dictates the flow of data that is unimpeded and central operates much faster, more efficiently”

    – , and are more agile than systems where needed information may be locked away.

    When EMC did the spin-out of Pivotal Software out of its assets and the assets of its 80% owned subsidiary VMware, and then formed the EMC-VMware-Pivotal-GE joint venture, it emphasized that the combination of storage, computation and analytics capabilities would allow customers to scale-out their “data lakes”, coupled with the new enterprise search technologies noted above. Search technologies getting better all the time.

    Note: this is similar (not exactly the same) to the cognitive application development that IBM is doing with Watson via IBM Bluemix. Using the power of Watson for search. But IBM being IBM, it is more detailed. If you have the stamina to work through “IBM Watson Content Analytics” (600+ pages) it is a treasure trove of search knowledge.

    Oh, and I looked up the whole Stewart Brand quote, which he coined in 1984 as a reposte to Steve Wozniak at the first Hackers Conference which Brand helped organize.

    According to a transcript of the event he said:

    “On the one hand, information wants to be expensive, because it’s so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other.”

    Wozniak’s reply: “Fine, information should be free, but your time should not.”

    • Oops. Sorry. Correction. James Dixon, the CTO of Pentaho, is more properly credited with coining the term “data lake”. From a Forbes article in 2011 quoting him on how big data and the massive explosion of sources of information would require a new IT architecture:

      “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

      He also wrote a detailed blog post.

      His point was CIOs would need to think of “data lakes”, the difference between a data lake and a data warehouse being that in a data warehouse, the data is pre-categorized at the point of entry, which can dictate how it’s going to be analyzed. This is especially true in online analytical processing, which stores the data in an optimal form to support specific types of analysis.

      The problem he said is that in the world of big data, we don’t necessarily know each time what value the data has when it’s initially accepted from the array of sources available to us. We might know some questions we want to answer, but not to the extent that it makes sense to close off the ability to answer questions that materialize later. And if that data is improperly “organized” then it is lost. Therefore, storing data in some “optimal” form for later analysis doesn’t make any sense. Instead, store the data in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers.

  3. + 1 for search. Storage is/has approached $0, so naturally people will store everything, which then makes smart search a high priority. NoSQL systems like Hadoop, CouchDB, Mongo, etc. are a natural fit for this ever-changing index, because they don’t require a schema.

  4. Great post Ralph and while I’m loath to venture into the Lion’s den, I must. To start I won’t take issue with many of your core operating principles. My main issue with your search vs governance mantra is that is presumes that all data has value and that the value (potential or real) outweighs the risks and liability associated with the ESI. This concept is core to the definition promulgated by the IGI and others.

    I wrote blog on this topic a while back (http://www.recommind.com/blog/2013/06/19/information-governance-gangnam-style-the-perils-of-dark-data) which attempted to combat the notion that big data will make all data valuable one day. In your parlance it’s the “yesterdays trash can be tomorrow’s treasure” which I think is the big fallacy. Information garbage weeks, months or years ago is highly unlikely to be mined in the future by big data initiatives to derive valuable business insights.

    Instead, that same questionably valuable data is much more certain to harm the organization, either in terms of data breaches, ESI that must be preserved/collected/reviewed/produced for eDiscovery, etc.

    Finally, all that data kept in the hopes that it might have big data-esque value creates more noise making it harder to find useful “signals” buried therein.

    I do agree that governance needs to evolve as a practical and achievable concept, but the “keep it all forever” is similarly imperfect.


    • Ralph Losey says:

      Thanks for the comment Dean. We have been rubbing shoulders in the search field together for many a year. So I know you understand the value and importance of search. In the spirit of “dialogue” that Richard Braman wanted us all to practice in grappling with important new legal issues, I will respond by saying I hear you, and will process your message, and review your article, and try to fully grasp your perspective. I will respond later. Please venture back into the Lion’s Den any time. I will not bite, and if I roar, well, that’s just me. 🙂

  5. […] This is the second part of a two-part blog, please read part one first. […]

  6. […] E-discoveryteam.com- Ralph Losey, a practicing attorney, tackles the governance vs. search debate. Read More […]

  7. […] expert Ralph Losey on the subject of How to best conquer out of control information growth http://e-discoveryteam.com/2014/10/26/hadoop-data-lakes-predictive-analytics-and-ultimate-demise-of-…, in which he argues that we have arrived at 2 alternative approaches, namely, to either Classify […]

  8. […] destruction purposes under the umbrella of an information governance strategy? Or should they keep Big Data and comb through it with advanced search technologies? Don’t expect these questions to be resolved in 2015. There is […]

%d bloggers like this: