I stumbled into this nice infographics below mentioning the hyper growth in data, the rate at which data is generated per year will be 4300% higher. We are challenged with the hyper growth in unstructured data (Volume), that is constantly ingested and needs to be processed at significantly faster rates (Velocity), not to mention the different sources and data types (Variety).
In order to deliver valuable insights from the data and draw the right conclusions all the data must be accessible regardless of its physical location. Data must be “always connected” yet secured. The current approach of preparing, staging, and copying the data into the application cluster before running any analysis, just doesn’t make sense if data keeps on growing and changing, not to mention the need to draw real-time or interactive conclusions which cannot wait for long data perpetration or ETL processes.
In memory and stream processing approaches are a good addition, but their focus is reducing, cleaning, or potentially turning the data to semi-structured or indexed data. We still have to store the enormous amounts of data cost effectively and make them accessible to different consumers across the globe.
Can the current storage approaches like SAN, NAS, Hadoop (HDFS), or Hyper-Converged address those challenges? Probably not, since they focus on managing data silos directly attached to, or co-located with the applications. With the current approaches data is not “networked” and cannot be accessed from everywhere. Adding capacity requires adding expensive computation resources, data is constantly copied, and security is not tightly enforced or granular enough.
So we need a radical change to address the 3 Vs, from data silos to hyper-scale class, high-performance, infinitely scalable, always connected, and extremely secure data lakes. Microsoft new Azure Data lake offering seems to be one in the right direction.