There seems to be a lot of buzz around Hyper-Converged infrastructure, but does it fit data intensive applications? Is it aligned with the BigData, Cloud, IoT, and Docker trends? Or aligned with Cloud Giants direction?
In this post we will try and examine the industry trends and the right ways to solve storage challenges in the Big Data era as it being manifested by industry leaders.
Some history first…
Before all this cloud and virtualization era started we used to have desktops and application servers with local disks. Critical or shared data was stored in database systems and file servers (NAS). Databases and file servers grew to a size which required storage networks, this gave birth to FC SAN. Once we had SAN we started consolidating those disks from application servers as well, automated server provisioning, and created Virtual Machines. Then came VDI (Virtual Desktops) and we started consolidating disks from desktops to the SAN, only to learn about things like I/O Blenders and boot storms which mandate SSDs for VDI to really work. We figured it’s too expensive so we invented block level deduplication and thin clones to make sure we don’t store all those redundant windows OS files multiple times on the same disks.
The complexity and cost of deploying virtual infrastructure in the enterprise grew too high, especially when you compare it with public cloud efficiencies. This led to hyper-converged systems such as Nutanix and VMware EVO RAIL which place the disks back in the server instead of using expensive and complex SANs, and implement some replication and striping across servers to provide availability and load-balancing. But hyper-converged infrastructure is addressing those applications which only need relatively small amount of data dedicated to a workload, like the ones which used local disk in the good old days, those virtual desktops and application servers, and we use it to store OS and application files, not shared and/or managed data. We are back to the start, today’s equivalent of databased and file servers (BigData and File/Object storage) need a different solution which can scale capacity in a cost effective way, and provide complete data management services.
And what lies in the future?
if we fast forward a few years, the majority of the data will be in Big Data, unstructured content like images, videos, Internet of things (IoT) data, social networks, medical records, etc.’. Such data is growing exponentially, generated by billions of devices (IoT sensors, Mobile phones) and is processed through multiple stages of Streaming, Hadoop, NoSQL, and Analytics applications. The Data must be shared globally and therefore it cannot be captive in a tightly coupled compute and storage clusters, where an individual workload can only access an individual virtual disk. Data needs to be stored in extremely cost effective, highly scalable, and fully accessible data repositories. Data security must be tightly enforced and in a granular level.
If we examine the recent moves of the major cloud vendors we will see less focus and energy going into hyper-convergence (e.g. AWS EBS), and much more attention and efforts going into object and data management technologies as a service. Such solutions include AWS S3, DynamoDB, RedShift, Aurora, Google Cloud Datastore/Dataflow, Azure Blub, etc.
Counter to the hype trend, amount of storage in desktops and application servers (the hyper-converged workloads) will go down. Today each application server uses a 10-100GB virtual disk image, yes we can de-duplicate at the disk level but some folks came with a better idea called Docker. Why not de-duplicate at the application level, share the same kernel and application files across workloads and install only the stuff you need? In Docker an image size is only 200-300MB, that’s 100 times less storage, this also simplifies application version management, and it’s all file based sharing, so no room for all this vDisk technology we perfected for such a long time. When we look at the desktop side with Office 365, Google apps, Tablets, and the fact we store much of the content in cloud storage it means we will need relatively less disk space in desktops just like we see in tablets and Chromebooks.
But Isn’t Hadoop hyper-converged?
At the beginning Hadoop main innovation was in forming a limited but low-cost distributed file storage (HDFS), coupled with Map/Reduce framework for processing data chunks at local proximity. Back then it was focused on offline processing and batch workloads. Both HDFS and Map/Reduced are now being challenged since data is being ingested and processed in real-time by multiple applications which cannot be collocated with the disks, and data feeds continue to grow in volume, velocity, and variety.
We see Hadoop Map/Reduce being replaced with in-memory and streaming technologies like Spark and Storm, while HDFS being swapped with a variety of scale-out file and object storage solutions through HCFS (HDFS compliant file system). Hadoop is becoming more of an Application/OS framework to host and schedule multiple data processing tasks rather than the original offline processing role it had.
Basing Hadoop on external object storage has many benefits in cloud and virtualized infrastructure as articulated by Xplenty which provides Hadoop as a service and uses Amazon S3 as its storage.
Hadoop is not the only emerging data management platform which started as converged and now “out-sources” the storage implementation, MongoDB recently introduced a pluggable storage engine, and Redis and MySQL has had that for a while.
So there seems to be a growing consensus that “Big Data needs divergence” as articulated by Enrico Signoretti in his recent blog post, object storage is the preferred alternative, but higher performance object implementations are needed.
So what is the right solution for data intensive applications?
Let’s first examine a few facts about Big Data and its trends
- Data keeps on growing, at faster rates than computation needs
- Storage systems’ lifespan is 5 years, with cold storage and SSDs it will be even longer, while servers get outdated after only 2-3 years.
- We hardly delete data, but may need to dynamically move it across tiers over time
- The number of data sources can be endless (Mobile, IoT, Apps)
- Jobs/workloads processing the data can come and go, but the data stays
- There could be multiple jobs/workloads in different locations which need shared access to the same data (e.g. many write/ingest the data, some digest, some analyze, some visualize)
- Data must have strong security and access control governed across different users, applications, relations, and locations.
Concluding from the above, data needs to be decoupled from application servers, not as raw disk blocks but accessed through abstract APIs such as Objects and Key/Value to allow proper sharing, scaling, concurrency, tiering, global replication, and security. Data storage systems should be extremely cost/effective with optimal layering. Adding storage capacity should not force adding computation resources, and capacity should be constructed of different types of storage at different cost/performance points. Key focus should be on allowing simultaneous access of many data producers and consumers to billions of data items, while maintaining SLA and security.
The best match for the above requirements is using object storage, it has low-cost, endless scalability, flexible metadata, is globally distributed, and has simple atomic APIs which don’t limit concurrency. Today’s Object storage solutions have some drawbacks like lower performance and limited consistency or security, but those are implementation specific issues that can be solved. In a sense, with future object storage we re-divide the responsibilities between applications and storage, and move just enough intelligent to the storage, but not too much so we keep it generalized and scalable.
An alternative can be scale-out NAS solutions, but they can’t meet the requirements we outlined, and there are no examples of NAS at cloud scale. Facebook figured it out many years ago and introduced Haystack object storage as a successor to a limited NFS based solution. Since then it further evolved, and they recently introduced a tiered Blob (object) storage.
Another great example is Amazon Aurora a scalable, high-performance, and consistent (ACID) SQL Database which is layered on top of Amazon S3 object storage service, in Aurora layers which are traditionally handled by the database application moved to the object storage, this allowed for much better performance, durability, and scalability. Other examples include Google latest internal database called Spanner (a “NewSQL” successor to Google BigTable, which was the first NoSQL database), Spanner is based on a new object storage layer called colossus, and is fully consistent (ACID) unlike current NoSQL and object storage implementations.
Many of the latest IT trends like Cloud, Hadoop, NoSQL started incubating in the corridors of Amazon, Google, and Facebook, and it seems like they are now setting new trends, the only question is when the rest of the industry will follow.