The Next Gen Digital Transformation: Cloud-Native Data Platforms

The software world is rapidly transitioning to agile development using micro-services and cloud-native architectures.  And there’s no turning back for companies that want to be competitive in the new digital transformation.

Evolution in the application space has a significant impact on the way we build and manage infrastructure. Cloud-native applications in particular require shared cloud-native data services.

The Old Apps and Storage

To better understand this new approach, let’s take a quick look back.

With legacy applications, servers had disk volumes that held the application data. As the markets matured and services changed, things shifted to clouds and infrastructure-as-a-service (IaaS) which meant virtualized servers (virtual machines, or VMs) were mapped to disk partitions (vDisks) in a 1:1 relationship. Storage vendors took pools of disks from one or more nodes, added redundancy, and provisioned them as virtual logical unit numbers (LUNs).
Then came Hyper-Converged. This technology wave enhanced the process and pooled disks from multiple nodes. Real security wasn’t required; rather this solution relied on isolation to ensure only the relevant server talked to its designated LUN. The process is also known as zoning.



The New Apps and Data

As the evolution continues, the new phase is platform-as-a-service (PaaS). Rather than managing virtual infrastructures such as virtual machines, apps are now managed. Rather than managing virtual disks, data is now managed. The applications don’t store any data or state internally because they are elastic and distributed.  The applications use a set of persistent and shared data services to store data objects, streams, logs, and records.

These new data services and NoSQL technologies do not require legacy storage stacks since the resiliency, compression, and data layouts are built into the data services. What that means is that, for example, traditional redundant arrays of independent disks (RAID) and deduplication features are useless and, in some cases, potentially harmful.

cloud-native data1

This new model must address the data sharing challenge since many apps access the same data – and from far-flung places, such as mobile devices or remote sensors.  In addressing this new reality, here are some important aspects for consideration:

  • Security must evolve from isolation to tighter management of who is allowed to access what and how; it must have the capability for guaranteeing the identity of remote devices or container applications; and it must include the means for automatically detecting breaches.
  • When different apps or devices access the same data, guaranteeing data consistency and integrity without significant performance degradation is critical.
  • Understanding how searched data will be found among potentially billions of items is crucial.
  • Different applications may have different access patterns to the data; today they use purpose-optimized data services. To support the case where different apps with different access requirements access the same data, we need to enable broad APIs and access variety, or we will end up creating copies or doing ETL

Today, the industry is seemingly focusing its efforts on cloud-native application platforms while using a fragmented set of data services or legacy storage approaches. That’s led some vendors to quickly rebrand their legacy storage technologies as “Cloud-Native”.

In reality, as we move from IaaS to PaaS we need to think about “data containers” rather than storage – similarly to how we now focus on application containers rather than the underlying VMs.

Interestingly enough, the large cloud providers took the opposite approach and are now focusing their efforts on building fully managed and self-service cloud-native data platforms to serve as ideal home and storage for the new apps. It is a ploy to lock in customers? Probably.  They realize data has gravity, and that once it lands in their cloud platform, the apps will follow and the customers will be captive for a long time.

But here’s something worth considering. In parallel to the many new and noteworthy commercial cloud-native application platforms from companies like Docker Inc., Mesophere, and others, the industry also needs commercial-grade cloud-native data platforms that address the difficult challenges in managing and sharing huge amount of diverse data items and deliver data-as-a-service on-premises.

There is no reason why, with a bit of innovation, on-prem data platforms cannot be faster, cheaper, and simpler to use than those in the public cloud. After all, the public cloud is a decade old and not much of its technology has changed.  With some forward thinking and more modern technology, on-prem data platforms may be the next big thing.  Think “modern architecture meets tomorrow’s challenges.”  read

It’s Time for Reinventing Data Services

During the last decades, The IT industry have used and cultivated the same storage and data management stack. The problem is, everything around those stacks changed from the ground up — including new storage media, distributed computing, NoSQL, and the cloud.

Combined, those changes make today’s stack exceedingly inefficient — slowing application performance, creating huge operational overhead, and hogging resources. An additional impact of today’s stack is multiple data silos that are each optimized to a single application usage model, and the requirement for data duplication to handle the case when multiple access models are used.

With the application stack now adopting a cloud-native and containerized approach, what we also need are highly efficient cloud-native data services.

The Current Data Stack

Figure 1 shows, in red, the same functionality being repeated in various layers of the stack. This needless repetition leads to inefficiencies. Breaking the stack into many standard layers is not the solution, however, as the lower layers have no insight into what the application was trying to accomplish, leading to potentially even worse performance. The APIs are usually serialized and force the application to call multiple functions to update a single data item, leading to high overhead and data consistency problems. A functional approach to update data elements is being adopted in cloud applications and can solve a lot of chatter.


Figure 1: Current Data Stack

In the current model the application state is stored with the application (known as stateful or persistent services). This is in contrast to the cloud-native architecture and leads to deployment and maintenance complexity.

When developing cloud-native apps in the public cloud, data is stored in shared services such as Amazon S3, Kinesis, or DynamoDB. Meanwhile, the application container is stateless, leading to lower cost and easier-to-manage application environments. To save the extra overhead, new data services use direct attached storage and skip the lowest layer (external storage array or hyper-converged storage), but that process eliminates only a small part of the problem.

In the last few years, new types of media – including SSD, NV-Memory solutions, key/value disks, shingled drives – have emerged. Bolting a sector-based disk access API or even a traditional file-system on top of those new media types may not be the correct approach; there are more optimal ways to store a data structure in a specific media, or the media may offload portions of the upper stack. The media API needs to be at a higher level than emulating disk sectors and tracks on elements with no spinning head. A preferred functional yet generic approach would be to store structures of variable data size with an ID (key) that will be used to retrieve that data (also called key/value storage) .

Time for a new stack

The requirements are simple:

  • Don’t implement the same functionality in multiple layers
  • Enable stateless application containers and a cloud-native approach
  • Avoid data silos; store the data in one place that efficiently supports a variety of access models
  • Provide secure and shared access to the data from multiple applications and users
  • Enable media-specific implementations and hardware offloads
  • Simplify deployment and management

An optimal stack has just three layers, as is illustrated in Figure 2: Elastic applications, elastic data services, and a media layer.


Figure 2: The New Data Stack

  • Elastic applications want to persist or retrieve data structures in the forms of objects, files, records, data streams and messages, all of which can be done with existing APIs or protocols mapped to a common data model, and in a stateless way which always commit updates to the backend data services (this will guarantee that apps can easily recover from failures and that multiple apps can share the same data in a consistent fashion).
  • The elastic data services expose “data containers” which store and organize data objects serving one or more applications and users. The applications can read, update, search or manipulate data objects, with the data service providing guaranteed data consistency, durability and availability. Security can be enforced at the data layer as well, making certain only designated users or applications access the right data elements. Data services should scale-out efficiently, and potentially can be distributed on a global scale.
  • The media layer should store data chunks in the most optimal way for the media, which might include a remote storage system or cloud. Data elements of variable sizes can be assigned unique keys to retrieve the data rather than accessing fixed-size disk sectors. By adding a key/value abstraction, we can implement a media-specific way to store the chunks. For example, when using non-volatile memory one would use pages; with hard drives, one would use disk sectors; and with flash one would use blocks and eliminate redundant flash logical-to-physical mappings and garbage cleaning. The media or remote storage may support certain higher level features such as RAID, compression, deduplication, and more, in which case the data service can skip those features in software and offload it to the hardware.

This article was originally posted on


DC/OS Enables Data Center “App Stores”

Containers are getting widely adopted as they simplify the application development and deployment process, the transition into continuous integration with Microservices and Cloud-Native architectures will have significant impact on the businesses agility and competitiveness- They bring the notion of Smartphone Apps to the datacenter.

Unfortunately, today there are still gaps, and Cloud-Native implementations require quite a bit of integration and DevOps. A few commercial and proprietary offerings try to tackle the integration challenge, but without a real community behind them they may not gain enough momentum. This is why Mesosphere latest move to open source DC/OS can be a significant milestone in enabling the Enterprise “App Store” experience.


DC/OS Enabling Data Center “App Stores”

The Enterprise “App Store” Micro-Services Stack

When building a Micro-Services architecture, we need several key ingredients such as:

       Trusted and searchable image repository (Application Marketplace)

       A way to monitor and manage the physical or virtual cluster (The Devices)

       Scheduler & Orchestrator to automate deployment and resource management of apps

       Cloud-Native Storage to host shared data and state

For a real enterprise deployment, you would need a bunch of additional components like Service discovery, Network isolation, Identity management, and the list goes on.

Integrating the above discrete components manually is resource consuming and useless given everyone would need similar components, which is why we see the emergence of commercial “Cloud-Native Stacks” and formation of standard organizations like CNCF (Cloud-Native Foundation). Commercial or “best of bread” stacks have limited impact since they do not attract collaboration from multiple vendors, this is where DC/OS fits in, now there is a holistic Cloud-Native stack to which people can contribute and vendors can integrate with.

As a background DC/OS (Datacenter Operating System) is based on Mesos, a popular cluster resource manager and orchestrator which is used by some large cloud operators, Mesos have a unique two level approach allowing it to manage a cluster of clusters and support a wide array of workloads such as Docker containers, VMs, Hadoop, Spark, etc. over physical and virtual infrastructure spanning on-prem and cloud. Mesos can work with different scheduling paradigm for batch processing, continuous services, and Chronos (Scheduled) tasks over the same resource pool, making it very attractive for a broad set of applications.  

DC/OS extends Mesos by providing a fully integrated stack with:

       Different scheduling tools for Spark, Kafka, Cassandra, Docker etc.

       Integrated resource monitoring across nodes, CPUs, memory

       Application packages and configuration repository, which already pre-package some of the common development tools and apps

       Service discovery (DNS), Authentication, proxy & load-balancing service, ..

Some integrations are in the works or can be extended by partners for example using project Calico for network virtualization and isolations, and persistent volumes for external storage, but the most important part is that as an open framework new components can be added or modified. In the long run it would be most valuable if various micro-services platforms will converge on APIs, hopefully this can be accomplished by CNCF.

Storing “Apps” Data

In mobile “Apps” we store the application image in Google Play Store or Apple App Store, but the data and its configuration are stored somewhere in the cloud, so when our device is stolen or breaks we load the App on the new phone, enter our credentials, and we are back in business.

Micro-Service “Apps” should be no different, Data should not be stored “locally” on stateless and elastic containers, rather in shared and highly available repositories exposing stateless (atomic) APIs for accessing files, objects, records, and streams/message-queues. Multiple micro-service instances may access the same data simultaneously without imposing locks or inconsistency, access should be location independent, and management of the data should be as simple as with mobile “Apps”, where users and developers work in a self-service environment.

This is the case today with cloud native Apps deployed in Amazon, they use managed data services like S3, DynamoDB, Kinesis, RedShift, etc. We still don’t have the on-prem equivalents of “Cloud-Native Storage Services”, but stay tuned, those will arrive if we want to deliver the “App Store” experience for on-prem Applications.

Re-Structure Ahead in Big Data & Spark

restructBig Data used to be about storing unstructured data in its raw form – . “Forget about structures and Schema, it will be defined when we read the data”. Big Data has evolved since – The need for Real-Time performance, Data Governance and Higher efficiency is forcing back some structure and context.

Traditional Databases have well defined schemas which describe the content and the strict relations between the data elements. This made things extremely complex and rigid.  Big Data initial application was analyzing unstructured machine log files so rigid schema was impractical. It then expanded to CSV or Json files with data extracted (ETL) from different data sources. All were processed in an offline batch manner where latency wasn’t critical.

Big Data is now taking place at the forefront of the business and is used in real-time decision support systems, online customer engagement, and interactive data analysis with users expecting immediate results. Reducing Time to Insight and moving from batch to real-time is becoming the most critical requirement. Unfortunately, when data is stored as inflated and unstructured text, queries take forever and consume significant  CPU, Network, and Storage resources.

Big Data today needs to serve many use cases, users, and large variety in content, data must be accessible and organized for it to be used efficiently. Unfortunately traditional “Data Preparation” processes are slow and manual and don’t scale, data sets are partial and inaccurate, and dumped to the lake without context.

As the focus on data security is growing, we need to control who can access the data and when. When data is unorganized there is no way for us to know if files contain sensitive data, and we cannot block access to individual records or fields/columns.

Structured Data to the rescue

To address the performance and the data wrangling challenge , new file formats like Parquet and ORC were developed. Those are highly efficient compressed and binary data structures with flexible schema. It is now the norm to use Parquet with Hive or Spark since it enables much faster data scanning and allows reading only the specific columns which are relevant to the query as opposed to going over the entire file.

Using Parquet, one can save up to 80% of storage capacity comparing to a text format while accelerating queries by 2-3x.

The new formats force us to define some structure upfront, with the option to expand or modify the schema dynamically unlike older legacy databases. Having such schema and metadata helps in reducing data errors, and makes it possible for different users to understand the content of the data and collaborate. With  built-in metadata it becomes much simpler to secure and govern the data and filter or anonymize parts of it.

One challenge with the current Hadoop file based approach regardless if it is unstructured or structured data, is that updating individual records is impossible, and it is constrained to bulk data uploads. This means that dynamic and online applications will be forced to rewrite an entire file just to modify a single field. When reading an individual record, we still need to run full scans instead of selective random reads or updates. This is also true for what may seem to be sequential data, e.g. delayed time series data or historical data adjustments.

Spark moving to structured data

Apache Spark is the fastest growing analytics platform and can replace many older Hadoop based frameworks, it is constantly evolving and trying to address the demand for interactive queries on large datasets, real-time stream processing, graphs and machine learning. Spark has changed dramatically with the introduction of “Data Frames”, in-memory table constructs that are manipulated in parallel using machine optimized low-level processing (see project Tungsten). DataFrames are structured and can be mapped directly to variety of Data Sources via a pluggable API, this include:

–        Files such as: Parquet, ORC, Avro, Json, CSV, etc.

–        Databases such as: MongoDB, Cassandra, MySQL, Oracle, HP Vertica, etc.

–        Cloud Storage like Amazon S3 and DynamoDB

DataFrames can be loaded directly from external databases, or be created from unstructured data by crawling and parsing the text (a long and CPU/Disk intensive task). DataFrames can be written back to external data sources and it can be done in a random and indexed fashion if the backend support such operation (e.g. in the case of a Database).

Spark 2.0 release adds “Structured Streaming “, expanding the use of DataFrames from batch and SQL to streaming and real-time. This will greatly simplify the data manipulation task and speed up performance – Now we can use streaming, SQL, Machine-Learning, and Graph processing semantics over the same data.


Spark is not the only streaming engine moving to structured data, Apache Druid delivers high performance and efficiency by working with structured data and columnar compression.


New applications are designed to process data as it gets ingested and react in seconds or less instead of waiting for hours or days. The Internet of Things (IoT) will drive huge volumes of data which in some cases may need to be processed immediately to save or improve our lives. The only way to process such high volumes of data while lowering the time to insight is to normalize, clean, and organize the data as it lands in the Data Lake and store it in highly efficient dynamic structures. When analyzing massive amounts of data, we better run over structured and pre-indexed data. This will be faster in orders of magnitudes.

With SSDs and Flash at our disposal there is no reason to re-write an entire file just to update individual fields or records – We’d better harness structured data and only modify the impacted pages.

At the center of this revolution we have Spark and DataFrames. After years of investment in Hadoop some of its projects are becoming superfluous and are displaced by faster and simpler Spark based applications. Spark engineers made the right choice and opened it up to a large variety of external data sources instead of sticking to the Hadoop’s approach and forcing us to copy all the data into a crippled and low-performing file-system … yes I’m talking about HDFS.

Cloud Data Services Force Awakens

If you’ve been reading the storage blogs and analyst reports you may conclude that storage growth is in Flash arrays, Hyper-converged, maybe scale-out NAS or Object. Most ignore the massive growth in Self-served and fully integrated Data Services and its potential impact on the overall storage market.

It’s all part of the same trend of moving from infrastructure focus (Private Clouds, IaaS, ..) to services and applications (PaaS, SaaS, Micro-Services, DevOps, ..). Cloud vendors deliver a full turn-key solutions that are billed by the hour or usage. Amazon experiences exponential growth a few years in a row for all those Data Services (S3, RedShift, DynamoDB, Kinesis, Aurora, ..).

Did you know that the cost of higher level services like DynamoDB or RedShift is 30-100X higher per GB compared to S3 simple service? Why settle for 3 cents/GB when you can charge Dollars. And yet, many customers opt to pay the extra bucks and the adoption of these services is soaring, since its robust, easy to use, and still lower TCO than legacy databases.

To be competitive such services are built on the most cost effective and bottleneck free architectures. None of these services leverage SAN, Hyper-converged, or even NAS. Those are not relevant, as I will explain below, they use storage servers (DAS) with new and more relevant abstractions that maximize the scale, efficiency and performance. Those technologies are starting to proliferate within the enterprise and will have an even bigger impact on the already struggling enterprise storage market, especially with the erosion in margins and the fact that the highest data growth is in unstructured, BI, and Analytics data – not those VM images or Legacy databases most storage vendors go after.

If you are still skeptical check out Google Cloud Platform. 8 out of the main 14 services are Managed Data Services and none of them use SAN or NAS or traditional Hyper-Converged (file/block) layer underneath.

GCP Data Services

Microsoft is already fully vested in this with Azure Data services. Larry Allison is now trying to steer the Oracle ship at full speed in that direction selling per usage cloud services vs licenses. IBM is throwing much of the hardware business baggage and is re-focusing on data services and analytics – It remains to be seen if they manage to take off. HP is (over) promising the future “Machine” as the best hardware for Data Services and Analytics. Meanwhile the Enterprise storage vendors fight over the shrinking pie, without internalizing the colossal changes that are about to come.

So what’s wrong with the current storage abstractions

The most common Enterprise storage architecture today is SAN or Virtual-SAN (Hyper-Converged), which is pulling together a bunch of disks in some RAID schema, and is creating virtual disk (LUN) abstractions. Striping, caching, indexing, data layout, and compression are all owned by the storage layer, which has no clue how the apps work or layout the data. This is why we constantly tune storage and applications. It may be ok for general purpose VM images which are not application specific, but the new trends such as micro-services are making those images much smaller and more stateless, and data is stored in dedicated Persistent Data Services (see my post on micro-services architecture).

Now imagine a database or NoSQL data query. It is forced to scan and transfer many redundant disk blocks over the fabric, it must implement its own layer of caching and indexing which doesn’t benefit from the underline layer. Many of the new databases today implement extremely efficient contextual columnar compression and dedup, so compression in the underline layer will be useless. A critical problem is that any update to a higher level data service involves multiple changes to the disk which have to occur atomically. This means that with SAN we must use locking and journaling with redo and undo logs – an extremely high overhead operation. If the storage is application-aware it can also combine multiple types of storage like NV-RAM, Flash, and Disk and use them most appropriately vs relay on statistics based tiering. Why waste the x86 CPU cycles on organizing, caching, and indexing fixed sized and randomly accessed blocks when we can do it in the context of an organized set of variable size application records

Oracle designed Exadata quite some time ago which pioneered the notion of Database optimized storage, and moved table indexing, scanning, and compression to the storage node. Most of the NoSQL and Hadoop components today are optimized for Direct Attached Storage (DAS) or key/value abstractions for the same reasons, running over SAN or NAS is more expensive and usually deliver worse results.

Last year Amazon exposed the internal architecture of Aurora (a scale-out SQL DB) which performs 6x faster than alternatives, scales linearly, and is far simpler to operate. Oracle is in panic since Aurora is one of the fastest growing AWS service. As you can see below they moved some of the traditional functionality like journaling and caching to the storage nodes, and using S3 for archiving. You can imagine how much cost it saves when comparing it with SAN and all the operational overhead around it.


Ok, so why not Scale-out NAS

Indeed, many traditional data services run on top of local file-systems. This is mainly to abstract the storage and avoid ties with the low-level layers. They will usually open a few very big sparse files and work with them just as if it was block storage, but with a higher overhead.

Add to that all the file system lookups and traversals/indirections, and since the app needs to update several files per transaction we now have two layers of journaling – one in the file system and one in the application. In the previous post we also discussed the fact traditional file systems don’t know how to take advantage of non-volatile byte aligned memory like Intel 3D Xpoint, or how to maximize the value of NVMe.

If we use remote shared files (Scale-out NAS) there will be too much protocol chatter and potential locks that degrade performance even further. Given we need to guarantee transaction isolation and atomicity we must implement replication as part of the application anyway and can’t depend solely on the storage replication.  There is now a growing trend to use optimized key/value abstractions under the data service (e.g. LevelDB, RocksDB) which owns the data organization, fragmentation, and lookup and those work best with direct attached disks,Flash or non-volatile Memory.

When it comes to static files like images, logs, or archiving NAS has still some room. The challenge is that with the large growth in unstructured data NAS and POSIX semantics becoming a huge burden, we cannot use central metadata services and directory traversals. We’d rather use fast hashing and sharding approaches of object storage which enable practically unlimited scale at very low cost. As the number of data items keeps on growing we are also becoming very reliant on extensible object metadata to describe and quickly lookup objects/files.

A somewhat overlooked fact is that much of the unstructured file content is ingested or read as a whole. While NAS can be fast on reads and writes, it’s extremely slow when creating/opening new files or doing metadata operations, again a big win for object.

Note that many of the new entrants in scale-out storage claim to have object implementations. Those are usually emulated over their limited POSIX/NFSv3 optimized file systems and won’t be a native object implementation. A simple test would be to check how many objects they can create per second, or how well/fast they can query metadata and Billions of objects, and can they really match Amazon S3 prices.


Many organizations no longer want to consume fragmented hardware and software components and integrate or maintain them internally, they like the cloud self-service approach where developers or users can just consume services and pay for what they use. IT is now a utility, and organizations rather focus their most valuable assets (people) on business applications and their competitive edge.

Integrated Data Services means we are no longer bound to legacy SAN and NAS approaches. New layering is required between storage and applications. We need to implement critical processing closer to the data, while decoupling application logic from the data to allow better scaling and elasticity.

Will the Enterprise IT Federation be able to defend against the imminent attack by the Cloud Empire? Will the FORCE guide them to stop thinking infrastructure, and deliver optimized and self-served data platforms? Will someone come to the rescue?

Wanted! A Storage Stack at the speed of NVMe & 3D XPoint

Major changes are happening in storage media hardware – Intel announced a 100X faster storage media, way faster than the current software stack. To make sure the performance benefits are evident, they also provide a new block storage API kit (SPDK) bypassing the traditional stack, so will the current stack become obsolete?

Some background on the current storage APIs and Stack

Linux has several storage layers: SCSI, Block, and File. Those were invented when CPUs had a single core, and disks were really slow. Things like IO Scheduling, predictive pre-fetch, and page caches were added to minimize Disk IO and seeks. As more CPU cores came, basic multi-threading was added but in an inefficient way using locks. As number of cores grew further and flash was introduced, this became a critical barrier. Benchmarks we carried out some time ago showed a limit of ~300K IOPs per logical SCSI unit (LUN) and high latency due to the IO serialization.


Linux Storage Stack

NVMe came to the rescue, borrowing the RDMA hardware APIs and created multiple work queues per device which can be used concurrently by multiple CPUs. Linux added blk-mq, eliminating the locks in the block layer, with blk-mq we saw 10x more IOPs per LUN and lower CPU overhead. Unfortunately locks and serialization are still there in many of the stack components like SCSI, File systems, RAID/LVM drivers, .. and it will take years to eliminate them all.

NVMe drivers are quite an improvement and solve the case for NAND SSDs, but with faster storage like Intel new 3D XPoint™ the extra context switches and interrupts take more CPU cycles than the access to the data, so to fix it Intel (again) borrowed ideas from RDMA and DPDK (High speed network APIs) and developed SDPK, a work-queue based direct access (kernel bypass) APIs to access the storage at much better CPU efficiency and lower latency.

We fixed Block, but what about files?

So now storage vendors are happy, they can produce faster and more efficient Flash boxes, BUT what about the applications?  They don’t use Block APIs directly, they use file or record access!.

Anyone who developed high-performance software would know, the basic trick is have asynchronous/concurrent and lock-free implementation (a work queue). Linux even has such an API (Libaio), but it does not work well for files due a bunch of practical limitations.

So what do people do? They emulate it using many CPUs for “IO worker threads”. Applications perform ok, but we spend a lot more CPUs and end up with higher latency which hinder transaction rates, this latency grows exponentially as the queue load increases (see my post: It’s the latency, Stupid!). This is also adding quite a bit of complexity to apps.

Now, with NVMe cards that produce a Million IOPs, not to mention NV-RAM solutions like 3D XPoint™ with micro-second level latency, the current architecture is becoming outdated.

A new approach for the new storage media is needed

The conclusion is that we need the equivalent of SPDK, DPDK, RDMA APIs (those OS bypass, lock-free hardware work queue based approaches) for higher level storage abstractions like file, object, key/value storage. This will enable applications (not just storage vendors) to take the full advantage of NVMe SSDs and NV-RAMs, and in some cases be 10-1,000 times faster.

For local storage, a simple approach can be to run mini file systems, object, or simple key/value as a user space library on top of new SPDK APIs, avoiding context switches, interrupts, locks or serialization. With the new NVMe over Fabric (RDMA) protocol it can even be extended over the Ethernet wire. This is similar to the trends we see in high-performance networking, where high-level code and even full TCP/IP implementations run as part of the application process and use DPDK to communicate directly with the network. It may be a bit more complicated due to the need to maintain data and cache consistency across different processes in one or more machines.

A simpler solution can be to access remote storage service. Implement efficient direct access File, Object, or Key/Value APIs that use remote calls over fast network using TCP/DPDK or RoCE (RDMA over Ethernet) to those distributed storage resources. This way we can bypass all the legacy SCSI and File-system stack. Contrary to the common beliefs non-serialized access to a remote service over fast networks is faster than serialized/blocking access to a local device. If it sounds a bit crazy I suggest to watch this 2 min section on How Microsoft Azure Scale Storage.

Unfortunately there is still no organization or open-source community that takes on this challenge, and defines or implements such highly efficient data access APIs. Maybe Intel should pick that effort too if it wants to see greater adoption of the new technologies.


The storage industry is going through some monumental changes, cloud, hyper-converged, object and new technologies like NVMe or 3D XPoint™. But in the end we must serve the applications, and if the OS storage software layers are outdated we must raise the awareness and fix it.

Anyone shares my view ?

EMC/Dell, IBM, HP – Wake Up!

The Dell/EMC merger, IBM and co storage revenues decline, … Traditional IT vendors are under attack! What is going on? What’s the bigger picture? How can they recover? I’ll be trying here to answer these questions and suggest some less “simplistic” answers.

We see several major trends which shape our life:

  • Most data is consumed and generated by mobile devices
  • Applications move to a SaaS based data-consumption model
  • Everyone from Retailers, Banks, to Governments depend on data based intelligence to stay competitive
  • We are on the verge of the IoT revolution with billions of sensors generating more data

New companies prefer using office 365 or in the cloud and avoid installing and maintaining local exchange servers, ERP or CRM platforms, and are relieved from investing capital on hardware or software. If you are a twenty or even a hundred-employee firm, there’s no reason to own any servers – do it all in the cloud. But at the same time many organizations depend on their (exponentially growing) data and home-grown apps and are not going to ship it to Amazon anytime soon.

Enterprise IT is becoming very critical to the business. Let’s assume you are the CIO of an insurance company. If you don’t provide a slick and responsive mobile app and tiered pricing based on machine learning of driver records, someone else will, and will disrupt your business, i.e. the “Uber” effect.  This forces enterprises to borrow cloud-native and SaaS concepts for on premise deployments, and design around agile and micro-services architecture. These new methodologies impose totally different requirements from the infrastructure, and have caught legacy enterprise software and hardware vendors unprepared.


The challenge for current IT vendors and managers is how do they transform from providing hardware/system solutions for the old application world (exchange, SharePoint, oracle, SAP, ..) which is now migrating to SaaS models, to become a platform/solutions provider for the new agile and data intensive applications which matters most for the business. Some analysts attribute EMC and IBM storage revenue decline to “Software Defined Storage” or Hyper-Converged Infrastructure, Well … it may have had some impact, but the big strategic challenge they have is the Data Transformation.

The above trends impact data in three dimensions: Locality, Scalability, and Data Awareness.  As we go mobile and globally distributed, data locality (silos) is becoming a burden, with more data we must address scalability, complexity, and lower the overall costs. To serve the business we must be able to find relevant data in huge data piles, draw insights faster, and make sure it is secured, forcing us to build far more data aware systems.

New solutions have been designed for the cloud and replace traditional enterprise storage, as illustrated in the diagram below.


We used to think of Cloud as IaaS (Infrastructure as a service), and had our Private IaaS Clouds using VMware or now OpenStack. But developers no longer care about infrastructure, they want data/computation services and APIs. Cloud Providers acknowledge that and have been investing most of their energy in the last few years in technologies which are easier to use, more scalable and distributed, and in gaining deeper data awareness.

Google developed new databases like Spanner which can span globally and handle semi-structured data consistently (ACID), new data streaming technologies (Dataflow), object storage technologies, Etc. Amazon is arming itself with fully integrated scale-out data services stack including S3 (Object), DynamoDB (NoSQL), Redshift (DW), Kinesis (Streaming), Aroura (Scale-out SQL), Lambda (Notifications), CloudWatch.  And Microsoft is not standing still with new services like Azure Data Lake, and is shifting developers from MS SQL to a scale-out database engines.

What’s common to all those new technologies is the design around commodity servers, direct attached high-density disk enclosures or SSD, and fast networking. They implement highly efficient and data aware low-level disk layouts to optimize IOs and search performance, and don’t need a lot of the SAN, vSAN/HCI, or NAS baggage and features, resiliency, consistency, and virtualization is handled at higher levels.

Cloud data services have self-service portals and APIs and are used directly by the application developers, eliminating any operational overhead, no need to wait for IT in order to provision infrastructure for new applications, we consume services and pay per use.

Latest stats show that the majority of Big Data projects are done in the cloud, mainly due to the simplicity. And majority of the on premise deployments are not considered successful, in some cases have negative ROI (due to human resource drain, lots of pro-services, and expensive software licenses). So as long as you’re not concerned about storing your data in the cloud, it would be simpler and cheaper.

EMC, IBM, Dell, and HP have been building is the same non data aware legacy storage model, just slightly better. Many of the new entrants in the all flash, scale-out vSAN, Scale-out NAS, and hyper converged (HCI) are basically going after the same old resource intensive application world with somewhat improved efficiency and scalability. But they do not cater for the new world (read my post on cloud-native apps).

HP has Vertica which is good but a point solution. EMC has Pivotal with several elements of the stack. Many are partial and not too competitive, and they lack the overall integration. IBM has the richest stack including Softlayer, BigInsights, and the latest Cleaversafe acquisition, but not yet in an integrated fashion, and they need to prove they can be as agile and cost effective as the cloud guys.

They can build a platform using open source projects. The challenge is those are usually point solutions, not robust, need to be managed and installed individually, need to add glue logic and sort out dependencies. Most lack the required usability, security, compliance, or management features needed for the enterprise, those issues must be addressed to make it a viable platform. We need things to be integrated just like Amazon is using S3 to backup DynamoDB, Kinesis as a front end to S3, RedShift or DynamoDB, and  Lambda as S3 events handler. Vendors may need to shift some focus from their IaaS and OpenStack efforts (IT orientation), to Micro-services, PaaS, and SaaS (DevOps Orientation) with modern, self-service, integrated, and managed data services.

There is still a huge market for on premise IT, many organizations like banks, healthcare, telco, large retailers, etc. have lots of data and computation needs and would not trust Amazon with it. Many Enterprise users are concerned about security and compliance, or are guided by national regulations, and would prefer to stay on premise. Not all Amazon users are happy with the bill they get at the end of the month, especially when they are heavy users of data and services.

But today cloud providers deliver significantly better efficiency than what IT can offer to the business units. If that won’t change we will see more users flowing to the cloud, or clouds coming to the users. Amazon and Azure are talking to organizations about building on premise clouds inside the enterprise, basically out-sourcing the IT function altogether.

Enterprise IT vendors and managers should better wake up soon, take actions, and stop whining about public clouds, or some of them will be left in the cold.

Cloudera Kudu: Yet Another Data Silo?

Recently Cloudera launched a new Hadoop project called Kudu. it is quite aligned with the points I made in my Architecting BigData for Real Time Analytics post, i.e. the need to support random and faster transactions or ingestion alongside batch or scan processing (done by MapReduce or Spark) on a shared  repository.

Today users which need to ingest data from Mobile, IoT sensors, or ETL, and on the same time run analytics have to use multiple data storage solutions and copy the data just because some tools are good for writing (ingestion), some are good for random access, and some are good for reading and analysis. Beyond the fact this is rather inefficient it adds significant delay (coping terabytes of data) which hinders the ability to provide real-time results, and add data consistency and security challenges. Also not having true update capabilities in Hadoop limits its use in the high margin data warehouse market.

Cloudera wants to penetrate faster into the enterprise by simplifying the stack, and at the same time differentiate itself and maintain leadership. It sees how many native Hadoop projects like Pig, Hive, Mahout are becoming irrelevant with Spark or other alternatives, and it must justify its multi-billion dollar valuation.

But isn’t Kudu somewhat adding more fragmentation, and yet another app specific data silo? Are there better ways to address the current Hadoop challenge? more on that later …

What is Kudu?

You can read a detailed analysis here. In a nutshell, Kudu is a distributed Key/Value storage with column awareness (i.e. row value can be broken to columns, and column data can be compressed), somewhat like the lower layers of Cassandra or Amazon Red-Shift. Kudu also incorporates caching and can leverage memory or NVRAM. Kudu is implemented in C++ to allow better performance and memory management (vs HDFS in Java).

It provide a faster alternative to HDFS with Parquet files, Plus allow updating records in place instead of re-writing the entire dataset (updates are not supported in HDFS or Parquet). Note Hortonworks recently added update capabilities to Hive over HDFS, but in an extremely inefficient way (every update adds a new file which is linked to all the previous version files, due to HDFS limitations).


Source: Cloudera

What is it NOT?

Kudu is a point solution, it is not a full replacement for HDFS since it doesn’t support file or object semantics and it is slower for pure sequential reads, it is not a full NoSQL/NewSQL tool and other solutions like Impala or Spark need to be layered on top to provide SQL. Kudu is faster than HDFS but still measure transaction latency in milliseconds, i.e. can’t substitute in-memory DB, it also have higher insert (ingestion) overhead so not better than HBASE in write intensive loads.

Kudu is not “Unstructured” or “Semi-Structured”, you must explicitly define all your columns like in traditional RDBMS, somewhat against the NoSQL trend.

Kudu is a rather new project at Alpha or Beta level stability, and with very limited or no functionality when it comes to data security, compliance, backups, tiering, etc. It will take quite a bit of time until it matures to enterprise levels, and it’s not likely to be adopted by the other Hadoop distributions (MapR or HortonWorks) who work on their flavor.

What are the key challenges with this approach?

The APIs of Kudu do not map to any existing abstraction layer, this means Hadoop projects need to be modified quite a bit to make use of it, Cloudera will only support Spark and Impala initially.

It is not a direct HDFS or HBASE replacement as outlined in Cloudera web site:

“As Hadoop storage, and the platform as a whole, continues to evolve, we will see HDFS, HBase, and Kudu all shine for their respective use cases.

  • HDFS is the filesystem of choice in Hadoop, with the speed and economics ideal for building an active archive.
  • For online data serving applications, such as ad bidding platforms, HBase will continue to be ideal with its fast ability to handle updating data.
  • Kudu will handle the use cases that require a simultaneous combination of sequential and random reads and writes – such as for real-time fraud detection, online reporting of market data, or location-based targeting of loyalty offers”

This means the community has created yet another point solution for storing data. Now, in addition to allocating physical nodes with disks to HDFS, HBASE, Kafka, MongoDB, Etc. we need to add physical nodes with disks for Kudu, and each one of those data management layers will have its own tools to deploy, provision, monitor, secure, etc. Now users who are already confused would need to decide which to use for what use case, and would spend most of their time integrating and debugging OpenSource projects rather than analyzing data.

What is the right solution IMHO?

The challenge with Hadoop projects is that there is no central governess, layers, or architecture like in the Linux Kernel or OpenStack projects or any other successful open source projects I participated in. Every few weeks we can hear about a new project which in most cases overlaps with others. In many cases different vendors will support different packages (the ones they contributed). How do you do security in Hadoop? Well that depends if you ask Horton or Cloudera, Which tool has the best SQL? Again depends on the vendor, even file formats are different.

I think its ok to have multiple solutions for the same problem, Linux has an endless number of file systems as we know, OpenStack has several overlay network or storage provider implementations, BUT they all adhere to the same interfaces, have the same expected semantics, behavior, and management. Where we see much overlap (like with HDFS, Kudu, HBase) we create intermediate layers so components can be shared.

If there is a consensus that the persistent storage layers in Hadoop (HDFS, HBASE) are limited or outdated, and we may need new abstractions to support record or column semantics, improve random access performance, or if we all understand security is a mess, the best way is to first define and agree on the new layers and abstractions, and gradually modify the different projects to match that. If the layers are well defined it means different vendors can provide different implementations of the same component while enjoying the rest of the echo-system. Existing commercial products or already established OpenSource projects can be used with some adaptation and immediately deliver Enterprise resiliency and usability. We can grow the echo-system beyond the current 3 vendors with better solutions that complement each other rather than compete on the entire stack, and we can add more analytics tools (like Spark) which may want to access the data layer directly without being ties to the entire Hadoop package.

If we seek better Enterprise adoption in Hadoop, I believe the right way for Cloudera or the other vendors is to provide an open interface for partners to build solutions around it. Much like Linux, OpenStack, MongoDB, Redis or MySQL did alongside their own reference implementation. A better way for them to build value may be to improve the overall usability and focus on pre-integrating vertical analytics applications or recipes above the infrastructure

Adding another point solution like Kudu to a singular data model, and working for few years to make it Enterprise grade just makes Hadoop deployment more complex and make the life of data analysts even more miserable. Well … on second thought if most of your business is in pro services keeping things complicated might be a good idea J

Cloud-Native Will Shake Up Enterprise Storage !

Cloud-native shake enterprise

Enterprise IT is on the verge of a revolution, adopting hyper-scale and cloud methodologies such as Micro-services, DevOps and Cloud-Native. As you might expect the immediate reaction is to try and apply the same infrastructure, practices and vendor solutions to the new world, but many solutions and practices are becoming irrelevant, SAN/VSAN and NAS among others.

Read my previous blog post for background on Cloud-Native, or this nice post from an eBay expert.


In the new paradigms we develop software the way cloud vendors do:

  • We assume everything can break
  • Services need to be elastic
  • Features are constantly added in an agile way
  • There is no notion of downtime

The way to achieve this nirvana is to use small stateless, elastic and versioned micro-services deployed in lightweight VMs or Docker containers. When we need to scale we add more micro-service instances. When we need to upgrade, DevOps guys replace the micro-service version on the fly and declare its dependencies. If things break the overall service is not interrupted. The data and state of the application services are stored in a set of “persistent” services (Which will be elaborated on later), and those have unique attributes such as Atomicity, Concurrency, Elasticity, etc. specifically targeting the new model.

If we contrast this new model with current Enterprise IT: Today, application state is stored in Virtual Disks.  This means we have to have complex and labor intensive provisioning tools to build it, snapshot, and backup. Storage updates are not atomic so we invented “consistent snapshot” which doesn’t always work. We don’t distinguish between shared OS/application files and data, so we must dedup all the overlap. Today the storage layer is not aware of the data semantics, so we deploy complex caching solutions, or just go for expensive All-Flash or In-Memory solution – why be bothered with app specific performance tuning.?

Managing Data in a Stateless World

Now that we understand the basic notion that everything can and will break, we have to adopt several key data storage paradigms:

  • All data updates must be atomic and to a shared persistency layer. We cannot have temporary dirty caches, cannot use local logs, cannot do partial updates to files, or maintain local journals in the micro-service. Micro-services are disposable!
  • Data access must be concurrent (asynchronous). Multiple micro-services can read/update the same data repository in parallel. Updates should be serialized, no blocking or locking or exclusivity is allowed. This allows us to adjust the number of service instances according to demand.
  • Data layer must be elastic and durable – we need to support constant data growth or model changes without any disruption to the service. Failures to data nodes should not lead to data loss.
  • Everything needs to be versioned to detect and avoid inconsistencies.

You can notice that Enterprise NAS, POSIX semantics and not to mention SAN/VSAN solutions do not comply with the above requirements, and specifically with Atomicity, Concurrency, and Versioning. This can explain why Hyper-Scale Cloud vendors don’t widely use SAN or NAS internally.

With Cloud-Native Apps services like Object Storage, Key/Value, Message Queues, Log Streams are used to make the different types of data items persistent. Disk images may still exist to store small stateless application binaries (like Docker does), those would be generated automatically by the build and CI/CD systems and don’t need to be backed up.

persistent services

Data items and files are backed up in the object storage, which have built-in versioning, cloud tiering, extensible and searchable metadata. No need for separate backup tools and processes or complex integrations, and no need to decipher VMDK (virtual disk) image snapshots to get to a specific file version since data is stored and indexed in its native and most granular form.

Unlike traditional file storage Cloud-Native data services have atomic and stateless semantics such as Put (to save an object/record), Get (to retrieve an object or record by key and version), List/select (to retrieve a bunch of objects or records matching the query statement and relevant version), exec (to execute a DB side procedure atomically).

The Table below describes some of the key persistent services by category

Category Amazon AWS Service Name OpenSource Alternatives Focus
Object Storage S3 OpenStack Swift Store mid–large objects cost effectively, extensible Metadata & versioning, usually slow
NoSQL/NewSQL DB, Key/Value DynamoDB, Aurora Cassandra, MongoDB, Etc. Store small-mid size objects, data/column awareness, faster
Object Cache
(in memory)
ElastiCache (Redis, Memcached) Redis, Memcached Store objects in memory (as shared cache), no/partial durability
Durable Message Queue Kinesis Kafka Store and route message and task objects between services, fast
Log Streams CloudWatch Logs Elastic Search (ELK), Solr Store, map, and query semi-structured log streams
Time Series Streams CloudWatch Monitoring Graphite, InfluxDB Store, compact, and query semi-structured time series data

One may raise the possibility of deploying those persistent services over a SAN or VSAN. But that won’t work well since they must be atomic and keep the data, metadata, and state consistent across multiple nodes and implement their own replication anyway. So using an underline storage RAID/Virtualization is not useful (in many cases even more harmful). The same applies for snapshots/versioning which are handled by those tools at transaction boundaries Vs. at non consistent intervals. In most cases such tools will use just a bunch of local drives.

What to expect in the future?

The fact that each persistent service manages its own data pool, repeats similar functionality, is tight to local physical drives, and lacks data security, tiering, backups or reduction is challenging. One can also observe there is a lot of overlap between the services and most of the difference is at the trade-off between volume, velocity, and data awareness (Variety). In the future many of these tools would be able to use shared Low-Latency, Atomic, and Concurrent Object Storage APIs as an alternative (already supported by MongoDB, CouchDB, Redis, Hadoop, Spark, Etc.). This would lead to centralizing the storage resources and management, disaggregating the services from the physical media, allowing better governess and greater efficiency, and simplifying deployment. All are key for broader Enterprise adoption.


If you are about to deploy a micro-services and agile IT architecture don’t be tempted to reuse your existing IT practices. Learn how cloud and SaaS vendors do it, and internalize that it may require a complete paradigm shift. Some of those brand-new SANs, VSANs, Hyper-Converged, AFAs, and even scale-out NAS solutions may not play very well in this new world.