Recently Cloudera launched a new Hadoop project called Kudu. it is quite aligned with the points I made in my Architecting BigData for Real Time Analytics post, i.e. the need to support random and faster transactions or ingestion alongside batch or scan processing (done by MapReduce or Spark) on a shared repository.
Today users which need to ingest data from Mobile, IoT sensors, or ETL, and on the same time run analytics have to use multiple data storage solutions and copy the data just because some tools are good for writing (ingestion), some are good for random access, and some are good for reading and analysis. Beyond the fact this is rather inefficient it adds significant delay (coping terabytes of data) which hinders the ability to provide real-time results, and add data consistency and security challenges. Also not having true update capabilities in Hadoop limits its use in the high margin data warehouse market.
Cloudera wants to penetrate faster into the enterprise by simplifying the stack, and at the same time differentiate itself and maintain leadership. It sees how many native Hadoop projects like Pig, Hive, Mahout are becoming irrelevant with Spark or other alternatives, and it must justify its multi-billion dollar valuation.
But isn’t Kudu somewhat adding more fragmentation, and yet another app specific data silo? Are there better ways to address the current Hadoop challenge? more on that later …
What is Kudu?
You can read a detailed analysis here. In a nutshell, Kudu is a distributed Key/Value storage with column awareness (i.e. row value can be broken to columns, and column data can be compressed), somewhat like the lower layers of Cassandra or Amazon Red-Shift. Kudu also incorporates caching and can leverage memory or NVRAM. Kudu is implemented in C++ to allow better performance and memory management (vs HDFS in Java).
It provide a faster alternative to HDFS with Parquet files, Plus allow updating records in place instead of re-writing the entire dataset (updates are not supported in HDFS or Parquet). Note Hortonworks recently added update capabilities to Hive over HDFS, but in an extremely inefficient way (every update adds a new file which is linked to all the previous version files, due to HDFS limitations).
What is it NOT?
Kudu is a point solution, it is not a full replacement for HDFS since it doesn’t support file or object semantics and it is slower for pure sequential reads, it is not a full NoSQL/NewSQL tool and other solutions like Impala or Spark need to be layered on top to provide SQL. Kudu is faster than HDFS but still measure transaction latency in milliseconds, i.e. can’t substitute in-memory DB, it also have higher insert (ingestion) overhead so not better than HBASE in write intensive loads.
Kudu is not “Unstructured” or “Semi-Structured”, you must explicitly define all your columns like in traditional RDBMS, somewhat against the NoSQL trend.
Kudu is a rather new project at Alpha or Beta level stability, and with very limited or no functionality when it comes to data security, compliance, backups, tiering, etc. It will take quite a bit of time until it matures to enterprise levels, and it’s not likely to be adopted by the other Hadoop distributions (MapR or HortonWorks) who work on their flavor.
What are the key challenges with this approach?
The APIs of Kudu do not map to any existing abstraction layer, this means Hadoop projects need to be modified quite a bit to make use of it, Cloudera will only support Spark and Impala initially.
It is not a direct HDFS or HBASE replacement as outlined in Cloudera web site:
“As Hadoop storage, and the platform as a whole, continues to evolve, we will see HDFS, HBase, and Kudu all shine for their respective use cases.
- HDFS is the filesystem of choice in Hadoop, with the speed and economics ideal for building an active archive.
- For online data serving applications, such as ad bidding platforms, HBase will continue to be ideal with its fast ability to handle updating data.
- Kudu will handle the use cases that require a simultaneous combination of sequential and random reads and writes – such as for real-time fraud detection, online reporting of market data, or location-based targeting of loyalty offers”
This means the community has created yet another point solution for storing data. Now, in addition to allocating physical nodes with disks to HDFS, HBASE, Kafka, MongoDB, Etc. we need to add physical nodes with disks for Kudu, and each one of those data management layers will have its own tools to deploy, provision, monitor, secure, etc. Now users who are already confused would need to decide which to use for what use case, and would spend most of their time integrating and debugging OpenSource projects rather than analyzing data.
What is the right solution IMHO?
The challenge with Hadoop projects is that there is no central governess, layers, or architecture like in the Linux Kernel or OpenStack projects or any other successful open source projects I participated in. Every few weeks we can hear about a new project which in most cases overlaps with others. In many cases different vendors will support different packages (the ones they contributed). How do you do security in Hadoop? Well that depends if you ask Horton or Cloudera, Which tool has the best SQL? Again depends on the vendor, even file formats are different.
I think its ok to have multiple solutions for the same problem, Linux has an endless number of file systems as we know, OpenStack has several overlay network or storage provider implementations, BUT they all adhere to the same interfaces, have the same expected semantics, behavior, and management. Where we see much overlap (like with HDFS, Kudu, HBase) we create intermediate layers so components can be shared.
If there is a consensus that the persistent storage layers in Hadoop (HDFS, HBASE) are limited or outdated, and we may need new abstractions to support record or column semantics, improve random access performance, or if we all understand security is a mess, the best way is to first define and agree on the new layers and abstractions, and gradually modify the different projects to match that. If the layers are well defined it means different vendors can provide different implementations of the same component while enjoying the rest of the echo-system. Existing commercial products or already established OpenSource projects can be used with some adaptation and immediately deliver Enterprise resiliency and usability. We can grow the echo-system beyond the current 3 vendors with better solutions that complement each other rather than compete on the entire stack, and we can add more analytics tools (like Spark) which may want to access the data layer directly without being ties to the entire Hadoop package.
If we seek better Enterprise adoption in Hadoop, I believe the right way for Cloudera or the other vendors is to provide an open interface for partners to build solutions around it. Much like Linux, OpenStack, MongoDB, Redis or MySQL did alongside their own reference implementation. A better way for them to build value may be to improve the overall usability and focus on pre-integrating vertical analytics applications or recipes above the infrastructure
Adding another point solution like Kudu to a singular data model, and working for few years to make it Enterprise grade just makes Hadoop deployment more complex and make the life of data analysts even more miserable. Well … on second thought if most of your business is in pro services keeping things complicated might be a good idea J