How to Build the Lowest Cost Storage For BigData and Cloud !

With the emergence of Cloud and BigData the importance of storage costs and expandability become a major concern. Today most commercial and software defined storage (SDS) systems are rather inefficient. The cost of a Terabyte of capacity storage can be 7-50 times the cost of the disk media. In this post we will review the cost breakdown and how it can be greatly reduced with focus on capacity storage (SATA HDD). Efficiency of SSD based performance systems will be covered in a future post.

First we need to understand the price breakdown in storage systems:

Traditional SDS systems trade the storage vendor margins with higher redundancy costs (doing 2-3X replication vs Raid 5/6) and add software license costs. Given that storage vendors’ margins are usually pretty high (systems are sold to the channel at about 3X the cost) users can still extract some cost saving from this exercise and reduce the Terabyte cost from $500-2,000 down to $300, when a Terabyte of SATA disk cost about $40. This cost saving doesn’t come for free. It usually means higher operational costs and slower time to market. It also means lower performance since commercial systems usually incorporate NV-RAM caches for write log, hardware Raid, tiering and more optimized and bug free code paths. SDS systems may have reduced data availability since a storage node failure would lead to inaccessibility to all the disks in that node. With a 24 x 6TB disk system it means a CPU failure would block us from accessing 144TB of storage, and those would need to be rebuilt on another node. If we have a single disk failure on the remaining node during that time the entire file system can be corrupted as data is usually striped across disks, which is why many server based storage systems like Ceph and Hadoop HDFS use 3 replicas and not just 2. This is to avoid critical failures during node rebuild times.

So what more can we do to reduce costs?

  • Improve the ratio of platform cost to disks
  • Reduce the replication overhead, by implementing erasure coding
  • Beg for higher discount

Since I can’t help with the 3rd item let’s look at the equation again and review few alternatives, first without factoring the cost of redundancy:

Evaluating A Few Storage Server Configurations

A common practice is to use up to 24 disks per server – a storage server like SuperMicro 6047R-E1R24L can be bought fully configured with a decent dual socket Xeon, 64GB RAM, 10GbE NIC and SATA adapters for around $5,000-6,000, adding 24 x 4TB disks each at $160 would get us to a ~$9,300 system which translates to ~$100 per Terabyte. With 3 replica that would be $300.

One of the most cost effective designs for cold storage is the BackBlaze 4.0 Storage POD, worth reading the link as it covers the BOM and design up to the screw level, in case you want to buy the box you go to an ODM vendor like 45drives.com in which case a box with redundant power supply and a 10GbE NIC would cost you $7,548 with 45 x 4TB drives it’s $82 per Terabyte. This comes with a bunch of caveats, this is not an enterprise grade box that comes with support like SuperMicro, Dell, HP, etc’, the design is using pretty low-end CPU and a highly blocking SATA fabric so it would fit cold storage, not so much BigData or Cloud applications.

My preferred design is using high density 60 bay (or even denser) JBODs, this is also a trend with some big cloud providers. A set of 2U server connects to 2 such JBODs (or 4 for redundancy). Such server would cost about $5,000 with a 40GbE adapter and few SATA cards, each JBOD is about $4,000, and this bundle of a server with 2 fully populated JBODs will cost $32,200 which translates to $67 per Terabyte. It has a few other notable advantages – Each JBODs can be connected to 2 servers simultaneously allowing continuous availability in a case of a node failure and requiring only 2 replica vs 3 to achieve the same resiliency. It has great density of 120 drives in 10U, Disks/JBODs can be scaled independently without adding servers, and using a 40GbE adapter leads to greater network cable consolidation. With RDMA 40GbE or InfiniBand NICs we can offload the CPU from the communication overhead and reduce the IO latency. In this option we need to ensure the SDS software stack is implemented efficiently and doesn’t consume too much CPU and Memory resources, unfortunately some of those options like Ceph and Hadoop HDFS are quite resource hungry (recommend CPU core for every 1-2 disks).

Assuming you are not downloading and installing a free software version you need to add an additional software/support license from the SDS vendor of your choice.

So now we got an attractive cost per Terabyte of $67-100 which is adding a 70-150% overhead on the disk cost ($40), but to make the data resilient we need to add redundancy. If we use replication the first 2 options mandate 3 replica which translates to $250-300 per Terabyte, with the 3rd option only two replica are needed getting us to an amazingly low cost of $134 for a resilient Terabyte which is half the cost compared to the other two options.

Configuration Summary Table

24 Disk Servers

BackBlaze POD

High-Density JBODs

Raw $/Terabyte

$100

$82

$67

Density per server

96TB in 4U

180TB in 4U

480TB in 10U

$/Terabyte with redundancy

$300

$246

$134

Redundant Systems for 700TB
(1 PB with 6TB Disks)

24 servers in 96U

48 x 10GbE Uplinks

(3 Replicas)

12 servers in 48U

24 x 10GbE Uplinks

(3 Replicas)

3 servers + 6 JBODs in 30U

6 x 40GbE Uplinks

(2 Replicas)

Rebuild disks in case of server failure or maintenance

Yes

Yes

No

$/Terabyte with Erasure coding

$140

$115

$94

Reduce Replica Overhead with Erasure Codes

The emerging solution to reduce the redundancy overhead is to use Erasure Codes, this it the way most web infrastructure companies store their cold data to be both resilient and cost effective. With Erasure codes you can add redundancy while reducing the overhead, in a 10+4 coding scheme you can have up to 4 failures and only pay 40% data overhead i.e we can get down to $94 for a resilient Terabyte, this also means that once there is a failure we may even want to wait before we correct it and replace few drives at a time, or we can decide to migrate data chunks to other boxes when the system is live, leading to operational efficiency.

The down side of Erasure Codes is the high CPU and Memory bandwidth overhead involved in calculating and distributing the data, which is why it’s mainly used in lower performing storage tiers and in object storage solutions focused on backup/archiving. In the future I hope CPU and/or Network/Storage adapter vendors will incorporate this functionality in hardware which will allow its use in higher data tiers.

If we do the cost math again now with Erasure code of 10+4 we can get to a cost of $94 to $140 per Terabyte, quite cool I assume J Note that the 3rd option which decouple the JBODs from Servers is still better even without the extra replica saving, because we can avoid a rebuild of the relevant data chunks in case the server node fails or is in maintenance. This however requires the SDS software to be aware of multi-path which is not yet the case with many of the object based SDS solutions today.

Summary

To summarize, we have went through the trail of understanding the storage cost model with the inefficient solutions of today which have a 7-50X cost overhead over the raw disk cost. We also examined ways to reduce the cost while understanding the different tradeoffs, getting eventually to a solution with as minimal overhead as low as 2.35X ($94/$40) or a faster one with an overhead of 3.35X ($134/$40). To this number we need to add the SDS software/support cost assuming we don’t use freeware. Obviously cost is not the only metric and higher cost solutions may have their advantage in better availability, supportability, management, performance, etc. All of this needs to be factored in your decision – I will cover some of these aspects in the upcoming posts.

8 thoughts on “How to Build the Lowest Cost Storage For BigData and Cloud !

    • Andreas, the JBOD has two uplinks for multipath, and one downlink to each disk (a common practice with arrays given JBOD is stateless and have high MTBF), can lookup 4TB SATA drives in Amazon they are about $160 each.
      60 bay JBODs are manufactured by ODMs like Quanta, Sanmina, Jabil and can be bought through their resellers for around this price with a bit of barging
      another even cheaper alternative is to use two SuperMicro 45bay JBODs, each is sold for less than $2,000 (e.g. http://www.superbiiz.com/detail.php?name=CA-87E1RD1), so you get 90 disk bays for the same $4,000

      Like

  1. The 300$ price for usable storage can be acheved with entry level starage. Like v3700 from ibm. Or the3100 vnx from emc.
    The problem with home made storage is the unexpected behavior and mentenence.
    how to upgrade the disk FW?
    How to replace component like power supply?
    How to upgrade the storage code when new one is neaded?

    Like

    • Aries,
      Those are valid points, indeed DIY has downsides around OpEx and maintenance, I have made the same points in this and in the previous post (on OpenStack storage options)
      Not sure how you get to $300/TB with VNX, probably heavily discounted price, Raid5, SAS FE? The prices I referred to are for single unit prices, with 40GbE FE, and using replication, which have a clear benefit in rebuild times and in IOPs over Raid5/6. In any case we are talking about a 3x or more price difference.
      Note that the SDS solution whether it’s an open source like Ceph or a commercial one like EMC ScaleIO, do take care of various failures or maintenance operations like the ones you mentioned, some would argue they are even more resilient, since data is replicated across multiple boxes in different racks or even zones and not one highly available enclosure like in traditional systems.

      Yes, SDS stacks are still not as robust or full featured like Enterprise storage systems, but they will get there.

      Like

  2. Hi Yaron,

    Interesting article. Have you heard of Maxta? Maxta is a SDS company delivering the simplicity and the enterprise class features that were highlighted as the key issues in the article and in the comments. Maxta makes storage management extremely simple by enabling customers to manage VMs and not storage. Enterprise class features include data integrity, time/performance/capacity efficient snaps and clones, capacity optimization features, rolling upgrade of Maxta software etc.

    To take away the guess work around interoperability of hardware and deliver predictability we have developed MaxDeploy reference architecture around most of the server platforms. This addresses some of he concerns that were highlighted in the comments.

    I will be happy to have a call to answer any questions.

    Like

    • Thanks, i did not have hands-on experience with Maxta, stumbled into it at VMworld i believe, seems like you have a nice set of features
      one challenge to look at is how you can coexist with traditional IT not just greenfield deployments, you may want to provide some gateways to external storage or from network clients (e.g. via iSCSI or NAS).
      i think Hyper-converged is good for portion of the market in which you can control the ratio of compute to storage, or the simplicity of an integrated solution is more important than potential growth or scalability, for large amounts of data and constant scaling i think dis-aggregation of compute and storage may be better

      Yaron

      Like

  3. Yaron, these are very nice breakdowns, but they address engineers. I doubt any of the cost gains you point to, has the power to sway decision makers. Decision makes prefer not to trust their own engineers to build home-grown solutions, Especially in areas where corporate solutions exist.

    What is missing from your cost analysis the the cost of recovering from a fatal failure. Controller failure, disaster etc.

    Like

    • Dan,

      the goal of the blog is not to promote a commercial product, rather provide insight, challenge the current assumptions/perception, and promote an architecture
      yes, an Enterprise customer would like a shrink wrapped solution which addresses all the aspects of failure recovery, but there is nothing in this architecture which limits the best solution for failure recovery, in practice doing a hashed mirror or erasure code is known to provide faster rebuilds than Raid5, but that is up to the relevant software stack to implement (and some already do).

      on the same time there are many service providers who build their own storage solutions loaded with an open source block/file/object software, home grown, or a commercial SDS software, or users who build Hadoop, Cassandra, or MongoDB clusters which provide the replication and fault recovery as part of the application stack. they have lots of storage and don’t want to buy commercial products which are sold at 3-5x the price/GB when the fault recovery is handled at a higher layer.

      Yaron

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s