Commenters on the last post – Open source storage array – helped crystallize an idea that’s been lurking for years: comparing disk storage hardware on per-slot price. The Backblaze box, which costs about $50/slot, got a comment that said, in effect, “it doesn’t have the features of a $200/slot box.” Good!
But the comment raised an interesting point: since we all use the same disks from the same few – and soon to be fewer – manufacturers, isn’t the cost of the tin we wrap them in a key metric? Let’s call it PSC – Per Slot Cost.
Some advantages:
- Focus on value-add. We know how many disk slots there are in a storage system. We know how much disks cost. Therefore, the per-slot price tells us what the vendor’s value-add per disk is – or what we’re supposed to think it is.
- Increases pricing contrast. Disk costs are typically 10-15% of the price of a mid-to-high end array. The number of disk slots in those arrays vary, as do individual disk capacities. These variables obscure what the vendor is asking for their value add.
- Cleaner comparisons. As a corollary to the previous point, PSC makes it easier to compare architecturally similar systems – SAS vs SAS, hybrid SSD/SATA systems, RAID 6 systems – whose hardware cost structures should be similar.
- Focus on software value. Since most storage systems – even high-end systems – run on commodity hardware, the biggest price variable is in software. Isn’t that where we should focus?
The cloud storage angle
PSC should be useful for market segmentation. Instead of dumping arrays into entry-level price buckets – such as $75-$100k or $/GB – the PSC should track with the value of the stored data.
Expect to see segments range from Bulk (the Backblaze segment) to Heavy Transactional (traditional big iron) with yet-to-be-named segments between. But the most important use for PSC is in highly-scalable architectures in the public vs private cloud storage arena.
Cloud architectures are distinguished by the fact that the larger they scale, the lower their PSC. This is partly a function of economic necessity – who can afford 2 dozen PB of Symm? – and largely due to their use of software-based object replication instead of RAID.
When your storage is cheap, you can afford triple replication. And when you have massive numbers of boxes – and at least 2 data centers – you can have strong disaster tolerance. So large-scale cloud suppliers have motive and opportunity to reduce PSC.
The private cloud space is where the calculus gets interesting. Many observers dismiss the private cloud concept because they can’t possibly compete with Amazon, Microsoft and Google on scale or cost, including PSC.
The StorageMojo take
There is a private cloud market because there are other issues, such as network latency, and the commercialization of high-scale software such as Hadoop, that make it possible for any focused billion-dollar company to build a competitive cloud infrastructure. The hardware is already a commodity, and many of the improvements that Google 1st pushed, such as more efficient power supplies, are now widely available.
The bigger issue for competitive private clouds is the enterprise IT mindset that lacks the skills to specify and manage them. This is where PSC comes in: it allows CFOs to compare their costs to best-in-breed cloud providers in a simple way.
PSC is just a metric, not the metric. The big guys are optimizing things – like power distribution – that won’t move the needle for smaller players.
But if you use commodity hardware then you should focus on the software. And since every big player is already running on commodity hardware – a Good Thing, BTW – let’s focus on getting software that delivers business value. To the extent that PSC helps decision-makers do that, it will help the industry shift the focus from things like $/GB to a higher-level discussion.
Courteous comments welcome, of course. I just paid $250 per slot for an array with 1 controller, 1 fan and 1 Thunderbolt connection to my 1 desktop. Yes, I could have done better – if I didn’t want Thunderbolt. So PSC doesn’t trump all.
The problem I see with IT mindset is not with IT, but with users. As long as it’s “local” users have many, usually high, expectations for speed, availability, and reliability that are hard to control and difficult for cloud solutions to meet. CFO’s also do not understand the “swarm” approach and are likely to shoehorn a cloud solution into and enterprise mentality and, trust me, bad things will happen.
There is a whole other discussion point that the OS, software, and filesystems have not nearly kept up with technology. Most protocols still push bits around in the filesystem with no native validation of the data and no easy way to scale or “backup”. Multiple copies of files both live and “backup” ( once again with little native integrity checking ) is a huge waste of space and IO, but necessary to ensure some level of data safety. IO speed is especially problematic as the storage per slot is growing much faster than the usable bandwidth per slot ( how long would it take amazon to scan it’s dataset? ).
I don’t know the answers, but there are a growing number of questions out there that desperately need answers so that technology can move past these hurdles.
Yes software is the key, something I forgot to add to my last comment on your other post with regards to software – is Google and Amazon (google especially) do write a ton of storage software to link their hardware and do replication and stuff. Neither organization has released anything that I’ve seen other than Google’s white papers on their broad architecture, but they don’t actively contribute code to open public projects based on this as far as I know, because, of course they understand the value and this gives them a competitive advantage to keep it close to their chest (more power to ’em I guess).
Rackspace (and others?) are working on some similar technology though I suspect it is several years before anyone serious would trust it, takes time for that kind of stuff to mature.
So yes from a hardware perspective you absolutely can afford to double or triple(or more) protect things with such cheap hardware, but doing it right in the software gets really complicated really fast. There was a interesting post by an equallogic user a while back –
http://www.tuxyturvy.com/blog/index.php?/archives/61-Three-Years-of-Equallogic.html
where one of the comparisons made was the granularity in replication/snapshot technologies between equallogic (at the time, maybe it’s better now) vs the person’s existing EMC gear, and the massive effect that had on bandwidth requirements between sites for the same amount of changed data at the source.
I came across this last year when I was doing research for a hadoop cluster –
http://storageconference.org/2010/Papers/MSST/Shvachko.pdf
“Replication of data three times is a robust guard against loss
of data due to uncorrelated node failures. It is unlikely Yahoo!
has ever lost a block in this way; for a large cluster, the prob-
ability of losing a block during one year is less than .005. The
key understanding is that about 0.8 percent of nodes fail each
month. (Even if the node is eventually recovered, no effort is
taken to recover data it may have hosted.) So for the sample
large cluster as described above, a node or two is lost each day.
That same cluster will re-create the 54 000 block replicas
hosted on a failed node in about two minutes. (Re-replication is
fast because it is a parallel problem that scales with the size of
the cluster.) The probability of several nodes failing within two
minutes such that all replicas of some block are lost is indeed
small.”
HDFS, while a nice concept, is by no means the solution to all problems, it was built for a single purpose and it does that purpose fairly well.
UNfortunately for many less technical folks they don’t realize this until it’s too late.
Mirroring has the other advantage of you can read the same source data from multiple places increasing performance, something you can’t really do with parity based RAID.
Vertica (and I imagine some others too) takes this mirroring one step further, their product is a analytics product, so they allow you to mirror the data multiple times, but store each copy in a differently sorted way. Since all of the data is there, if one copy fails any other copy can be used to reconstruct the failed copy but it gives them the advantage of being able to read data in different forms much faster since it is stored in that form natively on disk already. I thought that was really creative at least I had not heard of that before they mentioned it. (I haven’t spent any time in that space so I’m sure I’m quite ignorant of what goes on in it).
Vertica’s replication (just another data point) by contrast isn’t quite so sophisticated, if you have a 50 node cluster, with say 10TB of storage, and you want to replicate the *data* to a remote site, you need 50 nodes at that remote site. You can’t replicate to a 1-node cluster with 10TB of storage (say in the event your mainly concerned about protecting the data, not providing HA to the cluster). This may be fixed by now, they said they were going to work on it. Just another tidbit of data on how replication can get complicated.. A workaround is you can run 50 VMs on a single host and do it that way, not ideal, but at least you don’t need to run 50 pieces of hardware.
The other topic that, I touched on, but really needs to be brought up again. Since they are using consumer drives you can expect that each BB “pod” has at least one sector that will be unrecoverable due to the BER. Linux based raid-6 will catch an unrecoverable sector, but will not catch double bit error reads since it does not do a full stripe read. That is before you count all the silent corruption that is bound to occur when you throw together a bunch of consumer grade hardware without ECC. In the end backblaze does not publish it’s block or file failure rates so it’s hard to compare. They also claim that the failure they do have don’t matter since they are just a backup and they would have to fail at the same time as your computer to lose data which is not exactly reassuring.
“Enterprise” hardware has much higher tolerances for BER and interconnect checksumming, so they will have to deal with this problem as well, but further down the line.
Obviously end-to-end checksums are the way to go, but how many people with petabyte dreams think that far ahead? I see a lot of people claiming to use the BB pod for teir 2 storage without a checksumming filesystem and I cringe.