The post-RAID era begins

by Robin Harris | Monday, July 23, 2012 | Architecture, Cloud computing & storage, Clusters, Future Tech | 18 comments

The post-RAID (noRAID) era has begun. While RAID arrays aren’t going away, the growth is elsewhere, and corporate investment follows growth.

Why now?
There are now architecturally superior alternatives to RAID that are lower cost. But you could argue that the post-RAID milestone was passed years ago.

The authors of the 1988 original RAID paper (Patterson, Gibson and Katz) all moved on a decade or more ago: Patterson to OceanStore; Gibson to Panasas, a scale-out object storage company he co-founded; and Katz has been working on Hadoop among other projects.
What are probably the fastest growing large storage infrastructures in the world – Google’s and Amazon’s – aren’t based on RAID.
Major storage vendors including NetApp, HP, EMC and Hitachi have all invested in – and are selling – noRAID systems.
But the biggest reason? The math behind erasure codes improved after the RAID paper was written.

The math?
RAID uses a form of Reed-Solomon erasure coding to create parity information that protects a RAID array from 1 (RAID5) or 2 (RAID6) uncorrectable read errors (URE). But RAID 5 stopped working 3 years ago if you use SATA drives.

Erasure coding’s key advantage is that you can break up your data into n fragments, add m additional fragments, store the fragments across n+m devices, and then recover the original data from any n of the devices. Thus in a RAID5 8 drive stripe, the original data is divided into 7 fragments, an 8th fragment is calculated – the parity data – and then any one of the 8 drives can fail without losing (theoretically) any data.

RAID5’s problems are that as disks get larger, rebuild times get dangerously long – increasing the chance that another disk will fail before rebuild completes while reducing performance all the while – and that an URE will be found on another disk, killing the rebuild. Surviving 2 failures is the minimal reasonable protection today.

In the ’90s a new form of erasure coding was developed that enabled developers to create codes with an arbitrary level of redundancy – survive 4 failures? 10? Pick a number! – called fountain or rateless erasure codes. Startups including Digital Fountain, Cleversafe and Amplidata have sprung up to take advantage of these new codes.

A new StorageMojo video explores the advantages of rateless codes, using the Amplidata example. One key advantage: the redundancy needed to survive 4 failures is, they tell me, down to 50-60% of the data. Much better than the 3x replication that Amazon and Google use in their infrastructures, and competitive with RAID6.

The StorageMojo take
Redundant Arrays of Inexpensive Disks shook up a complacent industry almost 25 years ago. But time and technology move on.

Despite the huge investment the industry has in RAID controller code, we now have better solutions. Properly priced and marketed, these solutions will drive the next big round of storage growth.

Courteous comments welcome, of course. I’ve been doing work for Amplidata. For a quick intro to erasure coding for storage developers, check out Prof. Jim Plank’s Erasure Codes for Storage Applications (pdf) presentation.

18 Comments

anon on Tuesday, 24 July, 2012 at 8:52 pm

What really terrifies me is that you’re citing examples that are more than a decade old. There has been that little forward motion.
anonymous on Wednesday, 25 July, 2012 at 12:16 pm

Are any of these post-raid solutions available on a Small Office Home Office scale? If so, which?
Robin Harris on Wednesday, 25 July, 2012 at 1:49 pm

I had great hopes a couple of years back that someone would do this for the SOHO market with a few plug computers and low-end flash drives, but no joy. I am expecting something similar of higher scale and cost late this year or early next. Stay tuned.
Chris on Wednesday, 25 July, 2012 at 6:44 pm

Hi Robin

Is there anything that is coming that works similarly but at a lower level (block, NFS etc)? REST or API seems rather specific relative to the rest of the market. Not really as universal useful as RAID.
matt on Thursday, 26 July, 2012 at 6:09 am

The times I have had to emit the mantra :
RAID is not backup
Howard Marks on Thursday, 26 July, 2012 at 11:18 am

Robin,

Much as I love what Cleversafe and Amplidata are doing with erasure coding I’m still looking for what comes after RAID for more active data. Small random writes to a system using a 16 of 20 erasure code system will require something like 40 back end disk IOs for each 4K write.

I guess this could be addressed by a log structured data layout with a large log/cache device of SSDs but we haven’t seen such a solution yet.
Taylor on Thursday, 26 July, 2012 at 2:39 pm

The other thing is that the place where such advanced coding schemes become valuable — you’ve got so much data that the difference in overhead v. RAID6 is saving you lots of money, and the data is important enough / needs to be available enough that RAID5 doesn’t cut it — is exactly the place where you *don’t* want to try out new, largely un-tested algorithms (and more importantly implementations).

And yes, these algorithms will kill performance for random writes.
Robin Harris on Thursday, 26 July, 2012 at 6:55 pm

Howard, while not directly comparable, as they don’t use advanced erasure codes, the Nimble Storage box does handle small writes nicely in a way that might be applicable to Amplidata and other scale-out object stores.

The field is young and there is much room for invention. Let’s see how it shapes up.
Richie on Friday, 27 July, 2012 at 4:11 am

Sound interesting, but is it appropriate for a desktop (or does it only make sense for data centres)?
Anton Kolomyeytsev on Sunday, 5 August, 2012 at 12:45 pm

Are Nimble using log-structured file system inside their appliances? That could be an answer why they handle small writes perfectly. Ideas?

Anton
Jamon Bowen on Monday, 6 August, 2012 at 11:11 am

Robin,

I’ve spent a some time thinking about erasure coding in scale-out settings recently. There is an additional advantage that took some time for me to wrap my head around. At a large enough scale and with cheap enough disks, if the # of disks >> n+m (and if this isn’t the case than the benefits vs. raid 6 are limited to higher failure tolerance), the erasure codes are distributed randomly across the disks, and there is some extra capacity in the disk cluster that is not used yet (spread evenly across the disks so that a large number of disks are involved in rebuilds), you can just leave failed disks in place. This makes sense to do if the cost of swapping a disk exceeds the cost of the drive. Rather that swapping a disk after failure, just wait until enough disks in a server have failed & swap them all at once – something you just couldn’t consider doing with single server RAID.

Regarding write performance – just assume that you have to use copy-on-write block management with erasure coding and the random write performance issues are resolved.

Jamon
Taylor on Wednesday, 8 August, 2012 at 11:35 am

Jamon – with COW, the data block writes themselves can be serialized. However the metadata writes are still essentially random, at least if you have a non-trivial file tree size.

You can put the metadata on different storage, eg RAID10 or SSDs, but of course that complicates things further.
Paul Hewitt on Friday, 10 August, 2012 at 8:37 am

Howard, regarding your question about small random io’s, please contact me. We have the solution.
Jon Kuroda on Monday, 13 August, 2012 at 5:11 pm

RAID itself may be dead – and I think you’re right that it has been for a while (really, who wants to run RAID with 3+ parity drives?) – but I think some ideas from RAID will continue on for a good long time – Expect failure in individual components. Trade capacity/speed for data redundancy and integrity to mitigate the impact of that failure.

As a point of clarification – OceanStore was always Kubiatowicz’s thing – Patterson had already moved on to things like Network of Workstations, the more general theme of reliability/manageability in/of large computing systems – Recovery-oriented Computing and Design/Deployment of large scale “Internet Systems” (RAD Lab) – and design and use of novel architectures, most recently multi/many-core architectures (Par Lab).

Katz’s own post-RAID research interests are more along the lines of large-scale systems design with a recent interest in energy management (both in the datacenter and in large buildings) as an information management problem.
Tom Leyden on Tuesday, 14 August, 2012 at 5:06 pm

@Jamon

We do indeed offer our customers to not have to replace disks but rather nodes when a number of disks have failed.

Happy to follow up!

KR

Tom
Darren McBride on Thursday, 16 August, 2012 at 2:13 pm

Jon and Robin – regarding “RAID has been dead for a while” I’ve been thinking about this a lot based on Robin’s writings. 1) almost every HP, Dell, IBM server is purchased to this day with mirrored boot drives and either RAID5 or 6 data drives. The RAID controller is integrated into even the most lowly server so is hardly dead. Do you think a server purchased in 2017 will have local storage and if so will the controller do RAID? My guess is they may have hardware RAID-TP (aka RAID 7.3) by then. 2) While many SANs are moving to spread node storage and sophisticated (or proprietary) versions of RAID 6 many still use standard RAID 3) I’m not convinced the failure math is giving us real world results. See my post http://www.high-rely.com/hr_66/blog/why-raid-5-stops-working-in-2009-not/ 4) Yes. I believe Triple parity RAID (RAID 7.3 or RAID-TP) and perhaps beyond will be desired for reliability and speed. Because I don’t see how scale out storage nodes that spread data over ethernet will be speed competitive with multi-channel DAS running at 6Gbps (assuming both use the same media – we have to do apples to apples on the drives to be fair). I suppose the scale out guys will say “because we can use 10Gbps ethernet and it’ll be cheap real soon now”. Maybe. But even using iSCSI each Storage packet is wrapped in a TCP datagram and then wrapped in an IP packet, which is wrapped in an ethernet frame.
Lucas B. Cohen on Thursday, 27 September, 2012 at 1:40 pm

Is it so fundamentally different to imagine a redundant array of discs with an erasure code, only because that erasure code is new and improved ? Or am I missing something ?
HikingMike on Wednesday, 8 May, 2013 at 3:06 pm

It is unsettling to me that all of these larger storage systems with their own error codes and tiered data and extra feature X are all using proprietary methods. Don’t get me wrong, these things can definitely be a benefit and the new tech progress is great, but heaven help someone if they have a problem and actually don’t have a good backup. Specialist data recovery firms or software tools are out of the question because of the proprietary nature of the technology and it’s almost certain that the storage company would not lift a finger to help or build any tools themselves. You can just look to the heat that Drobo is getting for this (yes a more consumer oriented company).