The post-RAID (noRAID) era has begun. While RAID arrays aren’t going away, the growth is elsewhere, and corporate investment follows growth.

Why now?
There are now architecturally superior alternatives to RAID that are lower cost. But you could argue that the post-RAID milestone was passed years ago.

  • The authors of the 1988 original RAID paper (Patterson, Gibson and Katz) all moved on a decade or more ago: Patterson to OceanStore; Gibson to Panasas, a scale-out object storage company he co-founded; and Katz has been working on Hadoop among other projects.
  • What are probably the fastest growing large storage infrastructures in the world – Google’s and Amazon’s – aren’t based on RAID.
  • Major storage vendors including NetApp, HP, EMC and Hitachi have all invested in – and are selling – noRAID systems.
  • But the biggest reason? The math behind erasure codes improved after the RAID paper was written.

The math?
RAID uses a form of Reed-Solomon erasure coding to create parity information that protects a RAID array from 1 (RAID5) or 2 (RAID6) uncorrectable read errors (URE). But RAID 5 stopped working 3 years ago if you use SATA drives.

Erasure coding’s key advantage is that you can break up your data into n fragments, add m additional fragments, store the fragments across n+m devices, and then recover the original data from any n of the devices. Thus in a RAID5 8 drive stripe, the original data is divided into 7 fragments, an 8th fragment is calculated – the parity data – and then any one of the 8 drives can fail without losing (theoretically) any data.

RAID5’s problems are that as disks get larger, rebuild times get dangerously long – increasing the chance that another disk will fail before rebuild completes while reducing performance all the while – and that an URE will be found on another disk, killing the rebuild. Surviving 2 failures is the minimal reasonable protection today.

In the ’90s a new form of erasure coding was developed that enabled developers to create codes with an arbitrary level of redundancy – survive 4 failures? 10? Pick a number! – called fountain or rateless erasure codes. Startups including Digital Fountain, Cleversafe and Amplidata have sprung up to take advantage of these new codes.

A new StorageMojo video explores the advantages of rateless codes, using the Amplidata example. One key advantage: the redundancy needed to survive 4 failures is, they tell me, down to 50-60% of the data. Much better than the 3x replication that Amazon and Google use in their infrastructures, and competitive with RAID6.

The StorageMojo take
Redundant Arrays of Inexpensive Disks shook up a complacent industry almost 25 years ago. But time and technology move on.

Despite the huge investment the industry has in RAID controller code, we now have better solutions. Properly priced and marketed, these solutions will drive the next big round of storage growth.

Courteous comments welcome, of course. I’ve been doing work for Amplidata. For a quick intro to erasure coding for storage developers, check out Prof. Jim Plank’s Erasure Codes for Storage Applications (pdf) presentation.