Late last year Sun engineer, DTrace co-inventor, flash architect and ZFS developer Adam Leventhal, analyzed RAID 6 as a viable data protection strategy. He lays it out in the Association of Computing Machinery’s Queue magazine, in the article Triple-Parity RAID and Beyond, which I draw from for much of this post.

The good news: Mr. Leventhal found that RAID 6 protection levels will be as good as RAID 5 was until 2019.

The bad news: Mr. Leventhal focussed on enterprise drives whose unrecoverable read error (URE) spec has improved faster than the more common SATA drives. SATA RAID 6 will stop being reliable sooner unless drive vendors get their game on. More good news: one of them already has.

The crux of the problem
SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 200,000,000 sectors, the disk will not be able to read a sector.

2 hundred million sectors is about 12 terabytes. When a drive fails in a 7 drive, 2 TB SATA disk RAID 5, you’ll have 6 remaining 2 TB drives. As the RAID controller is reconstructing the data it is very likely it will see an URE. At that point the RAID reconstruction stops.

Here’s the math:
(1 – 1 /(2.4 x 10^10)) ^ (2.3 x 10^10) = 0.3835

You have a 62% chance of data loss due to an uncorrectable read error on a 7 drive (2 TB each) RAID 5 with one failed disk, assuming a 10^14 read error rate and ~23 billion sectors in 12 TB. Feeling lucky?

When 4 TB drives ship later this year only 3 drives will equal 12 TB. If they don’t up the spec, this will be a mess.

RAID 6 creates enough parity data to handle 2 failures. You can lose a disk and have a URE and still reconstruct your data.

NetApp noted several years ago that you can have dual parity without increasing the percentage of disk devoted to parity. Doubling the size of RAID 5 stripe gives you dual disk protection with the same capacity.

Instead of a 7 drive RAID 5 stripe with 1 parity disk, build a 14 drive stripe with 2 parity disks: no more capacity for parity and protection against 2 failures. Of course, every rebuild will require twice as many I/Os since each disk in the stripe must be read. Larger stripes aren’t cost free.

Grit in the gears
The chance that a single sector rebuild will encounter 2 read errors is tiny, so what is the problem?

Mr. Leventhal says a confluence of factors are leading to a time when even dual parity will not suffice to protect enterprise data.

These include:

  • Long rebuild times. As disk capacity grows, so do rebuild times. 7200 RPM full drive writes average about 115 MB/sec – they slow down as they fill up – which means about 2.5 hours per TB minimum to rebuild a failed drive. Most arrays can’t afford the overhead of a top speed rebuild, so rebuild times are usually 2-5x that.
  • More latent errors. Enterprise arrays employ background disk-scrubbing to find and correct disk errors before they bite. But as disk capacities increase scrubbing takes longer. In a large array a disk might go for months between scrubs, meaning more errors on rebuild.
  • Disk failure correlation. RAID proponents assumed that disk failures are independent events, but long experience has shown this is not the case: 1 drive failure means another is much more likely.

On the last point: in a corridor conversation at FAST ’10 I was told that at a large HPC installation they found that with drives from the same manufacturing lot that 1 drive failure made a 2nd 10x more likely – while a 2nd made a 3rd 100x more likely. Not clear how manufacturing or environmental issues – or interaction between the 2 – led to the result. YMMV.

Simplifying: bigger drives = longer rebuilds + more latent errors -> greater chance of RAID 6 failure.

Mr. Leventhal graphs the outcome:

Courtesy ACM Queue

By 2019 RAID 6 will be no more reliable than RAID 5 is today. Mr. Leventhal’s solution: triple-parity protection.

The StorageMojo take
For enterprise users this conclusion is a Big Deal. While triple parity will solve the protection problem, there are significant trade-offs.

21 drive stripes? Week long rebuilds that mean arrays are always operating in a degraded rebuild mode? Wholesale move to 2.5″ drives to reduce drive and stripe capacities? Functional obsolescence of billions of dollars worth of current arrays?

What is scarier is that Mr. Leventhal assumes disk drive error rates of 1 in 10^16. That is true of the small, fast and costly enterprise drives, but most SATA drives are 2 orders of magnitude less: 1 in 10^14.

With one exception: Western Digital’s Caviar Green, model WD20EADS, is spec’d at 10^15, unlike Seagate’s 2 TB ST32000542AS or Hitachi’s Deskstar 7K2000 (pdf).

Before entering full panic mode though it would be good to see more detailed modeling of RAID 6 data loss probabilities. Perhaps a reader would like to take a whack at it.

Comments welcome, of course. I worked at Sun years ago and admire what they’ve been doing with ZFS, flash, DTrace and the great marketing job the ZFS team did without any “help” from Sun marketing. An earlier version of this post appeared on Storage Bits. Looking for a scientific calculator program? PCalc – Mac & Windows – is the best I’ve found.