Over on Storage Bits I’ve ignited quite a bit of controversy with the post Why RAID 5 stops working in 2009.

My point in that post is that as SATA disk drive capacity continues to increase, and the unrecoverable read error (URE) rate remains constant, the time will come – 2009? – when every RAID 5 disk failure will be likely to encounter a URE during rebuild.

The arithmetic goes like this. Take a 7 drive RAID 5 stripe. Each drive is 2 TB (in a couple of years). One drive fails, leaving 12 TB of capacity to read to recreate the lost data. With a SATA URE of 10^14, which is about 12 TB – OK, a little more – you are highly likely to encounter a URE. At that point an honest RAID controller will inform you that it can’t complete the rebuild.

I *think* different controllers have different responses to this scenario, but I will bow to the more knowledgeable among my readers who might care to elucidate.

The real question is URE
SATA drives that I’ve looked at have a URE of 10^14 while enterprise drives are spec’d at 10^15. My question is: why aren’t the drives spec’d at 10^16 or more?

Essentially, drive reads are a statistical process, as the unfortunate hyping of PRML (partial response, maximum likelihood) a few years ago made all too clear. (It’s highly probable that the data we read is the data you wrote, and we have the statistics to prove it!)

If the drive vendors devoted more space to ECC it seems that they could build drives with much lower URE rates. That is what they already do with enterprise drives.

Obligatory conspiracy theory
Maybe the drive vendors don’t do so because they know that with the advent of RAID 6 they’ll be selling that many more drives. And the array vendors will be as well.

As I noted in the Storage Bits post, the net effect of drive failure + URE is to render RAID 6 the new RAID 5. That doesn’t address the problem of dual drive failures, which we already know are more common than standard theory expects. So you’ll be paying RAID 6 prices for what is, in effect, RAID 5 protection. W00t!

I don’t think there is any conspiracy. I feel for disk folks because they are in such a competitive, cut-throat industry with 6-12 month product cycles and brutal pricing. It is hard for them to do much more than react as fast as they can.

The StorageMojo take
I’ve noted before that disk folks seem to have a hard time with strategy, a thought that first occurred to me when Seagate bought Xiotech: “let’s get into a business we know nothing about AND compete with our best customers! It’s a twofer!” It would have been much smarter to buy EuroLogic or Xyratex and move up the value chain with something of value for existing customers.

Endlessly pushing capacity as the only metric only guarantees an ever faster treadmill. Vendors should look at how they can subtly alter volume products, as WD has done with the 10k Raptors, to create new niches. Lots of people would like to have more reliable disk drives, so reducing capacity in favor of lower URE rates to create RAID 5-friendly SATA drives could be lucrative.

I believe consumers are educable if the value can be simply and vividly articulated. Drive vendors need to take a fresh look at their marketing to break out of the high-volume, low-margin box they are trapped in now.

Comments welcome, as always.