Why aren’t disk reads more reliable?

by Robin Harris | Thursday, July 19, 2007 | Disk | 14 comments

Over on Storage Bits I’ve ignited quite a bit of controversy with the post Why RAID 5 stops working in 2009.

My point in that post is that as SATA disk drive capacity continues to increase, and the unrecoverable read error (URE) rate remains constant, the time will come – 2009? – when every RAID 5 disk failure will be likely to encounter a URE during rebuild.

The arithmetic goes like this. Take a 7 drive RAID 5 stripe. Each drive is 2 TB (in a couple of years). One drive fails, leaving 12 TB of capacity to read to recreate the lost data. With a SATA URE of 10^14, which is about 12 TB – OK, a little more – you are highly likely to encounter a URE. At that point an honest RAID controller will inform you that it can’t complete the rebuild.

I *think* different controllers have different responses to this scenario, but I will bow to the more knowledgeable among my readers who might care to elucidate.

The real question is URE
SATA drives that I’ve looked at have a URE of 10^14 while enterprise drives are spec’d at 10^15. My question is: why aren’t the drives spec’d at 10^16 or more?

Essentially, drive reads are a statistical process, as the unfortunate hyping of PRML (partial response, maximum likelihood) a few years ago made all too clear. (It’s highly probable that the data we read is the data you wrote, and we have the statistics to prove it!)

If the drive vendors devoted more space to ECC it seems that they could build drives with much lower URE rates. That is what they already do with enterprise drives.

Obligatory conspiracy theory
Maybe the drive vendors don’t do so because they know that with the advent of RAID 6 they’ll be selling that many more drives. And the array vendors will be as well.

As I noted in the Storage Bits post, the net effect of drive failure + URE is to render RAID 6 the new RAID 5. That doesn’t address the problem of dual drive failures, which we already know are more common than standard theory expects. So you’ll be paying RAID 6 prices for what is, in effect, RAID 5 protection. W00t!

I don’t think there is any conspiracy. I feel for disk folks because they are in such a competitive, cut-throat industry with 6-12 month product cycles and brutal pricing. It is hard for them to do much more than react as fast as they can.

The StorageMojo take
I’ve noted before that disk folks seem to have a hard time with strategy, a thought that first occurred to me when Seagate bought Xiotech: “let’s get into a business we know nothing about AND compete with our best customers! It’s a twofer!” It would have been much smarter to buy EuroLogic or Xyratex and move up the value chain with something of value for existing customers.

Endlessly pushing capacity as the only metric only guarantees an ever faster treadmill. Vendors should look at how they can subtly alter volume products, as WD has done with the 10k Raptors, to create new niches. Lots of people would like to have more reliable disk drives, so reducing capacity in favor of lower URE rates to create RAID 5-friendly SATA drives could be lucrative.

I believe consumers are educable if the value can be simply and vividly articulated. Drive vendors need to take a fresh look at their marketing to break out of the high-volume, low-margin box they are trapped in now.

Comments welcome, as always.

14 Comments

Graeme on Thursday, 19 July, 2007 at 10:59 pm

It seems to me RAID 5 is obsolete anyway. RAID 10 seems to be the way to go these days with SATA drives, considering their cost. You get much better redundancy, in many cases better speed for block and random access, and it really doesn’t cost you much more. Even given current 500 GB disks, for most small/medium businesses it already makes sense in an 8 disk (or whatever) config to have 2TB available instead of 3.5TB to get the extra redundancy and performance of RAID 10. As disk costs drop, and sizes increase, this is just going to become more and more the case.
Robert Pearson on Thursday, 19 July, 2007 at 11:57 pm

In the 1997-1998 time frame, another support guy and myself tried to convince Management that we should invest in Storage profiling software from the Storage vendor. The purpose would be so we could monitor the “State of Health” of all disk drives in all Storage from that vendor.
The goal was to identify failing or “candidates for failing” disk drives before they failed and forced a rebuild. We ran all RAID5.

We proposed that doing this would give us data, over time, that would allow us to replace disk drives before they became a problem based on firmware and operating information. The example we used was 30% or 1/3 of the drives every 36 months. To be really safe we recommended that we buy drives in Mass Quantities and replace 1/3 of them every 12 months. This would vary depending on the drive specs. Each generation is different.

We both found other jobs and left after we were told our “Careers?” were over.

The Cost/Benefit Analysis I did showed the Benefits ROI was about 10 times the TCO to do this. The biggest problems were scheduling the down-time and the threat this “appeared” to present to the Backup Group.

There were other factors that made this scenario good for that environment. I only recommend it for environments where the ROI/TCO ratio is determined to be high enough.

Backups are becoming more obsolete.
The only scenario that makes sense, to me, for eCommerce or eBusiness is to make each write to three unique locations. I actually believe a fourth write to SSD, Flash or some solid state “removable media” is going to become necessary. Particularly if you do the “Pace Layering” analysis of your Managed Units of Information and integrate that with the “Long Tail” ROI/TCO.
You might be very surprised what you learn?

Even RAIDVD would be good, if it were fast enough. The removable is only for Disaster Recovery. Most shops will need online Recovery for everyday Local Disasters.
PJ on Friday, 20 July, 2007 at 6:47 am

>Endlessly pushing capacity as the only metric only guarantees an ever faster treadmill.

As you pointed out, however, this treadmill is about to come to an end; so to an extent pushing capacity was/is fine, in the category of “let’s solve problem A (capacity) before we move on to solving problems we don’t have yet (reliability).” The trick is, as you point out, to switch to other metrics once capacity is considered a ‘solved problem’.
David Magda on Friday, 20 July, 2007 at 7:37 am

The (personal) take of someone at Sun:

My stance to this topic is a little bit different. In my personal opinion, Filesystems and RAID technology without strong checksum will be impracticability. You doesnÂ´t need exaclty RAID6 when you have different means to ensure data integrity.

Of course RAID-Z doesn’t save you from a dual drive loss, it can ensure that URE can be recovered from during rebuild.
David Magda on Friday, 20 July, 2007 at 7:47 am

Robert,

Backups are only obsolete if one of the unique locations you mentioned is offsite. This is often necessary for business continuity reasons.

There’s also archiving, which while technically different from backups, often uses the same infrastructure (e.g., Legato, NetBackup, tapes, etc.).
Steven on Friday, 20 July, 2007 at 7:53 am

Actually, “most” enterprise class disk vendors are consistently doing media scans of both the live volume blocks as well as the parity. This allows for areas that have the traditional URE to be corrected long before a disk failure. I do see your point though. I think that as SATA disks continue to increase in size, even media scans will take too long to complete in a timely fashion. Controller manufacturers like LSI have this as a tunable within the array management software. I think the places where you are going to run into problems are in the grow it at home NAS solutions that are based on commodity hardware platforms that have no ability to do this level of proactive protection.
Wes Felter on Friday, 20 July, 2007 at 12:16 pm

Reliability should increase when vendors move to 4KB sectors which use more efficient ECC.
Joerg M. on Friday, 20 July, 2007 at 12:49 pm

ZFS helps with this problem in an addtional way. Selective Resilvering in case of a drive failure. You donÂ´t have to do a complete resync of the harddrive when itÂ´s not completly filled. It only resyncs the used parts of the harddrive.

@Steven: The disc scans doesnÂ´t help you. The data on the rotating rust may be correct, but something in the way from the rust to the SATA plug can corrupt the data. Most of the time this is the source of unrecoverable errors.
the storage anarchist on Friday, 20 July, 2007 at 1:00 pm

A 6+2 RAID 6 group requires the same number of spares as two 3+1 groups, so you really won’t have to buy more storage to get the same usable – at least, not with most storage arrays that support 8 or more disk drives. Symmetrix supports R5 3+1 & 7+1, or R6 6+2 and 14+2; other systems are perhaps even more flexible. And in massively cached high-end arrays, the response time for R5 and R6 are virtually indistinguishable except under the heaviest paint-peeling workloads (which few systems operate under for any significant length of time).

So if your assertions come true, the answer is simple: RAID 6 everywhere.
Robert Pearson on Sunday, 22 July, 2007 at 3:22 am

RE: “Backups are only obsolete if one of the unique locations you mentioned is offsite. This is often necessary for business continuity reasons.”

Thanks for the feedback , David.
I have really enjoyed your site since I discovered it here on StorageMojo.

In an effort to be brief, the real point I was referring to has to do with this.
I have been working on the iSCSI â€œSpeed Limit of the Information Universeâ€ numbers. A corollary to throughput is the â€œgreenâ€ cost. Is it â€œgreenerâ€ to write slower? How slow is too slow?

There might be a point in the not too distant future when the â€œgreenâ€ cost of creating, replicating and de-storing Information is the dominant cost?
In an Energy World gone madâ€¦

If I have an online copy for Local Disasters then a 4th online copy that is geographically dispersed may not be of much value relative to the “green” cost.
A removable copy, stored in a geographically dispersed, but considered highly safe from Disasters site, may have lower “green” cost, i.e. Flash or DVD versus tape, and be more effective by being totally flexible.
Have Complete Backups, Will Travel!
Open Systems guy on Sunday, 22 July, 2007 at 11:28 am

“My point in that post is that as SATA disk drive capacity continues to increase, and the unrecoverable read error (URE) rate remains constant, the time will come – 2009? – when every RAID 5 disk failure will be likely to encounter a URE during rebuild.”

As disks become more dense, it will indeed become harder to manage because all the other non-capacity specifications (throughput, IOs per second) stay about the same. Many people end up buying smaller, faster disks for the most important data. Even if a couple of 1TB drives in a RAID would do the job, most business that are serious about their storage will buy smaller, faster drives to reduce the chance of an URE.

The next leap in drive technology will probably not be storage density, it will probably be access speed (sequentially and randomly speaking). Flash shows promise for this, as does holographic technology.

Holographics are interesting because the only theoretical limit would be the sensitivity of the laser receptors and the speed of the servos that have to move them.
Bill Todd on Tuesday, 24 July, 2007 at 7:04 pm

I suspect that disk manufacturers understand the trade-off between density and reliability pretty well, and that at most only in marginal areas do they weigh capacity too heavily over reliability.

E.g., if you halved the capacity, you’d have to attain sufficiently greater reliability to make the RAID-5 array more reliable than a RAID-10 array (because otherwise you could leave things as they are and the user could make that decision). And that’s just considering reliability by itself: the RAID-10 array that the higher densities make economically feasible also provides considerably better performance.

Secondly, if you’re using RAID-5 you’re already essentially saying that performance takes a back-seat to economy, so moving to RAID-6 is no big deal (at least as long as RAID-6 is sufficiently commoditized to avoid costing significantly more, which it certainly can be in software-RAID situations).

Thirdly, many common configurations (specifically, those requiring off-site replication) can prop up RAID-5 in the URE department: if you replicate disk-to-disk (actually, it can be even more flexible than this as soon as someone creates the right product) then having duplicate RAID-5 arrays at two sites (or even just a plain data copy at the backup site) drives exposure to UREs back down into the negligible category.

So RAID-5 will continue to occupy a useful niche in the pantheon of replication strategies, and to the degree that this niche narrows RAID-6 will be the beneficiary. And disk manufacturers will continue to increase capacities (or build smaller-form-factor disks that trade capacity for increased performance) rather than complicate their product lines by adding a significant dimension of ‘reliability’ to the existing dimensions of capacity and performance (since reliability can – and in fact to some degree always must – be better addressed at the system level).
Ryan on Friday, 3 August, 2007 at 1:38 pm

It’s even worse than just rebuilds… A ‘normal’ RAID 5 doesn’t check parity on host read operations. On a large array, that means there is a good chance of being handed bad data without knowing it.

There is at least one vendor (check who supplies many of the the top500 super computers list including #1 for a clue on who) that does:
8+1 with parity constancy verification on all reads as long as it is in 8+1 mode
8+2 where it will /correct/ bit errors on as the correct data goes to the host, flag it, and write the correct bits back out to the drive.

They also do partial drive rebuilds for drives that ‘lag’ for a bit, vs having to rebuild full capacity of 1TB drives when slow for just a moment, it lets you bring the drive up and only rebuild what has changed while the drive was ‘away’.
Johannes on Thursday, 15 October, 2009 at 11:52 pm

This has nothing to do with RAID-5. If disk was as unreliable as you claim we would have major problems every day. A lot of people transfer 12 Tb of data from disks without any RAID protections and doesn’t experience any read errors. Why?

The answer is that URE should be applied per sector since URE is after ECC and multiple read retries of a sector. The error rate is then 1 in 512 bytes (sector size) * 10^14. That’s one per 51 petabytes read data!

I’ve written about this in my blog in Swedish but Google can translate it quite good for you: http://translate.google.com/translate?hl=sv&sl=sv&tl=en&u=http%3A%2F%2Fwww.teknikhemmet.se%2Fblog.php%2F2009%2Fraid-5-fungerar-fortfarande%2F

Trackbacks/Pingbacks

Turning the Page on RAID | Stephen Foskett, Pack Rat - [...] issues. As drives have become larger, the tiny chance of an unrecoverable media error compounds, becoming a certainty. Even…