Late last year Sun engineer, DTrace co-inventor, flash architect and ZFS developer Adam Leventhal, analyzed RAID 6 as a viable data protection strategy. He lays it out in the Association of Computing Machinery’s Queue magazine, in the article Triple-Parity RAID and Beyond, which I draw from for much of this post.
The good news: Mr. Leventhal found that RAID 6 protection levels will be as good as RAID 5 was until 2019.
The bad news: Mr. Leventhal focussed on enterprise drives whose unrecoverable read error (URE) spec has improved faster than the more common SATA drives. SATA RAID 6 will stop being reliable sooner unless drive vendors get their game on. More good news: one of them already has.
The crux of the problem
SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 200,000,000 sectors, the disk will not be able to read a sector.
2 hundred million sectors is about 12 terabytes. When a drive fails in a 7 drive, 2 TB SATA disk RAID 5, you’ll have 6 remaining 2 TB drives. As the RAID controller is reconstructing the data it is very likely it will see an URE. At that point the RAID reconstruction stops.
Here’s the math:
(1 – 1 /(2.4 x 10^10)) ^ (2.3 x 10^10) = 0.3835
You have a 62% chance of data loss due to an uncorrectable read error on a 7 drive (2 TB each) RAID 5 with one failed disk, assuming a 10^14 read error rate and ~23 billion sectors in 12 TB. Feeling lucky?
When 4 TB drives ship later this year only 3 drives will equal 12 TB. If they don’t up the spec, this will be a mess.
RAID 6
RAID 6 creates enough parity data to handle 2 failures. You can lose a disk and have a URE and still reconstruct your data.
NetApp noted several years ago that you can have dual parity without increasing the percentage of disk devoted to parity. Doubling the size of RAID 5 stripe gives you dual disk protection with the same capacity.
Instead of a 7 drive RAID 5 stripe with 1 parity disk, build a 14 drive stripe with 2 parity disks: no more capacity for parity and protection against 2 failures. Of course, every rebuild will require twice as many I/Os since each disk in the stripe must be read. Larger stripes aren’t cost free.
Grit in the gears
The chance that a single sector rebuild will encounter 2 read errors is tiny, so what is the problem?
Mr. Leventhal says a confluence of factors are leading to a time when even dual parity will not suffice to protect enterprise data.
These include:
- Long rebuild times. As disk capacity grows, so do rebuild times. 7200 RPM full drive writes average about 115 MB/sec – they slow down as they fill up – which means about 2.5 hours per TB minimum to rebuild a failed drive. Most arrays can’t afford the overhead of a top speed rebuild, so rebuild times are usually 2-5x that.
- More latent errors. Enterprise arrays employ background disk-scrubbing to find and correct disk errors before they bite. But as disk capacities increase scrubbing takes longer. In a large array a disk might go for months between scrubs, meaning more errors on rebuild.
- Disk failure correlation. RAID proponents assumed that disk failures are independent events, but long experience has shown this is not the case: 1 drive failure means another is much more likely.
On the last point: in a corridor conversation at FAST ’10 I was told that at a large HPC installation they found that with drives from the same manufacturing lot that 1 drive failure made a 2nd 10x more likely – while a 2nd made a 3rd 100x more likely. Not clear how manufacturing or environmental issues – or interaction between the 2 – led to the result. YMMV.
Simplifying: bigger drives = longer rebuilds + more latent errors -> greater chance of RAID 6 failure.
Mr. Leventhal graphs the outcome:
By 2019 RAID 6 will be no more reliable than RAID 5 is today. Mr. Leventhal’s solution: triple-parity protection.
The StorageMojo take
For enterprise users this conclusion is a Big Deal. While triple parity will solve the protection problem, there are significant trade-offs.
21 drive stripes? Week long rebuilds that mean arrays are always operating in a degraded rebuild mode? Wholesale move to 2.5″ drives to reduce drive and stripe capacities? Functional obsolescence of billions of dollars worth of current arrays?
What is scarier is that Mr. Leventhal assumes disk drive error rates of 1 in 10^16. That is true of the small, fast and costly enterprise drives, but most SATA drives are 2 orders of magnitude less: 1 in 10^14.
With one exception: Western Digital’s Caviar Green, model WD20EADS, is spec’d at 10^15, unlike Seagate’s 2 TB ST32000542AS or Hitachi’s Deskstar 7K2000 (pdf).
Before entering full panic mode though it would be good to see more detailed modeling of RAID 6 data loss probabilities. Perhaps a reader would like to take a whack at it.
Comments welcome, of course. I worked at Sun years ago and admire what they’ve been doing with ZFS, flash, DTrace and the great marketing job the ZFS team did without any “help” from Sun marketing. An earlier version of this post appeared on Storage Bits. Looking for a scientific calculator program? PCalc – Mac & Windows – is the best I’ve found.
Surprisingly the economical Samsung F3EG 2tb drives are also rated at 10^15 for URE.
I know, I’m as shocked as you are.
Does RAID 6 stops working in 2019?
You mean:
Does RAID 6 stop working in 2019?
or
RAID 6 stops working in 2019?
> As the RAID controller is reconstructing the data it is very likely it will
> see an URE. At that point the RAID reconstruction stops.
Presumably you mean for the one file (rather than ALL RAID reconstruction)?
Jack, what does a RAID array know about files? That is the file system’s job.
Based on what the FAST ’10 papers described about latent sector areas – get one and you are highly likely to get another within a few MB of data – even if the rebuild did go forward you’d likely see further corrupted files.
Better to get the backup out and do it right the first time.
Robin
Most SATA drives have actual URE’s at a much lower instance than what is spec’d. For both SATA and most SAS the specification is for the drive media and not the entire drive; ie the data protection written into the data on the disk may recover the data before the end user notices. Most of the routines for this is the same in SATA and SAS drives as are many other common parts. The main difference is the warranty provided on the drives for reliability. I know two vendors that have demonstrated orders of magnitude better URE for several years, yet actively choose not to change their specifications for two reasons. Their marketing departments say that it will not make an appreciable difference in sales, and finance points out that warranty reserve will have to be increased resulting in a higher priced product or even less margin of profit.
At a previous position, we ran a test to determine actual URE rate over a two year period of the two largest volume manufacturers of 2.5″ SATA drives. The population included several hundred drives across several build lots and capacity points and saw and actual URE rate greater than 10^18. These were standard drives, not near line or the extended duration drives.
2019? So what? Who knows if we’ll still be using spinning magnetic discs in 10 years?
Depending on your Recovery Time Objectives, RAID6 and other dual-parity schemes (e.g. ZFS RAIDZ2) are dead today. We know from hard experience.
Try 3 weeks to recover from a dual-drive failure on 8x 500GB ZFS RAIDZ2 array.
It goes like this:
– 2 drives fail
– Swap 2 drives (no hot spares on this array), start rebuild
– Rebuild-while-operating took over one week. How much longer, we don’t know because …
– 2 more drives failed 1 week into the rebuild.
– Start restore from several week old LTO-4 backup tapes. The tapes recorded during rebuild were all corrupted.
– One week later, tape restore is finished.
– Total downtime, including weekends and holidays – about 3 weeks (we’re not a 24xforever shop).
Shipped chassis and drives back to vendor – No Trouble Found!
I can’t even imagine the worst case recovery times for our older Thumpers (one big 46x 500GB ZFS RAIDZ2 pool with two hot spares), or what they could be for a new Oracle Sun x4540 with 48x 2TB drives.
Even in our relaxed requirements environment, multi-week recovery times are not acceptable.
We could consider mirrored servers. Double the server cost, added sys admin cost, added bandwidth needs, … Not Cheap.
Suddenly solutions like Cleversafe are looking much, much better.
I wonder how this impacts distributed RAID? Where rebuilds at the RAID array level are many:many instead of many:one ? Once the RAID array is restored the individual failed disk can be replaced/rebuilt, and that can take all week if it wants, since the array has already restored 100% redundancy it doesn’t matter if there is a 2nd failure.
Systems like 3PAR, Compellent, Xiotech, and I believe IBM XIV are among systems that do distributed RAID. Myself I did testing on my 3PAR array last year failing a disk and the system rebuilt at about 60MB/s with no noticeable impact to the system. At the time there wasn’t much data written to the drive so the rebuild took about 4 minutes(since only written data is rebuilt). Our previous storage system would take more than 24 hours to rebuild an array with ~400GB SATA-I disks(new array has a mixture of 750GB and 1TB), and 4-6 hours with 146GB 10k FC disks. And on the old array there was a significant performance impact during array rebuilds.
To me, the likelihood of having a double disk failure in a RAID group on a system that rebuilds as fast as this does during an actual rebuild is probably about as likely as a major earthquake happening in the area.
http://www.techopsguys.com/2009/11/20/enterprise-sata-disk-reliability/
I suspect that distributed RAID will push out the window of where RAID 5 and RAID 6 break down, just don’t know how far.
http://www.techopsguys.com/2009/11/24/81000-raid-arrays/
I recall at one point being told that over at Myspace, 3PAR was the only array manufacturer allowed to replace disks during business hours, because it was a no impact event for the array to rebuild, which I thought was interesting, I think there’s a PB or two over there backing their MSSQL databases.
Does an entire array need to be rebuilt if only one block on one disk failed?
Why not just that raid stripe?
If one block has failed, it wasn’t the first. drives have had transparent re-mapping for a long time now, so generally by the time bad blocks are detected, the “reserve” pool has run out and there is a significant number of bad blocks.
Depending on the storage system, most systems will just watch for X number of errors, if the error count is exceeded the array will forcefully fail the drive and rebuild the array.
Other systems may wait for total failure. I had two such systems from HP which had major issues when a disk was failing. One was an internal disk the other was a disk in an external MSA enclosure. In both cases the controller predicted the drive was failing, but it would not proactively fail it. Performance to the array suffered enormously, to the point it was pretty much unusable.
There was no way to remotely fail the drive via the HP management tools, so we had to have someone go on site and physically yank the drive to force it to rebuild and restore performance.
HP support later said that the particular issue we saw was fixed in a newer firmware release(don’t they always say that?). And as this was several years ago I hope by now HP has added the software functionality to force fail a drive.
Most(all?) enterprise arrays allow you to look at the error count for each of the drives as well.
Please check your math on this section:
Here’s the math:
(1 – 1 /(2.4 x 10^10)) ^ (2.3 x 10^10) = 0.3835. I’ve recalculated it several times and did it different ways, it comes out to 0.28947 which would make the percentage = 71% and not 62% which is quite a difference.
If you want higher than double parity, look to Isilon. The storage administrator can chose higher than double parity for select data sets. They also have extremely fast rebuild rates, due to the manner in which they marshal all nodes in their clusters to collectively participate in the rebuilds. Faster rebuilds = less time risk exposure to data loss.
I just ran it again and got the same number. I agree that 62% seems low. Rounding error?
Robin
Robin,
How do URE’s affect newer RAID levels such as RAID 50 and RAID 10?
You were right in the beginning.
0.383531572868………
http://www.wolframalpha.com/input/?i=(1+%E2%80%93+1+/(2.4+x+10^10))+^+(2.3+x+10^10)
Munkie,
Thanks for checking! The prior comment had me going for a while.
Robin
Is it really the case that the URE translates into real world reliability statistics when all data encoding done on the drive via the firmware ues an ECC for laying down the sector? I would expect the URE to relate to the bit level encoding technique and media quality, but I would expect total drive data reliability to be the combination of the media and the ECC encoding mechanism used at the firmware layer. Since we as customers only see the latter, what do URE numbers tell us? I agree with the math on the URE but I’m unclear this tells me something about my reliability, most especially since ECC can be as good as you want it to be. The tradeoff is capacity.
I think the Robin’s math (1 – 1 /(2.4 x 10^10)) ^ (2.3 x 10^10) = 0.3835 is quite close if ignoring rounding errors. Here is my calculations:
1). the 6 survived 2TB raid drives sum into total number of 512Byte sectors as:
NSect=6x2E12/512=2.3438E10.
2). given the URE=1/1E14 bit error rate (BER) of ECC, the unrecoverable sector error rate is:
SER=1/(1E14 /4096)=4.096E-11.
3). Then, the probability of 6 2TB raid drives having no error is:
Pg=(1-SER)^NSect=0.3829.
4). So, the probability of this RAID-5 Set becoming failure is:
Pe=1-Pg=0.6171.
There is an excel spreadsheet demonstrating the mean time to data loss for several different RAID levels (as well as factoring in Bit Error Rate) published in my June 2009 blog article:
http://www.zetta.net/_wp/?m=200906
ZettaFS is an N+3 configuration.
Others mention in comments that rebuild times are also tremendously important — another factor incorporated in the spreadsheet.
What about the IBM new XIV?
From reading when it rebuilds it only reads written data relevant ton the lost drive. So it seems to rebuild a lot less data per drive? No?
Oh and they claim 35min rebuild times.
What is the news in stating that RAID 6 will stop being a valid technology 10 years from now? Thats like saying Windows 7 will stop being a viable technology in 2019 or my Single core PC won’t be capable of running the current Microsoft OS in 2019. He makes a number of (IMHO), invalid assumptions and he continues to focus on home/non-commercial/small business, spinning disk based storage rather than enterprise class intelligent storage arrays and state of the art storage technology. His assumptions also seem to imply that an enterprise will still be using a current technology storage array 10 years from today. Im not saying it can’;t happen but I will say that I seriously doubt it within GM.
First, the probability that ANYONE will still be selling spinning disk storage 10 years from now I would place at <1%. Probably lower than that. SSD will have 300GB, 600G and 900GB drives by the end of the year. As technology improves, more SSD are created and sold, cost will go down, MTBF will continue to climb and URE rates will continue to fall. As you eliminate moving parts, decrease power consumption and heat dissipation, improve the silicon and technology of SSD, which he completely ignores, his entire formula for disk failure starts to change by orders of magnitude. The compute power of storage management subsytems will continue to progress at the same rate as server cpu. His assertion that rebuild times will take longer because of increasing disk size ignores that storage bandwidth could be 100 or 1000X faster than it is today with 10-100X more CPU capacity to tackle the problem.
IMHO, I honestly see this as someone trying to make himself relevant by espousing incomplete theory based on incomplete assumptions regarding the state of technology 10 years from now. Nothing to see here. Move along.
And RAID-10 is out of the question? Here is another argument for RAID-10:
An interesting alternative analysis is interface error rate based. Seagate SAS disks list an interface error rate less than one per 10 to the twelfth (10 ^ 12) bits. An undetected error on transfer occurs every 200 thousand sectors over the SATA/SAS interface.
Conclusion: read everything twice to be sure it is correct. Or use RAID-1 and perform checkums on every file read.
Interesting article, I started reading up on this potential problem when I started thinking about what could go wrong with our 100+TB disk arrays…. now I’m worried, granted we are using enterprise disks (UBE 10^16) but the problem is still there.
@eric… we are still using tape technology and people said that would have been dead ages ago.
“First, the probability that ANYONE will still be selling spinning disk storage 10 years from now I would place at <1%. Probably lower than that. "
I would say that the probability of selling spinning disks 10 years from now is 100% unless there is a significant materials change with NAND. Current silicon based NAND flash can not scale to be price competitive with current hard drives. There is only so small that they can make the cells. They are approaching sizes of only a few atoms thick. And currently each shrink (55nm to 34nm to 25nm) is also reducing the write endurance. Manufacturers have to use extra reserve capacity to counteract the lower write endurance of the newer flash.
"we are still using tape technology and people said that would have been dead ages ago."
Same here. I have a few hundred LTO tapes here at work. I expect to be using tape 10 years from now. NAND certainly will not replace this. Just as hard drives did not either.
I may be off track, but it seems the switch from 512 byte sectors to 4k sectors has changed the numbers for the better.
There’s something fundamentally wrong with this formula…
From the post:
Here’s the math:
(1 – 1 /(2.4 x 10^10)) ^ (2.3 x 10^10) = 0.3835
Change 10^10 to 10^2 and you have roughly the same answer, so it doesn’t seem to matter if you have a high error rate or a low error rate on the drive.
On a 2TB drive (4 * 10^9 sectors) with a 10^15 error rate (the ones Seagate is selling now), the odds of an unrecoverable read error on the first drive is 1/250,000 assuming you have to read every sector to rebuild. The odds of a double-disk failure are miniscule. Even a drive with a 1 in 10^14 error rate means you have only a 1/25,000 odds of a failure.
If your math had been true, we would have lost a LOT of raidsets over the years during rebuilds. We’ve lost some – large (12-14 member) and old RAID5 sets – but not at the rate you seem to be suggesting. With a petabyte of storage on the floor, we frequently go a few weeks without a single drive failing.
Not necessarily true Ed … you can still have bit rot and not have the RAID fail. Issue is the data gets corrupted and can’t be read properly when accessed and if it isn’t accessed for a long enough period of time you would never know it went bad. This happens far more often than RAID failures.
SSD and ZFS solved all of these. Even RAIDz (which is RAID5 equivalent) is still working as reliably as before.
And there’s a reason for that, in the ZFS case at least. Adam Leventhal, whose research my piece was based on, was also one of the ZFS developers. Fixing the RAID problem was definitely on the ZFS developer’s minds.
As for the claim that SSDs have fixed this: it’s hard to know, as not all SSDs have a URE spec. Seagate’s Nytro series does, and it’s 10 to the negative 16, which supports your claim. Disk drives have also tended to up their spec as well, so the issue is less relevant today.
And, of course, much of today’s new storage capacity doesn’t use RAID at all, preferring either triple replication or advanced erasure codes that offer greater capacity efficiency and greater failure tolerance.