RAID 5 – Do the Math!

by Robin Harris on Tuesday, 18 April, 2006

Even though disk storage gets about 5-10% cheaper every quarter, people still hate paying for it. A new CPU goes faster, a new display is brighter and/or bigger, but new storage just sits there until we fill it up.

For that reason, the idea of RAID 5 (see the World’s Shortest RAID Guide) seems to hold a hypnotic attraction for customers everywhere. While I understand that cheaper and almost as good is a win for most of us, RAID 5 is a mixed bag that may not do what you need, even if it does what you want.

Start RAID 5 definition
A formal engineering definition of RAID would require using words that I think many people would need defined as well, so I’m not going there. Operationally, a RAID 5 controller calculates data recovery information (parity) and spreads your data and the data recovery information across several disks, usually 4-10 disks. The big advantage of RAID 5 is that it protects your data while using only the capacity of one disk to do so. So if you have 6 400GB disks in a RAID 5 configuration, you have 2000 GB ( 6 * 400GB = 2400GB less the one 400GB disk of recovery info) of usable data storage capacity.

If you mirrored (RAID 1) those 6 400GB disks, you would only have 1200GB of usable capacity. Same disks, same power & space requirements, but 40% less capacity. For what?
End RAID 5 Definition

The technical answer to that last question is complicated, because it depends on what you are doing and how the RAID 5 is engineered. The non-technical (i.e. not for gearheads) answer is that by maintaining two complete copies of your data, RAID 1 (and its sibling RAID 1+0) will often complete individual reads faster, usually complete writes faster, and when a disk fails will protect your data better.

If there is a second disk failure in a RAID 5 disk group, ALL the data is LOST. Gone. Pff-f-f-t. So the natural question has always been: “How likely is a second disk failure?” Take the disk vendor’s MTBF (mean time between failure) data and posit a random distribution of disk failures, and the non-tech answer is: “not very.”

To illustrate, take a modern 400GB SATA drive with an MTBF spec of 400,000 hours. In a six drive RAID 5, like the one above, you would expect a drive failure once almost every 67,000 hours (400,000/6). Since there are only 8,760 hours in a non-leap year, that is about every 7.5 years. So no worries, eh?

Sorry, yes, there are worries, of two different types:

  • First, what if the drive failures are not random? In my experience they frequently are not. Bad power, poor cooling, heavy duty cycles, shock and vibration problems, all come together to produce unexpected failure clusters. Even with a good environment, there will be clusters of failures simply as a function of statistical variation. So the random failure assumption is not always valid.
  • Second, the problem of read failures. As this note in NetApp’s Dave’s Blog explains, complete disk failures are not the only issue. The other is when the drive is unable to read a chunk of data. The drive is working, but for some reason that chunk on the drive is unreadable (& yes, drives automatically try and try again). It may be an unimportant or even vacant chunk, but then again, it may not be. According to Dave’s calculations, if you have a four 400GB drive RAID 5 group, there is about a 10% chance that you will lose a chunk of data as the data is recovered onto the replacement drive. As Dave notes, even a 1% chance seems high.

Where Dave and I part company is in our response to this problem. Dave suggests insisting on something called RAID 6, which maintains TWO copies of the recovery data. Compared to our RAID 5 example above, this means that instead of having 2000GB of usable capacity, you would have 1600GB. And now RAID 1 would only have 25% less capacity. I say drop RAID 5 and 6 and go to RAID 1+0, which is both faster and more reliable.

RAID 5 and 6 use much more complicated software to create the recovery data in the first place, and then after a disk fails they need to read each of the remaining disks along with the recovery data to re-create the lost data. For large disks in large RAID groups this can take many hours, if not days. And while the recovery is underway your storage performance is hosed.

My point is, why even go there? Why not just maintain two complete copies of your data, so when a failure occurs, as it inevitably will (and at the worst possible time, of course) your data is just copied from one disk to another at disk-to-disk speed?

Small and medium businesses face enough uncertainty as it is. Spending a few extra bucks for RAID 1 or 1+0 will make your local digital data storage as bulletproof as it can be. Isn’t that what you really want?

{ 6 comments… read them below or add one }

NT May 5, 2006 at 9:44 am

Hi Robin,

Lets do some math for your RAID 1+0 example using a 6 Disk RAID group. In this scenario you have 15 Unique failure scenarios. Three of these scenarios are fatal. That means you’re protected against 12/15 of the failure scenarios (i.e 80%) and you’re exposed with the remaining 20%. So at 2x capacity, 2x cost, I have a 20% exposure. That is a high number. The numbers gets much worst as the RAID group sizes have started to shrink as a result of drive capacities increasing. For example in a 4 drive 1+0 Raid Group the probablity of survival drops to 66% while the probability of a second failure due either to a bad drive or an unrecoverable bit error increases to 33%. While RAID 1 only has 25% less usable capacity as RAID 6 it also has 2x the cost. The whole idea is to do more with less, not less with more…

Robin May 5, 2006 at 9:57 am

Can you show your work? And also comment on the write performance impact of RAID 6? I’d like to understand where the numbers come from.
Thanks,
Robin

NT May 5, 2006 at 11:11 am

Ok. So lets stick with the 6 drive RG:
——————————————
| Drive 1 Drive 2 Drive 3 |
| Drive 4 Drive 5 Drive6 |
——————————————

After you pick the 1st drive there are: 5+4+3+2+1 Unique Pairings = 15
Failures in anyone of these Three pairings is fatal: (Drive 1/Drive4, Drive2/Drive5, Drive 3/Drive6). Therefore I’m protected against 12/15=80% and I’m not protected against 3/15=20%. Another of way (easier) of doing is by using the following folmula: 1/n-1 where n= # Total number of drives -1. That’ll give you the survival %. So 1/6-1=0.2

As far as RAID 6 goes, one thing to note is that not all RAID 6 implementations are the same. There are at least 5 variations out there starting from RAID 6 using Reed-Solomon algorithm, RAID 6 using Even-Odd encoding, RAID 6 using X-Code encoding, RAID 6 using Adaptec’s encoding scheme and Netapp’s RAID-DP. There maybe other which I don’t know. Performance will vary across all of these with some performing much better than others. But in general, I will agree with you that purely from a performance perspective RAID 1/RAID1+0 offers better performance. One thing to keep in mind is that not eveybody needs performance and not everybody deploys FC drives. The other thing that we also need to consider is large array caches that are used to optimize write patterns to disk. In isolation, RAID-6 will cause 6 IOs to disk. However, arrays with large caches can hold random writes in cache for a long time, in hopes that adjacent writes will show up so they can be writen efficiently to disk with a minimal performance hit.
There are vendors out there today that recommend RAID 5 over RAID 1+0 even for high perf. apps for this reason. I suspect that down the road this will probably continue with RAID 6 because they can throw enough memory at the problem by combining writes to get good performance.

Robin May 5, 2006 at 12:19 pm

Thank you for adding that. So my question now is: what’s changed? I’m thinking it is the capacity of the disks which increases the likelihood of hitting one of those 1 in 10^15 errors. If that is so, why not just partition the stripe set into some number of small enough (125GB ?) virtual drives? I can think of some reasons, but I wonder what you think.

BTW, there is a nice paper on The mathematics of RAID-6 that taught me a thing or two.

John June 20, 2007 at 3:57 am

Raid-5 dilemma:
Please illuminate an apparent contradiction as follows.
Raid-5 boasts rotating parity, implying variable write size. The segment size and stripe size is constant.
How do these aspects resolve?

AG April 27, 2011 at 8:03 pm

I can say that I have not lost a single byte of data with my 48 drive array set up in 12 RAID 10 arrays (4 drives per array, total 600GB of 15k storage per array), with 12 drive failures over the past year… damn Toshiba.

The 48 drive archive holds crucial data to both my company’s success and my parent company’s success. Had it not been for RAID 10, would I be home free? Would I have gone on vacation with the family because of the nice bonus I got? I suppose I’ll never know… I don’t really wanna know, but all I can say is that RAID 10 gives your data an impenetrible shield, a wall of China, a nice helping hand; whatever you want to call it.

Leave a Comment

Previous post:

Next post: