Everything You Know About Disks Is Wrong

by Robin Harris on Tuesday, 20 February, 2007

Update II: NetApp has responded. I’m hoping other vendors will as well.

Which do you believe?

  • Costly FC and SCSI drives are more reliable than cheap SATA drives.
  • RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.
  • After infant mortality, drives are highly reliable until they reach the end of their useful life.
  • Vendor MTBF are a useful yardstick for comparing drives.

According to the one of the “Best Paper” awards at FAST ’07, none of these are backed by empirical evidence.

Beyond Google
Yesterday’s post discussed a Google-authored paper on disk failures. But that wasn’t the only cool storage paper.

Google’s wasn’t even the best: Bianca Schroeder of CMU’s Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a “Best Paper” award. (BTW, Ms. Schroeder is a post-doc looking for an academic position – but if I were Google or Amazon I’d be after her in a big way.)

Best “academic computer science” paper
So it is very heavy on statistics, including some cool techniques like the “auto-correlation function”. Dr. Schroeder explains:

The autocorrelation function (ACF) measures the correlation of a random variable with itself at different time lags l. The ACF, for example, can be used to determine whether the number of failures in one day is correlated with the number of failures observed l days later.

Translation: ever wonder if a disk drive failure in an array makes it more likely that another drive will fail? ACF will tell you.

She looked at 100,000 drives
Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed internet services providers. The drives had different workloads, different definitions of “failure” and different levels of data collection so the data isn’t quite as smooth or complete as Google’s. Yet it probably looks more like a typical enterprise data center, IMHO. Not all of the data could be used to draw all of the conclusions, but Dr. Schroeder appears to have been very careful in her statistical analysis.

Key observations from Dr. Schroeder’s research:
High-end “enterprise” drives versus “consumer” drives?

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”

Maybe consumer stuff gets kicked around more. Who knows?

Infant mortality?

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.

Dr. Schroeder didn’t see infant mortality – neither did Google – and she also found that drives just wear out steadily.

Vendor MTBF reliability?

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.

Actual MTBFs?

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.”

In other words, that 1 million hour MTBF is really about 300,000 hours – about what consumer drives are spec’d at.

Drive reliability after burn-in?

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Drives get old, fast.

Data safety under RAID 5?

. . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

Independence of drive failures in an array?

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

Big iron array reliability is illusory
One implication of Schroeder’s results is that big iron arrays only appear more reliable. How? Using smaller “enterprise” drives means that rebuilds take less time. That makes RAID 5 failures due to the loss of a second disk less likely. So array vendors not only get higher margins from smaller enterprise disks, they also get higher perceived reliability under RAID 5, for which they also charge more money.

The StorageMojo take
After these two papers neither disk drive or array businesses will ever be the same. Storage is very conservative, so don’t expect overnight change, but these papers will accelerate the consumerization of large-scale storage. High-end drives still have advantages, but those fictive MTBFs aren’t one of them anymore.

Further, these results validate the Google File System’s central redundancy concept: forget RAID, just replicate the data three times. If I’m an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

Comments welcome, especially from disk drive and array vendors who dispute these conclusions. Moderation turned on to protect the innocent.

Update: Garth Gibson’s name is also on the paper. Since he is busy as a CMU professor and CTO of Panasas, I hope he’ll pardon me for assuming that Dr. Schroeder deserves most of the credit.

{ 5 comments… read them below or add one }

Tim April 12, 2010 at 6:08 am

Further to Tracy Valleau

The industry is moving towards using AFR (Annual Failure Rate). The reason is that MTBF is really confusing, and AFR gives the consumer a better idea of what the number is. an AFR of 0.87% is equivalent to MTBF of 1,000,000. the equation is AFR = 1-exp(-8760/MTBF)

Both of these measures are POPULATION statistics. One would expect from a large population that a small fraction might be faulty or break earlier than expected. Most people can intuitively understand that about 1% of disks might fail in a single year, or there is a 1% chance of a disk failing in a year. They also do not link this failure rate with the disks lifetime. As such AFR is much more sensible metric for this type of information. and AFR=0.87% is exactly the same as MTBF of 1,000,000 hours.

This statistic also in no way defines how long a disk will last. That is the useful life value (say 30,000 POH (power on hours)). This will be linked to the warranty period, wear-out etc.

On a slightly different note…. The paper did not measure disk failures, rather, “disk replacements”. There is a difference between the two, namely mis-diagnosis. This may also help explain why she got a autocorrelation. If I incorrectly replace a disk that is faulty, I still leave the root cause of the problem, and am likely to repeat the same mistake a week or so latter…. hence the autocorrelation result.

My hypothesis is that the autocorrelation seen is caused by mis-diagnosis. Unfortunately I do not have the data to prove/disprove that hypothesis.

ItsMe October 7, 2010 at 2:23 pm

The last two posts did a good job explaining MTBF. Here’s another way of explaining it.

Get 1,000,000 hard drives in a room. Run them all and see when they fail. Let’s say that in the first year you had one hard drive fail every hour. That would be 8760 drives that would fail. During the second year you might also have 8760 drives fail. During the third year you might have 8760 drives fail. During the 4th year you might have 50,000 drives fail. During the 5th year all the remaining drives might fail.

What is the MTBF? You would clearly decide that the useful life of a hard drive is 3 years, because you start getting a lot of failures in the 4th year, and all of them failed during the 5th year. So you look at you average failure rate for the first three years. Well for the first three years, you had 1,000,000 drives running, and one failed every hour. But every hour you have 1,000,000 hard-drive hours accumulated. So you have one failure per 1,000,000 hours of operation. Thus your MTBF is 1,000,000 hours. MTBF means mean time between failures. You have a one failure for every 1,000,000 hours of operation, thus a 1 million hour MTBF.

Notice the fact that all the drives failed by the fifth year. The MTBF has nothing to do with the life expectancy.

On an unrelated note, I have not read any of the referenced papers, but it seems to me that the statistic showing clustered failures is totally bogus. It turns out that when you find a drive has failed and you go to rebuild the raid, it’s not that another drive fails during the rebuild, but rather that the other drive has in fact failed before the rebuild (failure meaning having unreadable data), but the failure is not discovered until the rebuild.

Phil Koenig February 2, 2012 at 11:22 am

Late late late comment, sorry.

Re: the safety of backing up a RAID array with a failed drive first, versus swapping the failed drive and rebuilding the array first.

I would think the main advantage of the “backup first” strategy is that it does not require any new disk writes, only reads.

Seems to me that there would likely be far more potential failures resulting from re-writing all the data/parity during a rebuild than simply reading what’s there onto a backup.

tim newman January 29, 2015 at 9:11 pm

hi all- 7 years late into this thread. I am amazed at the lectures given by many in this topic. Several of the definitions or examples of MTBF are incorrect. The statement about MTBF not being related to life is also incorrect. MTBF is a term used to measure the time between failures of a repairable system. How you define a repairable system is up to you. if a system consisted of 10 items each that failed at the 30,000 hour mark, the MTBF is 3,000 hours, and the MTTF is 30,000 hours. the reason why MTBF is useless for hard drives is that they exhibit an increasing failure rate with respect with time. It is the same reason that MTBF of brake pads or tyres is misleading. For wearout items, an “equivalent MTBF” can be determined, for simplicity. This is merely to smooth out the changes in failure rates over time and makes spares calculations simpler.

Mike B December 2, 2015 at 11:54 am

Wow – just discovered this thread after how many years? Anyway, it looks like it’s really the Mercedes theory of reliability: if you have a part that based on testing and design you expect to last 75000 =/- 25000 miles, you schedule it for replacement every 25000 miles. That way, absent “infant mortality”, it’ll never break. The maintenance bill will be substantial, though. A practice of automatically replacing drives every 3 years (or however long the warranty is), whether or not degradation is seen, is much the same: the data center sees reliability because absent failure within warranty the drives will never break. Of course, with 1000000 drives there will be some failures within warranty so suitable systems (certain forms of RAID) are needed to tolerate those. People who can survive an extended outage can, possibly, tolerate more risk of actual failure and the resulting need for restoration from a backup; they will run a drive will beyond the warranty period and will sometimes get away with 5 years or even more, and I’ve been able to recover data from stored drives that had been used for 4-5 years then shelved for as long as 15. Leading me to wonder: besides reliability in operation, what’s the shelf life of a hard disk, such as a backup that’s never had to be used?

Leave a Comment

{ 8 trackbacks }

Previous post:

Next post: