Everything You Know About Disks Is Wrong

by Robin Harris on Tuesday, 20 February, 2007

Update II: NetApp has responded. I’m hoping other vendors will as well.

Which do you believe?

  • Costly FC and SCSI drives are more reliable than cheap SATA drives.
  • RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.
  • After infant mortality, drives are highly reliable until they reach the end of their useful life.
  • Vendor MTBF are a useful yardstick for comparing drives.

According to the one of the “Best Paper” awards at FAST ’07, none of these are backed by empirical evidence.

Beyond Google
Yesterday’s post discussed a Google-authored paper on disk failures. But that wasn’t the only cool storage paper.

Google’s wasn’t even the best: Bianca Schroeder of CMU’s Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a “Best Paper” award. (BTW, Ms. Schroeder is a post-doc looking for an academic position – but if I were Google or Amazon I’d be after her in a big way.)

Best “academic computer science” paper
So it is very heavy on statistics, including some cool techniques like the “auto-correlation function”. Dr. Schroeder explains:

The autocorrelation function (ACF) measures the correlation of a random variable with itself at different time lags l. The ACF, for example, can be used to determine whether the number of failures in one day is correlated with the number of failures observed l days later.

Translation: ever wonder if a disk drive failure in an array makes it more likely that another drive will fail? ACF will tell you.

She looked at 100,000 drives
Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed internet services providers. The drives had different workloads, different definitions of “failure” and different levels of data collection so the data isn’t quite as smooth or complete as Google’s. Yet it probably looks more like a typical enterprise data center, IMHO. Not all of the data could be used to draw all of the conclusions, but Dr. Schroeder appears to have been very careful in her statistical analysis.

Key observations from Dr. Schroeder’s research:
High-end “enterprise” drives versus “consumer” drives?

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”

Maybe consumer stuff gets kicked around more. Who knows?

Infant mortality?

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.

Dr. Schroeder didn’t see infant mortality – neither did Google – and she also found that drives just wear out steadily.

Vendor MTBF reliability?

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.

Actual MTBFs?

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.”

In other words, that 1 million hour MTBF is really about 300,000 hours – about what consumer drives are spec’d at.

Drive reliability after burn-in?

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Drives get old, fast.

Data safety under RAID 5?

. . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

Independence of drive failures in an array?

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

Big iron array reliability is illusory
One implication of Schroeder’s results is that big iron arrays only appear more reliable. How? Using smaller “enterprise” drives means that rebuilds take less time. That makes RAID 5 failures due to the loss of a second disk less likely. So array vendors not only get higher margins from smaller enterprise disks, they also get higher perceived reliability under RAID 5, for which they also charge more money.

The StorageMojo take
After these two papers neither disk drive or array businesses will ever be the same. Storage is very conservative, so don’t expect overnight change, but these papers will accelerate the consumerization of large-scale storage. High-end drives still have advantages, but those fictive MTBFs aren’t one of them anymore.

Further, these results validate the Google File System’s central redundancy concept: forget RAID, just replicate the data three times. If I’m an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

Comments welcome, especially from disk drive and array vendors who dispute these conclusions. Moderation turned on to protect the innocent.

Update: Garth Gibson’s name is also on the paper. Since he is busy as a CMU professor and CTO of Panasas, I hope he’ll pardon me for assuming that Dr. Schroeder deserves most of the credit.

{ 84 comments… read them below or add one }

Kensey February 21, 2007 at 10:54 am

hymieg, as I recall it’s not so much the spinup, as it is the polymerization of the lubricant after the drive spins *down* preventing it from spinning up again in the first place. Thus the old “whack it and back it” advice (smack the drive to get the platters unstuck, then boot and *immediately* do a full backup). This is also why disks that have run for a long, long time will continue to run just fine *until* a power outage or something else causes them to spin down, at which point they die, the lubricant having essentially turned to glue.

random February 21, 2007 at 12:58 pm

Tmack, HDs haven’t used stepper motors for almost 2 decades now. They use voice coils now.

robert February 21, 2007 at 2:40 pm

That’s it, I’m going back to chiseling data on a rock tablet!

Alan February 21, 2007 at 4:53 pm

In regards to rob’s “bollocks #2″, where he says, “Not one major disk vendor is looking to provide solid state in the big chassis”. I saw a presentation by a major storage vendor where that was in fact exactly what they were touting. It wasn’t EMC or HP.

John February 21, 2007 at 5:01 pm

According to The Thinker,
you need a massive data collection project tracking hundreds of variables before you can begin to construct an elementary model of any real-world phenomenon…

Not true. By the Central Limit Theorem, the effects of many independent random factors that weren’t taken into account will simply increase the level of variability in the observed population means. The analysis is only problematic if the neglected factors are dependent on the ones being analyzed – for example perhaps SCSI drives were inherently more reliable but for some reason were being put into high-vibration environments more often than other types.

So is there reason to believe that the (admittedly important) initial/operating conditions depend strongly on the variables that were recorded in the data?

If not, I’m inclined to accept the paper conclusions. Which mostly amounted to noting that the current model is broken and proposing one that better fits their observed data. Remember Galileo’s notion of experimental error – something like “don’t take my error of one cubit and try to hide Plato’s error of 100 cubits behind it”. A model can be simplistic, incorrect, and still a VAST improvement on the previous art.

Jason Williams February 21, 2007 at 5:33 pm

Something to keep in mind is that all of these drives were “enterprise” class drives regardless of their interface (SATA, FC, and SCSI). Anything over 1,000,000 MTTF is an enterprise drive…as noted by the E in front of Hitachi enterprise class SATA drives. The SATA drives bought by a consumer usually have spec sheets in the 300,000 range. Which is really scary if you consider the results of this paper.

Third Grade Math February 21, 2007 at 6:40 pm

1,000,000 hour MTTF?

1000000 / (24 * 365) = over 114 years!

WTF?

If it’s 1,000,000 minutes, we get

1000000 / (24 * 365 * 60) = 1.9 years

which is more like what I’ve experienced. Now the study says 300,000,

300000 / (60 * 24 * 7) = almost 30 weeks!

I’ve had a lot of drives die on me, but this value is ridiculous.

John February 21, 2007 at 8:29 pm

I resoundingly second (the other) John’s comment. The Thinker hides his misconception(s) behind lots of fancy talk, a typical example of someone with a high ratio of verbal skills to actual understanding.

True, there are lots of parameters that were not considered in this study, but (as John noted) as long as the parameters are statistically independent, it’s okay to draw conclusions about the parameters that were in the study based upon those parameters alone.

If you don’t believe this, then just take a look at any scientific paper which uses statistics to make inferences about a phenomenon being studied. Seldom is it practical or possible to model every parameter that exists in reality. Instead, we choose some suitable subset of parameters that explain as much of the variation as possible.

If, for example, drive orientation were responsible for a some of the observed variation in drive reliability, and the orientation of a drive had no relationship to the other parameters (e.g. drive type), then this parameter essentially becomes “background noise,” that is, the overall effect of this parameter on drive reliability will be the same on drives of different types, so the difference in reliability between drives of different types >due to the drive orientation parameter

Corley Kinnane February 21, 2007 at 9:23 pm

If you are paranoid at all about RAID5 having inherently more problems than RAID1, consider using *software* RAID5.

I use RAID1 for booting, RAID5 for data – its just easier to setup this way.

Now, RAID6 is the better option.

The question of whether RAID1 or RAID5 is more reliable for one failure tolerance – it comes down to the reliability of the software – nothing else.

When I make a software RAID5 array, I tag each drive with a number on an unused partition – set it up well, and you can’t really go wrong unless you decide to wipe something important – which you could of course do at any time RAID or not.

Software RAID is a negligible CPU hit these days – and RAID5 is fast, not slow – even just using 4 drives, you should get around 70 – 100 Mb/sec with 7200rpm drives.

When I need 12+ drives in an array, I use hardware RAID5, knowing it isn’t as secure as the software RAID but is a lot easier to manage.

I think if I had to make a solution right now, it would be software RAID6 backed up to single drive or RAID0 array in another location – I’d install the backup 6 months after setting up the array and swap some drives from the main array with the backup drive/array.
If it had to be always live – I would cluster just 2 arrays.

Triplication ? throw another server on the fire.

Corley.

Arghh February 21, 2007 at 10:44 pm

I would really like to see some paper like this on CDs, DVDs etc.
With media like this the validity of actual claims generally only become evident years after usage.

dp February 21, 2007 at 10:59 pm

Great – so if the title is correct, everything I’ve just learned about disks is wrong.

DZNTUNDERSTAND February 22, 2007 at 3:59 am

Will some please explain the MTBF values and why if it is 34 years drives just last 3?

Robin Harris February 22, 2007 at 5:37 am

Normally I try to respond to comments, but this time there are too many. So I’m going to cherry-pick here.

Richard, you raise many good points. Google appears to have priorities for its computing infrastructure, with the #1 priority for the revenue generating ad placements. I’ve heard complaints about Gmail uptime, but not about ad placements.

Also, Google has probably been the single most important force in getting chip, power supply and motherboard vendors to focus on power consumption. They’ve been having their motherboards custom-made for several years, and they support three drives per node currently. I know they’ve looked at more drives per node. In fact the impetus for this study may have been to determine the optimum number of drives per node.

Robin

Not Important February 22, 2007 at 5:50 am

I think both the paper and the discussion has misunderstood the notion of MTTF. MTTF is the mean time to fail if the drives are replaced on a regular basis within the warranty period.
I other words, as long as you replace a drive when it reaches about three years of age, there should be an expected average of 114 years between disk failures.
This explains the fact that drives fail a lot more often that the MTTF would suggest when it has been in use for 5-7 years as the paper states. The misconception is that the MTTF is stated for a specific drive unit, which it is not.

Thankful February 22, 2007 at 1:49 pm

Thank you for finally explaining the MTTF numbers!

Scudchtr February 22, 2007 at 4:15 pm

Not Important, I am calling BS on your definition of MTTF unless you can provide some legitimate references.

JeePee February 23, 2007 at 1:14 am

Not Important,

Brilliant reasoning, if you replace the discs before they fail, there is a bigger chance of avoiding failure. But why would I replace a perfectly good disc? Just to meet the manufacturer’s specifications? That’s a bit of turning the world around.

Jessica February 23, 2007 at 6:04 am

I think many people misunderstand the purpose of RAID as it is used in a datacentre.

[ramble, for those who don’t work in datacentres]
RAID is used to reduce (a) loss of data and (b) downtime. A secondary benefit, dependent on configuration, is an increase in performance. Reliability is a byproduct, not the goal.

If a non-redundant disk fails – your system is down and data is probably lost. RAID gets around this and gives you time to plan how to recover.

A disk failure under RAID puts you “at risk”. Your FIRST action in the case of a disk failure under RAID is to ensure you have a good backup, not to slap in a disk and resync. As I’m sure others have noted, any RAID resync will put an abnormal load on the remaining disks. You want to avoid this until you can be sure that should there be a second failure, you can recover.

It is also true that if you do not have an up-to-date backup, taking a backup will load the remaining disks, but this should be a lower load than any resync. You may be fortunate enough that (in decreasing order of preference) either (a) you have a full backup and no data has been added since (b) you can do a quick incremental and place the least stress on the disks (c) the data added can be recreated with minimal effort.
[end of ramble]

My point here is that RAID is not a magic solution, but an important part in an overall strategy.

People have also been talking about the differences between hardware and software RAID. As far as risk is concerned, there is no difference. Until your CPU interfaces directly to the disks there is always a component which could fail and deprive you of data. If you have (S)ATA disks that typically means your motherboard. SCSI or FC – the HBA. In many cases your hardware RAID HBA is just a SCSI/whatever HBA with RAID intelligence added. Your defences against loss of this are (a) duplication of hardware paths and (b) standby spare parts.

There is an illusion with software RAID that because it is host-controlled, you might be able to trawl through the bits on disk and recover things should there be a catastrophic failure, whereas with hardware RAID the data format is inaccessible. In practice, no one would spend the time doing this unless the data were absolutely vital, and were this the case your backup regime and data duplication to other systems is more efficient.

Speed of software RAID is entirely orthogonal to the subject at hand.

Brian February 23, 2007 at 11:31 am

First, no one here, including the papers author, have explained MTTF properly. The paper’s authors got it all wrong. Let me explain.

MTTF is the mean-time-to-failure. That means that each drive will, on average, last a certain amount of time. In this case, each drive will last, on average, 1,000,000 hours. That means some will die sooner, some later, etc.

MTBF is the mean-time-between-failures. That means that the system of drives will, on average, have a certain period of time between failures. That number can be far lower than MTTF.

Also, the authors state that there is no infant mortality effect, yet the results of their weibull analysis clearly point to infant mortality. It is commonly accepted in reliability analysis that a rate of failure of less than 1 indicates infant mortality. Some though would claim that a value of 0.71 is random. Either way, the system of drives is not exhibiting the wearout failure mode that they state.

MTTF indicates life, but MTBF doesn’t. MTTF generally will not vary with time, but MTBF does. Also, MTTF doesn’t vary with rates of installation or replacement, yet MTBF will.

It is very easy to confuse the two. Many on here have, many on slashdot have, and the paper’s authors have misunderstood as well.

Pipson February 23, 2007 at 12:27 pm

I disagree with Jessica’s statement that backup of a non-redundant RAID is easier on the drives than a rebuild (unless of course you don’t have to do a full backup). Moreover, in a production scenario, where uptime is important, offlining the RAID to perform a backup instead of rebuiling the array to regain redundancy is defeating the purpose of the system. I do agree with your comments on importance of *regular* backups. This is where providing RPO and RTO that meets business needs is the ultimate failback.

I absolutely agree with magicalbob’s last two paragraphs on redundancy.

When talking about MTBF, not enough emphasis was put on the duty cycle of each system. In my experience SATA systems simply buckle under constant heavy IO load with drives popping just like popcorn. Under light to medium load I would expect SATA and SCSI/FC to show similarly lower failure rates. Then again I may be just the exception…

Pete March 8, 2007 at 11:55 am

A few points to blunt the hysteria. Do any of you realize how long 1,000,000 hours is? A quick punch-up in a calculator shows that it is just over 114 years. Even if MTTF and MTBF estimates were off by 50%, that is still 57 years of 24/7 service. Somehow that doesn’t shake my faith in harddrives. Lets also keep in mind that the price of drive storage has dropped steadily over the last 25 years. I remember when the cost of storage was over $100 per MB. Now that cost is about $.01 per MB. Excuse my math if I miscalculated but isn’t that a 10,000% drop?

Given the comparatively low cost of storage, isn’t RAID 1, 4 and 5 outdated now? Doesn’t RAID 10 give better performance and more redundancy? Given the lower cost of storage isn’t it the epitomy of cheap to still be using RAID 5 in server or SAN systems?

Let me also address the myth that “enterprise” drives are somehow better than “consumer” grade drives. Anyone who knew what they were speccing when designing storage systems knew damned well they weren’t paying for fewer failures, they were paying for performance. Faster spindle speeds, lower seek times, lower transfer rates, more write cache and higher throughput were the name of the game. Anyone thinking they were buying lower failure rates was on a fool’s errand.

I would also like to address that silliness that RAID is pointless because you still have a single point of failure in the controller. Well, Duh! That’s why the RAID config info is stored on the drives and not in the card or any other volatile memory. If a card fails, it can be replaced without losing data. Also keep in mind that RAID isn’t the end-all and be-all for data security. It is at best one piece of a comprehensive stratagy that should include other things like backups, redundant storage and archiving

Let me also put things into perspective with mfg published MTBF and MTTF rates. As I stated above 1,000,000 hours is 114 years. The published numbers are ESTIMATES based on predictive TESTING. If they were to actually to run real-world tests on samples to get statistical numbers, we would still be putting 20MB MFM and RLL 5.25″ drive in our systems while we waited for manufacturers to complete their testing.

Let me take the opportunity to put it into geek speak since I just by chance, watched Star Wars the other night “..So you see Luke what I told you IS true… from a certain point of view.” Interpreting statistics is a fool’s game. They are guidelines based on a certain set of conditions and not facts.

Amos March 17, 2007 at 2:59 am

So what practical software/filesystem can you recommend to implement such a file-redundancy setup, Or am I obliged to implement this in my applications?

clockwinder March 27, 2007 at 9:40 am

Permanent data storage?? The hard part is getting rid of stuff you no longer need. I have lived with failures of 9-track tape, Dat tape, winchester technology drives, CD platters, 80-column punch cards, and punched paper tape. Information Week a number of years ago published a survey on longevity of storage media (not quite the same thing as disk drive longevity). Worst was cheap mag tape. Then hard disk. Then high-quality CD ( guessed at reliable for 50-75 years). Most reliable was acid-free paper, good for probably 500 years or more. In this case, we have actual examples!
Gigabytes per page? It depends… dont throw the books away yet, folks!

From a cost perspective.. April 23, 2007 at 6:11 am

For those taking all this information/comments/thoughts into consideration for real world applications, some cost data to consider…

On a current “big iron” application, we made the chang from SATA drives to Fibre drives before implementation this past year. Storage costs increased exactly 100% for the same amount of storage, not 400 to 600% as has been suggested. So, if you’re thinking of doubling or “tripling” up on SATA, look at the costs also.

Facility costs on “big iron” projects are huge. The costs to double, or triple, up the space to stand up SATA and the added costs for cooling these drives over a period of years can be staggering.

Now if your just looking at a simple “one for one” replacment Fibre with SATA, with the same size of storage in the end, then it’s worth looking into because storage costs could be reduced by half.

As an example our costs could be reduced from $4 million to $2. I’ll be taking a look, and will have to make a complicated business decision.

Ted Fay May 20, 2007 at 9:41 pm

Bob,

Of course I’m talking about data corruption due to bad blocks, and the fact that only drive-wide hardware failures were taken into account in this study is the basis of my point.

Robin tried to dismiss my point as being architectural and not real world, yet my whole point is that this study misses some critical aspects of real world experience, which is that when you go to fetch data, and you can’t get it because the blocks are bad, or you can’t rebuild a portion of the data after a failure because the block are bad, then whoever needed that data is going to consider it to be a failture, regardless of whether the RAID controller labels the disk as failed or not.

Data corruption = failure. Anyone who tells you diffrent is trying to sell you something.

-ted

Ted Fay May 20, 2007 at 9:56 pm

Annoymous,
Regarding you comment “Are you saying we should go back to the ST-506 for reliability?”

Of course not. Radically different technologies, as you know.

Packing twice the blocks on the same physical spindle as onother drive built with the SAME TECHNOLOGY will and does result in twice the number of bad blocks for the same physical damage to, or inperfection in the platter.

There is no free lunch, and you do indeed get what you pay for. It doesn’t show up in this study, because this study doesn’t take into account the primary advantage of enterprise diks, which is twice the phyical media allocated to each block using the same platter technology as their consumer grade cousions.

Even if FC, SAS and SATA all do inded have similar rates of failure for their mechansisms, which I wouldn’t doubt, if you’re willing to pay for RAID redundancy, why not media redundancy teh blocks on your platter?

Apart from the advantages on the contoller board of FC or SAS, what your paying for is twice the saftey of the data contained on those blocks. If you don’t care about what lives on those blocks, I guarantee you someone will when they go missing. :)

Just my two cents.

-ted

A Dutch Library June 1, 2007 at 3:53 am

Well, it’s a bit of a late reply seeing the date that this discussion started, yet I thought it couldn’t harm to add my own advise. We’re all interested in making our data persistent which is quite a challenge due to media detoriation and rapid media obsolescence. The topic interested me and I’m currently graduating by performing research on it for a library who is interested in digital preservation. There are many difficulties with digital preservation of which this particular one is just a minor (almost easy) part. I will save you the whole reasoning behind my conclusion since it’s not yet finished (and there are probably limits to the textsize that you can post :)) but the conclusion might be helpful to some of you:

A few assumptions:
-The target storage system needs to be able to contain 10 TB worth of data
-The storage system needs to be scalable
-The storage system needs optimal data security vs. costs. (of course data triplication is nice, but most of us, libraries including, don’t have that much money)
-The storage system needs to be web-accessible
-The storage system needs to be disaster-proof

If you are searching for something that should fit these needs as well, this is probably your best solution:

Two seperate servers stored at seperate locations (cheapest way of avoiding data-loss through distasters). Configure the first server for RAID5EE (hot spare integration) and the second for RAID60 (SAN). Use 500GB enterprise drives for your first server and 500GB nearline drives for the SAN. Make the first server backup daily to the SAN. Perform nightly checkdisks so you can determine when new spare drives should be ordered. And last, but not least, make sure you have the money to buy a whole new server environment within 7 years.

That isn’t anywhere near cheap, but it’s most cost-effective almost 100% guarantee for preserving your data. This configuration doesn’t necessarily have to be optimal for the next generation of hardware you will buy.

Perhaps noone is helped with this, but I’ll be happy if it just helps Someone. Just some (nearly offtopic) sidepoints, for cheap home RAID’s, check the Intel Matrix RAID solution. For future archiving, pay attention to holographic storage development. I’ll save you the other random findings of my study :)

wgh August 23, 2007 at 9:47 pm

Joe Claborn said (on February 21st, 2007 at 6:41 am): Is this right? A MTBF of ‘only’ 300,000 hours translates in 34 years. Our disk drives seem to last about 3 years. Why the difference?

I’ve skimmed the above thread but didn’t see anyone note that MTBF (and to a degree MTTF) should be divided by the number of drives that are in your environment to estimate how often you’ll see a single drive within the environment fail. Yes, as you’ve mentioned, the MTBF numbers suggest 34 yrs to fail for one drive, but if you have 10 drives in your environment you can expect one of them to fail in about 3.4 years. Just as when you have 10 men working construction there’s 10 times the probability of one of them getting sick on any given day. When working in a “big iron” shop with thousands of RAID devices, this is (usually) taken into account. Those who say triplicate the data instead of using RAID appear to me to not be faced with needing up to date accurate data available in one location, without time available (due to SLAs) to restore or even time to fail over to a seperate set of drives. Many in mainframe environments have come to heavily rely on no down time to restore or fall over to other drives, that is unless the situation is very dire (of a disaster type). If one were to “simply” have three copies, as someone suggested above, then which one do you update? All three? Doing so and waiting for validation of completion of I/O would typically cause response times on heavily I/O burdened systems to degrade beyond acceptability. To not wait on validation opens a window to potential corruption to any copies that were not being synchronously updated (synchronous updates are expensive). Thus RAID. Yes, drives will fail and drives will be replaced. But a well laid out RAID array will still give the needed response times during failures, even at peak transaction time… again, I said if they’re “well laid out”. And yes, if the data is mission critical, such RAID arrays should be copied to another location… for the event of a disaster (including at a minimum, lightening).

Jered Floyd August 20, 2008 at 2:08 pm

Robin,

A bit of a late comment here, but I think what’s even more interesting than bogus MTBFs for drives is the interesting difference in bit error rate for SCSI/FC vs. SATA drives. I just wrote an article on this, Are Fibre Channel and SCSI Drives More Reliable? It turns out that they are, at least for RAID, and not for the reason you might suspect! I think there’s a false market segmentation going on here…

Jered Floyd
CTO, Permabit Technology Corp.

Kmann August 22, 2008 at 11:01 am

The Bianca Schroeder paper is excellent, but I saw something very interesting in the paper that seems to have gone unnoticed here,

Table 2. — “Node outages that were attributed to hardware problems broken down by the responsible hardware component.”

Component (HPC1)
CPU 44%
Memory 29%
Hard drive 16%
PCI motherboard 9%
Power supply 2%

Fully 82% of the failures were related to “solid state” components.

This in spite of the fact that the system population included 3,406 disks and 784 servers. DRAM was almost twice as likely to cause a failure and the CPUs were three times more likely to cause an outage. Moreover, 784 motherboards produced 9% of failures while 3,400 disks produced only 16%.

And this is a very high-end system, presumably “top-shelf” DRAM, CPU and motherboard components.

Also, from the text:

“…we have analyzed failure data covering any type of node outage, including those caused by hardware, software, network problems, environmental problems, or operator mistakes. The data was collected over a period of 9 years on more than 20 HPC clusters and contains detailed root cause information. We found that, for most HPC systems in this data,
more than 50% of all outages are attributed to hardware problems… Consistent with the data in Table 2, the two most common hardware components to cause a node outage are memory and CPU.”

So much for the myth of “solid state” reliability.

For some perspective, while CPU makers stopped publishing MTBF many years ago, and DRAM manufacturers have to my knowledge never published them, most motherboard manufacturers do publish — typically in the 100,000 hour range. So…if 784 motherboards produced 9% of failures, and 3,400 disks only produced 16%, then it seems that perhaps the numbers published by the disk drive makers are, in relative terms, not so wildly off the mark. It would appear (from a system/sub-system perspective) that disks are relatively much more reliable than the “solid state” components.

I wonder how people would react if they actually knew the MTBF numbers on stuff like DRAM and CPUs? Perhaps we should all remember that silicon DOES “wear out” (in a manner of speaking).

All this makes me wonder why everyone assumes that Flash SSD is going to be so much more reliable than other silicon. Are we to believe the ridiculous MTBF claims of the SSD makers (Intel sez 2,000,000 hrs), given the numbers on DRAM?

It will be interesting to see the results on the first large-scale deployments of flash-SSD. Unfortunately it will probably be five or more years that the “free ride” for SSD continues before folks begin to realize that solid-state in not necessarily more reliable than mechanical disks…and very frequently (in the case of DRAM and CPUs) less reliable!

Tracy Valleau February 12, 2009 at 10:00 pm

I often get asked about MTBF (Mean Time Between Failure) and it’s amazing how many “industry people” don’t understand it.

And for those who have already figured out that their 1.5M MTBF drives don’t last 150 years, but are not sure what that MTBF thing is… here’s a quickie:

Why your hard drive doesn’t last 150 years.

(There are about 8700 hours in a year, but to make this example simple, let’s call it 10,000.)

Here’s how MTBF works: it’s an aggregate of many units based on expected life of a single unit.

Let’s say you have a hard drive that is warranted to last 3 years, or 30,000 hours.

You put it in a server, and behold, it lasts 3 years. You take it out and put in a new one, and that also lasts 3 years. So you replace it with a new one, and that too…. well, you get it.

Let’s say you keep doing that and finally, on the 50th unit, only two years into it’s life, it breaks.

You now have 3 years or 30,000 hours per unit, times 50 units = 1,500,000.

And that’s your MTBF.

So anyone who says “Wow! MTBF of 1.5 million hours! that mean this thing will last (1.5M / 10000) 150 years!” -clearly- doesn’t know what they’re talking about.

(MTBF is more complex than my example, including “infant mortality” and “wear out” phases; “theoretical” vs “operational” MTBF and so on, but the gist of what’s here is correct.)

Cordially,

Tracy Valleau

“Don’t believe everything you think.”

Tim April 12, 2010 at 6:08 am

Further to Tracy Valleau

The industry is moving towards using AFR (Annual Failure Rate). The reason is that MTBF is really confusing, and AFR gives the consumer a better idea of what the number is. an AFR of 0.87% is equivalent to MTBF of 1,000,000. the equation is AFR = 1-exp(-8760/MTBF)

Both of these measures are POPULATION statistics. One would expect from a large population that a small fraction might be faulty or break earlier than expected. Most people can intuitively understand that about 1% of disks might fail in a single year, or there is a 1% chance of a disk failing in a year. They also do not link this failure rate with the disks lifetime. As such AFR is much more sensible metric for this type of information. and AFR=0.87% is exactly the same as MTBF of 1,000,000 hours.

This statistic also in no way defines how long a disk will last. That is the useful life value (say 30,000 POH (power on hours)). This will be linked to the warranty period, wear-out etc.

On a slightly different note…. The paper did not measure disk failures, rather, “disk replacements”. There is a difference between the two, namely mis-diagnosis. This may also help explain why she got a autocorrelation. If I incorrectly replace a disk that is faulty, I still leave the root cause of the problem, and am likely to repeat the same mistake a week or so latter…. hence the autocorrelation result.

My hypothesis is that the autocorrelation seen is caused by mis-diagnosis. Unfortunately I do not have the data to prove/disprove that hypothesis.

ItsMe October 7, 2010 at 2:23 pm

The last two posts did a good job explaining MTBF. Here’s another way of explaining it.

Get 1,000,000 hard drives in a room. Run them all and see when they fail. Let’s say that in the first year you had one hard drive fail every hour. That would be 8760 drives that would fail. During the second year you might also have 8760 drives fail. During the third year you might have 8760 drives fail. During the 4th year you might have 50,000 drives fail. During the 5th year all the remaining drives might fail.

What is the MTBF? You would clearly decide that the useful life of a hard drive is 3 years, because you start getting a lot of failures in the 4th year, and all of them failed during the 5th year. So you look at you average failure rate for the first three years. Well for the first three years, you had 1,000,000 drives running, and one failed every hour. But every hour you have 1,000,000 hard-drive hours accumulated. So you have one failure per 1,000,000 hours of operation. Thus your MTBF is 1,000,000 hours. MTBF means mean time between failures. You have a one failure for every 1,000,000 hours of operation, thus a 1 million hour MTBF.

Notice the fact that all the drives failed by the fifth year. The MTBF has nothing to do with the life expectancy.

On an unrelated note, I have not read any of the referenced papers, but it seems to me that the statistic showing clustered failures is totally bogus. It turns out that when you find a drive has failed and you go to rebuild the raid, it’s not that another drive fails during the rebuild, but rather that the other drive has in fact failed before the rebuild (failure meaning having unreadable data), but the failure is not discovered until the rebuild.

Phil Koenig February 2, 2012 at 11:22 am

Late late late comment, sorry.

Re: the safety of backing up a RAID array with a failed drive first, versus swapping the failed drive and rebuilding the array first.

I would think the main advantage of the “backup first” strategy is that it does not require any new disk writes, only reads.

Seems to me that there would likely be far more potential failures resulting from re-writing all the data/parity during a rebuild than simply reading what’s there onto a backup.

Leave a Comment

{ 27 trackbacks }

Previous post:

Next post: