Everything You Know About Disks Is Wrong

by Robin Harris on Tuesday, 20 February, 2007

Update II: NetApp has responded. I’m hoping other vendors will as well.

Which do you believe?

  • Costly FC and SCSI drives are more reliable than cheap SATA drives.
  • RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.
  • After infant mortality, drives are highly reliable until they reach the end of their useful life.
  • Vendor MTBF are a useful yardstick for comparing drives.

According to the one of the “Best Paper” awards at FAST ’07, none of these are backed by empirical evidence.

Beyond Google
Yesterday’s post discussed a Google-authored paper on disk failures. But that wasn’t the only cool storage paper.

Google’s wasn’t even the best: Bianca Schroeder of CMU’s Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a “Best Paper” award. (BTW, Ms. Schroeder is a post-doc looking for an academic position – but if I were Google or Amazon I’d be after her in a big way.)

Best “academic computer science” paper
So it is very heavy on statistics, including some cool techniques like the “auto-correlation function”. Dr. Schroeder explains:

The autocorrelation function (ACF) measures the correlation of a random variable with itself at different time lags l. The ACF, for example, can be used to determine whether the number of failures in one day is correlated with the number of failures observed l days later.

Translation: ever wonder if a disk drive failure in an array makes it more likely that another drive will fail? ACF will tell you.

She looked at 100,000 drives
Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed internet services providers. The drives had different workloads, different definitions of “failure” and different levels of data collection so the data isn’t quite as smooth or complete as Google’s. Yet it probably looks more like a typical enterprise data center, IMHO. Not all of the data could be used to draw all of the conclusions, but Dr. Schroeder appears to have been very careful in her statistical analysis.

Key observations from Dr. Schroeder’s research:
High-end “enterprise” drives versus “consumer” drives?

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”

Maybe consumer stuff gets kicked around more. Who knows?

Infant mortality?

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.

Dr. Schroeder didn’t see infant mortality – neither did Google – and she also found that drives just wear out steadily.

Vendor MTBF reliability?

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.

Actual MTBFs?

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.”

In other words, that 1 million hour MTBF is really about 300,000 hours – about what consumer drives are spec’d at.

Drive reliability after burn-in?

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Drives get old, fast.

Data safety under RAID 5?

. . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

Independence of drive failures in an array?

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

Big iron array reliability is illusory
One implication of Schroeder’s results is that big iron arrays only appear more reliable. How? Using smaller “enterprise” drives means that rebuilds take less time. That makes RAID 5 failures due to the loss of a second disk less likely. So array vendors not only get higher margins from smaller enterprise disks, they also get higher perceived reliability under RAID 5, for which they also charge more money.

The StorageMojo take
After these two papers neither disk drive or array businesses will ever be the same. Storage is very conservative, so don’t expect overnight change, but these papers will accelerate the consumerization of large-scale storage. High-end drives still have advantages, but those fictive MTBFs aren’t one of them anymore.

Further, these results validate the Google File System’s central redundancy concept: forget RAID, just replicate the data three times. If I’m an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

Comments welcome, especially from disk drive and array vendors who dispute these conclusions. Moderation turned on to protect the innocent.

Update: Garth Gibson’s name is also on the paper. Since he is busy as a CMU professor and CTO of Panasas, I hope he’ll pardon me for assuming that Dr. Schroeder deserves most of the credit.

{ 40 comments… read them below or add one }

ugh February 20, 2007 at 7:09 pm

great… my $80k San is doomed to fail.

Warren Strange February 20, 2007 at 7:45 pm

One word:

zfs 🙂

Farhad February 20, 2007 at 8:01 pm

So, what are the physical problems with the notion of ‘permanent’ storage? Will we ever have a medium that can store data reliably and permanently? or will we always have mtbf, mttf and failures and replacements?

hymieg February 20, 2007 at 8:05 pm

In my own experience, one of the best life extending techniques for hard disks is simple – don’t power them off. The sysems where I experience the lowest failure rate are consistently the systems that are up 24/7. I have always thought that the biggest killer of drives is the initial power up – the bearings take a bit of a beating getting up to 7200 rpm and beyond from 0 rpm in a hurry.

sooth_sayer February 20, 2007 at 8:51 pm

Get a life .. and some understanding of science.

Your prof. needs to learn 101 of reliability.

Every spec. is with certain conditions .. and to say one number is “false” without a careful setup of “other factors” is yellow journalism .. but I guess content is the king and facts are a secondary artifact of history

mustard mom February 20, 2007 at 9:14 pm

“forget RAID, just replicate the data three times”

Gindication at last!

I have been saying this for years where I work. Not to mention RAID’s have a central point of failure at the controller. Saw this happen last year and the RAID was not backed up due to the perceived reliability.

p-money February 20, 2007 at 9:26 pm

her paper includes LOTS of data. read it and draw your own conclusions about reliability if you want, but don’t claim her data is wrong without offering a similar experiment of your own. what facts are you offering that are more reliable than her tests?

you’re right – every spec IS with certain conditions. and if those conditions are nothing like what is encountered in the real world, they will mislead buyers. notice that the title includes: “What Does an MTTF of 1,000,000 Hours Mean to You?” it does not say “What does an MTTF of 1,000,000 hours mean in a highly environmentally controlled clean room?”

i think the conclusion that can be drawn here is that the drive manufacturer’s claims are not stark facts but are misleading at best and fabricated at worst.

Creditor February 20, 2007 at 9:34 pm

Google invented replication? GoogleFS is pretty standard replication from way way way back.

fairly_reliable February 20, 2007 at 9:35 pm

Re: Reliability
We can either take drive manufacturer’s word for it, or we can go look at real world reliability. The numbers we are given on datasheets reflect certain conditions – which often don’t exist in our datacenters. The point of the paper is that in the REAL WORLD, drives fail more often than under ideal conditions. Seems a no-brainer when you put it that way.

Some Other Professor February 20, 2007 at 9:38 pm

When a paper has two authors, you give them both credit. To assume that the professor on the author list is “busy” and therefore give more credit to the other author is odd and certainly misleading. Ugh.

– A Professor

Creditor February 20, 2007 at 9:42 pm

Google didn’t invent cluster replication, it’s been around since the beginning.

Both studies sure do expose a lot of lies tho 🙂

A Student February 20, 2007 at 9:49 pm

Yes, you should give the professor credit. Without his name, the paper would not have made it into the conference to begin with. And there are also probably one or two sentences that were reorganized because the professor read them! 🙂

Ted Fay February 20, 2007 at 9:54 pm

SATA drives pack far more blocks onto a single platter than a FC or SCSI drive, so they are inherently more prone to losing multiple blocks rather than a single bad block. This means that SATA drives failures, all things in physics being equall, are of neccisity of higher probability, and also of higher danger of lossing data in a RAID 5 set, as loosing a larger number of blocks can lead to data loss without having a complete drive failure.

3-way mirrors are definately one of the most reliable and robust forms of data security, but make no mistake, FC and SCSI drives are WITHOUT QUESTION more reliable. FC and SCSI drives may not match the MTBF specs, but I doubt SATA drives do either, particularly where bad blocks are concerned.

It is nonsense and contrary to physics for a platter that packs more than double the blocks in the same space to be a reliable as one that pack in only half. The platter would have to be twice as robust, and I doubt even Seagate would make that claim.

exHDinsider February 20, 2007 at 10:18 pm

I used to work in the drive industry in the late 80’s through the late 90’s. I can tell you as an insider that ALL of the reliability stats from that time period were bogus. Small sample sizes, extrapolating data from just a week long start-stop experiment, not doing 100% testing of parts before and/or after assembly because a process was thought to be 6-sigma, etc.

Poltsi February 20, 2007 at 10:21 pm

Regarding Farhads question:

When you can suspend the enthropy from happening, then you can have your permanent data storage.

Not A Professor Yet February 20, 2007 at 10:21 pm

In general, I give the junior author credit for the detail, the senior author credit for the gist. The ideas and spin generally come top-down, but the grunt work and analysis definitely come up from the bottom.

-Grad student

sickmind fraud February 20, 2007 at 10:32 pm

Probably the best form of semipermanent data storage is an obelisk. But this is not a terribly convenient form factor

rob February 20, 2007 at 10:44 pm

so I’ve been ranting on at some of the larger storage providers for a couple of years now. im a big customer and they just will not listen

Robin Chauhan February 20, 2007 at 10:49 pm

Poltsi, don’t you mean “empathy”?

rob February 20, 2007 at 10:50 pm

who cares if sata is technically not as reliable as FC or SCSI. its so much cheaper you can build in more redundacy for a much cheaper price. formula is simple. sata 1/6 of the price as high perf FC running at 1/2 the speed, vendor says its 1/3 as reliable, so I can halve my disk cost by putting in sata and increase gerneral performance casue I’ve got three times the spindle and only halve the speed… but vendor says this won’t be supported in 24/7 service cycle. bollocks number 1.

Nathan Myers February 20, 2007 at 10:53 pm

I would interpret the decreasing hazard rates the other way: if one drive in a box fails, expect another drive in the box to fail soon after. It makes sense considering they all experience the same conditions, and were probably from the same lot, subject to the same manufacturing defects.

It’s disappointing that they didn’t record mfr/model/serial numbers of failed drives. It’s very common in other fields (munitions, particularly, and aircraft parts) to recall all members of a lot when one fails.

Anonymous February 20, 2007 at 11:12 pm

“SATA drives pack far more blocks onto a single platter than a FC or SCSI drive, so they are inherently more prone to losing multiple blocks rather than a single bad block.”

Are you saying we should go back to the ST-506 for reliability?

rob February 20, 2007 at 11:15 pm

bollocks number two. high perf IO is really really expensive cause you need to buy shedloads of spindle to support it, im talking like an app that needs in excess of 5000 IO ops per sec. solid state flash is now at a cost point where it is 5 times cheaper to get that perfomance in spindles than across spining disk spindle. great you think. but no. Not one major disk vendor is looking to provide solid state in the big chassis.

Robin Harris February 20, 2007 at 11:34 pm

Too many great comments to respond to right now. Thanks!

I do take issue with Ted Fay’s comment though. I don’t put much stock in architecture arguments because I’ve seen them fail to account for the real world too often. I’m not a disk engineer, yet I suspect the reason “enterprise” disks have lower densities is because of their higher speed, not because of some reliability issue.

As Bianca’s study shows, there was *no difference* in reliability between FC and SATA drives in the environments she examined. So, Ted, if those drives are so much more reliable, how come that didn’t show up in the data?

Finally, the choice isn’t between SSD and fast disks. Multi-port cache is the technology that gives most arrays their performance, especially on writes. In lieu of actual knowledge, most customers of big iron arrays just say “optimize everything” so they can sleep at night. They don’t KNOW what they need, and the industry has been very careful not to figure it out for them. I’m not throwing rocks – I’m an ardent capitalist – but as things continue to scale out it starts to make sense to figure out, as Bianca has, what is really important.


zeropointburn February 20, 2007 at 11:38 pm

He does mean enthalpy. In other words, entropy, random failure, degradation over time. Ordered systems are susceptible to degradation because of entropy. It’s a basic physical property that cannot be avoided (to the best of our current understanding).
If there were a true breakthrough in that regard, much of physics as we know it would be rendered invalid, and it would be possible to store data indefinately.

rob February 20, 2007 at 11:47 pm

robin, yes cache does do that, except cache as the issue that it becomes flooded much quicker than the actual spindle, especially in a highly intensive IO environment, which is what i’m talking about.

subspawn February 21, 2007 at 12:26 am

Ah, just buffer with a load of nvram or just plain battery backed dram for that matter. High I/O loads are the number one reason for a SAN imho, the huge I/O buffer offers great performance.

Why anyone hasn’t written such an application for linux yet puzzles me (I would like to, but my C knowledge is terrible). Is you could take that combined with cheap redundant (x3 or so) SATA storage boxes… HP/Netapp/Symmetrix/… are all out of bussiness 🙂

joseph martins February 21, 2007 at 12:54 am

[u]Farhad wrote “Will we ever have a medium that can store data reliably and permanently? or will we always have mtbf, mttf and failures and replacements?”[/u]

In the contextof long-term information storage, reliable media is [b]not[/b] the problem. Sufficiently reliable products exist today. In my opinion, the development of even longer-lasting media is a huge waste of resources.

I’ll give you two reasons:

1. Information Refresh – The increasingly rapid evolution of business applications, databases and file formats undermines the long-term resuscitability of information assets. Companies must develop a plan for ensuring that long-term information assets are always in a usable form/format.
2. Technology Refresh – The long-term viability of any media depends heavily upon the availability of an appropriate interface….ST506 anyone?

In short, if you can’t read the media, and/or you can’t make sense of the information stored on it, using an HDD that lasts 10 years or a tape that lasts 50 isn’t going to matter.

Karl O. Pinc February 21, 2007 at 1:13 am

Isn’t manufacturer MTBF computed on the basis of probability of failure during the warranty period? No wonder real world tests show different MTBF values, many of the failed drives were out of warranty.

Bob February 21, 2007 at 1:27 am

@TedFay: The density of blocks on the drive will have nothing to do with whether the drive hardware fails. I think you’re thinking of data corruption which is another issue.

chronus February 21, 2007 at 1:44 am

enthalpy: The sum of the internal energy of a body and the product of its volume multiplied by the pressure. This seems to have little to do with our topic

Corley Kinnane February 21, 2007 at 3:01 am

Statistics are only a part of the story, usually the boring part.

Regarding RAID5, the conclusion seems to be based on the assumption that people trust RAID5 too much – not the reality of it. RAID5 is a good solution, triplication is better one, by principle, not by statistics.

Of course, many trade off the security of triplication with the value and efficiency of RAID5. These days, HDs are cheap enough to consider options like triplication – its not really a revelation many could afford years ago.

On the issue of RAID5 having an increased chance of secondary failure during a rebuild – its clearly mostly related to the degree of independence between drives.
Thanks to the stats, we know that there is no point in a drives age when it is much more likely to die than any other time – just a high level of early failures, some semi plateau and then a steady decline in reliability after that.
That means that this increase in clustered failures must be due to influences external to the drive but not to the array – power cycles and usage patterns.
That could mean your RAID5 array could fail during a rebuild if it doesn’t normally see that kind of usage – or if you use a cheap SATA solution that requires a power cycle before a rebuild.
Truth is, those factors apply to any array, not the concept of RAID5.
That means they also apply to a triplicated cluster, unless you can afford to force more independence by physically separating the cluster components.

Buy an array or cluster in one shot and it lives for years – all components lose reliability equally, doesn’t matter if you have a RAID5 or triplication, this will threaten your system as if you were comparing a new RAID5 array with a new triplication cluster.

If you are concerned about the reliability of your RAID5 controller, use software RAID (decent software of course).

Your probably best of duplicating RAID5 arrays in seperate locations.


w February 21, 2007 at 3:06 am

enthalpy is certainly not entropy. It (enthalpy) is better understood as available thermal energy (from thermal energy and flow work).

Entropy is a basic physical property, but to some extent, it can be mitigated. It simply provides that thermodynamic processes will tend to go in a certain direction (2d law of thermo) — something that can be overcome by application of restorative work.

The real relationship that this work has to entropy is that both are statistical phenomena describing “one-way” processes. (Be it globally increasing entropy or globally increasing failed drives…)


kevin February 21, 2007 at 3:08 am

RAID has two advantages:
(1) a single HD failure doesn’t halt the server
(2) the replication is latency free
So the optimum setup is RAID 1 on main server and then regular replication to _at least_ two backup servers across fast network.

magicalbob February 21, 2007 at 3:08 am

At last, factual field results from the coalface. Somewhat akin to the realisation a few years ago that real cars don’t crash into brick walls, and hence a major scramble to provide real safety ratings for vehicles.

Redundancy is the key, preferrably on a differen’t set of disks, from a different batch, from a different factory in different locations. (Don’t want 20 Fujitsu MPG320 drives in a raid array :P).

Having done a lot of field experimental field work, the following motto holds true.

One is none
Two is one
Three is two.

Meaning if you want 1 reliable backup, you need 3 systems.

Ben February 21, 2007 at 3:31 am

(Hardware driven) RAID5 brings more weak points to the system.

The controller is (as someone noted above), materialy, a single point of failure by itself.

And it involves complex softwares with their own set of bugs, hazardous upgrade paths, compatibility problems, … :
– the controller firmware,
– the controller driver,
– the OS’ raid stack,
– the online management utilities

In my (short) experience, raid controllers ROM and drivers are even more troublesome than disks drives (at least, they brings trouble sooner).

Bram February 21, 2007 at 3:42 am

So how does one “replicate data 3 times” in practice without a raid controller?

The Thinker February 21, 2007 at 4:58 am

Unfortunately, this paper is severely flawed. Similar to the Google paper, it is written by academics with little understanding of the subject matter, but a strong desire to publish lengthy papers.

To write a meaningful paper, there is a lot of data about the drives and the systems they are used in that needs to be collected. These are initial conditions and operating conditions that any real system scientist will tell you cannot be ignored (to say the least). One cannot look at drives in the abstract, but must look at many details of how they are used, including the storage systems they are part of.

Google, to their credit, did collect the SMART measurements. That is a good start, but not sufficient data to support the conclusions of the Google paper.

For example, the orientation of each drive needs to be taken into account. What percentage of the drives analyzed were mounted horizontally vs. vertically? How were the drives themselves mounted? Specific mounting techniques result in a greater incidence of particular failure patterns. How were the drives cooled? Particular cooling techniques similarly result in specific failure patterns. What sort of data usage patterns were in use? What levels of RAID were used across the various drives?

I see no measurements of vibration in this paper. Drive orientation and drive vibration (including system-based vibration) are two factors that are very important in determining drive reliability. Drives have a certain resistance to vibration (and shock) that varies based on the directionality of the vibration.

We also see no meaningful treatment of the conditions for the HPC1, COM1, and COM2 systems. In HPC1 and COM1 we see massive failure levels for memory, likely indicating severe heat problems in those systems. In the COM2 system, we see a very high incidence of motherboard failure, again mostly likely indicating heat problems (or possibly bad caps). Specific heat conditions are operating conditions for drives that must be taken into account. Maybe the early onset of wear-out degradation is at least in part due to heat?

I have merely touched on several important elements of study that were neglected in both papers. To gain a real understanding of drive failure in the “real world”, real and comprehensive data is needed first. Otherwise we are dealing with merely variations on the “GIGO / Garbage In Garbage Out” theme.

Also, I see a number of irrational conclusions being put forth by readers — no value in RAID just replicate your data 3 times? This sounds a bit like how to get home from Oz. It works in the movies. But it doesn’t work as well in real life.

RAID1 is a very solid solution for many businesses (and their correspondent data usage models), especially if there is a hot spare on the system as well. Many studies have shown the business value of the simple, transparent, low cost redundancy that RAID1 delivers. Even simple probability theory will tell you that RAID1 has clear potential for reliability improvements (that are well measured and proven in the real world).

I see a lot of analysis of RAID5 which people in the real world know is not a good choice for data that matters. There is no sane recovery procedure for RAID5. The drive access patterns tend to result in a lot of vibration as well.

Overall, I am disappointed that with all the investment that large organizations make in purchasing and deploying storage, they seem to have no one in their organization that (1) understands the mechanics and physics of even a single disk drive, (2) understands the concept of initial conditions, (3) understands the concept of operating environment/conditions (4) has the willingness to make actual measurements vs. barf up a bunch of hearsay, and (5) truly wants to understand the reliability of storage systems vs. take pot shots at the drive industry.

Each of these papers, CMU’s and Google’s is incomplete. There is not enough data to support the conclusions. There is not even enough data to support almost any conclusion beyond the basic observation, “drives fail, some days more than others.”

Richard February 21, 2007 at 5:46 am

Quoting … “Google File System’s central redundancy concept: forget RAID, just replicate the data three times”.

Google File System relies on inexpensive “commodity” motherboard hardware and is specifically designed for their internal requirements.

I suggest that Google’s concept is not entirely driven by the need for reliability, as a top requirement.

Google File System need for speed is probably much higher on the list and such triplication of data is a great bonus, especially under reads.

A typical ‘motherboard’- based node can only support a small number of disks, making such small disk configurations inefficient under RAID 5 / 6 algorithms. Software RAID 5 running on ‘bare’ PC hardware is very slow & RAID 6 is a lot worse. Not a lot of choice here…. run mirrors.

A typical low cost ‘motherboard’ environment is a tangle of cabling and a typical PC power supply may not be at its best (i.e. MTBF of 100,000 hrs ) when loaded by high peak power profiles required by disks….this is much worse than those spinning disks.

In fact, I suggest that we may all get very very depressed by the *real* MTBF of a complete Google ‘commodity’ system solution … power & cabling included. This will be further degraded by inter-node cabling, etc… Does Google have any figures on this…. probably not.

Ignoring the initial cost of disks…. has anyone calculated the running cost of such ‘triplicate’ system… with ‘underutilized’ power hungry motherboards included … and the cost of replacement parts. They do care about power, I am sure…. but may be locked-in.

How does …say a 1000 disk solution …. compare against an equivalent centralized big-iron …. or any other well designed multiple RAID solution.. ? I am sure that Google File System can run with inexpensive RAID backends…. is this worse in terms of CTO ?

In general, RAID 6 is available ( even from EMC recently) if you are worried by additional disk failures during data reconstruct time … and these are not getting shorter. Disks come with a multi-year warranty…and as someone suggested … if reliability is important, forget about the MTBF and change the lot at the end of the warranty period.

I am sure that Google strategy relies *a lot* on warranty period of the various “commodity” components… not just the disks, which seem to be better than the rest.

Joe Claborn February 21, 2007 at 6:41 am

Is this right? A MTBF of ‘only’ 300,000 hours translates in 34 years. Our disk drives seem to last about 3 years. Why the difference?

Leave a Comment

{ 10 trackbacks }

Previous post:

Next post: