Everything You Know About Disks Is Wrong

by Robin Harris | Tuesday, February 20, 2007 | Clusters, Enterprise | 87 comments

Update II: NetApp has responded. I’m hoping other vendors will as well.

Which do you believe?

Costly FC and SCSI drives are more reliable than cheap SATA drives.
RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.
After infant mortality, drives are highly reliable until they reach the end of their useful life.
Vendor MTBF are a useful yardstick for comparing drives.

According to the one of the “Best Paper” awards at FAST ’07, none of these are backed by empirical evidence.

Beyond Google
Yesterday’s post discussed a Google-authored paper on disk failures. But that wasn’t the only cool storage paper.

Google’s wasn’t even the best: Bianca Schroeder of CMU’s Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a “Best Paper” award. (BTW, Ms. Schroeder is a post-doc looking for an academic position – but if I were Google or Amazon I’d be after her in a big way.)

Best “academic computer science” paper
So it is very heavy on statistics, including some cool techniques like the “auto-correlation function”. Dr. Schroeder explains:

The autocorrelation function (ACF) measures the correlation of a random variable with itself at different time lags l. The ACF, for example, can be used to determine whether the number of failures in one day is correlated with the number of failures observed l days later.

Translation: ever wonder if a disk drive failure in an array makes it more likely that another drive will fail? ACF will tell you.

She looked at 100,000 drives
Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed internet services providers. The drives had different workloads, different definitions of “failure” and different levels of data collection so the data isn’t quite as smooth or complete as Google’s. Yet it probably looks more like a typical enterprise data center, IMHO. Not all of the data could be used to draw all of the conclusions, but Dr. Schroeder appears to have been very careful in her statistical analysis.

Key observations from Dr. Schroeder’s research:
High-end “enterprise” drives versus “consumer” drives?

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”

Maybe consumer stuff gets kicked around more. Who knows?

Infant mortality?

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.

Dr. Schroeder didn’t see infant mortality – neither did Google – and she also found that drives just wear out steadily.

Vendor MTBF reliability?

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.

Actual MTBFs?

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.”

In other words, that 1 million hour MTBF is really about 300,000 hours – about what consumer drives are spec’d at.

Drive reliability after burn-in?

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Drives get old, fast.

Data safety under RAID 5?

. . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

Independence of drive failures in an array?

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

Big iron array reliability is illusory
One implication of Schroeder’s results is that big iron arrays only appear more reliable. How? Using smaller “enterprise” drives means that rebuilds take less time. That makes RAID 5 failures due to the loss of a second disk less likely. So array vendors not only get higher margins from smaller enterprise disks, they also get higher perceived reliability under RAID 5, for which they also charge more money.

The StorageMojo take
After these two papers neither disk drive or array businesses will ever be the same. Storage is very conservative, so don’t expect overnight change, but these papers will accelerate the consumerization of large-scale storage. High-end drives still have advantages, but those fictive MTBFs aren’t one of them anymore.

Further, these results validate the Google File System’s central redundancy concept: forget RAID, just replicate the data three times. If I’m an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

Comments welcome, especially from disk drive and array vendors who dispute these conclusions. Moderation turned on to protect the innocent.

Update: Garth Gibson’s name is also on the paper. Since he is busy as a CMU professor and CTO of Panasas, I hope he’ll pardon me for assuming that Dr. Schroeder deserves most of the credit.

87 Comments

← Older Comments

Newer Comments →

Michael on Wednesday, 21 February, 2007 at 6:44 am

Wow, it does not take rocket science to figure out that profit is the bottom line. Anyone in business knows that survival is linked to margin. Anyone manufacturing product knows their market and builds toward an expected life time not an ideal one. Even if you take the lowest factor necessary of say 100K hours, that’s more than 10years life expectancy. Sorry, but I don’t know anyone running drives in bussiness that long anymore. Most swap them out in three years. Consumers typically will hold on to their investment longer. Anyway, there is a reason why most manufactures provide a limited warranty, 5years is good, but I don’t think a life time warranty will ever be there.
straav on Wednesday, 21 February, 2007 at 6:56 am

Google’s article at the end of 3.1 does talk about there being a “noticeable influence of infant mortality” (Failure Trends in a Large Disk Drive Population, Google inc, pg4)

As for differing reliability between SCSI, ATA, FC, have you looked at the model number definitions? Look at some drive vendor sites for how to decode model numbers. The part you will find interesting is that part of the number is what interface it has, while the rest remains the same.

So does anyone else see how despite the interface we are talking about the same drive mechanism? So predicted failure rate would be the same for the same hardware. I can’t say that I know this for a fact in any thing else than a handful of drives I opened years ago, so this may be a bit dated.

So while it is possible it has changed, I would suspect the money savings of mass production makes it an common choice.
Guillaume on Wednesday, 21 February, 2007 at 7:00 am

“one array drive failure means a much higher likelihood of another drive failure”: that’s a well known fact. the problem is that most of the time when you get your raid array delivered, most of the disk are coming from the same manufacturer, but also the same factory, the same run. That means that a serie of disk build under the same conditions and used under the same conditions have a higher chance of failling if one of them is failling. Nothing new: a couple of sysadmin friends of mine are odering disks separately, from different manufacturer to build their raid systems and are insisting on always getting a good mix.
thinkers fanboy on Wednesday, 21 February, 2007 at 7:34 am

i must agree with the thinker. there is so much important data which hasnÂ´t been taken into account.
i think itÂ´s sweet how bianca “scales” figure 2 with little arrows indication what year
we are at…learn to scale, babe. this paper is almost completely free of sense.
Mike on Wednesday, 21 February, 2007 at 8:05 am

Another interesting thing is that all hardware builders always pack drives together in a brick. If you have a big SCSI server with 8 drive bays, and order 4 drives, THE UNWRITTEN LAW OF SERVERS dictates they be next to each other, wheras if you leave a space between each of them, and throw in an extra fan or two, they seem to last forever.
T’was heat that killed the beast.
Aaron Becker on Wednesday, 21 February, 2007 at 8:06 am

I can’t believe that nobody has mentioned raid6 as a solution for drive failures during rebuild.

Certainly you do have a risk involved if you lose one drive from a raid5 array and then you beat the hell out of the remaining disks to do a rebuild.

But I have to imagine the probability of losing _two more_ drives during a rebuild gets extremely low…
joseph martins on Wednesday, 21 February, 2007 at 8:41 am

Michael, you are correct. And as I pointed out earlier, existing media is sufficiently reliable.

The Thinker wrote “Unfortunately, this paper is severely flawed. Similar to the Google paper, it is written by academics with little understanding of the subject matter, but a strong desire to publish lengthy papers.”

While I agree the study’s design was poorly constructed and constrained, lengthy nuanced debates about drive longevity and MTBF are largely academic anyway. Real world data would certainly be interesting, but I seriously question if the conclusions drawn from real-world data would a) be substantially different from those in the study . and b) really matter.

As a business user, do I really care if a particular model drive has an MTBF of 300,000 or 1,000,000 hours? No. Given today’s drive reliability, and the reasons I mentioned in my previous comment, it is far more likely that I’ll have bigger fish to fry long before my drives are a major concern.

Everybody loves a good debate.
Anonymous Coward on Wednesday, 21 February, 2007 at 9:00 am

“The Thinker” is so proud of his reply that he posted it both here and on slashdot.
Jameson on Wednesday, 21 February, 2007 at 9:27 am

Backups should not remain attached to your computer.
All your RAID disks can be scrambled at once.

I had a 400GB disk drive nightly backing up a 180GB disk drive,
with old files getting renamed, so I somewhat had more than just one backup.
One night, the data on my disk drive was scrambled, and recovery has so far been futile.
Luckily, I had that 400GB backup.
Unluckily, the incident that scrambled the data on my main disk drive
also scrambled my “mounted” backup drive.
A small partition on that backup drive was not mounted and was not scrambled.
I eventually reformatted these two disk drives (from different manufacturers),
using them once again.
Unfortunately, my loss included 60 files that were important for a project,
files that took me over 100 hours to create.

I run Debian Linux and could find no-one else getting scrambled disks.
However, some comments by others inferred that my Asus K8N4-E motherboard had problems that might cause this.

So, I upgraded the firmware on that Asus motherboard,
firmware with an update hinting about a tangential problem like mine.

I conclude that some incidents like UPS polarity reversals and motherboard firmware can ruin the data on all your disk drives.
You need to retain backups that are not attached to your computer.
RAID can protect you from disk drives’ physical failures,
but it cannot protect you against numerous other causes.
RAID can keep your system going when disk drives physically fail,
but disk drive physical failures are not sufficiently more common than other causes for data failure.
For my home computing, I once thought I could safeguard my data with RAID,
but I now instead run backups with a few large raw (without a case) SATA disk drives attached externally via USB adapters.
Tmack on Wednesday, 21 February, 2007 at 10:34 am

Regarding reliability of the different drive types, specifically comments made first by Ted Fay: having taken apart numerous drives of different interfaces, I can state that the internals are basically the same across the board. The only difference is what controller gets slapped onto the bottom. That controller determines how data is spaced out on the drive itself, how it talks to the compter, etc, but the actual physical moving parts of the platters, arms, spindles, etc, are basically the same between IDE/SCSI/SATA/FC/whatever. They consist of a stepping motor to drive an aluminum spindle holding 2-4 platters about 1mm thick, stamped metal arms with the read/write head attached to a block with a bearing and a coil on the other side held between very strong magnets that drive it back and forth across the platters. The failures are generally related to these mechanical parts failing, such as the surface of the platters wearing out or the heads crashing into them. The reliability of the controller card on the drive is based on solid state electrical components, which if designed correctly will far outlive the mechanicals. This is supported by the paper, and by my experience. The illusion of better reliability is due to the more expensive SCSI/FC drives being used in a more consistent environment, like a datacenter. As more and more SATA/IDE drives are making their way into data centers thanks to cheaper and more available RAID solutions that can use them instead of the SCSI only solutions of the past, the truth is coming out in studies like these.
Kensey on Wednesday, 21 February, 2007 at 10:54 am

hymieg, as I recall it’s not so much the spinup, as it is the polymerization of the lubricant after the drive spins *down* preventing it from spinning up again in the first place. Thus the old “whack it and back it” advice (smack the drive to get the platters unstuck, then boot and *immediately* do a full backup). This is also why disks that have run for a long, long time will continue to run just fine *until* a power outage or something else causes them to spin down, at which point they die, the lubricant having essentially turned to glue.
random on Wednesday, 21 February, 2007 at 12:58 pm

Tmack, HDs haven’t used stepper motors for almost 2 decades now. They use voice coils now.
robert on Wednesday, 21 February, 2007 at 2:40 pm

That’s it, I’m going back to chiseling data on a rock tablet!
Alan on Wednesday, 21 February, 2007 at 4:53 pm

In regards to rob’s “bollocks #2”, where he says, “Not one major disk vendor is looking to provide solid state in the big chassis”. I saw a presentation by a major storage vendor where that was in fact exactly what they were touting. It wasn’t EMC or HP.
John on Wednesday, 21 February, 2007 at 5:01 pm

According to The Thinker,
you need a massive data collection project tracking hundreds of variables before you can begin to construct an elementary model of any real-world phenomenon…

Not true. By the Central Limit Theorem, the effects of many independent random factors that weren’t taken into account will simply increase the level of variability in the observed population means. The analysis is only problematic if the neglected factors are dependent on the ones being analyzed – for example perhaps SCSI drives were inherently more reliable but for some reason were being put into high-vibration environments more often than other types.

So is there reason to believe that the (admittedly important) initial/operating conditions depend strongly on the variables that were recorded in the data?

If not, I’m inclined to accept the paper conclusions. Which mostly amounted to noting that the current model is broken and proposing one that better fits their observed data. Remember Galileo’s notion of experimental error – something like “don’t take my error of one cubit and try to hide Plato’s error of 100 cubits behind it”. A model can be simplistic, incorrect, and still a VAST improvement on the previous art.
Jason Williams on Wednesday, 21 February, 2007 at 5:33 pm

Something to keep in mind is that all of these drives were “enterprise” class drives regardless of their interface (SATA, FC, and SCSI). Anything over 1,000,000 MTTF is an enterprise drive…as noted by the E in front of Hitachi enterprise class SATA drives. The SATA drives bought by a consumer usually have spec sheets in the 300,000 range. Which is really scary if you consider the results of this paper.
Third Grade Math on Wednesday, 21 February, 2007 at 6:40 pm

1,000,000 hour MTTF?

1000000 / (24 * 365) = over 114 years!

WTF?

If it’s 1,000,000 minutes, we get

1000000 / (24 * 365 * 60) = 1.9 years

which is more like what I’ve experienced. Now the study says 300,000,

300000 / (60 * 24 * 7) = almost 30 weeks!

I’ve had a lot of drives die on me, but this value is ridiculous.
John on Wednesday, 21 February, 2007 at 8:29 pm

I resoundingly second (the other) John’s comment. The Thinker hides his misconception(s) behind lots of fancy talk, a typical example of someone with a high ratio of verbal skills to actual understanding.

True, there are lots of parameters that were not considered in this study, but (as John noted) as long as the parameters are statistically independent, it’s okay to draw conclusions about the parameters that were in the study based upon those parameters alone.

If you don’t believe this, then just take a look at any scientific paper which uses statistics to make inferences about a phenomenon being studied. Seldom is it practical or possible to model every parameter that exists in reality. Instead, we choose some suitable subset of parameters that explain as much of the variation as possible.

If, for example, drive orientation were responsible for a some of the observed variation in drive reliability, and the orientation of a drive had no relationship to the other parameters (e.g. drive type), then this parameter essentially becomes “background noise,” that is, the overall effect of this parameter on drive reliability will be the same on drives of different types, so the difference in reliability between drives of different types >due to the drive orientation parameter
Corley Kinnane on Wednesday, 21 February, 2007 at 9:23 pm

If you are paranoid at all about RAID5 having inherently more problems than RAID1, consider using *software* RAID5.

I use RAID1 for booting, RAID5 for data – its just easier to setup this way.

Now, RAID6 is the better option.

The question of whether RAID1 or RAID5 is more reliable for one failure tolerance – it comes down to the reliability of the software – nothing else.

When I make a software RAID5 array, I tag each drive with a number on an unused partition – set it up well, and you can’t really go wrong unless you decide to wipe something important – which you could of course do at any time RAID or not.

Software RAID is a negligible CPU hit these days – and RAID5 is fast, not slow – even just using 4 drives, you should get around 70 – 100 Mb/sec with 7200rpm drives.

When I need 12+ drives in an array, I use hardware RAID5, knowing it isn’t as secure as the software RAID but is a lot easier to manage.

I think if I had to make a solution right now, it would be software RAID6 backed up to single drive or RAID0 array in another location – I’d install the backup 6 months after setting up the array and swap some drives from the main array with the backup drive/array.
If it had to be always live – I would cluster just 2 arrays.

Triplication ? throw another server on the fire.

Corley.
Arghh on Wednesday, 21 February, 2007 at 10:44 pm

I would really like to see some paper like this on CDs, DVDs etc.
With media like this the validity of actual claims generally only become evident years after usage.
dp on Wednesday, 21 February, 2007 at 10:59 pm

Great – so if the title is correct, everything I’ve just learned about disks is wrong.
DZNTUNDERSTAND on Thursday, 22 February, 2007 at 3:59 am

Will some please explain the MTBF values and why if it is 34 years drives just last 3?
Robin Harris on Thursday, 22 February, 2007 at 5:37 am

Normally I try to respond to comments, but this time there are too many. So I’m going to cherry-pick here.

Richard, you raise many good points. Google appears to have priorities for its computing infrastructure, with the #1 priority for the revenue generating ad placements. I’ve heard complaints about Gmail uptime, but not about ad placements.

Also, Google has probably been the single most important force in getting chip, power supply and motherboard vendors to focus on power consumption. They’ve been having their motherboards custom-made for several years, and they support three drives per node currently. I know they’ve looked at more drives per node. In fact the impetus for this study may have been to determine the optimum number of drives per node.

Robin
Not Important on Thursday, 22 February, 2007 at 5:50 am

I think both the paper and the discussion has misunderstood the notion of MTTF. MTTF is the mean time to fail if the drives are replaced on a regular basis within the warranty period.
I other words, as long as you replace a drive when it reaches about three years of age, there should be an expected average of 114 years between disk failures.
This explains the fact that drives fail a lot more often that the MTTF would suggest when it has been in use for 5-7 years as the paper states. The misconception is that the MTTF is stated for a specific drive unit, which it is not.
Thankful on Thursday, 22 February, 2007 at 1:49 pm

Thank you for finally explaining the MTTF numbers!
Scudchtr on Thursday, 22 February, 2007 at 4:15 pm

Not Important, I am calling BS on your definition of MTTF unless you can provide some legitimate references.
JeePee on Friday, 23 February, 2007 at 1:14 am

Not Important,

Brilliant reasoning, if you replace the discs before they fail, there is a bigger chance of avoiding failure. But why would I replace a perfectly good disc? Just to meet the manufacturer’s specifications? That’s a bit of turning the world around.
Jessica on Friday, 23 February, 2007 at 6:04 am

I think many people misunderstand the purpose of RAID as it is used in a datacentre.

[ramble, for those who don’t work in datacentres]
RAID is used to reduce (a) loss of data and (b) downtime. A secondary benefit, dependent on configuration, is an increase in performance. Reliability is a byproduct, not the goal.

If a non-redundant disk fails – your system is down and data is probably lost. RAID gets around this and gives you time to plan how to recover.

A disk failure under RAID puts you “at risk”. Your FIRST action in the case of a disk failure under RAID is to ensure you have a good backup, not to slap in a disk and resync. As I’m sure others have noted, any RAID resync will put an abnormal load on the remaining disks. You want to avoid this until you can be sure that should there be a second failure, you can recover.

It is also true that if you do not have an up-to-date backup, taking a backup will load the remaining disks, but this should be a lower load than any resync. You may be fortunate enough that (in decreasing order of preference) either (a) you have a full backup and no data has been added since (b) you can do a quick incremental and place the least stress on the disks (c) the data added can be recreated with minimal effort.
[end of ramble]

My point here is that RAID is not a magic solution, but an important part in an overall strategy.

People have also been talking about the differences between hardware and software RAID. As far as risk is concerned, there is no difference. Until your CPU interfaces directly to the disks there is always a component which could fail and deprive you of data. If you have (S)ATA disks that typically means your motherboard. SCSI or FC – the HBA. In many cases your hardware RAID HBA is just a SCSI/whatever HBA with RAID intelligence added. Your defences against loss of this are (a) duplication of hardware paths and (b) standby spare parts.

There is an illusion with software RAID that because it is host-controlled, you might be able to trawl through the bits on disk and recover things should there be a catastrophic failure, whereas with hardware RAID the data format is inaccessible. In practice, no one would spend the time doing this unless the data were absolutely vital, and were this the case your backup regime and data duplication to other systems is more efficient.

Speed of software RAID is entirely orthogonal to the subject at hand.
Brian on Friday, 23 February, 2007 at 11:31 am

First, no one here, including the papers author, have explained MTTF properly. The paper’s authors got it all wrong. Let me explain.

MTTF is the mean-time-to-failure. That means that each drive will, on average, last a certain amount of time. In this case, each drive will last, on average, 1,000,000 hours. That means some will die sooner, some later, etc.

MTBF is the mean-time-between-failures. That means that the system of drives will, on average, have a certain period of time between failures. That number can be far lower than MTTF.

Also, the authors state that there is no infant mortality effect, yet the results of their weibull analysis clearly point to infant mortality. It is commonly accepted in reliability analysis that a rate of failure of less than 1 indicates infant mortality. Some though would claim that a value of 0.71 is random. Either way, the system of drives is not exhibiting the wearout failure mode that they state.

MTTF indicates life, but MTBF doesn’t. MTTF generally will not vary with time, but MTBF does. Also, MTTF doesn’t vary with rates of installation or replacement, yet MTBF will.

It is very easy to confuse the two. Many on here have, many on slashdot have, and the paper’s authors have misunderstood as well.
Pipson on Friday, 23 February, 2007 at 12:27 pm

I disagree with Jessica’s statement that backup of a non-redundant RAID is easier on the drives than a rebuild (unless of course you don’t have to do a full backup). Moreover, in a production scenario, where uptime is important, offlining the RAID to perform a backup instead of rebuiling the array to regain redundancy is defeating the purpose of the system. I do agree with your comments on importance of *regular* backups. This is where providing RPO and RTO that meets business needs is the ultimate failback.

I absolutely agree with magicalbob’s last two paragraphs on redundancy.

When talking about MTBF, not enough emphasis was put on the duty cycle of each system. In my experience SATA systems simply buckle under constant heavy IO load with drives popping just like popcorn. Under light to medium load I would expect SATA and SCSI/FC to show similarly lower failure rates. Then again I may be just the exception…
Pete on Thursday, 8 March, 2007 at 11:55 am

A few points to blunt the hysteria. Do any of you realize how long 1,000,000 hours is? A quick punch-up in a calculator shows that it is just over 114 years. Even if MTTF and MTBF estimates were off by 50%, that is still 57 years of 24/7 service. Somehow that doesn’t shake my faith in harddrives. Lets also keep in mind that the price of drive storage has dropped steadily over the last 25 years. I remember when the cost of storage was over $100 per MB. Now that cost is about $.01 per MB. Excuse my math if I miscalculated but isn’t that a 10,000% drop?

Given the comparatively low cost of storage, isn’t RAID 1, 4 and 5 outdated now? Doesn’t RAID 10 give better performance and more redundancy? Given the lower cost of storage isn’t it the epitomy of cheap to still be using RAID 5 in server or SAN systems?

Let me also address the myth that “enterprise” drives are somehow better than “consumer” grade drives. Anyone who knew what they were speccing when designing storage systems knew damned well they weren’t paying for fewer failures, they were paying for performance. Faster spindle speeds, lower seek times, lower transfer rates, more write cache and higher throughput were the name of the game. Anyone thinking they were buying lower failure rates was on a fool’s errand.

I would also like to address that silliness that RAID is pointless because you still have a single point of failure in the controller. Well, Duh! That’s why the RAID config info is stored on the drives and not in the card or any other volatile memory. If a card fails, it can be replaced without losing data. Also keep in mind that RAID isn’t the end-all and be-all for data security. It is at best one piece of a comprehensive stratagy that should include other things like backups, redundant storage and archiving

Let me also put things into perspective with mfg published MTBF and MTTF rates. As I stated above 1,000,000 hours is 114 years. The published numbers are ESTIMATES based on predictive TESTING. If they were to actually to run real-world tests on samples to get statistical numbers, we would still be putting 20MB MFM and RLL 5.25″ drive in our systems while we waited for manufacturers to complete their testing.

Let me take the opportunity to put it into geek speak since I just by chance, watched Star Wars the other night “..So you see Luke what I told you IS true… from a certain point of view.” Interpreting statistics is a fool’s game. They are guidelines based on a certain set of conditions and not facts.
Amos on Saturday, 17 March, 2007 at 2:59 am

So what practical software/filesystem can you recommend to implement such a file-redundancy setup, Or am I obliged to implement this in my applications?
clockwinder on Tuesday, 27 March, 2007 at 9:40 am

Permanent data storage?? The hard part is getting rid of stuff you no longer need. I have lived with failures of 9-track tape, Dat tape, winchester technology drives, CD platters, 80-column punch cards, and punched paper tape. Information Week a number of years ago published a survey on longevity of storage media (not quite the same thing as disk drive longevity). Worst was cheap mag tape. Then hard disk. Then high-quality CD ( guessed at reliable for 50-75 years). Most reliable was acid-free paper, good for probably 500 years or more. In this case, we have actual examples!
Gigabytes per page? It depends… dont throw the books away yet, folks!
From a cost perspective.. on Monday, 23 April, 2007 at 6:11 am

For those taking all this information/comments/thoughts into consideration for real world applications, some cost data to consider…

On a current “big iron” application, we made the chang from SATA drives to Fibre drives before implementation this past year. Storage costs increased exactly 100% for the same amount of storage, not 400 to 600% as has been suggested. So, if you’re thinking of doubling or “tripling” up on SATA, look at the costs also.

Facility costs on “big iron” projects are huge. The costs to double, or triple, up the space to stand up SATA and the added costs for cooling these drives over a period of years can be staggering.

Now if your just looking at a simple “one for one” replacment Fibre with SATA, with the same size of storage in the end, then it’s worth looking into because storage costs could be reduced by half.

As an example our costs could be reduced from $4 million to $2. I’ll be taking a look, and will have to make a complicated business decision.
Ted Fay on Sunday, 20 May, 2007 at 9:41 pm

Bob,

Of course I’m talking about data corruption due to bad blocks, and the fact that only drive-wide hardware failures were taken into account in this study is the basis of my point.

Robin tried to dismiss my point as being architectural and not real world, yet my whole point is that this study misses some critical aspects of real world experience, which is that when you go to fetch data, and you can’t get it because the blocks are bad, or you can’t rebuild a portion of the data after a failure because the block are bad, then whoever needed that data is going to consider it to be a failture, regardless of whether the RAID controller labels the disk as failed or not.

Data corruption = failure. Anyone who tells you diffrent is trying to sell you something.

-ted
Ted Fay on Sunday, 20 May, 2007 at 9:56 pm

Annoymous,
Regarding you comment “Are you saying we should go back to the ST-506 for reliability?”

Of course not. Radically different technologies, as you know.

Packing twice the blocks on the same physical spindle as onother drive built with the SAME TECHNOLOGY will and does result in twice the number of bad blocks for the same physical damage to, or inperfection in the platter.

There is no free lunch, and you do indeed get what you pay for. It doesn’t show up in this study, because this study doesn’t take into account the primary advantage of enterprise diks, which is twice the phyical media allocated to each block using the same platter technology as their consumer grade cousions.

Even if FC, SAS and SATA all do inded have similar rates of failure for their mechansisms, which I wouldn’t doubt, if you’re willing to pay for RAID redundancy, why not media redundancy teh blocks on your platter?

Apart from the advantages on the contoller board of FC or SAS, what your paying for is twice the saftey of the data contained on those blocks. If you don’t care about what lives on those blocks, I guarantee you someone will when they go missing. 🙂

Just my two cents.

-ted
A Dutch Library on Friday, 1 June, 2007 at 3:53 am

Well, it’s a bit of a late reply seeing the date that this discussion started, yet I thought it couldn’t harm to add my own advise. We’re all interested in making our data persistent which is quite a challenge due to media detoriation and rapid media obsolescence. The topic interested me and I’m currently graduating by performing research on it for a library who is interested in digital preservation. There are many difficulties with digital preservation of which this particular one is just a minor (almost easy) part. I will save you the whole reasoning behind my conclusion since it’s not yet finished (and there are probably limits to the textsize that you can post :)) but the conclusion might be helpful to some of you:

A few assumptions:
-The target storage system needs to be able to contain 10 TB worth of data
-The storage system needs to be scalable
-The storage system needs optimal data security vs. costs. (of course data triplication is nice, but most of us, libraries including, don’t have that much money)
-The storage system needs to be web-accessible
-The storage system needs to be disaster-proof

If you are searching for something that should fit these needs as well, this is probably your best solution:

Two seperate servers stored at seperate locations (cheapest way of avoiding data-loss through distasters). Configure the first server for RAID5EE (hot spare integration) and the second for RAID60 (SAN). Use 500GB enterprise drives for your first server and 500GB nearline drives for the SAN. Make the first server backup daily to the SAN. Perform nightly checkdisks so you can determine when new spare drives should be ordered. And last, but not least, make sure you have the money to buy a whole new server environment within 7 years.

That isn’t anywhere near cheap, but it’s most cost-effective almost 100% guarantee for preserving your data. This configuration doesn’t necessarily have to be optimal for the next generation of hardware you will buy.

Perhaps noone is helped with this, but I’ll be happy if it just helps Someone. Just some (nearly offtopic) sidepoints, for cheap home RAID’s, check the Intel Matrix RAID solution. For future archiving, pay attention to holographic storage development. I’ll save you the other random findings of my study 🙂
wgh on Thursday, 23 August, 2007 at 9:47 pm

Joe Claborn said (on February 21st, 2007 at 6:41 am): Is this right? A MTBF of â€˜onlyâ€™ 300,000 hours translates in 34 years. Our disk drives seem to last about 3 years. Why the difference?
—
I’ve skimmed the above thread but didn’t see anyone note that MTBF (and to a degree MTTF) should be divided by the number of drives that are in your environment to estimate how often you’ll see a single drive within the environment fail. Yes, as you’ve mentioned, the MTBF numbers suggest 34 yrs to fail for one drive, but if you have 10 drives in your environment you can expect one of them to fail in about 3.4 years. Just as when you have 10 men working construction there’s 10 times the probability of one of them getting sick on any given day. When working in a “big iron” shop with thousands of RAID devices, this is (usually) taken into account. Those who say triplicate the data instead of using RAID appear to me to not be faced with needing up to date accurate data available in one location, without time available (due to SLAs) to restore or even time to fail over to a seperate set of drives. Many in mainframe environments have come to heavily rely on no down time to restore or fall over to other drives, that is unless the situation is very dire (of a disaster type). If one were to “simply” have three copies, as someone suggested above, then which one do you update? All three? Doing so and waiting for validation of completion of I/O would typically cause response times on heavily I/O burdened systems to degrade beyond acceptability. To not wait on validation opens a window to potential corruption to any copies that were not being synchronously updated (synchronous updates are expensive). Thus RAID. Yes, drives will fail and drives will be replaced. But a well laid out RAID array will still give the needed response times during failures, even at peak transaction time… again, I said if they’re “well laid out”. And yes, if the data is mission critical, such RAID arrays should be copied to another location… for the event of a disaster (including at a minimum, lightening).
Jered Floyd on Wednesday, 20 August, 2008 at 2:08 pm

Robin,

A bit of a late comment here, but I think what’s even more interesting than bogus MTBFs for drives is the interesting difference in bit error rate for SCSI/FC vs. SATA drives. I just wrote an article on this, Are Fibre Channel and SCSI Drives More Reliable? It turns out that they are, at least for RAID, and not for the reason you might suspect! I think there’s a false market segmentation going on here…

Jered Floyd
CTO, Permabit Technology Corp.
Kmann on Friday, 22 August, 2008 at 11:01 am

The Bianca Schroeder paper is excellent, but I saw something very interesting in the paper that seems to have gone unnoticed here,

Table 2. — “Node outages that were attributed to hardware problems broken down by the responsible hardware component.”

Component (HPC1)
CPU 44%
Memory 29%
Hard drive 16%
PCI motherboard 9%
Power supply 2%

Fully 82% of the failures were related to “solid state” components.

This in spite of the fact that the system population included 3,406 disks and 784 servers. DRAM was almost twice as likely to cause a failure and the CPUs were three times more likely to cause an outage. Moreover, 784 motherboards produced 9% of failures while 3,400 disks produced only 16%.

And this is a very high-end system, presumably “top-shelf” DRAM, CPU and motherboard components.

Also, from the text:

“…we have analyzed failure data covering any type of node outage, including those caused by hardware, software, network problems, environmental problems, or operator mistakes. The data was collected over a period of 9 years on more than 20 HPC clusters and contains detailed root cause information. We found that, for most HPC systems in this data,
more than 50% of all outages are attributed to hardware problems… Consistent with the data in Table 2, the two most common hardware components to cause a node outage are memory and CPU.”

So much for the myth of “solid state” reliability.

For some perspective, while CPU makers stopped publishing MTBF many years ago, and DRAM manufacturers have to my knowledge never published them, most motherboard manufacturers do publish — typically in the 100,000 hour range. So…if 784 motherboards produced 9% of failures, and 3,400 disks only produced 16%, then it seems that perhaps the numbers published by the disk drive makers are, in relative terms, not so wildly off the mark. It would appear (from a system/sub-system perspective) that disks are relatively much more reliable than the “solid state” components.

I wonder how people would react if they actually knew the MTBF numbers on stuff like DRAM and CPUs? Perhaps we should all remember that silicon DOES “wear out” (in a manner of speaking).

All this makes me wonder why everyone assumes that Flash SSD is going to be so much more reliable than other silicon. Are we to believe the ridiculous MTBF claims of the SSD makers (Intel sez 2,000,000 hrs), given the numbers on DRAM?

It will be interesting to see the results on the first large-scale deployments of flash-SSD. Unfortunately it will probably be five or more years that the “free ride” for SSD continues before folks begin to realize that solid-state in not necessarily more reliable than mechanical disks…and very frequently (in the case of DRAM and CPUs) less reliable!
Tracy Valleau on Thursday, 12 February, 2009 at 10:00 pm

I often get asked about MTBF (Mean Time Between Failure) and it’s amazing how many “industry people” don’t understand it.

And for those who have already figured out that their 1.5M MTBF drives don’t last 150 years, but are not sure what that MTBF thing is… here’s a quickie:

Why your hard drive doesn’t last 150 years.

(There are about 8700 hours in a year, but to make this example simple, let’s call it 10,000.)

Here’s how MTBF works: it’s an aggregate of many units based on expected life of a single unit.

Let’s say you have a hard drive that is warranted to last 3 years, or 30,000 hours.

You put it in a server, and behold, it lasts 3 years. You take it out and put in a new one, and that also lasts 3 years. So you replace it with a new one, and that too…. well, you get it.

Let’s say you keep doing that and finally, on the 50th unit, only two years into it’s life, it breaks.

You now have 3 years or 30,000 hours per unit, times 50 units = 1,500,000.

And that’s your MTBF.

So anyone who says “Wow! MTBF of 1.5 million hours! that mean this thing will last (1.5M / 10000) 150 years!” -clearly- doesn’t know what they’re talking about.

(MTBF is more complex than my example, including “infant mortality” and “wear out” phases; “theoretical” vs “operational” MTBF and so on, but the gist of what’s here is correct.)

Cordially,

Tracy Valleau

“Don’t believe everything you think.”

← Older Comments

Newer Comments →

Trackbacks/Pingbacks

ä¸€ç¬‘äº†ä¹‹ » StorageMojo Â» Everything You Know About Disks Is Wrong - [...] StorageMojo Â» Everything You Know About Disks Is Wrong # Costly FC and SCSI drives are more reliable than…
Mac OS X Things :: StorageMojo - Everything You Know About Disks Is Wrong - [...] StorageMojo Â» Everything You Know About Disks Is Wrong: Which do you believe? [...]
smriti.com » Blog Archive » Another (better?) study on disk reliability? - [...] Slashdot has a new article which says: Google’s wasn’t the best storage paper at FAST ‘07. Another, more provocative…
Unilever Centre for Molecular Informatics, Cambridge - Jim Downing » Blog Archive » Surprising disk failure research - [...] I don’t usually geek out over hardware, and I try to resist becoming a sysadmin myself, but I found…
thak’s cool links » Everything You Know About Disks Is Wrong - [...] StorageMojo Â» Everything You Know About Disks Is Wrong.Â Ah.Â Statistics. [...]
a glob of nerdishness » Schroeder and Gibson on hard drives - [...] More on hard drives. Here’s a paper that won a “Best Paper” award at FAST ‘07. And a wonderful…
New Study Exposes Hard Disk Myths at FresHDV - [...] The study reveals a number of fallacies, misconceptions, and some outright lies about storage technology and reliability. High-dollar SCSI,…
Web Development Stuff » Blog Archive » StorageMojo Â» Everything You Know About Disks Is Wrong - TheV247.com - [...] StorageMojo Â» Everything You Know About Disks Is Wrong Everything You Know About Disks Is Wrong February 20th, 2007…
Stephen Foskett, Pack Rat - Specialized Hard Drives: Worth the Effort?... Lately, there has been a lot of buzz in the enterprise storage arena about…