StorageMojo




Robin Harris    


Everything You Know About Disks Is Wrong

February 20th, 2007 by Robin Harris in Clusters, Enterprise

Update II: NetApp has responded. I’m hoping other vendors will as well.

Which do you believe?

  • Costly FC and SCSI drives are more reliable than cheap SATA drives.
  • RAID 5 is safe because the odds of two drives failing in the same RAID set are so low.
  • After infant mortality, drives are highly reliable until they reach the end of their useful life.
  • Vendor MTBF are a useful yardstick for comparing drives.

According the one of the “Best Paper” awards at FAST ‘07, none of these are backed by empirical evidence.

Beyond Google
Yesterday’s post discussed a Google-authored paper on disk failures. But that wasn’t the only cool storage paper.

Google’s wasn’t even the best: Bianca Schroeder of CMU’s Parallel Data Lab paper Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? won a “Best Paper” award. (BTW, Ms. Schroeder is a post-doc looking for an academic position - but if I were Google or Amazon I’d be after her in a big way.)

Best “academic computer science” paper
So it is very heavy on statistics, including some cool techniques like the “auto-correlation function”. Dr. Schroeder explains:

The autocorrelation function (ACF) measures the correlation of a random variable with itself at different time lags l. The ACF, for example, can be used to determine whether the number of failures in one day is correlated with the number of failures observed l days later.

Translation: ever wonder if a disk drive failure in an array makes it more likely that another drive will fail? ACF will tell you.

She looked at 100,000 drives
Including HPC clusters at Los Alamos and the Pittsburgh Supercomputer Center, as well as several unnamed internet services providers. The drives had different workloads, different definitions of “failure” and different levels of data collection so the data isn’t quite as smooth or complete as Google’s. Yet it probably looks more like a typical enterprise data center, IMHO. Not all of the data could be used to draw all of the conclusions, but Dr. Schroeder appears to have been very careful in her statistical analysis.

Key observations from Dr. Schroeder’s research:
High-end “enterprise” drives versus “consumer” drives?

Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”

Maybe consumer stuff gets kicked around more. Who knows?

Infant mortality?

. . . failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation.

Dr. Schroeder didn’t see infant mortality - neither did Google - and she also found that drives just wear out steadily.

Vendor MTBF reliability?

While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs. Most commonly, the observed ARR values are in the 3%range.

Actual MTBFs?

The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.”

In other words, that 1 million hour MTBF is really about 300,000 hours - about what consumer drives are spec’d at.

Drive reliability after burn-in?

Contrary to common and proposed models, hard drive replacement rates do not enter steady state after the first year of operation. Instead replacement rates seem to steadily increase over time.

Drives get old, fast.

Data safety under RAID 5?

. . . a key application of the exponential assumption is in estimating the time until data loss in a RAID system. This time depends on the probability of a second disk failure during reconstruction, a process which typically lasts on the order of a few hours. The . . . exponential distribution greatly underestimates the probability of a second failure . . . . the probability of seeing two drives in the cluster fail within one hour is four times larger under the real data . . . .

Independence of drive failures in an array?

The distribution of time between disk replacements exhibits decreasing hazard rates, that is, the expected remaining time until the next disk was replaced grows with the time it has been since the last disk replacement.

Translation: one array drive failure means a much higher likelihood of another drive failure. The longer since the last failure, the longer to the next failure. Magic!

Big iron array reliability is illusory
One implication of Schroeder’s results is that big iron arrays only appear more reliable. How? Using smaller “enterprise” drives means that rebuilds take less time. That makes RAID 5 failures due to the loss of a second disk less likely. So array vendors not only get higher margins from smaller enterprise disks, they also get higher perceived reliability under RAID 5, for which they also charge more money.

The StorageMojo take
After these two papers neither disk drive or array businesses will ever be the same. Storage is very conservative, so don’t expect overnight change, but these papers will accelerate the consumerization of large-scale storage. High-end drives still have advantages, but those fictive MTBFs aren’t one of them anymore.

Further, these results validate the Google File System’s central redundancy concept: forget RAID, just replicate the data three times. If I’m an IT architect, the idea that I can spend less money and get higher reliability from simple cluster storage file replication should be very attractive.

Comments welcome, especially from disk drive and array vendors who dispute these conclusions. Moderation turned on to protect the innocent.

Update: Garth Gibson’s name is also on the paper. Since he is busy as a CMU professor and CTO of Panasas, I hope he’ll pardon me for assuming that Dr. Schroeder deserves most of the credit.

97 Responses to ' Everything You Know About Disks Is Wrong '

Subscribe to comments with RSS or TrackBack to ' Everything You Know About Disks Is Wrong '.

  1. ugh said,

    on February 20th, 2007 at 7:09 pm

    great… my $80k San is doomed to fail.


  2. on February 20th, 2007 at 7:16 pm

    [...] and one from Bianca Schroeder from the Carnegie Mellon University: Everything You Know About Disks Is Wrong links to the complete studies are in the articles that are linked above. [...]


  3. on February 20th, 2007 at 7:30 pm

    [...] http://storagemojo.com/?p=383 [...]


  4. on February 20th, 2007 at 7:45 pm

    One word:

    zfs :-)

  5. Farhad said,

    on February 20th, 2007 at 8:01 pm

    So, what are the physical problems with the notion of ‘permanent’ storage? Will we ever have a medium that can store data reliably and permanently? or will we always have mtbf, mttf and failures and replacements?

  6. hymieg said,

    on February 20th, 2007 at 8:05 pm

    In my own experience, one of the best life extending techniques for hard disks is simple - don’t power them off. The sysems where I experience the lowest failure rate are consistently the systems that are up 24/7. I have always thought that the biggest killer of drives is the initial power up - the bearings take a bit of a beating getting up to 7200 rpm and beyond from 0 rpm in a hurry.


  7. on February 20th, 2007 at 8:15 pm

    [...] As queued by Slashdot, I found an excellent summary (thanks StorageMojo) of a paper presented at the USENIX conference. The paper, which one the Best Paper Award, is titled, Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?. [...]


  8. on February 20th, 2007 at 8:18 pm

    [...] More people are picking up the drive story. I expect to hear rebuttals any time now from the big expensive disk players. FWIW: we have been talking about this for a while. Lots of our partners have observed these things. MTBF is a great way to estimate things. The model appears to be broken, as it is not 5-20% off. But 5-10x off. [...]

  9. sooth_sayer said,

    on February 20th, 2007 at 8:51 pm

    Get a life .. and some understanding of science.

    Your prof. needs to learn 101 of reliability.

    Every spec. is with certain conditions .. and to say one number is “false” without a careful setup of “other factors” is yellow journalism .. but I guess content is the king and facts are a secondary artifact of history

  10. mustard mom said,

    on February 20th, 2007 at 9:14 pm

    “forget RAID, just replicate the data three times”

    Gindication at last!

    I have been saying this for years where I work. Not to mention RAID’s have a central point of failure at the controller. Saw this happen last year and the RAID was not backed up due to the perceived reliability.

  11. p-money said,

    on February 20th, 2007 at 9:26 pm

    her paper includes LOTS of data. read it and draw your own conclusions about reliability if you want, but don’t claim her data is wrong without offering a similar experiment of your own. what facts are you offering that are more reliable than her tests?

    you’re right - every spec IS with certain conditions. and if those conditions are nothing like what is encountered in the real world, they will mislead buyers. notice that the title includes: “What Does an MTTF of 1,000,000 Hours Mean to You?” it does not say “What does an MTTF of 1,000,000 hours mean in a highly environmentally controlled clean room?”

    i think the conclusion that can be drawn here is that the drive manufacturer’s claims are not stark facts but are misleading at best and fabricated at worst.

  12. Creditor said,

    on February 20th, 2007 at 9:34 pm

    Google invented replication? GoogleFS is pretty standard replication from way way way back.

  13. fairly_reliable said,

    on February 20th, 2007 at 9:35 pm

    Re: Reliability
    We can either take drive manufacturer’s word for it, or we can go look at real world reliability. The numbers we are given on datasheets reflect certain conditions - which often don’t exist in our datacenters. The point of the paper is that in the REAL WORLD, drives fail more often than under ideal conditions. Seems a no-brainer when you put it that way.

  14. Some Other Professor said,

    on February 20th, 2007 at 9:38 pm

    When a paper has two authors, you give them both credit. To assume that the professor on the author list is “busy” and therefore give more credit to the other author is odd and certainly misleading. Ugh.

    - A Professor

  15. Creditor said,

    on February 20th, 2007 at 9:42 pm

    Google didn’t invent cluster replication, it’s been around since the beginning.

    Both studies sure do expose a lot of lies tho :)

  16. A Student said,

    on February 20th, 2007 at 9:49 pm

    Yes, you should give the professor credit. Without his name, the paper would not have made it into the conference to begin with. And there are also probably one or two sentences that were reorganized because the professor read them! :)

  17. Ted Fay said,

    on February 20th, 2007 at 9:54 pm

    SATA drives pack far more blocks onto a single platter than a FC or SCSI drive, so they are inherently more prone to losing multiple blocks rather than a single bad block. This means that SATA drives failures, all things in physics being equall, are of neccisity of higher probability, and also of higher danger of lossing data in a RAID 5 set, as loosing a larger number of blocks can lead to data loss without having a complete drive failure.

    3-way mirrors are definately one of the most reliable and robust forms of data security, but make no mistake, FC and SCSI drives are WITHOUT QUESTION more reliable. FC and SCSI drives may not match the MTBF specs, but I doubt SATA drives do either, particularly where bad blocks are concerned.

    It is nonsense and contrary to physics for a platter that packs more than double the blocks in the same space to be a reliable as one that pack in only half. The platter would have to be twice as robust, and I doubt even Seagate would make that claim.

  18. exHDinsider said,

    on February 20th, 2007 at 10:18 pm

    I used to work in the drive industry in the late 80’s through the late 90’s. I can tell you as an insider that ALL of the reliability stats from that time period were bogus. Small sample sizes, extrapolating data from just a week long start-stop experiment, not doing 100% testing of parts before and/or after assembly because a process was thought to be 6-sigma, etc.

  19. Poltsi said,

    on February 20th, 2007 at 10:21 pm

    Regarding Farhads question:

    When you can suspend the enthropy from happening, then you can have your permanent data storage.

  20. Not A Professor Yet said,

    on February 20th, 2007 at 10:21 pm

    In general, I give the junior author credit for the detail, the senior author credit for the gist. The ideas and spin generally come top-down, but the grunt work and analysis definitely come up from the bottom.

    -Grad student


  21. on February 20th, 2007 at 10:32 pm

    Probably the best form of semipermanent data storage is an obelisk. But this is not a terribly convenient form factor

  22. rob said,

    on February 20th, 2007 at 10:44 pm

    so I’ve been ranting on at some of the larger storage providers for a couple of years now. im a big customer and they just will not listen

  23. Robin Chauhan said,

    on February 20th, 2007 at 10:49 pm

    Poltsi, don’t you mean “empathy”?

  24. rob said,

    on February 20th, 2007 at 10:50 pm

    who cares if sata is technically not as reliable as FC or SCSI. its so much cheaper you can build in more redundacy for a much cheaper price. formula is simple. sata 1/6 of the price as high perf FC running at 1/2 the speed, vendor says its 1/3 as reliable, so I can halve my disk cost by putting in sata and increase gerneral performance casue I’ve got three times the spindle and only halve the speed… but vendor says this won’t be supported in 24/7 service cycle. bollocks number 1.

  25. Nathan Myers said,

    on February 20th, 2007 at 10:53 pm

    I would interpret the decreasing hazard rates the other way: if one drive in a box fails, expect another drive in the box to fail soon after. It makes sense considering they all experience the same conditions, and were probably from the same lot, subject to the same manufacturing defects.

    It’s disappointing that they didn’t record mfr/model/serial numbers of failed drives. It’s very common in other fields (munitions, particularly, and aircraft parts) to recall all members of a lot when one fails.

  26. Anonymous said,

    on February 20th, 2007 at 11:12 pm

    “SATA drives pack far more blocks onto a single platter than a FC or SCSI drive, so they are inherently more prone to losing multiple blocks rather than a single bad block.”

    Are you saying we should go back to the ST-506 for reliability?

  27. rob said,

    on February 20th, 2007 at 11:15 pm

    bollocks number two. high perf IO is really really expensive cause you need to buy shedloads of spindle to support it, im talking like an app that needs in excess of 5000 IO ops per sec. solid state flash is now at a cost point where it is 5 times cheaper to get that perfomance in spindles than across spining disk spindle. great you think. but no. Not one major disk vendor is looking to provide solid state in the big chassis.

  28. Robin Harris said,

    on February 20th, 2007 at 11:34 pm

    Too many great comments to respond to right now. Thanks!

    I do take issue with Ted Fay’s comment though. I don’t put much stock in architecture arguments because I’ve seen them fail to account for the real world too often. I’m not a disk engineer, yet I suspect the reason “enterprise” disks have lower densities is because of their higher speed, not because of some reliability issue.

    As Bianca’s study shows, there was *no difference* in reliability between FC and SATA drives in the environments she examined. So, Ted, if those drives are so much more reliable, how come that didn’t show up in the data?

    Finally, the choice isn’t between SSD and fast disks. Multi-port cache is the technology that gives most arrays their performance, especially on writes. In lieu of actual knowledge, most customers of big iron arrays just say “optimize everything” so they can sleep at night. They don’t KNOW what they need, and the industry has been very careful not to figure it out for them. I’m not throwing rocks - I’m an ardent capitalist - but as things continue to scale out it starts to make sense to figure out, as Bianca has, what is really important.

    Robin

  29. zeropointburn said,

    on February 20th, 2007 at 11:38 pm

    Robin,
    He does mean enthalpy. In other words, entropy, random failure, degradation over time. Ordered systems are susceptible to degradation because of entropy. It’s a basic physical property that cannot be avoided (to the best of our current understanding).
    If there were a true breakthrough in that regard, much of physics as we know it would be rendered invalid, and it would be possible to store data indefinately.

  30. rob said,

    on February 20th, 2007 at 11:47 pm

    robin, yes cache does do that, except cache as the issue that it becomes flooded much quicker than the actual spindle, especially in a highly intensive IO environment, which is what i’m talking about.

  31. subspawn said,

    on February 21st, 2007 at 12:26 am

    Ah, just buffer with a load of nvram or just plain battery backed dram for that matter. High I/O loads are the number one reason for a SAN imho, the huge I/O buffer offers great performance.

    Why anyone hasn’t written such an application for linux yet puzzles me (I would like to, but my C knowledge is terrible). Is you could take that combined with cheap redundant (x3 or so) SATA storage boxes… HP/Netapp/Symmetrix/… are all out of bussiness :)


  32. on February 21st, 2007 at 12:54 am

    [u]Farhad wrote “Will we ever have a medium that can store data reliably and permanently? or will we always have mtbf, mttf and failures and replacements?”[/u]

    In the contextof long-term information storage, reliable media is [b]not[/b] the problem. Sufficiently reliable products exist today. In my opinion, the development of even longer-lasting media is a huge waste of resources.

    I’ll give you two reasons:

    1. Information Refresh - The increasingly rapid evolution of business applications, databases and file formats undermines the long-term resuscitability of information assets. Companies must develop a plan for ensuring that long-term information assets are always in a usable form/format.
    2. Technology Refresh - The long-term viability of any media depends heavily upon the availability of an appropriate interface….ST506 anyone?

    In short, if you can’t read the media, and/or you can’t make sense of the information stored on it, using an HDD that lasts 10 years or a tape that lasts 50 isn’t going to matter.

  33. Karl O. Pinc said,

    on February 21st, 2007 at 1:13 am

    Isn’t manufacturer MTBF computed on the basis of probability of failure during the warranty period? No wonder real world tests show different MTBF values, many of the failed drives were out of warranty.

  34. Bob said,

    on February 21st, 2007 at 1:27 am

    @TedFay: The density of blocks on the drive will have nothing to do with whether the drive hardware fails. I think you’re thinking of data corruption which is another issue.

  35. chronus said,

    on February 21st, 2007 at 1:44 am

    enthalpy: The sum of the internal energy of a body and the product of its volume multiplied by the pressure. This seems to have little to do with our topic


  36. on February 21st, 2007 at 2:06 am

    [...] http://storagemojo.com/?p=383 [...]


  37. on February 21st, 2007 at 2:14 am

    [...] According the one of the “Best Paper” awards at FAST ‘07, none of these are backed by empirical evidence. StorageMojo has a good summary of the paper’s key points [...]


  38. on February 21st, 2007 at 2:48 am

    StorageMojo ? Everything You Know About Disks Is Wrong…

    I’ve said it before and I will continue saying so Raid Doesn’t work.

    Now StorageMojo reports Everything You Know About Disks Is Wrong based on the Usenix Paper
    Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?…

  39. Corley Kinnane said,

    on February 21st, 2007 at 3:01 am

    Statistics are only a part of the story, usually the boring part.

    Regarding RAID5, the conclusion seems to be based on the assumption that people trust RAID5 too much - not the reality of it. RAID5 is a good solution, triplication is better one, by principle, not by statistics.

    Of course, many trade off the security of triplication with the value and efficiency of RAID5. These days, HDs are cheap enough to consider options like triplication - its not really a revelation many could afford years ago.

    On the issue of RAID5 having an increased chance of secondary failure during a rebuild - its clearly mostly related to the degree of independence between drives.
    Thanks to the stats, we know that there is no point in a drives age when it is much more likely to die than any other time - just a high level of early failures, some semi plateau and then a steady decline in reliability after that.
    That means that this increase in clustered failures must be due to influences external to the drive but not to the array - power cycles and usage patterns.
    That could mean your RAID5 array could fail during a rebuild if it doesn’t normally see that kind of usage - or if you use a cheap SATA solution that requires a power cycle before a rebuild.
    Truth is, those factors apply to any array, not the concept of RAID5.
    That means they also apply to a triplicated cluster, unless you can afford to force more independence by physically separating the cluster components.

    Buy an array or cluster in one shot and it lives for years - all components lose reliability equally, doesn’t matter if you have a RAID5 or triplication, this will threaten your system as if you were comparing a new RAID5 array with a new triplication cluster.

    If you are concerned about the reliability of your RAID5 controller, use software RAID (decent software of course).

    Your probably best of duplicating RAID5 arrays in seperate locations.

    Corley.

  40. w said,

    on February 21st, 2007 at 3:06 am

    enthalpy is certainly not entropy. It (enthalpy) is better understood as available thermal energy (from thermal energy and flow work).

    Entropy is a basic physical property, but to some extent, it can be mitigated. It simply provides that thermodynamic processes will tend to go in a certain direction (2d law of thermo) — something that can be overcome by application of restorative work.

    The real relationship that this work has to entropy is that both are statistical phenomena describing “one-way” processes. (Be it globally increasing entropy or globally increasing failed drives…)

    -w

  41. kevin said,

    on February 21st, 2007 at 3:08 am

    RAID has two advantages:
    (1) a single HD failure doesn’t halt the server
    (2) the replication is latency free
    So the optimum setup is RAID 1 on main server and then regular replication to _at least_ two backup servers across fast network.

  42. magicalbob said,

    on February 21st, 2007 at 3:08 am

    At last, factual field results from the coalface. Somewhat akin to the realisation a few years ago that real cars don’t crash into brick walls, and hence a major scramble to provide real safety ratings for vehicles.

    Redundancy is the key, preferrably on a differen’t set of disks, from a different batch, from a different factory in different locations. (Don’t want 20 Fujitsu MPG320 drives in a raid array :P).

    Having done a lot of field experimental field work, the following motto holds true.

    One is none
    Two is one
    Three is two.

    Meaning if you want 1 reliable backup, you need 3 systems.

  43. Ben said,

    on February 21st, 2007 at 3:31 am

    (Hardware driven) RAID5 brings more weak points to the system.

    The controller is (as someone noted above), materialy, a single point of failure by itself.

    And it involves complex softwares with their own set of bugs, hazardous upgrade paths, compatibility problems, … :
    - the controller firmware,
    - the controller driver,
    - the OS’ raid stack,
    - the online management utilities

    In my (short) experience, raid controllers ROM and drivers are even more troublesome than disks drives (at least, they brings trouble sooner).

  44. Bram said,

    on February 21st, 2007 at 3:42 am

    So how does one “replicate data 3 times” in practice without a raid controller?


  45. on February 21st, 2007 at 4:01 am

    If you can’t trust the storage vendors ……

    … isn’t that another reason to go with massively parallel systems?
    StorageMojo has a great post on storage myth and reality.

    ……


  46. on February 21st, 2007 at 4:18 am

    [...] StorageMojo » Everything You Know About Disks Is Wrong (tags: storage Disk troubleshooting) [...]

  47. The Thinker said,

    on February 21st, 2007 at 4:58 am

    Unfortunately, this paper is severely flawed. Similar to the Google paper, it is written by academics with little understanding of the subject matter, but a strong desire to publish lengthy papers.

    To write a meaningful paper, there is a lot of data about the drives and the systems they are used in that needs to be collected. These are initial conditions and operating conditions that any real system scientist will tell you cannot be ignored (to say the least). One cannot look at drives in the abstract, but must look at many details of how they are used, including the storage systems they are part of.

    Google, to their credit, did collect the SMART measurements. That is a good start, but not sufficient data to support the conclusions of the Google paper.

    For example, the orientation of each drive needs to be taken into account. What percentage of the drives analyzed were mounted horizontally vs. vertically? How were the drives themselves mounted? Specific mounting techniques result in a greater incidence of particular failure patterns. How were the drives cooled? Particular cooling techniques similarly result in specific failure patterns. What sort of data usage patterns were in use? What levels of RAID were used across the various drives?

    I see no measurements of vibration in this paper. Drive orientation and drive vibration (including system-based vibration) are two factors that are very important in determining drive reliability. Drives have a certain resistance to vibration (and shock) that varies based on the directionality of the vibration.

    We also see no meaningful treatment of the conditions for the HPC1, COM1, and COM2 systems. In HPC1 and COM1 we see massive failure levels for memory, likely indicating severe heat problems in those systems. In the COM2 system, we see a very high incidence of motherboard failure, again mostly likely indicating heat problems (or possibly bad caps). Specific heat conditions are operating conditions for drives that must be taken into account. Maybe the early onset of wear-out degradation is at least in part due to heat?

    I have merely touched on several important elements of study that were neglected in both papers. To gain a real understanding of drive failure in the “real world”, real and comprehensive data is needed first. Otherwise we are dealing with merely variations on the “GIGO / Garbage In Garbage Out” theme.

    Also, I see a number of irrational conclusions being put forth by readers — no value in RAID just replicate your data 3 times? This sounds a bit like how to get home from Oz. It works in the movies. But it doesn’t work as well in real life.

    RAID1 is a very solid solution for many businesses (and their correspondent data usage models), especially if there is a hot spare on the system as well. Many studies have shown the business value of the simple, transparent, low cost redundancy that RAID1 delivers. Even simple probability theory will tell you that RAID1 has clear potential for reliability improvements (that are well measured and proven in the real world).

    I see a lot of analysis of RAID5 which people in the real world know is not a good choice for data that matters. There is no sane recovery procedure for RAID5. The drive access patterns tend to result in a lot of vibration as well.

    Overall, I am disappointed that with all the investment that large organizations make in purchasing and deploying storage, they seem to have no one in their organization that (1) understands the mechanics and physics of even a single disk drive, (2) understands the concept of initial conditions, (3) understands the concept of operating environment/conditions (4) has the willingness to make actual measurements vs. barf up a bunch of hearsay, and (5) truly wants to understand the reliability of storage systems vs. take pot shots at the drive industry.

    Each of these papers, CMU’s and Google’s is incomplete. There is not enough data to support the conclusions. There is not even enough data to support almost any conclusion beyond the basic observation, “drives fail, some days more than others.”

  48. Richard said,

    on February 21st, 2007 at 5:46 am

    Quoting … “Google File System’s central redundancy concept: forget RAID, just replicate the data three times”.

    Google File System relies on inexpensive “commodity” motherboard hardware and is specifically designed for their internal requirements.

    I suggest that Google’s concept is not entirely driven by the need for reliability, as a top requirement.

    Google File System need for speed is probably much higher on the list and such triplication of data is a great bonus, especially under reads.

    A typical ‘motherboard’- based node can only support a small number of disks, making such small disk configurations inefficient under RAID 5 / 6 algorithms. Software RAID 5 running on ‘bare’ PC hardware is very slow & RAID 6 is a lot worse. Not a lot of choice here…. run mirrors.

    A typical low cost ‘motherboard’ environment is a tangle of cabling and a typical PC power supply may not be at its best (i.e. MTBF of 100,000 hrs ) when loaded by high peak power profiles required by disks….this is much worse than those spinning disks.

    In fact, I suggest that we may all get very very depressed by the *real* MTBF of a complete Google ‘commodity’ system solution … power & cabling included. This will be further degraded by inter-node cabling, etc… Does Google have any figures on this…. probably not.

    Ignoring the initial cost of disks…. has anyone calculated the running cost of such ‘triplicate’ system… with ‘underutilized’ power hungry motherboards included … and the cost of replacement parts. They do care about power, I am sure…. but may be locked-in.

    How does …say a 1000 disk solution …. compare against an equivalent centralized big-iron …. or any other well designed multiple RAID solution.. ? I am sure that Google File System can run with inexpensive RAID backends…. is this worse in terms of CTO ?

    In general, RAID 6 is available ( even from EMC recently) if you are worried by additional disk failures during data reconstruct time … and these are not getting shorter. Disks come with a multi-year warranty…and as someone suggested … if reliability is important, forget about the MTBF and change the lot at the end of the warranty period.

    I am sure that Google strategy relies *a lot* on warranty period of the various “commodity” components… not just the disks, which seem to be better than the rest.


  49. on February 21st, 2007 at 5:56 am

    [...] Everything You Know About Disks is Wrong [...]

  50. Joe Claborn said,

    on February 21st, 2007 at 6:41 am

    Is this right? A MTBF of ‘only’ 300,000 hours translates in 34 years. Our disk drives seem to last about 3 years. Why the difference?

  51. Michael said,

    on February 21st, 2007 at 6:44 am

    Wow, it does not take rocket science to figure out that profit is the bottom line. Anyone in business knows that survival is linked to margin. Anyone manufacturing product knows their market and builds toward an expected life time not an ideal one. Even if you take the lowest factor necessary of say 100K hours, that’s more than 10years life expectancy. Sorry, but I don’t know anyone running drives in bussiness that long anymore. Most swap them out in three years. Consumers typically will hold on to their investment longer. Anyway, there is a reason why most manufactures provide a limited warranty, 5years is good, but I don’t think a life time warranty will ever be there.

  52. straav said,

    on February 21st, 2007 at 6:56 am

    Google’s article at the end of 3.1 does talk about there being a “noticeable influence of infant mortality” (Failure Trends in a Large Disk Drive Population, Google inc, pg4)

    As for differing reliability between SCSI, ATA, FC, have you looked at the model number definitions? Look at some drive vendor sites for how to decode model numbers. The part you will find interesting is that part of the number is what interface it has, while the rest remains the same.

    So does anyone else see how despite the interface we are talking about the same drive mechanism? So predicted failure rate would be the same for the same hardware. I can’t say that I know this for a fact in any thing else than a handful of drives I opened years ago, so this may be a bit dated.

    So while it is possible it has changed, I would suspect the money savings of mass production makes it an common choice.

  53. Guillaume said,

    on February 21st, 2007 at 7:00 am

    “one array drive failure means a much higher likelihood of another drive failure”: that’s a well known fact. the problem is that most of the time when you get your raid array delivered, most of the disk are coming from the same manufacturer, but also the same factory, the same run. That means that a serie of disk build under the same conditions and used under the same conditions have a higher chance of failling if one of them is failling. Nothing new: a couple of sysadmin friends of mine are odering disks separately, from different manufacturer to build their raid systems and are insisting on always getting a good mix.

  54. thinkers fanboy said,

    on February 21st, 2007 at 7:34 am

    i must agree with the thinker. there is so much important data which hasn´t been taken into account.
    i think it´s sweet how bianca “scales” figure 2 with little arrows indication what year
    we are at…learn to scale, babe. this paper is almost completely free of sense.

  55. Mike said,

    on February 21st, 2007 at 8:05 am

    Another interesting thing is that all hardware builders always pack drives together in a brick. If you have a big SCSI server with 8 drive bays, and order 4 drives, THE UNWRITTEN LAW OF SERVERS dictates they be next to each other, wheras if you leave a space between each of them, and throw in an extra fan or two, they seem to last forever.
    T’was heat that killed the beast.

  56. Aaron Becker said,

    on February 21st, 2007 at 8:06 am

    I can’t believe that nobody has mentioned raid6 as a solution for drive failures during rebuild.

    Certainly you do have a risk involved if you lose one drive from a raid5 array and then you beat the hell out of the remaining disks to do a rebuild.

    But I have to imagine the probability of losing _two more_ drives during a rebuild gets extremely low…


  57. on February 21st, 2007 at 8:41 am

    Michael, you are correct. And as I pointed out earlier, existing media is sufficiently reliable.

    The Thinker wrote “Unfortunately, this paper is severely flawed. Similar to the Google paper, it is written by academics with little understanding of the subject matter, but a strong desire to publish lengthy papers.”

    While I agree the study’s design was poorly constructed and constrained, lengthy nuanced debates about drive longevity and MTBF are largely academic anyway. Real world data would certainly be interesting, but I seriously question if the conclusions drawn from real-world data would a) be substantially different from those in the study . and b) really matter.

    As a business user, do I really care if a particular model drive has an MTBF of 300,000 or 1,000,000 hours? No. Given today’s drive reliability, and the reasons I mentioned in my previous comment, it is far more likely that I’ll have bigger fish to fry long before my drives are a major concern.

    Everybody loves a good debate.

  58. Anonymous Coward said,

    on February 21st, 2007 at 9:00 am

    “The Thinker” is so proud of his reply that he posted it both here and on slashdot.


  59. on February 21st, 2007 at 9:25 am

    [...] StorageMojo » Everything You Know About Disks Is Wrong # Costly FC and SCSI drives are more reliable than cheap SATA drives. # RAID 5 is safe because the odds of two drives failing in the same RAID set are so low. # After infant mortality, drives are highly reliable until they reach the end of their useful life. [...]

  60. Jameson said,

    on February 21st, 2007 at 9:27 am

    Backups should not remain attached to your computer.
    All your RAID disks can be scrambled at once.

    I had a 400GB disk drive nightly backing up a 180GB disk drive,
    with old files getting renamed, so I somewhat had more than just one backup.
    One night, the data on my disk drive was scrambled, and recovery has so far been futile.
    Luckily, I had that 400GB backup.
    Unluckily, the incident that scrambled the data on my main disk drive
    also scrambled my “mounted” backup drive.
    A small partition on that backup drive was not mounted and was not scrambled.
    I eventually reformatted these two disk drives (from different manufacturers),
    using them once again.
    Unfortunately, my loss included 60 files that were important for a project,
    files that took me over 100 hours to create.

    I run Debian Linux and could find no-one else getting scrambled disks.
    However, some comments by others inferred that my Asus K8N4-E motherboard had problems that might cause this.

    So, I upgraded the firmware on that Asus motherboard,
    firmware with an update hinting about a tangential problem like mine.

    I conclude that some incidents like UPS polarity reversals and motherboard firmware can ruin the data on all your disk drives.
    You need to retain backups that are not attached to your computer.
    RAID can protect you from disk drives’ physical failures,
    but it cannot protect you against numerous other causes.
    RAID can keep your system going when disk drives physically fail,
    but disk drive physical failures are not sufficiently more common than other causes for data failure.
    For my home computing, I once thought I could safeguard my data with RAID,
    but I now instead run backups with a few large raw (without a case) SATA disk drives attached externally via USB adapters.


  61. on February 21st, 2007 at 9:43 am

    [...] StorageMojo » Everything You Know About Disks Is Wrong: Which do you believe? [...]

  62. Tmack said,

    on February 21st, 2007 at 10:34 am

    Regarding reliability of the different drive types, specifically comments made first by Ted Fay: having taken apart numerous drives of different interfaces, I can state that the internals are basically the same across the board. The only difference is what controller gets slapped onto the bottom. That controller determines how data is spaced out on the drive itself, how it talks to the compter, etc, but the actual physical moving parts of the platters, arms, spindles, etc, are basically the same between IDE/SCSI/SATA/FC/whatever. They consist of a stepping motor to drive an aluminum spindle holding 2-4 platters about 1mm thick, stamped metal arms with the read/write head attached to a block with a bearing and a coil on the other side held between very strong magnets that drive it back and forth across the platters. The failures are generally related to these mechanical parts failing, such as the surface of the platters wearing out or the heads crashing into them. The reliability of the controller card on the drive is based on solid state electrical components, which if designed correctly will far outlive the mechanicals. This is supported by the paper, and by my experience. The illusion of better reliability is due to the more expensive SCSI/FC drives being used in a more consistent environment, like a datacenter. As more and more SATA/IDE drives are making their way into data centers thanks to cheaper and more available RAID solutions that can use them instead of the SCSI only solutions of the past, the truth is coming out in studies like these.

  63. Kensey said,

    on February 21st, 2007 at 10:54 am

    hymieg, as I recall it’s not so much the spinup, as it is the polymerization of the lubricant after the drive spins *down* preventing it from spinning up again in the first place. Thus the old “whack it and back it” advice (smack the drive to get the platters unstuck, then boot and *immediately* do a full backup). This is also why disks that have run for a long, long time will continue to run just fine *until* a power outage or something else causes them to spin down, at which point they die, the lubricant having essentially turned to glue.

  64. random said,

    on February 21st, 2007 at 12:58 pm

    Tmack, HDs haven’t used stepper motors for almost 2 decades now. They use voice coils now.

  65. robert said,

    on February 21st, 2007 at 2:40 pm

    That’s it, I’m going back to chiseling data on a rock tablet!

  66. Alan said,

    on February 21st, 2007 at 4:53 pm

    In regards to rob’s “bollocks #2″, where he says, “Not one major disk vendor is looking to provide solid state in the big chassis”. I saw a presentation by a major storage vendor where that was in fact exactly what they were touting. It wasn’t EMC or HP.

  67. John said,

    on February 21st, 2007 at 5:01 pm

    According to The Thinker,
    you need a massive data collection project tracking hundreds of variables before you can begin to construct an elementary model of any real-world phenomenon…

    Not true. By the Central Limit Theorem, the effects of many independent random factors that weren’t taken into account will simply increase the level of variability in the observed population means. The analysis is only problematic if the neglected factors are dependent on the ones being analyzed - for example perhaps SCSI drives were inherently more reliable but for some reason were being put into high-vibration environments more often than other types.

    So is there reason to believe that the (admittedly important) initial/operating conditions depend strongly on the variables that were recorded in the data?

    If not, I’m inclined to accept the paper conclusions. Which mostly amounted to noting that the current model is broken and proposing one that better fits their observed data. Remember Galileo’s notion of experimental error - something like “don’t take my error of one cubit and try to hide Plato’s error of 100 cubits behind it”. A model can be simplistic, incorrect, and still a VAST improvement on the previous art.

  68. Jason Williams said,

    on February 21st, 2007 at 5:33 pm

    Something to keep in mind is that all of these drives were “enterprise” class drives regardless of their interface (SATA, FC, and SCSI). Anything over 1,000,000 MTTF is an enterprise drive…as noted by the E in front of Hitachi enterprise class SATA drives. The SATA drives bought by a consumer usually have spec sheets in the 300,000 range. Which is really scary if you consider the results of this paper.

  69. Third Grade Math said,

    on February 21st, 2007 at 6:40 pm

    1,000,000 hour MTTF?

    1000000 / (24 * 365) = over 114 years!

    WTF?

    If it’s 1,000,000 minutes, we get

    1000000 / (24 * 365 * 60) = 1.9 years

    which is more like what I’ve experienced. Now the study says 300,000,

    300000 / (60 * 24 * 7) = almost 30 weeks!

    I’ve had a lot of drives die on me, but this value is ridiculous.

  70. John said,

    on February 21st, 2007 at 8:29 pm

    I resoundingly second (the other) John’s comment. The Thinker hides his misconception(s) behind lots of fancy talk, a typical example of someone with a high ratio of verbal skills to actual understanding.

    True, there are lots of parameters that were not considered in this study, but (as John noted) as long as the parameters are statistically independent, it’s okay to draw conclusions about the parameters that were in the study based upon those parameters alone.

    If you don’t believe this, then just take a look at any scientific paper which uses statistics to make inferences about a phenomenon being studied. Seldom is it practical or possible to model every parameter that exists in reality. Instead, we choose some suitable subset of parameters that explain as much of the variation as possible.

    If, for example, drive orientation were responsible for a some of the observed variation in drive reliability, and the orientation of a drive had no relationship to the other parameters (e.g. drive type), then this parameter essentially becomes “background noise,” that is, the overall effect of this parameter on drive reliability will be the same on drives of different types, so the difference in reliability between drives of different types >due to the drive orientation parameter

  71. Corley Kinnane said,

    on February 21st, 2007 at 9:23 pm

    If you are paranoid at all about RAID5 having inherently more problems than RAID1, consider using *software* RAID5.

    I use RAID1 for booting, RAID5 for data - its just easier to setup this way.

    Now, RAID6 is the better option.

    The question of whether RAID1 or RAID5 is more reliable for one failure tolerance - it comes down to the reliability of the software - nothing else.

    When I make a software RAID5 array, I tag each drive with a number on an unused partition - set it up well, and you can’t really go wrong unless you decide to wipe something important - which you could of course do at any time RAID or not.

    Software RAID is a negligible CPU hit these days - and RAID5 is fast, not slow - even just using 4 drives, you should get around 70 - 100 Mb/sec with 7200rpm drives.

    When I need 12+ drives in an array, I use hardware RAID5, knowing it isn’t as secure as the software RAID but is a lot easier to manage.

    I think if I had to make a solution right now, it would be software RAID6 backed up to single drive or RAID0 array in another location - I’d install the backup 6 months after setting up the array and swap some drives from the main array with the backup drive/array.
    If it had to be always live - I would cluster just 2 arrays.

    Triplication ? throw another server on the fire.

    Corley.

  72. Arghh said,

    on February 21st, 2007 at 10:44 pm

    I would really like to see some paper like this on CDs, DVDs etc.
    With media like this the validity of actual claims generally only become evident years after usage.

  73. dp said,

    on February 21st, 2007 at 10:59 pm

    Great - so if the title is correct, everything I’ve just learned about disks is wrong.


  74. on February 22nd, 2007 at 1:46 am

    [...] Slashdot has a new article which says: Google’s wasn’t the best storage paper at FAST ‘07. Another, more provocative paper looking at real-world results from 100,000 disk drives got the ‘Best Paper’ award. Bianca Schroeder, of CMU’s Parallel Data Lab, submitted Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? The paper crushes a number of (what we now know to be) myths about disks such as vendor MTBF validity, ‘consumer’ vs. ‘enterprise’ drive reliability (spoiler: no difference), and RAID 5 assumptions. StorageMojo has a good summary of the paper’s key points. [...]

  75. DZNTUNDERSTAND said,

    on February 22nd, 2007 at 3:59 am

    Will some please explain the MTBF values and why if it is 34 years drives just last 3?

  76. Robin Harris said,

    on February 22nd, 2007 at 5:37 am

    Normally I try to respond to comments, but this time there are too many. So I’m going to cherry-pick here.

    Richard, you raise many good points. Google appears to have priorities for its computing infrastructure, with the #1 priority for the revenue generating ad placements. I’ve heard complaints about Gmail uptime, but not about ad placements.

    Also, Google has probably been the single most important force in getting chip, power supply and motherboard vendors to focus on power consumption. They’ve been having their motherboards custom-made for several years, and they support three drives per node currently. I know they’ve looked at more drives per node. In fact the impetus for this study may have been to determine the optimum number of drives per node.

    Robin

  77. Not Important said,

    on February 22nd, 2007 at 5:50 am

    I think both the paper and the discussion has misunderstood the notion of MTTF. MTTF is the mean time to fail if the drives are replaced on a regular basis within the warranty period.
    I other words, as long as you replace a drive when it reaches about three years of age, there should be an expected average of 114 years between disk failures.
    This explains the fact that drives fail a lot more often that the MTTF would suggest when it has been in use for 5-7 years as the paper states. The misconception is that the MTTF is stated for a specific drive unit, which it is not.

  78. Thankful said,

    on February 22nd, 2007 at 1:49 pm

    Thank you for finally explaining the MTTF numbers!

  79. Scudchtr said,

    on February 22nd, 2007 at 4:15 pm

    Not Important, I am calling BS on your definition of MTTF unless you can provide some legitimate references.

  80. JeePee said,

    on February 23rd, 2007 at 1:14 am

    Not Important,

    Brilliant reasoning, if you replace the discs before they fail, there is a bigger chance of avoiding failure. But why would I replace a perfectly good disc? Just to meet the manufacturer’s specifications? That’s a bit of turning the world around.

  81. Jessica said,

    on February 23rd, 2007 at 6:04 am

    I think many people misunderstand the purpose of RAID as it is used in a datacentre.

    [ramble, for those who don't work in datacentres]
    RAID is used to reduce (a) loss of data and (b) downtime. A secondary benefit, dependent on configuration, is an increase in performance. Reliability is a byproduct, not the goal.

    If a non-redundant disk fails - your system is down and data is probably lost. RAID gets around this and gives you time to plan how to recover.

    A disk failure under RAID puts you “at risk”. Your FIRST action in the case of a disk failure under RAID is to ensure you have a good backup, not to slap in a disk and resync. As I’m sure others have noted, any RAID resync will put an abnormal load on the remaining disks. You want to avoid this until you can be sure that should there be a second failure, you can recover.

    It is also true that if you do not have an up-to-date backup, taking a backup will load the remaining disks, but this should be a lower load than any resync. You may be fortunate enough that (in decreasing order of preference) either (a) you have a full backup and no data has been added since (b) you can do a quick incremental and place the least stress on the disks (c) the data added can be recreated with minimal effort.
    [end of ramble]

    My point here is that RAID is not a magic solution, but an important part in an overall strategy.

    People have also been talking about the differences between hardware and software RAID. As far as risk is concerned, there is no difference. Until your CPU interfaces directly to the disks there is always a component which could fail and deprive you of data. If you have (S)ATA disks that typically means your motherboard. SCSI or FC - the HBA. In many cases your hardware RAID HBA is just a SCSI/whatever HBA with RAID intelligence added. Your defences against loss of this are (a) duplication of hardware paths and (b) standby spare parts.

    There is an illusion with software RAID that because it is host-controlled, you might be able to trawl through the bits on disk and recover things should there be a catastrophic failure, whereas with hardware RAID the data format is inaccessible. In practice, no one would spend the time doing this unless the data were absolutely vital, and were this the case your backup regime and data duplication to other systems is more efficient.

    Speed of software RAID is entirely orthogonal to the subject at hand.


  82. on February 23rd, 2007 at 8:54 am

    [...] I don’t usually geek out over hardware, and I try to resist becoming a sysadmin myself, but I found this interesting: Everything you know about disks is wrong. It has some some implications for how preservation systems are run, and is in part a fillip for the LOCKSS approach. [...]

  83. Brian said,

    on February 23rd, 2007 at 11:31 am

    First, no one here, including the papers author, have explained MTTF properly. The paper’s authors got it all wrong. Let me explain.

    MTTF is the mean-time-to-failure. That means that each drive will, on average, last a certain amount of time. In this case, each drive will last, on average, 1,000,000 hours. That means some will die sooner, some later, etc.

    MTBF is the mean-time-between-failures. That means that the system of drives will, on average, have a certain period of time between failures. That number can be far lower than MTTF.

    Also, the authors state that there is no infant mortality effect, yet the results of their weibull analysis clearly point to infant mortality. It is commonly accepted in reliability analysis that a rate of failure of less than 1 indicates infant mortality. Some though would claim that a value of 0.71 is random. Either way, the system of drives is not exhibiting the wearout failure mode that they state.

    MTTF indicates life, but MTBF doesn’t. MTTF generally will not vary with time, but MTBF does. Also, MTTF doesn’t vary with rates of installation or replacement, yet MTBF will.

    It is very easy to confuse the two. Many on here have, many on slashdot have, and the paper’s authors have misunderstood as well.


  84. on February 23rd, 2007 at 11:41 am

    [...] StorageMojo » Everything You Know About Disks Is Wrong.  Ah.  Statistics. [...]

  85. Pipson said,

    on February 23rd, 2007 at 12:27 pm

    I disagree with Jessica’s statement that backup of a non-redundant RAID is easier on the drives than a rebuild (unless of course you don’t have to do a full backup). Moreover, in a production scenario, where uptime is important, offlining the RAID to perform a backup instead of rebuiling the array to regain redundancy is defeating the purpose of the system. I do agree with your comments on importance of *regular* backups. This is where providing RPO and RTO that meets business needs is the ultimate failback.

    I absolutely agree with magicalbob’s last two paragraphs on redundancy.

    When talking about MTBF, not enough emphasis was put on the duty cycle of each system. In my experience SATA systems simply buckle under constant heavy IO load with drives popping just like popcorn. Under light to medium load I would expect SATA and SCSI/FC to show similarly lower failure rates. Then again I may be just the exception…


  86. on February 24th, 2007 at 3:01 pm

    [...] More on hard drives. Here’s a paper that won a “Best Paper” award at FAST ‘07. And a wonderful summary from StorageMojo. Schroeder and Gibson are from CMU’s Parallel Data Lab. [...]


  87. on February 25th, 2007 at 11:44 pm

    [...] The study reveals a number of fallacies, misconceptions, and some outright lies about storage technology and reliability. High-dollar SCSI, FC, SATA, and even RAID users are in for a few suprises…what you thought you could depend on might not be so dependable. An excellent summary of all the major points can be found over at Storage Mojo in a post entitled “Everything You Know About Disks Is Wrong.” And yes, that title sums it up nicely. [...]

  88. Pete said,

    on March 8th, 2007 at 11:55 am

    A few points to blunt the hysteria. Do any of you realize how long 1,000,000 hours is? A quick punch-up in a calculator shows that it is just over 114 years. Even if MTTF and MTBF estimates were off by 50%, that is still 57 years of 24/7 service. Somehow that doesn’t shake my faith in harddrives. Lets also keep in mind that the price of drive storage has dropped steadily over the last 25 years. I remember when the cost of storage was over $100 per MB. Now that cost is about $.01 per MB. Excuse my math if I miscalculated but isn’t that a 10,000% drop?

    Given the comparatively low cost of storage, isn’t RAID 1, 4 and 5 outdated now? Doesn’t RAID 10 give better performance and more redundancy? Given the lower cost of storage isn’t it the epitomy of cheap to still be using RAID 5 in server or SAN systems?

    Let me also address the myth that “enterprise” drives are somehow better than “consumer” grade drives. Anyone who knew what they were speccing when designing storage systems knew damned well they weren’t paying for fewer failures, they were paying for performance. Faster spindle speeds, lower seek times, lower transfer rates, more write cache and higher throughput were the name of the game. Anyone thinking they were buying lower failure rates was on a fool’s errand.

    I would also like to address that silliness that RAID is pointless because you still have a single point of failure in the controller. Well, Duh! That’s why the RAID config info is stored on the drives and not in the card or any other volatile memory. If a card fails, it can be replaced without losing data. Also keep in mind that RAID isn’t the end-all and be-all for data security. It is at best one piece of a comprehensive stratagy that should include other things like backups, redundant storage and archiving

    Let me also put things into perspective with mfg published MTBF and MTTF rates. As I stated above 1,000,000 hours is 114 years. The published numbers are ESTIMATES based on predictive TESTING. If they were to actually to run real-world tests on samples to get statistical numbers, we would still be putting 20MB MFM and RLL 5.25″ drive in our systems while we waited for manufacturers to complete their testing.

    Let me take the opportunity to put it into geek speak since I just by chance, watched Star Wars the other night “..So you see Luke what I told you IS true… from a certain point of view.” Interpreting statistics is a fool’s game. They are guidelines based on a certain set of conditions and not facts.

  89. Amos said,

    on March 17th, 2007 at 2:59 am

    So what practical software/filesystem can you recommend to implement such a file-redundancy setup, Or am I obliged to implement this in my applications?

  90. clockwinder said,

    on March 27th, 2007 at 9:40 am

    Permanent data storage?? The hard part is getting rid of stuff you no longer need. I have lived with failures of 9-track tape, Dat tape, winchester technology drives, CD platters, 80-column punch cards, and punched paper tape. Information Week a number of years ago published a survey on longevity of storage media (not quite the same thing as disk drive longevity). Worst was cheap mag tape. Then hard disk. Then high-quality CD ( guessed at reliable for 50-75 years). Most reliable was acid-free paper, good for probably 500 years or more. In this case, we have actual examples!
    Gigabytes per page? It depends… dont throw the books away yet, folks!

  91. From a cost perspective.. said,

    on April 23rd, 2007 at 6:11 am

    For those taking all this information/comments/thoughts into consideration for real world applications, some cost data to consider…

    On a current “big iron” application, we made the chang from SATA drives to Fibre drives before implementation this past year. Storage costs increased exactly 100% for the same amount of storage, not 400 to 600% as has been suggested. So, if you’re thinking of doubling or “tripling” up on SATA, look at the costs also.

    Facility costs on “big iron” projects are huge. The costs to double, or triple, up the space to stand up SATA and the added costs for cooling these drives over a period of years can be staggering.

    Now if your just looking at a simple “one for one” replacment Fibre with SATA, with the same size of storage in the end, then it’s worth looking into because storage costs could be reduced by half.

    As an example our costs could be reduced from $4 million to $2. I’ll be taking a look, and will have to make a complicated business decision.

  92. Ted Fay said,

    on May 20th, 2007 at 9:41 pm

    Bob,

    Of course I’m talking about data corruption due to bad blocks, and the fact that only drive-wide hardware failures were taken into account in this study is the basis of my point.

    Robin tried to dismiss my point as being architectural and not real world, yet my whole point is that this study misses some critical aspects of real world experience, which is that when you go to fetch data, and you can’t get it because the blocks are bad, or you can’t rebuild a portion of the data after a failure because the block are bad, then whoever needed that data is going to consider it to be a failture, regardless of whether the RAID controller labels the disk as failed or not.

    Data corruption = failure. Anyone who tells you diffrent is trying to sell you something.

    -ted

  93. Ted Fay said,

    on May 20th, 2007 at 9:56 pm

    Annoymous,
    Regarding you comment “Are you saying we should go back to the ST-506 for reliability?”

    Of course not. Radically different technologies, as you know.

    Packing twice the blocks on the same physical spindle as onother drive built with the SAME TECHNOLOGY will and does result in twice the number of bad blocks for the same physical damage to, or inperfection in the platter.

    There is no free lunch, and you do indeed get what you pay for. It doesn’t show up in this study, because this study doesn’t take into account the primary advantage of enterprise diks, which is twice the phyical media allocated to each block using the same platter technology as their consumer grade cousions.

    Even if FC, SAS and SATA all do inded have similar rates of failure for their mechansisms, which I wouldn’t doubt, if you’re willing to pay for RAID redundancy, why not media redundancy teh blocks on your platter?

    Apart from the advantages on the contoller board of FC or SAS, what your paying for is twice the saftey of the data contained on those blocks. If you don’t care about what lives on those blocks, I guarantee you someone will when they go missing. :)

    Just my two cents.

    -ted


  94. on May 21st, 2007 at 11:48 am

    [...] StorageMojo » Everything You Know About Disks Is Wrong Everything You Know About Disks Is Wrong February 20th, 2007 by Robin Harris in Enterprise, Clusters [...]

  95. A Dutch Library said,

    on June 1st, 2007 at 3:53 am

    Well, it’s a bit of a late reply seeing the date that this discussion started, yet I thought it couldn’t harm to add my own advise. We’re all interested in making our data persistent which is quite a challenge due to media detoriation and rapid media obsolescence. The topic interested me and I’m currently graduating by performing research on it for a library who is interested in digital preservation. There are many difficulties with digital preservation of which this particular one is just a minor (almost easy) part. I will save you the whole reasoning behind my conclusion since it’s not yet finished (and there are probably limits to the textsize that you can post :)) but the conclusion might be helpful to some of you:

    A few assumptions:
    -The target storage system needs to be able to contain 10 TB worth of data
    -The storage system needs to be scalable
    -The storage system needs optimal data security vs. costs. (of course data triplication is nice, but most of us, libraries including, don’t have that much money)
    -The storage system needs to be web-accessible
    -The storage system needs to be disaster-proof

    If you are searching for something that should fit these needs as well, this is probably your best solution:

    Two seperate servers stored at seperate locations (cheapest way of avoiding data-loss through distasters). Configure the first server for RAID5EE (hot spare integration) and the second for RAID60 (SAN). Use 500GB enterprise drives for your first server and 500GB nearline drives for the SAN. Make the first server backup daily to the SAN. Perform nightly checkdisks so you can determine when new spare drives should be ordered. And last, but not least, make sure you have the money to buy a whole new server environment within 7 years.

    That isn’t anywhere near cheap, but it’s most cost-effective almost 100% guarantee for preserving your data. This configuration doesn’t necessarily have to be optimal for the next generation of hardware you will buy.

    Perhaps noone is helped with this, but I’ll be happy if it just helps Someone. Just some (nearly offtopic) sidepoints, for cheap home RAID’s, check the Intel Matrix RAID solution. For future archiving, pay attention to holographic storage development. I’ll save you the other random findings of my study :)


  96. on August 3rd, 2007 at 7:06 am

    Specialized Hard Drives: Worth the Effort?…

    Lately, there has been a lot of buzz in the enterprise storage arena about whether so-called “enterprise drives” are really any better than plain-Jane hard drives in Enterprise applications. This came to a head with the controversial findi…

  97. wgh said,

    on August 23rd, 2007 at 9:47 pm

    Joe Claborn said (on February 21st, 2007 at 6:41 am): Is this right? A MTBF of ‘only’ 300,000 hours translates in 34 years. Our disk drives seem to last about 3 years. Why the difference?

    I’ve skimmed the above thread but didn’t see anyone note that MTBF (and to a degree MTTF) should be divided by the number of drives that are in your environment to estimate how often you’ll see a single drive within the environment fail. Yes, as you’ve mentioned, the MTBF numbers suggest 34 yrs to fail for one drive, but if you have 10 drives in your environment you can expect one of them to fail in about 3.4 years. Just as when you have 10 men working construction there’s 10 times the probability of one of them getting sick on any given day. When working in a “big iron” shop with thousands of RAID devices, this is (usually) taken into account. Those who say triplicate the data instead of using RAID appear to me to not be faced with needing up to date accurate data available in one location, without time available (due to SLAs) to restore or even time to fail over to a seperate set of drives. Many in mainframe environments have come to heavily rely on no down time to restore or fall over to other drives, that is unless the situation is very dire (of a disaster type). If one were to “simply” have three copies, as someone suggested above, then which one do you update? All three? Doing so and waiting for validation of completion of I/O would typically cause response times on heavily I/O burdened systems to degrade beyond acceptability. To not wait on validation opens a window to potential corruption to any copies that were not being synchronously updated (synchronous updates are expensive). Thus RAID. Yes, drives will fail and drives will be replaced. But a well laid out RAID array will still give the needed response times during failures, even at peak transaction time… again, I said if they’re “well laid out”. And yes, if the data is mission critical, such RAID arrays should be copied to another location… for the event of a disaster (including at a minimum, lightening).

Leave a reply



StorageMojo RSS Feed May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007