After last year’s publication of the Google and CMU papers on the much-higher-than-expected annual failure rates of disk drives, StorageMojo challenged vendors to respond.
I said
The industry has an excellent opportunity to move to greater transparency with storage consumers. Sometimes relationships need a jolt to remind everyone just how much we rely upon each other. Storage is a vital industry with the responsibility to protect and access an ever increasing fraction of mankind’s data. Customers want the best tools for the job. It appears the industry hasn’t been providing them, at least for disk drives. I know some efforts are underway in IDEMA to improve the quality of the numbers. I’d get serious about ensuring that the revised processes actually benefit customers rather than soothing corporate egos. Otherwise this situation will arise again.
Further, the need to engage at a more personal level is a predictable outcome of the continuing consumerization of IT. This is an example of the new normal. Embrace it.
Working through the weekend, NetApp’s Val Bercovici did. IBM did so a little later. EMC said semi-nothing.
Two weeks later a not-very-bright EMC’er sent an EMC lawyer to shut StorageMojo up. Some people are so-o-o sensitive.
FAST forward
This week at FAST (File and Storage Technologies ’08) a group of research papers respond to the Google and CMU work. In Parity Lost and Parity Regained, Are Disks the Dominant Contributor for Storage Failures?, An Analysis of Latent Sector Errors in Disk Drives and An Analysis of Data Corruption in the Storage Stack NetApp researchers working with academics including Bianca Schroeder – one of the authors of the CMU paper – and Andrea and Remzi Arpaci-Dusseau, of the University of Wisconsin, produced a series of papers examining the state of the art in data storage.
Often using NetApp’s AutoSupport data base, the papers delve into knotty problems in array architecture and component behavior. With the advantage of large sample sizes the papers see further into statistically uncommon events.
For example An Analysis of Data Corruption in the Storage Stack looked at over 1.5 million disks on more than 40,000 systems over 41 months. Those numbers dwarf the combined samples of the Google and CMU teams.
Some surprising results
The cynical, myself among them, might be tempted to dismiss the work as exercise in self-justification. The studies find disk scrubbing useful in eliminating silent data corruption, a result any half-awake SE will use to their advantage.
But in Parity Lost and Parity Regained – nice Milton reference! – they also found that disk scrubbing could spread an error – parity pollution – across multiple disks. In fact,
. . . the tendency of scrubs to pollute parity increases the chances of data loss when only one error occurs.
This is honest research, following the data where ever it goes. It is the difference between science and spin.
The StorageMojo take
NetApp’s research offensive is commendable. While IBM, HP and Microsoft maintain large research groups and publish regularly, they are many times NetApp’s size.
It is also smart marketing. NetApp’s research gives them a ready entree to corporate system architects and technical opinion leaders with a fresh and data-heavy perspective on IT risk management.
NetApp is to be congratulated for the work they’ve done. By participating in the conversation they advance the state of the art and their stature with customers. The former is good for the industry and both are good for NetApp.
Update: A commenter requested links to the papers. They aren’t all freely available on line yet. Here are the two I found online. Download the pdf for Parity Lost and Parity Regained, An Analysis of Data Corruption in the Storage Stack.
Update 2: Prof. Peter Honeyman of CITI wrote in to let us know that the FAST papers are available here. Thanks Doc.
Comments welcome, of course.
Is the paper available online?
Yea, disk scrubbing can spread array corruption. It works if you have bad blocks on a single drive. If the scrubbing programs were smart enough to not spread corruption…
Thanks–Allen
NetApp uses 11% of the space on SATA disks as checksum protection (separate from RAID and the disks’ internal ECC). Disk space has really become cheap.
So when will more companies add checksums to their storage offerings and file systems? It’s much hard to miss corruption (and create parity pollution) when everything is ‘secured’ via Merkle (hash) tree.
The IEEE encryption standard (1619) may actually help in this regard somewhat since you have MAC authentication built into the encryption.
‘Parity Lost/Regained’ leaves a bit to be desired:
a) It observes that parental checksums can’t avoid problems with parity pollution *if the parity mechanism doesn’t coordinate with the checksum verification*, but fails to note that in ZFS they *do* so coordinate (RAID-Z is still brain-damaged, but not in that respect – and if it changed its mechanism to create stripes out of serial blocks within the same file rather than out of each individual file block it would be just fine). So coming to the conclusion that block checksums are the best base from which to move forward is indeed somewhat self-(NetApp-)serving (though it’s possible that this was merely the result of experiential myopia rather than an intentional distortion).
b) Furthermore, not only is the contention that “the tendency of scrubs to pollute parity increases the chances of data loss when only one error occurs” (your quote above) incorrect when using such a ZFS-style approach, but it’s incorrect *in general*. The situation that they describe leading to data loss when parent checksums are used independently of the parity mechanism *does not occur* during conventional scrubbing (which simply verifies that each disk sector can be read without error): it only occurs (and indeed *any* effect of scrubbing on parity pollution can only occur) when scrubbing also verifies (and if necessary corrects) the parity (and it’s not clear why that would be a good thing to do for precisely this reason, though flagging any inconsistency for human analysis would be reasonable: as they state elsewhere, scrubbing is primarily aimed at preventing latent sector errors from combining with a second error later on to cause data loss, and that has nothing to do with verifying/correcting parity information).
c) Write-verify, while it does protect against lost writes, doesn’t protect against torn writes at all, at least if they’re due to power loss (the example given in the paper): if power is lost, the verify never happens and the tear remains a tear (unless something like a log gets replayed on restart, but in that case it takes care of torn writes without any need for write-verify).
d) In view of point (a) above, all you need to reduce the ‘chance of data loss’ to zero (at least within the scope of their analysis) is parity-based redundancy plus in-parent checksums that coordinate with it (plus scrubbing to detect latent errors before they can combine with another error to cause data loss) – with no need at all for write-verify, version-mirroring, logical/physical identity, or in-sector/in-block checksums.
That NetApp finds parity-based redundancy particularly interesting is hardly surprising given the evolution of their product. But that doesn’t excuse the amount of myopia evident in the paper: there’s little reason to use parity-based redundancy save to economize on disk-space use, which means that you can use it only for large files (which account for most disk-space use in the vast majority of installations) without significantly compromising its effectiveness and thus *can* use it only within individual files where it can not only be easily coordinated with parent-checksum mechanisms but provide entirely acceptable run-time performance (all the validation checksums are already in memory) plus reasonably efficient scrubbing (even while following the metadata paths).
The information that 19% of ‘nearline’ disks develop unreadable sectors within 2 years (presumably including those detected and revectored before they become unreadable, which is where scrubbing makes a major difference) was interesting (perhaps the enterprise-class disks are only about 1/10th as prone to this at least in part due to lower recording densities), as was the observation that the incidence of lost or misdirected writes was as high as about 0.03% per year for nearline disks (or about 0.005% per year for enterprise disks); the information about torn writes was less so, since that’s just something any good system knows it has to deal with (or just pass on up to let applications do so).
But that’s starting to get into the territory covered by “Data Corruption in the Storage Stack”, where the emphasis on ‘silent data corruption’ suggests that NetApp feels some need to respond to the ZFS hoopla in this area. Unfortunately, that paper proved disappointing: while it may offer some insights for system administrators into whether to replace a disk after a particular kind of error, in general it added little to conventional understanding of error modes.
Not only did NetApp pioneer stellar technology a decade and a half ago that the competition is only now even beginning to catch up with, but it’s still got some of the world’s best and most innovative file system engineers on tap. These papers just don’t reflect such excellence: they smell a lot more like PR aimed at countering Sun’s attempt to position ZFS as a better solution in that market space.
– bill
All of the FAST papers are online at http://www.usenix.org/events/fast08/tech/