After last year’s publication of the Google and CMU papers on the much-higher-than-expected annual failure rates of disk drives, StorageMojo challenged vendors to respond.

I said

The industry has an excellent opportunity to move to greater transparency with storage consumers. Sometimes relationships need a jolt to remind everyone just how much we rely upon each other. Storage is a vital industry with the responsibility to protect and access an ever increasing fraction of mankind’s data. Customers want the best tools for the job. It appears the industry hasn’t been providing them, at least for disk drives. I know some efforts are underway in IDEMA to improve the quality of the numbers. I’d get serious about ensuring that the revised processes actually benefit customers rather than soothing corporate egos. Otherwise this situation will arise again.

Further, the need to engage at a more personal level is a predictable outcome of the continuing consumerization of IT. This is an example of the new normal. Embrace it.

Working through the weekend, NetApp’s Val Bercovici did. IBM did so a little later. EMC said semi-nothing.

Two weeks later a not-very-bright EMC’er sent an EMC lawyer to shut StorageMojo up. Some people are so-o-o sensitive.

FAST forward
This week at FAST (File and Storage Technologies ’08) a group of research papers respond to the Google and CMU work. In Parity Lost and Parity Regained, Are Disks the Dominant Contributor for Storage Failures?, An Analysis of Latent Sector Errors in Disk Drives and An Analysis of Data Corruption in the Storage Stack NetApp researchers working with academics including Bianca Schroeder – one of the authors of the CMU paper – and Andrea and Remzi Arpaci-Dusseau, of the University of Wisconsin, produced a series of papers examining the state of the art in data storage.

Often using NetApp’s AutoSupport data base, the papers delve into knotty problems in array architecture and component behavior. With the advantage of large sample sizes the papers see further into statistically uncommon events.

For example An Analysis of Data Corruption in the Storage Stack looked at over 1.5 million disks on more than 40,000 systems over 41 months. Those numbers dwarf the combined samples of the Google and CMU teams.

Some surprising results
The cynical, myself among them, might be tempted to dismiss the work as exercise in self-justification. The studies find disk scrubbing useful in eliminating silent data corruption, a result any half-awake SE will use to their advantage.

But in Parity Lost and Parity Regained – nice Milton reference! – they also found that disk scrubbing could spread an error – parity pollution – across multiple disks. In fact,

. . . the tendency of scrubs to pollute parity increases the chances of data loss when only one error occurs.

This is honest research, following the data where ever it goes. It is the difference between science and spin.

The StorageMojo take
NetApp’s research offensive is commendable. While IBM, HP and Microsoft maintain large research groups and publish regularly, they are many times NetApp’s size.

It is also smart marketing. NetApp’s research gives them a ready entree to corporate system architects and technical opinion leaders with a fresh and data-heavy perspective on IT risk management.

NetApp is to be congratulated for the work they’ve done. By participating in the conversation they advance the state of the art and their stature with customers. The former is good for the industry and both are good for NetApp.

Update: A commenter requested links to the papers. They aren’t all freely available on line yet. Here are the two I found online. Download the pdf for Parity Lost and Parity Regained, An Analysis of Data Corruption in the Storage Stack.

Update 2: Prof. Peter Honeyman of CITI wrote in to let us know that the FAST papers are available here. Thanks Doc.

Comments welcome, of course.