Earlier this week StorageMojo published summaries of two papers from the USENIX FAST ’07 conference, Google’s Disk Failure Experience and Everything You Know About Disks Is Wrong. I also published a briefer summary on Computerworld.com.

The credibility of the industry is in question
Both FAST papers were listed on slashdot.org, resulting in over 100,000 unique visitors here, and who knows how many downloads of the original papers. In short, the topic of disk MTBFs (or AFR’s), along with related issues raised in the papers, excited a great deal of popular attention.

The papers suggested that important assumptions about disks, and by implication, arrays, are wrong – and not just a little.

  • Failure rates are several times higher than reported by drive companies.
  • Actual MTBFs (or AFRs) of “enterprise” and “consumer” drives are much pretty much the same.
  • Drive failure rates rise steadily with age rather than staying flat through some n-year mark.
  • SMART is not a reliable predictor of drive failure.
  • Array disk failures are highly correlated, making RAID 5 two to four times less safe than assumed.

I believe many readers of these papers will conclude that uncomfortable facts were either ignored or misrepresented by companies that knew better or should have known better. For example, in all the discussion of RAID-DP I’ve seen, the argument is couched in terms of unrecoverable read error rates, not, for example, the likelihood of two drives failing in an array is greater than assumed. Given that field MTBF rates seems to be several times higher than vendors say, I’m now wondering about claimed bit error rates.

Many rivers to cross
The industry may have several responses:

  • The paper’s conclusions are wrong (completely or in important respects) and here’s why. Our hands are clean.
  • Gosh, we never correlated the behavior our field service and/or warranty groups saw with the claims made by our vendors or our marketing. We’ll do that now and get back to you with updated information. Thank you for bringing this to our attention.
  • These academic studies may reflect the conditions seen in these point-off-the-enterprise-curve installations, but thanks to our superior supply-chain management, manufacturing, test, burn-in and skilled field service we’ve never observed these effects. Here to give an in-depth review of our service experience is our director of field service engineering. Thank you for giving us the opportunity to highlight our operational superiority.

Or most likely a combination of all three strategies.

Where do we go from here?
These issues resonate widely based on the comments I’ve seen. This being the age of interactive communication, you’ll need to engage with customers on multiple levels to regain the trust and credibility I know you’d like to enjoy.

I’m offering StorageMojo as a platform for your responses. I’d really like to hear what you have to say about these papers and the anomalies they’ve documented.

I’ll give each of you your own post to write what you will. StorageMojo readers, including me, will be free to comment. You’ll get your statements out without journalistic interpretation. If those of you with bloggers like Hu, Dave or Mark choose to respond there, I’ll be happy to link those posts for my readers who might not otherwise see them.

The StorageMojo take
The industry has an excellent opportunity to move to greater transparency with storage consumers. Sometimes relationships need a jolt to remind everyone just how much we rely upon each other. Storage is a vital industry with the responsibility to protect and access an ever increasing fraction of mankind’s data. Customers want the best tools for the job. It appears the industry hasn’t been providing them, at least for disk drives. I know some efforts are underway in IDEMA to improve the quality of the numbers. I’d get serious about ensuring that the revised processes actually benefit customers rather than soothing corporate egos. Otherwise this situation will arise again.

Further, the need to engage at a more personal level is a predictable outcome of the continuing consumerization of IT. This is an example of the new normal. Embrace it.

So how about it? Will you respond?

Update: after looking at this in the morning, I decided that it fell short of the clarity I strive for. So this version is punched up a bit from yesterday.

Update II: NetApp has responded. I’m hoping other vendors will as well.

More than ever, comments welcome. Moderation turned on to evade the phentermine dealers, among others. What is phentermine, anyway>