Why do storage systems fail?

It’s the disks, right?
We’ve heard much about disk failures – as recently as last week as well as last year’s reports from Google and CMU. But what about the rest of the system?

In a FAST ’08 paper to be presented this week – Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics – authors Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky analyze logs from 39,000 systems over 44 months to get answers.

1.8 million disks in 155,000 shelves
NetApp provided data from a variety of systems, including near-line, low-end, mid-range and high-end arrays. The team analyzed the log reports to understand what components led to failures.

The 15 page paper offers some interesting findings

Physical interconnect failures are a significant contributor – anywhere from 27-68% – of storage subsystem failures.
Subsystem failure rates that use the same disk models show similar disk failure rates – but the subsystem failure rates vary significantly.
Enclosures have a strong impact on subsystem failures. Some enclosures work better with some drives than others.
Dual-redundant FC shelf interconnects reduce annual failure rates 30-40%.
Interconnect and protocol failure rates are much more bursty than disk failures. Some 48% of overall subsystem failure arrive at the same shelf within 10,000 seconds (~ 3 hours) of the previous failure.
As interconnect failures are so bursty, resilience mechanisms beyond RAID are required to achieve subsystem availability.

What else?
They also found that enterprise drives had an AFR consistent with manufacturer specs – less than 1% AFR. This result derives from looking at the disks as the system does rather than as users see them.

The StorageMojo take
Interconnects, especially connectors, have long been fingered as a significant cause of the equipment problems – and not just in storage. While the team seems to report that interconnects are a greater cause of subsystem failure than disks, there seems to be some room for disagreement about what the numbers are telling us.

For example, this result doesn’t fully explain the delta between what disk users have found and the “trouble not found” rates that manufacturers report. Even if you accept the common 50% TNF vendors report, drive failures are still higher than this research finds.

Perhaps we should conclude that NetApp’s engineering is higher quality than the general run of storage arrays. Or perhaps system log analysis is still a dark art whose results are more indicative than conclusive.

Comments welcome, as always. I’m at the FAST ’08 conference this week in the San Jose Fairmont hotel.

7 Comments

Tony on Monday, 25 February, 2008 at 10:33 am

Yes, connectors really suck (as I figure out where and what type of connectors to use in a new design) – but no connectors (hard-wired) is even worse!

There’s a lot more science and technology in good connectors than most people realize.

pmwut5 on Tuesday, 26 February, 2008 at 3:18 am

Cannot believe Netapp made no mention on out dated firmware running on their drives as a cause of storage failure. I have experienced several LIP storms from failed drives on LRC ESH2 and ESH4 modules storage interconnects on production Netapp Arrays. Netapp Solution is to dual connect all storage trays – great if you have enough I/O slots and downtime available.

Mad Morf on Tuesday, 26 February, 2008 at 5:56 am

pmwut5…
You’re mixing apples and oranges here.
Drive firmware is different from shelf module firmware.
Keeping drive and shelf firmware updated is the customer’s responsibility, as it is with any vendor.
LRCs provide no protection from LIP storms, so replace them if you don’t like the results you are getting. ESH2s and ESH4s do provide a lot of protection from LIP events, but no solution is perfect!
Dual connecting shelves requires no downtime if you have sufficient HBA ports available in the appliance.

Pete Steege on Tuesday, 26 February, 2008 at 9:32 am

Great perspective on disk failure. The fact that enterprise drives actually achieve their reliability numbers will surprise some people, given that they are so extreme (1.6 million hours between failures, for example).

Goes to show you get what you pay for.

Pete Steege on Tuesday, 26 February, 2008 at 9:35 am

It’s probably not that NetApp systems are better, but that their application environment is different than other systems on average. I have no facts to support this, but my sense is the NetApp sweet spot is very high capacity apps that tend to stress individual drives less than other apps.

My observation is that the higher the capacity in a system, the lower the IOPS per Gigabyte (or drive).

pmwut5 on Wednesday, 27 February, 2008 at 3:12 am

I still find it interesting that a Netapp report into storage failure rates make no mention on out dated drive firmware as a cause of storage subsystem failure. Some disk drives being used have N14 releases of firmware – mostly to handle disk error handling which prevents the drive being marked as failed. If I was going to produce a report into 1.8 million disks the first thing I would look at is the firmware on those drives before making any conclusion

Richard on Saturday, 1 March, 2008 at 2:48 am

Robin,
Yes… cabling plus connectors plus poor airflow plus power …. this is why the ‘commodity’ motherboard hardware apporoach does not work well.
NetApp controllers seem to be built on standard motherboard technology… right?

Trackbacks/Pingbacks

You get what you pay for « Storage Effect - [...] One failure in a million hours?Â It’s claims like these that seem extreme to some people when they look…

Why do storage systems fail?

7 Comments

Trackbacks/Pingbacks

Submit a Comment

Recent Comments

Recent Posts

Categories