StorageMojo




Robin Harris    


Why do storage systems fail?

February 24th, 2008 by Robin Harris in Architecture, Disk, Enterprise

It’s the disks, right?
We’ve heard much about disk failures - as recently as last week as well as last year’s reports from Google and CMU. But what about the rest of the system?

In a FAST ‘08 paper to be presented this week - Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics - authors Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky analyze logs from 39,000 systems over 44 months to get answers.

1.8 million disks in 155,000 shelves
NetApp provided data from a variety of systems, including near-line, low-end, mid-range and high-end arrays. The team analyzed the log reports to understand what components led to failures.

The 15 page paper offers some interesting findings

  • Physical interconnect failures are a significant contributor - anywhere from 27-68% - of storage subsystem failures.
  • Subsystem failure rates that use the same disk models show similar disk failure rates - but the subsystem failure rates vary significantly.
  • Enclosures have a strong impact on subsystem failures. Some enclosures work better with some drives than others.
  • Dual-redundant FC shelf interconnects reduce annual failure rates 30-40%.
  • Interconnect and protocol failure rates are much more bursty than disk failures. Some 48% of overall subsystem failure arrive at the same shelf within 10,000 seconds (~ 3 hours) of the previous failure.
  • As interconnect failures are so bursty, resilience mechanisms beyond RAID are required to achieve subsystem availability.

What else?
They also found that enterprise drives had an AFR consistent with manufacturer specs - less than 1% AFR. This result derives from looking at the disks as the system does rather than as users see them.

The StorageMojo take
Interconnects, especially connectors, have long been fingered as a significant cause of the equipment problems - and not just in storage. While the team seems to report that interconnects are a greater cause of subsystem failure than disks, there seems to be some room for disagreement about what the numbers are telling us.

For example, this result doesn’t fully explain the delta between what disk users have found and the “trouble not found” rates that manufacturers report. Even if you accept the common 50% TNF vendors report, drive failures are still higher than this research finds.

Perhaps we should conclude that NetApp’s engineering is higher quality than the general run of storage arrays. Or perhaps system log analysis is still a dark art whose results are more indicative than conclusive.

Comments welcome, as always. I’m at the FAST ‘08 conference this week in the San Jose Fairmont hotel.

8 Responses to ' Why do storage systems fail? '

Subscribe to comments with RSS or TrackBack to ' Why do storage systems fail? '.

  1. Tony said,

    on February 25th, 2008 at 10:33 am

    Yes, connectors really suck (as I figure out where and what type of connectors to use in a new design) - but no connectors (hard-wired) is even worse!

    There’s a lot more science and technology in good connectors than most people realize.

  2. pmwut5 said,

    on February 26th, 2008 at 3:18 am

    Cannot believe Netapp made no mention on out dated firmware running on their drives as a cause of storage failure. I have experienced several LIP storms from failed drives on LRC ESH2 and ESH4 modules storage interconnects on production Netapp Arrays. Netapp Solution is to dual connect all storage trays - great if you have enough I/O slots and downtime available.

  3. Mad Morf said,

    on February 26th, 2008 at 5:56 am

    pmwut5…
    You’re mixing apples and oranges here.
    Drive firmware is different from shelf module firmware.
    Keeping drive and shelf firmware updated is the customer’s responsibility, as it is with any vendor.
    LRCs provide no protection from LIP storms, so replace them if you don’t like the results you are getting. ESH2s and ESH4s do provide a lot of protection from LIP events, but no solution is perfect!
    Dual connecting shelves requires no downtime if you have sufficient HBA ports available in the appliance.

  4. Pete Steege said,

    on February 26th, 2008 at 9:32 am

    Great perspective on disk failure. The fact that enterprise drives actually achieve their reliability numbers will surprise some people, given that they are so extreme (1.6 million hours between failures, for example).

    Goes to show you get what you pay for.

  5. Pete Steege said,

    on February 26th, 2008 at 9:35 am

    It’s probably not that NetApp systems are better, but that their application environment is different than other systems on average. I have no facts to support this, but my sense is the NetApp sweet spot is very high capacity apps that tend to stress individual drives less than other apps.

    My observation is that the higher the capacity in a system, the lower the IOPS per Gigabyte (or drive).

  6. pmwut5 said,

    on February 27th, 2008 at 3:12 am

    I still find it interesting that a Netapp report into storage failure rates make no mention on out dated drive firmware as a cause of storage subsystem failure. Some disk drives being used have N14 releases of firmware - mostly to handle disk error handling which prevents the drive being marked as failed. If I was going to produce a report into 1.8 million disks the first thing I would look at is the firmware on those drives before making any conclusion

  7. Richard said,

    on March 1st, 2008 at 2:48 am

    Robin,
    Yes… cabling plus connectors plus poor airflow plus power …. this is why the ‘commodity’ motherboard hardware apporoach does not work well.
    NetApp controllers seem to be built on standard motherboard technology… right?


  8. on March 7th, 2008 at 12:15 pm

    [...] One failure in a million hours?  It’s claims like these that seem extreme to some people when they look at enterprise disk drives.  Yet a study of 39,000 NetApp systems by a researcher have found that these drives fail at a 1% annual failure rate (AFR).  Robin Harris summarizes the study in his blog. [...]

Leave a reply



StorageMojo RSS Feed May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006 May 2006 April 2006 March 2006 June 2005 April 2005 March 2005 February 2005 January 2005 December 2004 November 2004 October 2004 September 2004