As promised in the Open Letter to Seagate, Hitachi, EMC, IBM, NetApp, HP and Sun, StorageMojo is giving this space to NetApp to respond to Everything You Know About Disks Is Wrong and Google’s Disk Failure Experience.
Props to NetApp for being quick off the mark in responding [gee, maybe there is a reason they are the fastest growing large storage vendor] and to NetApp’s Director, Technical Strategy, Val Bercovici, who worked over the weekend to craft a data-rich response. Val’s response is unedited by moi. IMHO, Val has done a good job discussing the issues while keeping self-congratulatory chest beating under control To improve readibility I’ve added some bolded sub-heads in [brackets] which are mine alone.
And all you other guys, getting your lunch eaten by NetApp, the invitation still stands. The NetApp response begins here:
[NetApp feels a bit like Al Gore now]
It’s probably fitting that I’m writing this during the Oscar Awards weekend. Much like the Oscar broadcast itself, this is a long blog post and many overnight sensations will be discussed following the Oscar broadcast. In fact the suddenly hot topic of disk failures and resulting impact on data availability & resiliency might seem like yet another overnight sensation, courtesy of mainstream media coverage such as the “beeb”. However, most professionals in the far less glamorous storage industry know that like all overnight sensations, this one too has actually been many years in the making. Stretching the Oscar theme a bit more (regardless of political affiliation) many of us at NetApp also feel a little bit like Al Gore now. Let me explain …
It may surprise many of those reading StorageMojo (perhaps even the admin himself but NetApp is actually *thrilled* at the attention this whole topic is now gaining, and much like Al Gore we feel somewhat vindicated since we’ve been banging this drum for a while. I’ll be addressing all of Robin’s provocative points regarding the credibility of the storage industry (specifically drive & array manufacturers) below, but a little bit of NetApp history in this regard will add important context to my response.
[Disrupting ourselves by cannibalizing FC disk storage sales]
Back around the year 2000, NetApp’s thought leaders observed that the gap in density between consumer-oriented drives (then known mostly as IDE, today as xATA) and enterprise drives (SCSI & FC) was becoming too big to ignore. It was clear to us that Clayton Christensen’s “good enough” principle from his seminal disruptive technology work would clearly apply in this case. So our choice came down to either disrupting ourselves by cannibalizing some of our FC disk storage sales with lower revenue per-capita ATA drives, or watch somebody else do it to us. I’m glad we phrased it that way internally since it became an easy choice in hindsight.
That decision prompted NetApp to release in 2001 the first enterprise-class storage array based on ATA technology at new price points previously unavailable to the online storage market. The NetApp NearStore thus created the “Nearline storage” market segment. Little did we know that would thrust us into a virtuous circle where we also learned some hard lessons. The innovation we applied to overcoming those lessons learned has directly contributed to our dominance of the Nearline storage market, as well as ultimate industry capacity leadership in the overall “networked storage” category tracked by IDC. Yes that means today we ship more array-based FC & ATA disk capacity than EMC, HP, Sun and our OEM partner IBM as listed here in StorageMojo’s open letter. That key statistic helps add unmatched credibility to our responses surrounding this issue and the specific points raised below.
FAST (pardon the pun) forward to 2007 and the Google & CMU studies, resulting IT media / blogosphere coverage, plus resulting StorageMojo open letter. Let’s review the key points raised:
1. Failure rates are several times higher than reported by drive companies
Most mature storage array vendors already know this and devote serious engineering, disk qualification / testing and field support resources to mitigating the resulting customer risk. Conversely, most experienced storage array customers have learned to equate the accuracy of quoted drive failure specs to the MPG estimates reported by car manufacturers. A classic case of YMMV and often will if you deploy these disks in anything but the mildest of eval / demo lab environments.
2. Actual MTBFs (or AFRs) of “enterprise” and “consumer” drives are pretty much the same
This tidbit known mostly to industry insiders is largely true, especially when comparing comparable drive sizes. But how storage arrays handle the respective drive type failures is what continues to perpetuate the customer perception that more expensive drives should be more reliable. One of the storage industry’s dirty secrets is that most enterprise and consumer drives are made up of largely the same components. However, their external interfaces (FC, SCSI, SAS or SATA) and most importantly their respective firmware design priorities / resulting goals play a huge role in determining enterprise vs. consumer drive behavior in the real world.
[Firmware more reliable than people - good]
Considering the awe-inspiring areal density of the platters themselves, combined with drive firmware size and complexity rivaling entire operating systems of a few years ago, NetApp’s storage subsystem team considers contemporary disk drives “miracles of modern engineering”. In fact, that resulting firmware size and complexity is beginning to resemble the anthropological and demographic behaviors of human beings themselves!
To wit, consumer-class drives’ personality is determined by firmware that assumes the drive is isolated inside a laptop or desktop and cannot rely upon parity information stored on adjacent (peer) drives to recover from a partial or full error condition. Consequently, consumer drives exhibit “Type A” personalities as they heroically go offline for non-deterministic periods of time (a few seconds or many minutes), to “take charge” of the situation and perform various pre-programmed techniques attempting to resolve bad blocks, media / checksum errors, etc… Unpredictable and non-deterministic timeouts during these occurrences inside storage arrays can present some challenging circumstances to array designers – yet as one can easily see the end result will not always be a “failed” drive. Safe & efficient approaches to handling this situation without disruption and often without even physical removal of the drive itself, is one of the innovations NetApp delivers with our “Maintenance Center” suite of disk resiliency technologies which I cover in a bit more detail responding to the next point below.
OTOH enterprise-class drives exhibit markedly different group dynamics since their firmware makes the assumption that they are usually deployed as a member of a RAID set – and should consequently defer to their peers for mirror or parity-based recalculation / recovery during the same set of error conditions cited for the consumer drives above. That makes for much more deterministic behavior which has historically enabled storage arrays using exclusively enterprise-class drives to compensate in a much more consistent and predictable manner when drives fail. The makers of enterprise-class storage arrays now face some daunting challenges as they incorporate consumer-class drives while maintaining the same historical service levels. A quick scan of various enterprise storage vendors’ spec sheets quickly reveals which ones have risen to the engineering challenge with native SATA support vs those that have punted responsibility back onto the drive manufacturers themselves by using lower volume (higher price / not consumer-class) hybrid drives known as FATA or LC-FC.
FYI – NetApp storage engineering has actually moved on to tackle even more challenging consequences of today’s popular & complex dense disk drives. Many Enterprise storage arrays (and increasingly popular filesystems such as ZFS) have evolved sophisticated checksumming algorithms to verify the correctness of normal read operations. Some array vendors go the extra mile to continuously monitor such checksums on data that is not normally (or perhaps never) read after it is initially written. To the best of our knowledge, NetApp is the only array vendor to take the final step and check for the incident known as a “lost write” which conventional checksumming approaches do not (yet) catch. The risks of silent data corruption loom large in any filesystem, disk drive or storage array which does not account for the potential of “lost writes”.
3. SMART is not a reliable predictor of drive failure
We believe this is one of the most tangible points that separates the “men from the boys” in this industry. Few if any of the storage newcomers in this market have endured the real-world field experiences required to come to this difficult realization and make the necessary engineering & support investments to compensate. NetApp considers our solutions in this regard a distinct competitive advantage, so we’ve explicitly decided to drive public industry discussion of this issue. Forums such as the FAST conference have played a key role. Both of this year’s Google and CMU studies refer to seminal NetApp work in this area, and just like the CMU paper here in 2007, NetApp’s RAID-DP won “Best Paper” at the FAST ’04 Conference. Yet as pointed out correctly on this blog RAID-DP (a performance-optimized variety of SNIA-defined RAID 6) is merely a key part of the protection spectrum against this issue, not all of it.
Quick backgrounder substantiating our position – NetApp shipped over 104 PB (petabytes) of capacity during our last reported quarter (ending in Jan 2007). Since we didn’t publicly disclose the number of spindles that equates to, I’ll do some the back of the napkin math blending the 500, 300 & 146GB spindle varieties to arrive at a rough average of over 150,000 spindles per month, which by itself every month is well above the total amount of drives covered in each of the cited Google & CMU studies.
[Making SMART smarter]
Much like Google, NetApp has accumulated over the years a massive data warehouse of real-world drive behavior but under a much broader range of production deployment environments and configurations. We track drive ongoing behavior reported on a weekly basis during normal working states as well as in an event-driven manner during the various stages of drive failure. That has enabled us to surround the basic SMART information provided by the drive manufacturers with a comprehensive set of technologies branded as “Maintenance Center” (introduced in my response above) which enable NetApp arrays to take highly safe, accurate, granular and efficient predictive actions described in the response to the next point below.
4. Drive failure rates rise steadily with age rather than staying flat through some n-year mark
The relatively controlled sample sets of the Google & CMU studies enable them to arrive at more specific conclusions than NetApp has observed. OTOH since NetApp will soon ship more drives per month than both of those studies’ multi-year sample sets combined (to a much broader set of production deployment environments) we have learned that the actual list of possible reasons behind drive failures gets longer with the introduction of each new drive model. Consequently there are many best-practices we recommend to storage array administrators, which are derived from the consistent set of resiliency features we supply as a vendor of storage arrays to small, medium and large-sized organizations of all kinds.
If there’s one thing we’ve learned as a result of the massive real-world drive behavior data warehouse we’ve accumulated – it’s that there’s no simple pattern to predict when a drive will fail. But by far our most significant discovery is that drive failures are actually no longer the simple atomic and persistent occurrences they used to be a few short years ago. There are in fact many circumstances not restricted to age, environmentals (NVH), power & cooling, or even electro-mechanical behavior of drive peers within the same array, which can render a drive unusable – and eventually failed. One of the most fascinating Oscar-worthy plot-twists that we’ve uncovered as a result of our vast experience is that drives can also come back from the dead to lead very normal and productive lives! Industry-leading innovation we’ve been shipping with NetApp Maintenance Center allows a NetApp array to use algorithms derived from our aforementioned data warehouse to take intelligent proactive actions such as:
- Predict which drive(s) are likely to fail (using advanced algorithms based on our vast data warehouse).
- Copy readable data directly from the failing spindle onto a global hot spare without parity reconstruct overhead.
- Use RAID-DP parity to calculate the remaining subset of unreadable data (usually a very small percentage of the overall drive).
- Take the suspected “failed” drive offline (while physically maintaining it in the array) and probe said drive with low-level diagnostics to determine whether the failure was transitory or truly and permanently fatal.
- Return fixed drives which exhibited only single-instances of transitory errors back to the global hot spare pool.
Although we’ve only been collecting statistics on the advanced Maintenance Center functionality for about a year now, our assumptions have been validated in that the vast majority of “failed” drives only exhibit isolated incidents of transitory errors and can safely remain in the array while rejoining the spares pool. It should be noted that these drives don’t get a second chance at a second life :-). Should those same drives fail again in any manner, they are marked for physical removal from the array and considered permanently failed.
[If it ain't broke . . . ]
There are of course many net positive tangible NVH & electro-mechanical advantages in avoiding physical drive removal events from any storage array, which contribute to a different kind of NetApp virtuous circle around overall storage system RAS. Notably, an indirect yet significant benefit of more granular and intelligent drive failure management afforded by NetApp Maintenance Center is improved supply chain efficiencies NetApp customers enjoy. This comes as a result of the reduction in the expensive cycle of drive removal, RMA administrative processing & shipment, plus drive replacement shipping, handling & asset management.
5. Array disk failures are highly correlated, making RAID 5 two to four times less safe than assumed
This is an excellent final point. For readers that made it this far, there’s one takeaway I hope everyone remembers from this discussion. Given the realities of today’s drives (plus all the trends indicating what we can expect from electro-mechanical storage devices in the near future) – protecting online data only via RAID 5 today verges on professional malpractice.
That’s a deliberately strong and provocative statement. I use it often to raise awareness of this very real industry issue and when outlining NetApp competitive advantages such as RAID-DP & Maintenance Center in this regard. Apart from more capacity efficiency than both RAID 5 (in typical 3+1 or 4+1 best-practice configs) and RAID 1/0, RAID-DP (or any sturdy variety of RAID 6) is also becoming a necessary complement to the increasingly dense spindles most organizations are pressured to purchase for financial reasons. Patterson, Gibson and Katz defined some excellent RAID levels with their seminal work based on spindle realities of the eighties. 20 years later it’s time to retire those legacy RAID levels and define and implement modern ones which address the realities of contemporary drive technology in the 21st century.
[RAID 5 today verges on professional malpractice]
In more conservative, controversy-phobic settings one can tone down the rhetoric and merely refer to the copious 3rd party evidence we cite in this regard, including (but certainly not restricted to):
- Enthusiast-oriented reports such as AnandTech, quoting an “8% chance of complete data loss using RAID 5 with 200GB spindles”
- Seagate & Microsoft’s WinHEC 05 Presentation (SATA in the Enterprise) – “Call to Action: Use (only) RAID 1 or RAID 6 in SATA Array”
- IBM Research in Almaden (S. R. Hetzler, IBM Fellow) quoting a controlled study of large capacity drives “With only 2 9’s reliability – RAID 5 is insufficient with SATA”
- The “father of DEC StorageWorks” (now HP EVA) quoting that “If you have one petabyte of desktop drives with RAID 5, you could lose data twice a year”
Note that leading the industry in terms of transparency on the “inconvenient truth” of RAID 5 today required some sacrifices on NetApp’s behalf the past few years. Our sales force keeps reminding me that NetApp doesn’t win many brownie points among the uninitiated RFP writers out there in customer-land who are scoping out conventional RAID 0/1/5 solutions. Instead of coming across as self-serving scare-mongers when explaining “RAID 5 is not enough”, we at NetApp hope broader coverage of this important issue will help storage customers make more informed and safer array purchasing decisions.
As is clear by the length of this response and record number of blog comments on related posts here at StorageMojo, this is a rich and deep often esoteric topic with many nuances. Many readers who make a living higher up the storage stack at the host / server or application level may wonder whether or how all of this relates to them? Perhaps the best example of that comes from NetApp’s strategic database alliance partner and major customer Oracle Corp.
Having been in this business a long time and learning some hard storage lessons of their own, Oracle developed a storage resiliency certification program actually named H.A.R.D. Very few Oracle certified storage array partners are able to qualify for this exclusive program, and while NetApp is naturally one of them – we are proud to also be the only storage array vendor in this program that offers the highest form of database storage integrity across our entire online storage product line. All other compliant array vendors restrict this to their highest tier of storage available only to their most lucrative customers. Seems kind of unfair and disappointing for the increasing majority of enterprise storage customers considering modular (mid-range) storage arrays in support of Oracle databases and associated (usually mission-critical) applications. For related archive storage requirements, anything less than this high level of data integrity over the long-term would also be considered a major technical, business and legal issue.
Given the complexity involved with the technology behind the findings raised by the Google and CMU researchers lately, perhaps it’s best to close with a quote from the former US President still most closely associated with Hollywood and the Oscars “Trust – but Verify”
Val Bercovici (NetApp)
Director, Technical Strategy
Comments welcome, as always. I’ll be a lot more familiar, I hope, with NetApp after attending their analyst event in a couple of weeks. So don’t be shy about giving your view of Val’s response. Moderation on to keep spamsters under control.