As promised in the Open Letter to Seagate, Hitachi, EMC, IBM, NetApp, HP and Sun, StorageMojo is giving this space to NetApp to respond to Everything You Know About Disks Is Wrong and Google’s Disk Failure Experience.
Props to NetApp for being quick off the mark in responding [gee, maybe there is a reason they are the fastest growing large storage vendor] and to NetApp’s Director, Technical Strategy, Val Bercovici, who worked over the weekend to craft a data-rich response. Val’s response is unedited by moi. IMHO, Val has done a good job discussing the issues while keeping self-congratulatory chest beating under control To improve readibility I’ve added some bolded sub-heads in [brackets] which are mine alone.
And all you other guys, getting your lunch eaten by NetApp, the invitation still stands. The NetApp response begins here:
[NetApp feels a bit like Al Gore now]
It’s probably fitting that I’m writing this during the Oscar Awards weekend. Much like the Oscar broadcast itself, this is a long blog post and many overnight sensations will be discussed following the Oscar broadcast. In fact the suddenly hot topic of disk failures and resulting impact on data availability & resiliency might seem like yet another overnight sensation, courtesy of mainstream media coverage such as the “beeb”. However, most professionals in the far less glamorous storage industry know that like all overnight sensations, this one too has actually been many years in the making. Stretching the Oscar theme a bit more (regardless of political affiliation) many of us at NetApp also feel a little bit like Al Gore now. Let me explain …
It may surprise many of those reading StorageMojo (perhaps even the admin himself 🙂 but NetApp is actually *thrilled* at the attention this whole topic is now gaining, and much like Al Gore we feel somewhat vindicated since we’ve been banging this drum for a while. I’ll be addressing all of Robin’s provocative points regarding the credibility of the storage industry (specifically drive & array manufacturers) below, but a little bit of NetApp history in this regard will add important context to my response.
[Disrupting ourselves by cannibalizing FC disk storage sales]
Back around the year 2000, NetApp’s thought leaders observed that the gap in density between consumer-oriented drives (then known mostly as IDE, today as xATA) and enterprise drives (SCSI & FC) was becoming too big to ignore. It was clear to us that Clayton Christensen’s “good enough” principle from his seminal disruptive technology work would clearly apply in this case. So our choice came down to either disrupting ourselves by cannibalizing some of our FC disk storage sales with lower revenue per-capita ATA drives, or watch somebody else do it to us. I’m glad we phrased it that way internally since it became an easy choice in hindsight.
That decision prompted NetApp to release in 2001 the first enterprise-class storage array based on ATA technology at new price points previously unavailable to the online storage market. The NetApp NearStore thus created the “Nearline storage” market segment. Little did we know that would thrust us into a virtuous circle where we also learned some hard lessons. The innovation we applied to overcoming those lessons learned has directly contributed to our dominance of the Nearline storage market, as well as ultimate industry capacity leadership in the overall “networked storage” category tracked by IDC. Yes that means today we ship more array-based FC & ATA disk capacity than EMC, HP, Sun and our OEM partner IBM as listed here in StorageMojo’s open letter. That key statistic helps add unmatched credibility to our responses surrounding this issue and the specific points raised below.
FAST (pardon the pun) forward to 2007 and the Google & CMU studies, resulting IT media / blogosphere coverage, plus resulting StorageMojo open letter. Let’s review the key points raised:
1. Failure rates are several times higher than reported by drive companies
Most mature storage array vendors already know this and devote serious engineering, disk qualification / testing and field support resources to mitigating the resulting customer risk. Conversely, most experienced storage array customers have learned to equate the accuracy of quoted drive failure specs to the MPG estimates reported by car manufacturers. A classic case of YMMV and often will if you deploy these disks in anything but the mildest of eval / demo lab environments.
2. Actual MTBFs (or AFRs) of “enterprise” and “consumer” drives are pretty much the same
This tidbit known mostly to industry insiders is largely true, especially when comparing comparable drive sizes. But how storage arrays handle the respective drive type failures is what continues to perpetuate the customer perception that more expensive drives should be more reliable. One of the storage industry’s dirty secrets is that most enterprise and consumer drives are made up of largely the same components. However, their external interfaces (FC, SCSI, SAS or SATA) and most importantly their respective firmware design priorities / resulting goals play a huge role in determining enterprise vs. consumer drive behavior in the real world.
[Firmware more reliable than people – good]
Considering the awe-inspiring areal density of the platters themselves, combined with drive firmware size and complexity rivaling entire operating systems of a few years ago, NetApp’s storage subsystem team considers contemporary disk drives “miracles of modern engineering”. In fact, that resulting firmware size and complexity is beginning to resemble the anthropological and demographic behaviors of human beings themselves!
To wit, consumer-class drives’ personality is determined by firmware that assumes the drive is isolated inside a laptop or desktop and cannot rely upon parity information stored on adjacent (peer) drives to recover from a partial or full error condition. Consequently, consumer drives exhibit “Type A” personalities as they heroically go offline for non-deterministic periods of time (a few seconds or many minutes), to “take charge” of the situation and perform various pre-programmed techniques attempting to resolve bad blocks, media / checksum errors, etc… Unpredictable and non-deterministic timeouts during these occurrences inside storage arrays can present some challenging circumstances to array designers – yet as one can easily see the end result will not always be a “failed” drive. Safe & efficient approaches to handling this situation without disruption and often without even physical removal of the drive itself, is one of the innovations NetApp delivers with our “Maintenance Center” suite of disk resiliency technologies which I cover in a bit more detail responding to the next point below.
OTOH enterprise-class drives exhibit markedly different group dynamics since their firmware makes the assumption that they are usually deployed as a member of a RAID set – and should consequently defer to their peers for mirror or parity-based recalculation / recovery during the same set of error conditions cited for the consumer drives above. That makes for much more deterministic behavior which has historically enabled storage arrays using exclusively enterprise-class drives to compensate in a much more consistent and predictable manner when drives fail. The makers of enterprise-class storage arrays now face some daunting challenges as they incorporate consumer-class drives while maintaining the same historical service levels. A quick scan of various enterprise storage vendors’ spec sheets quickly reveals which ones have risen to the engineering challenge with native SATA support vs those that have punted responsibility back onto the drive manufacturers themselves by using lower volume (higher price / not consumer-class) hybrid drives known as FATA or LC-FC. 🙂
FYI – NetApp storage engineering has actually moved on to tackle even more challenging consequences of today’s popular & complex dense disk drives. Many Enterprise storage arrays (and increasingly popular filesystems such as ZFS) have evolved sophisticated checksumming algorithms to verify the correctness of normal read operations. Some array vendors go the extra mile to continuously monitor such checksums on data that is not normally (or perhaps never) read after it is initially written. To the best of our knowledge, NetApp is the only array vendor to take the final step and check for the incident known as a “lost write” which conventional checksumming approaches do not (yet) catch. The risks of silent data corruption loom large in any filesystem, disk drive or storage array which does not account for the potential of “lost writes”.
3. SMART is not a reliable predictor of drive failure
We believe this is one of the most tangible points that separates the “men from the boys” in this industry. Few if any of the storage newcomers in this market have endured the real-world field experiences required to come to this difficult realization and make the necessary engineering & support investments to compensate. NetApp considers our solutions in this regard a distinct competitive advantage, so we’ve explicitly decided to drive public industry discussion of this issue. Forums such as the FAST conference have played a key role. Both of this year’s Google and CMU studies refer to seminal NetApp work in this area, and just like the CMU paper here in 2007, NetApp’s RAID-DP won “Best Paper” at the FAST ’04 Conference. Yet as pointed out correctly on this blog RAID-DP (a performance-optimized variety of SNIA-defined RAID 6) is merely a key part of the protection spectrum against this issue, not all of it.
Quick backgrounder substantiating our position – NetApp shipped over 104 PB (petabytes) of capacity during our last reported quarter (ending in Jan 2007). Since we didn’t publicly disclose the number of spindles that equates to, I’ll do some the back of the napkin math blending the 500, 300 & 146GB spindle varieties to arrive at a rough average of over 150,000 spindles per month, which by itself every month is well above the total amount of drives covered in each of the cited Google & CMU studies.
[Making SMART smarter]
Much like Google, NetApp has accumulated over the years a massive data warehouse of real-world drive behavior but under a much broader range of production deployment environments and configurations. We track drive ongoing behavior reported on a weekly basis during normal working states as well as in an event-driven manner during the various stages of drive failure. That has enabled us to surround the basic SMART information provided by the drive manufacturers with a comprehensive set of technologies branded as “Maintenance Center” (introduced in my response above) which enable NetApp arrays to take highly safe, accurate, granular and efficient predictive actions described in the response to the next point below.
4. Drive failure rates rise steadily with age rather than staying flat through some n-year mark
The relatively controlled sample sets of the Google & CMU studies enable them to arrive at more specific conclusions than NetApp has observed. OTOH since NetApp will soon ship more drives per month than both of those studies’ multi-year sample sets combined (to a much broader set of production deployment environments) we have learned that the actual list of possible reasons behind drive failures gets longer with the introduction of each new drive model. Consequently there are many best-practices we recommend to storage array administrators, which are derived from the consistent set of resiliency features we supply as a vendor of storage arrays to small, medium and large-sized organizations of all kinds.
If there’s one thing we’ve learned as a result of the massive real-world drive behavior data warehouse we’ve accumulated – it’s that there’s no simple pattern to predict when a drive will fail. But by far our most significant discovery is that drive failures are actually no longer the simple atomic and persistent occurrences they used to be a few short years ago. There are in fact many circumstances not restricted to age, environmentals (NVH), power & cooling, or even electro-mechanical behavior of drive peers within the same array, which can render a drive unusable – and eventually failed. One of the most fascinating Oscar-worthy plot-twists that we’ve uncovered as a result of our vast experience is that drives can also come back from the dead to lead very normal and productive lives! Industry-leading innovation we’ve been shipping with NetApp Maintenance Center allows a NetApp array to use algorithms derived from our aforementioned data warehouse to take intelligent proactive actions such as:
- Predict which drive(s) are likely to fail (using advanced algorithms based on our vast data warehouse).
- Copy readable data directly from the failing spindle onto a global hot spare without parity reconstruct overhead.
- Use RAID-DP parity to calculate the remaining subset of unreadable data (usually a very small percentage of the overall drive).
- Take the suspected “failed” drive offline (while physically maintaining it in the array) and probe said drive with low-level diagnostics to determine whether the failure was transitory or truly and permanently fatal.
- Return fixed drives which exhibited only single-instances of transitory errors back to the global hot spare pool.
Although we’ve only been collecting statistics on the advanced Maintenance Center functionality for about a year now, our assumptions have been validated in that the vast majority of “failed” drives only exhibit isolated incidents of transitory errors and can safely remain in the array while rejoining the spares pool. It should be noted that these drives don’t get a second chance at a second life :-). Should those same drives fail again in any manner, they are marked for physical removal from the array and considered permanently failed.
[If it ain’t broke . . . ]
There are of course many net positive tangible NVH & electro-mechanical advantages in avoiding physical drive removal events from any storage array, which contribute to a different kind of NetApp virtuous circle around overall storage system RAS. Notably, an indirect yet significant benefit of more granular and intelligent drive failure management afforded by NetApp Maintenance Center is improved supply chain efficiencies NetApp customers enjoy. This comes as a result of the reduction in the expensive cycle of drive removal, RMA administrative processing & shipment, plus drive replacement shipping, handling & asset management.
5. Array disk failures are highly correlated, making RAID 5 two to four times less safe than assumed
This is an excellent final point. For readers that made it this far, there’s one takeaway I hope everyone remembers from this discussion. Given the realities of today’s drives (plus all the trends indicating what we can expect from electro-mechanical storage devices in the near future) – protecting online data only via RAID 5 today verges on professional malpractice.
That’s a deliberately strong and provocative statement. I use it often to raise awareness of this very real industry issue and when outlining NetApp competitive advantages such as RAID-DP & Maintenance Center in this regard. Apart from more capacity efficiency than both RAID 5 (in typical 3+1 or 4+1 best-practice configs) and RAID 1/0, RAID-DP (or any sturdy variety of RAID 6) is also becoming a necessary complement to the increasingly dense spindles most organizations are pressured to purchase for financial reasons. Patterson, Gibson and Katz defined some excellent RAID levels with their seminal work based on spindle realities of the eighties. 20 years later it’s time to retire those legacy RAID levels and define and implement modern ones which address the realities of contemporary drive technology in the 21st century.
[RAID 5 today verges on professional malpractice]
In more conservative, controversy-phobic settings one can tone down the rhetoric and merely refer to the copious 3rd party evidence we cite in this regard, including (but certainly not restricted to):
- Enthusiast-oriented reports such as AnandTech, quoting an “8% chance of complete data loss using RAID 5 with 200GB spindles”
- Seagate & Microsoft’s WinHEC 05 Presentation (SATA in the Enterprise) – “Call to Action: Use (only) RAID 1 or RAID 6 in SATA Array”
- IBM Research in Almaden (S. R. Hetzler, IBM Fellow) quoting a controlled study of large capacity drives “With only 2 9’s reliability – RAID 5 is insufficient with SATA”
- The “father of DEC StorageWorks” (now HP EVA) quoting that “If you have one petabyte of desktop drives with RAID 5, you could lose data twice a year”
Note that leading the industry in terms of transparency on the “inconvenient truth” of RAID 5 today required some sacrifices on NetApp’s behalf the past few years. Our sales force keeps reminding me that NetApp doesn’t win many brownie points among the uninitiated RFP writers out there in customer-land who are scoping out conventional RAID 0/1/5 solutions. Instead of coming across as self-serving scare-mongers when explaining “RAID 5 is not enough”, we at NetApp hope broader coverage of this important issue will help storage customers make more informed and safer array purchasing decisions.
As is clear by the length of this response and record number of blog comments on related posts here at StorageMojo, this is a rich and deep often esoteric topic with many nuances. Many readers who make a living higher up the storage stack at the host / server or application level may wonder whether or how all of this relates to them? Perhaps the best example of that comes from NetApp’s strategic database alliance partner and major customer Oracle Corp.
Having been in this business a long time and learning some hard storage lessons of their own, Oracle developed a storage resiliency certification program actually named H.A.R.D. Very few Oracle certified storage array partners are able to qualify for this exclusive program, and while NetApp is naturally one of them – we are proud to also be the only storage array vendor in this program that offers the highest form of database storage integrity across our entire online storage product line. All other compliant array vendors restrict this to their highest tier of storage available only to their most lucrative customers. Seems kind of unfair and disappointing for the increasing majority of enterprise storage customers considering modular (mid-range) storage arrays in support of Oracle databases and associated (usually mission-critical) applications. For related archive storage requirements, anything less than this high level of data integrity over the long-term would also be considered a major technical, business and legal issue.
Given the complexity involved with the technology behind the findings raised by the Google and CMU researchers lately, perhaps it’s best to close with a quote from the former US President still most closely associated with Hollywood and the Oscars “Trust – but Verify” 🙂
Val Bercovici (NetApp)
Director, Technical Strategy
Comments welcome, as always. I’ll be a lot more familiar, I hope, with NetApp after attending their analyst event in a couple of weeks. So don’t be shy about giving your view of Val’s response. Moderation on to keep spamsters under control.
RAID is NOT a backup, never was, never will be.
No backup is complete until there are two copies, one of which is off site.
I have cross-linked to this post, Mojo. While I think that the drive issues you bring up have been discussed for some time (by me at least), Val’s comments are very interesting and fraught with satirical opportunity.
I have a question for Mojo readers about this RAID 5 issue.
Would RAID 5 be acceptable IF:
You never had more than 6 drives in a RAID-5 set and you had 2 hot spares dedicated to the 6 drive set?
I am not asking whether this makes financial sense, only if it would make one more comfortable with using RAID-5.
I’ll review Val’s comments at a later date.
Dale, I’m not sure I get where you are going with this question. FWIW, reducing the number of drives in a RAID 5 set reduces the AFR, so that’s good. Not sure what the benefit of a second hot spare would be. The risk is losing a second drive before the rebuild completes, which a second hot spare doesn’t help. Maybe you have something else in mind though.
John, thanks for the link. I’ve talked to a number of people who say DrunkenData and StorageMojo are the two storage blogs they read regularly.
Norm, you are correct. There is no substitute for offsite backup.
Dale, RAID-5 is fine for protecting against a single drive failure. And even through multi-drive failures are correlated, the chance of two drive failures is still still quite low. The problem is that bit errors during array rebuild after a drive failure is VERY likely, especially with big arrays of large 1TB drives. These bit errors cause data loss because the failed drive can’t be rebuilt at that sector. Additional hot spares won’t help this at all. The easiest way to protect against bit errors dring rebuild is RAID-6.
With respect to RAID-6, one thought that comes to mind is, if you need RAID-6-level availability, is it not likely that you also need a remote copy to survive a whole-site-disaster (or just whole-site outage)?
In any event, in any situation where you *do* need a remote copy, then that gives you an additional copy of the data that will allow primary-site RAID-5 to recover from encountering a bad sector during a rebuild, or even a second disk failure (the remote copy could either be a disk-level mirror of the primary site’s disks or an unRAIDed logical copy of the data). Of course, you need a RAID mechanism that understands how to incorporate this additional copy, or at least how to ask for higher-level help to do so.
Val’s response was a very worthwhile read, but I think does contain one error: ZFS’s ‘checksum-in-parent’ approach should definitely detect and allow recovery from lost writes.
I find this entire thread quite disturbing. I would say that Val has raised a few good points, but come on, how can anyone say that there is not a LOT of vendor bias in his post. I only glanced through this, but have a few comments to share.
1) consumer and enterprise drives has a lot more differentiation than claimed. Do you believe that the motors, bearings, and actuators are the same in these drives? They are not. They are in fact considerably different in design and reliability. I agree that the media (platters) is the same, but even they are not tested in the QA phase before shipment in the same manner. More time and tougher standards are used for enterprise drives.
2) I agree that failure rates are higher than one would expect reading MTBF figures. It might be worth noting that MTBF values for consumer drives are usually reported with a duty-cycle of 20%, meaning the drives are not spun-up 80% of every day (saving motor and bearings as much wear and tear as possible). Enterprise drives are in fact normally reported with duty cycle at 100%, meaning they are spun up 24 hours per day. I am not saying that either live up to their MTBF claims, and find that Vendor A has a good run for a year or two, then produces a bunch of duds, Vendor B the same, etc. Trying to determine who is producing good 400GB drives or 750GB drives at any point in time is quite a challenge.
3) RAID-5 – Most of the article, and some of the comments, talk about how RAID-5 is not a good thing today. My experience is that most failures and lost data, or even loss of access to data, are due to other factors, not classic dual-drive failures in the same parity group. This is a very complex subject, but most customers I know rely on RAID-5 for very high availability, but use other technologies including real-time data replication, backup/recovery, and transaction replication, to allow for recovery in those unfortunate times where an array does fail. I do feel it much more important to mention that most failures that cause data loss or loss of access are not the result of drive failures, but the firmware in the arrays that is supposed to handle the first failure. It is easy to implement RAID-5 today, but some vendors have much more sophisticated implementations of RAID-5. Case in point – One array vendor I know well has an implementation where a drive that is having reached some threshold for recoverable errors, will begin a recovery, but will not use RAID to rebuild. It is much faster and safer to just copy that drive to a spare before it does have hard errors, then flag it as bad, and go back into normal operation. If another drive has the unfortunate luck to ever have a hard error during this recovery, no problem, the first drive that was being copied is still online, and the parity group can recover using parity from the other drives to recovery the second, hard failed drive. This is just one very isolated example of how systems vendors implement very sophisticated methods to make RAID-5 “robust”. Not all vendors do this to the same level, and you only need to ask experienced storage admin’s to know which vendors do this type of work well, and which are not so well respected.
4) Many data loss situations, and lost of access to data, rests squarely on our shoulders, the users of these technologies. Do you keep your storage farm up-to-date on each level of microcode / firmware that is released? Leading edge or bleading edge, another topic to discuss, but if you are more than 6 months back and are thinking “if it ain’t broke, don’t fix it” you are a fool. There are hundreds of data integrity bugs fixed in most major firmware releases from the storage vendors listed (usually once very 4 to 12 weeks), and to not apply this new firmware is what is malpractice in my view. Also, do you monitor the heat inside the data center, or in the specific area of the datacenter where your arrays are located? What is the temperature inside the arrays themselves, and how stable is the temperature. Storage arrays hate heat, and they hate temperature fluctuation. This can have a major impact on MTBF observed in your shop!!! And guess what, you are in control of both of these variables.
5) Of the vendors you list, only Seagate and Hitachi actually build disk drives. The rest (including NetApp) all just buy drives from disk manufactures to use in their storage arrays. They can spin and FUD all they want, and talk about how they influence the specs or designs by these vendors, but that is hogwash. I would venture to say that both Seagate and Hitachi (and Maxtor and Western Digital and Samsung, etc.) all know more about drive MTBF than any systems vendor. Yes, that is only part of the story, and I would also say that storage systems vendors (including NetApp) know more about array firmware than the drive manufacturers.
6) The comment that native support for SATA is better than FATA or other implementations is not backed up by anything more than chest-thumping. Just to make my point, what if vendor-X can produce a more cost effective and reliable storage solution by using FATA instead of SATA, because it requires less re-design and coding (which costs lots of money) than using SATA. Is that bad? I don’t think users should care at the end of the day if they are using SATA or chicken-feed to store the data, as long as it works, reliably, with performance, etc.
Anyway, an interesting thread, but I don’t agre that the chest-thumping is under control in this post, it does raise some good points and dialog. I would not rate NetApp as the vendor I have been most imprssed with in terms of data protection capability, reliability, or field experience, but do respect their products and ability in the market.
I do wish I had more time to really read this article closely, and post a more comlete reply.
Can anyone point to a URL where NetApps reponse appears uncommented? Maybe a netapp URL?
Nestorguy – I’m plagued by continuing network problems so I can’t comment in detail. However, there may be the differences between enterprise and consumer drives that you suggest, but if they aren’t reflected in the AFRs then the real differentiation is strictly performance and cost. I think that would reduce the premium vendors could get for the drives.
James3678, AFAIK the NetApp response is a StorageMojo exclusive. Val is welcome to post it elsewhere if he likes.
Well… interesting stuff – all. Perhaps the most poignant phrase in all of this is YMMV… Where any given storage vendor has zealots, they also have detractors – nay – haters. My philosophy has and always will be… they all suck to a degree – and our job on the “consumer” side of this equation (being IT professionals) is to choose the ones that have proven their ability to recover and maintain our data’s availability (or “suck less”).
We have chosen a hybrid approach to availability that includes the best RAID (6) technology for large disks – combined with off-platform replication. I would love to see someone come out with a replication standard that allows rep-sets between different vendors’ arrays, etc – but there’s no vendor impetus to do such a thing. Why enable interoperability when you can enforce lock-in? I know there are tools out there, but how cool would it be to SRDF to a NetApp and Snap-Mirror that to a Hitachi? True data mobility – but there’s another post and subsequent 200 replies.
Replication plus diligent and appropriate RAID technologies (regardless of who the vendor is) protect better than any single vendor’s claim that they do “this” better than the next guy. NetApp’s approach (IMO) is blatant honesty (most times) which lends itself well to their credibility, and is reflected by their substantial footprint in our environment. When your disk vendor says “you need amount of disks space to accomodate – and here’s the exact reasons why” – that goes miles. When they tell you that you’re wasting money on to their own detriment (from a sales perspective) – they’re a true partner.
~~ YMMV ~~