EMC: “These Aren’t The Droids You’re Looking For”

by Robin Harris | Thursday, March 1, 2007 | Enterprise | 15 comments

You can go about your business
Chuck Hollis of EMC took the challenge to respond to the Open Letter about the big differences between drive life specs and what research on 200,000 drives found (see Everything You Know About Disks Is Wrong and Google’s Disk Failure Experience).

Notable in the EMC response: They didn’t deny that the research is correct.

The Force is strong with this one
Chuck’s response is both hilarious and remarkably wide of the mark at the same time. In a parody of the blog-style, Chuck does a “gee-whiz, what’s all the fuss about” schtick that, if it didn’t reflect EMC’s marketing competence, would be merely goofy. Like this bit:

Maybe you saw the interesting white paper from a team at Google.

They tracked a population of disk drives over a period of five years, and concluded â€œhey, the data doesnâ€™t really match up to what we might have thoughtâ€.

Fair enough.

And then the blogging started. Responses to responses. Vendor posturing.

Many of us took a look at this and thought â€œsheesh, whatâ€™s the big deal?â€

Fair enough. Here’s the big deal.
First, there were two studies. One from Google, one from Carnegie-Mellon University. Serious techies. Key findings from the two papers – and for Chuck’s benefit a Venn diagram of their topics doesn’t overlap 100% – were

Disk drives have a field failure rate 2-4 times the vendors spec.
Reliability of cheap “consumer” drives and expensive “enterprise” drives is about the same, despite several hundred thousand hour differences in their MTBFs.
Drives failures are highly correlated, violating a chief assumption behind the data security of RAID systems.

Or to put it in the context of Chuck’s response:
â€œhey, the data [100,000 drives] doesnâ€™t really match up to what we might have thought [based on the vendor’s specs]â€.
Chuck’s response: Hey, stuff happens. Who knew? Who cares? Yawn.

But sir, nobody worries about upsetting a droid.
I asked the array companies to respond because I thought that with the millions of drives they buy each year and their field service experience they could offer unique insight into the validity of the two studies. I even offered a marketing line that I thought that EMC would find attractive:

These academic studies may reflect the conditions seen in these point-off-the-enterprise-curve installations, but thanks to our superior supply-chain management, manufacturing, test, burn-in and skilled field service weâ€™ve never observed these effects. Here to give an in-depth review of our service experience is our director of field service engineering. Thank you for giving us the opportunity to highlight our operational superiority.

That’s ’cause droids don’t pull people’s arms out of their sockets when they lose.
Note, of course, this means denying that the company has seen these effects. And that’s the rub, isn’t it. Because if you have seen these effects, and you haven’t communicated them to customers, at least through Sales Engineers, then you are at least a tiny bit complicit with the fictions the drive vendors are peddling. If you haven’t seen these effects, then why wouldn’t you just step up and say, “hogwash!”

I used to bullseye womp rats in my T-16 back home.
When I wrote to Chuck, asking for a response, I said “This is an opportunity for EMC to take a leadership role in sorting this out.” EMC’s response: “we’ll pass.”

You will never find a more wretched hive of scum and villainy. We must be cautious.
Chuck goes on to suggest that one of the more “strident” blogs may have:

. . . had their pattern recognition circuitry turned up a bit too high. Either that, or they thought that by being controversial, they could increase their presence in the community. . . .

Do I think there is a conspiracy among vendors to mislead the public?

Donâ€™t be ridiculous.

You guys are giving us way too much credit here.

The StorageMojo take
There are two sides to every story, but only one set of facts. I hoped that EMC would have offered some – any – in their response.

I’ve spent most of my working life in large companies and I respect what they can accomplish. I also have a well-honed appreciation of their many failings, how group dynamics can trump even the best intentions, such as these from EMC’s website:

We pride ourselves on doing what’s right and on putting our customers’ best interests first. We lead change and change to lead. We are devoted to advancing our people, customers, industry, and community. We say what we mean and do what we say.

Chuck, granted, you didn’t have much time to respond. But you and I both know that there are people inside EMC who know the answers to the questions these studies have raised. So here’s a suggestion: go and get the data and then respond. I’m sure that many EMC customers would appreciate the effort.

Update: Chuck kindly wrote me to assure me that he was NOT responding to the Open Letter:

I was responding to the many, many bloggers who’ve commented on the topic, and not you personally or specifically.

Thanks for clarifying that, Chuck. I stand corrected.

Update II: I try to reply to comments, such as the excellent ones this post received, and I realized as I did that my confusion about whether Chuck was responding to me or not was certainly understandable, since in an email to me he said

Hi Robin,

Had a chance to review all of the posts, the orginal white paper, etc. and Iâ€™ve responded from a personal perspective here: [url]

Make of it what you will.

Comments welcome, from one and all, in agreement or not. Moderation turned on because moderation is a virtue, except in the defense of liberty.

15 Comments

PJ on Friday, 2 March, 2007 at 11:22 am

I suspect that all EMC can do is shrug because to do otherwise means either:

1) admitting that they knew, so they were in fact complicit or
2) admitting that they didn’t know (which I suspect is the real truth), which makes them look bad for not running the real numbers on their own product.

On the surface, it’s lose-lose for them to do either one – if they were smart and had a teeny bit of guts, they’d admit 2) and make running those numbers a standard part of their business and hammer on the drive manufacturers to fix the problem.

OTOH, maybe, since they’re to some extent just middlemen in the actual hard-drive-sales world, they just don’t really care that much.
DJ McFadden on Friday, 2 March, 2007 at 12:58 pm

I had a similar exchange with Mr. Hollis. The exchange was taken offline mainly because Mr. Hollis wanted to keep it that way. My blog reply had a note of sarcasm but wasn’t over-the-top and the following conversation was cordial. I was disappointed that he wouldn’t post it since I thought that was the point of having a blog and exchanging ideas. However, he started to reiterate some factually inaccurate statements on NetApp technology – more emotion than facts – so it’s clear that some EMC ideas still can’t stand the light of day.

The gist of my note was we have two reports that document what I think is pretty obvious. Disks fail. They fail more than the manufacturers estimate in their spec sheets and they fail increasingly over time. The bigger the disk, the longer the rebuild the more susceptible we are to a second “failure” be it bit error or total disk failure. We know all this is going to happen (has happened in both my EMC and NetApp storage). I didn’t consider NetApp’s comments fear-mongering. I would consider it stating the obvious.

Given that this is a statement of the obvious – not detracting from the research reports attempting to find better predictive variables – I made the following analogy for Mr. Hollis. It’s raining outside. I don’t really care how much rain NetApp claims to have seen in their day as long as they are willing to hand me an umbrella. EMC is telling me I shouldn’t worry about it. Take it from a customer – wrong answer.

I can forgive NetApp a bit of chest pounding because they came up with a real answer to a real problem. Congratulations, NetApp – you did your job. You met expectations! But you know what, at least they didn’t put there collective heads in the sand and tell us not to worry about it.

The rest of the exchange went off into tangents that were clearly factually inaccurate on Mr. Hollis’ part. I think he lost sight of the fact that I have both NetApp and EMC on my floor and so once again could empirically test EMC claims. It’s clear that NetApp bothers these guys.
Richard on Friday, 2 March, 2007 at 10:04 pm

Robin,
You pose some excellent questionsâ€¦ but do not like practical answers.

What you have â€˜uncoveredâ€™ here is already well known to experienced RAID controller manufacturers. Disk figures are much overstated â€¦ and this is not going to changeâ€¦ to rapid product cycles, new disk densities and interface standards and pressure on margins. Nonetheless, their warranty periods are impressive.

Disks are only *one* of the problems â€¦ and I tend to agree with Chuck, who does not control the design or manufacturing of disksâ€¦ but needs to deliver a highly reliable solution. As noted, the overall issue is very complex â€¦ no easy solution here.

The only way out is through an *overall* product responsibility, protection provided by well tested controller-level RAID firmware. At that point, the quality of firmware becomes a major issue.

A typical disk drive is driven by up to five complex in-line firmware enginesâ€¦ all from different suppliers, usually untested in operational environment. The RAID controller becomes the â€œcaretakerâ€ of all problemsâ€¦.end to end. This costs a lot of effort & money.

IMHO, it can take up to 3 years of field testing to deliver reliable, highly available RAID firmware infrastructure. Large development team efforts make this worse.

This is an often an â€œunstatedâ€ problem with new startups â€¦ i.e. if you donâ€™t have reliable code *now*, it is probably too late, with the best of teams. This, coupled with the ever present need to â€œdifferentiateâ€ â€¦ is why the storage arena has been (mostly) a â€œblack holeâ€ to VC investments…. and some VCs still dont understand this.

Alsoâ€¦.for the same reasonâ€¦. it is clear that you can not rely on â€œcommodityâ€ motherboard, running under mostly generic OS type of drivers, to deliver the required level of reliability. This is beginning to happen quite often, for reasons of expediency. â€¦ where reliability evolves with time, at the end-user ‘expense”.

On the other hand, NetApp (or EMC) does not have an exclusive “licenseâ€™ on protective algorithms and did not invent RAID 6 or other anything else under their discussion. In fact one could argue that â€¦. for a long timeâ€¦ both have ignored RAID 5/6 in preference to mirroringâ€¦. for some very self-serving reasonsâ€¦. and customers ‘bought’ ther story.

Also… ss I have pointed out in my previous comment, Google should look at the overall MTTF figures, including their overhyped â€œcommodityâ€ hardware, power supplies & cablingâ€¦ they may conclude that with their volume, a well designed specialized hardware/firmware solution is more cost effective.

One more issueâ€¦
The FC disk represents â€˜reliabilityâ€™ and is dual-ported â€¦ a very important HA feature for RAID controller manufacturers.
Sadlyâ€¦ what is needed is a dual-ported SATA- priced disk driveâ€¦..why is this not available..?

All recently evolved silicon to dual-port SATA disks is another level of unneeded stupidity which adds to the problem. I fail to see why disk manufacturers donâ€™t understand this.
IO Guy on Friday, 2 March, 2007 at 10:58 pm

Just curious but no word back from Seagate or HGST?
Ha ha ha hummmmmm.

What a surprise!!!
Storagezilla on Friday, 2 March, 2007 at 11:26 pm

Hi Robin, when anyone from EMC is replying to you they’ll probably link to you. It’s no fun if the other guy doesn’t know you’ve drafted an answer. 😉

As for EMC’s definitive response to such a thing that wouldn’t appear on anyone’s blog, EMC blogs are author opinions as per the company blogging policy. Though I’ll admit some opinions carry more weight than others I’ve noticed though that anyone who wants a public hanging appears to treat every utterance or idea on those blogs like it came from Joe Tucci himself, as he descended from the mountain top carrying stone tablets inscribed with the word of God in his hands.

Should EMC chose to respond it would probably appear as white paper drafted by EMC Engineering with input and data from everyone from QA to the PH.d.s in the CTO’s office.

As for Mr McFadden: Sir you don’t have a right to be sarcastic, over the top or otherwise, on anyone’s blog except your own. It’s rude when you’re a guest in someone else’s space, which is what you are when you’re leaving a comment. Chances are that the person on the other end has a job and a life to be getting on with and they’re under no real obligation to pay you any mind unless they feel like it. Now I suppose you’re talking about RAID 6. It’s already available on Symmetrix, and where one goes I’m of the opinion that others usually follow.
Robin Harris on Saturday, 3 March, 2007 at 10:48 am

All,

Lot of great comments here. We’re having network problems here at Chez Mojo, so I’ve been a bit cramped for response time.

PJ – Any marketing team worth their salt should be able to figure out five ways to spin this to the benefit of an array company. Disk marketers have a tougher problem!

Perhaps the simplest case would be simply to say something like: “We base our maintenance pricing and service policies on the observed service history of our equipment. We pass on disk manufacturer representations to customers, but our focus is on creating high-performance, high-availability storage solutions. We’ve observed variations in drive AFRs and have engineered our systems accordingly.”

Yet to claim, as Chuck did, that the Google and CMU papers don’t really tell us anything about disk drive AFRs because of many dozens of variables is goofy. If the predicted (vendor AFR) and the observed (field AFR) are in wild disagreement, even after excluding the “no trouble found” drives for 100,000 drives in class A datacenters, then most engineers and scientists would conclude that the problem is in the prediction, not the observation.

DJ – Amen brother. EMC’s corporate persona seems to have a chip on its shoulder. Sure, even paranoids have enemies, but for the most part customers want to feel good about the vendors they’ve spent millions on. So stop being defensive and give them a hand!

Richard – I don’t disagree that RAID engineers and presumably disk engineers knew, but I’ve been in the storage business for over 15 years – and professionally concerned with storage for over 25 years – and I didn’t know that drive AFRs were so much higher than spec’d and that there was so little difference between consumer and enterprise drive AFRs. Maybe I should have spent more time drinking with the drive engineers.

Where I do disagree is with the implicit model of complex software/firmware handling all these issues. I believe that the large storage clusters have demonstrated that instead of managing disks you can manage nodes which include disks using very well wrung out interfaces. It doesn’t matter if the node’s AFR is 25,000 hours or 250,000 hours – you just design to the observed behavior. And the software to do that isn’t nearly so complex because we can rely on the nodes – Byzantine failures aside – to handle/report their failures.Redundancy is the solution. RAID arrays are one instantiation; storage clusters another. I believe it is clear that the storage cluster model is the coming thing and that storage arrays have fully exploited their technical advantages.

Also, I believe it is the array vendor’s advantage to tell the truth about drive AFR: if drives are less reliable than you think, then you need RAID more and it is more valuable.

Agree: enterprise drives have other features, like dual-porting and higher performance, that are valuable in certain applications. So let the marketing focus on those rather than specious AFRs.

IO guy – I’m sure they’re hoping this just blows over, kind of like Intel’s floating point bug a few years ago. And maybe it will. But the “you can’t trust our specs but you can trust our drives” idea is just wrong. Maybe this will spur IDEMA to quarterback a general re-spec’ing by the industry. No one wants to be first with the “New, Higher AFR” claim.

Storagezilla – great name that, wish I’d thought of it – I’d traded some emails with Chuck and in one of them he said:

Hi Robin,

Had a chance to review all of the posts, the orginal white paper, etc. and I’ve responded from a personal perspective here:

So I did think it was a “response” from someone at EMC, on an EMC company blog. Further Chuck noted in his reponse that he talked to other people at EMC for perspective. So it wasn’t just Chuck’s take. Now whether it went through EMC’s rigorous vetting process I can’t say.

As a general rule I try not to take things personally, so with Chuck’s post I tried to focus on the argument he made and not references to “strident” bloggers and such.

As for the comments about not being sarcastic or over-the-top on anyone else’s blog but your own: I couldn’t disagree more. I strongly prefer issue-oriented discussions, but I’m flexible on how someone chooses to make their points. If you aren’t ready for some sarcasm or worse, don’t start blogging. When I wrote about “25x data compression” almost a year ago I got called some awful things by arrogant techno-twerps who didn’t know what they were talking about. But hey, it’s only words. How you choose to react is up to you.

I don’t usually write about the Meaning of Blogging, but I may yet. I think the core of it is that even in cyberspace, we hunger for authentic human contact, the feeling that somewhere out there is another human being who is good company with stimulating information and opionions. Blogs by marketers usually lack this feeling, while most engineers come across as real, even when they are closely associated with a particular product or architecture that they advocate. EMC as a corporation isn’t comfortable with dialogue, while NetApp clearly is.

I wouldn’t go so far as to claim that either has an impact on stock price, but I have observed that companies that people like tend to do better than companies people don’t like. Personally, I believe that Tucci’s time at EMC is over – he’s a turn-around guy and he’s done that – and that EMC needs a thorough re-thinking, ala Gerstner at IBM, and a return to fundamentals. The fundamentals of today, not 15 years ago.

Robin
Alastair McLeod on Sunday, 4 March, 2007 at 9:18 am

It is just me or has no-one else spotted the obvious here ? The crucial difference between the AFR quoted by drive manufacturers and that experienced by users in the field is that manufacturers test single drives in isolation, but in the real world drives are used in arrays. The transmission of vibrations from one disk to another is a major cause of wear and increased stress on a disk.

This is backed up by a White Paper and patent application from Western Digital (Rotrary Acceleration Feed Forward, or RAFF) which attempts to address the problem. Further, this could also explain why many “failed” disks appear to be OK when tested offline – they fail in an array, but then are OK when tested on their own. This would also explain the findings in the Google paper which referred to failed but “tested OK” drives being very likely to fail again when re-deployed.

The other fundamental observation I would make is that all forms of RAID are designed to cope with drive failure, but not prevent it. Thus is akin to treating the symptoms, but not curing the disease.

Is there a solution to this problem ? Yes, at least for some specialist applications, and my company have filed patents for such designs and will be launching a radical new array design in mid year. I mention this, not so much to promote ourselves but rather by way of explaining that we have researched the subject in some depth.
DJ McFadden on Sunday, 4 March, 2007 at 6:20 pm

Storagezilla,

It’s actually Ms. or Miss (really Mrs. McFadden) and the sarcasm was meant as a form of humor since Mr. Hollis uses humor to illustrate his points. I would think it would be fine to answer in kind as long as we’re not taking personal jabs at one another. I do find it remarkable that you can take such a strong position on a blog reply that was never posted; that you never read.

I understand that Mr. Hollis has a job and I doubt any humorous poke I take at his position would cost him his job or really threaten him in any way. Maybe he even had a laugh with it. Regardless, he should be prepared to openly defend his position once he takes it.

His position on RAID-6 seems laughable. If you look at his earlier posts, he basically says EMC looked at the problem and determined that the vast majority of data loss or downtime was due to EMC complexity. They have spent a lot of time and effort attempting to simplify things for us “knuckle draggers.” At the end of the day it was determined that RAID-6 was not needed and NetApp was making a mountain out of a molehill. Roughly a month later EMC announces RAID-6. (Doesn’t anyone within EMC read Chuck’s blog? Why spend engineering resources on a feature that doesn’t matter? It might lend credence to your idea that an executive blog doesn’t necessarily relate to company policy, though. (NOTE: joke)). To Chuck’s credit he does attempt to remain consistent and refers to RAID-6 as a “checklist item.” No big deal. Just an item to help EMC out on RFPs. Unfortunately, I don’t give credit for features that are just there to help EMC fill out their checklist. Clearly, that’s what this is.

As far as “following” it appears to me that EMC’s MO is to dismiss a NetApp feature – and then follow them. Does NetApp bother EMC so much that they can’t admit NetApp might actually have a good idea? Having a practical RAID-6 solution is a good idea. If it’s not, take it out of Enginuity. If it is, put it in Flare. I think the worst answer is to tell customers that have personally experienced dual drive failures that there’s nothing to worry about and to look at the pretty GUI we’ve made for you. It’s getting harder and harder to distract us knuckle-draggers with shiny objects.
Robin Harris on Sunday, 4 March, 2007 at 10:54 pm

Alastair,

I’d love to see the research you refer to and am perfectly willing to be persuaded that your company is on to something. However, as a veteran of more than one array engineering effort, I do know that engineering spends a lot of time on the issue of properly mounting disk drives to ameliorate vibration issues. When a bunch of drives start seeking in unison it creates much higher vibration levels – the technical term is escaping me right now – than you’d expect from normal motor and seek activity. Those are the levels that mechanical engineers design for.

Further, while the CMU study covered high-density drive packaging that you are thinking of, Google only packages three drives to a server, AFAIK. Yes, the servers are packaged in large racks, yet this is not the density found in big iron arrays. So maybe Google’s packaging made the consumer drives look better than they otherwise would, but the CMU study found no great difference either. Further, drives are qual’d in the packaging the vendor is using, not in some idealized environment. Qual volumes are usually reduced to the bare minimum for cost reasons, yet array vendors definitely look at how the drives perform in the system.

The only conclusion I can come to is that while it certainly may be possible that high-density drive packaging leads to shorter drive life, given the engineering effort to control vibration and the testing array vendors perform, the drive manufacturers are under no illusions about where their enterprise drives are used or the rotational and seek vibration issues their very best customers face. Bottom line: either Google and CMU don’t know what they are talking about, or drive and/or array vendors haven’t been coming clean with the public about what their devices are really capable of.

DJ, you should be aware that Mr. Storagezilla appears to be an EMC employee, which may explain some of his comments. IMHO, his loyalty is laudable, his positions less so.

EMC is all about getting the sale. If consistency helps that, they’re consistent. If obfuscation helps, they obfuscate. If an abrupt about-face wins the deal, consider it done. Money talks, the rest is pool. So don’t let it bother you. It us just they way they are.

Robin
Richard on Monday, 5 March, 2007 at 9:25 am

Robin,
You are absolutely correct â€¦â€ money talks, the rest is pool.â€.

For a long â€¦ long time â€¦ NetApp did not believe in RAID-protected storage at all. This changed dramatically when they acquired the rights to RAID DP.

As Ms McFadden states, EMC suddenly announced one of the future â€˜check listâ€™ items â€¦. are they shipping already? I hope they had ample time for testingâ€¦ and that such fast implementation of a complex algorithm is not too slowâ€¦.but then it could be RAID DP.

Google drive just 3 disks per motherboard â€¦ probably because this is all they can pack into a 1U chassis ….. more enclosures into a rack â€¦.and then look for cheap power.

I suspect that the US Patent Office will probably let someone patent the idea of counter-rotating alternate/groups of disks â€¦ to reduce vibrations. … if this has not been done already

Andâ€¦ when this topic is about to conclude here â€¦ it seems that it is just beginning to gain steam at Computerworld â€¦hereâ€¦

http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9012066

These are interesting times.
Prof. John on Monday, 5 March, 2007 at 12:25 pm

One of the indications from the CMU and Google papers are really not so surprising: RAID-5 is not enough for many applications and we need RAID-6, RAID-7 and beyond. Why not many RAID-6 products shipped? Let me tell a story happened on FAST 2005 ( the one before this year’s ) from technical point of view. My friend, another professor, gave a tutorial talk on algorithms for building RAID-6 ( and more ), and about two dozens of tech people from industry attended. After the talk, he was asked many times about the same question: where to a get FREE RAID-6 algorithm that has NOT been patented. Where? Actually no where so far. All RAID-6 algorithms have been patented. The most well known was from IBM Almaden research center and IBM owns the patent. Now IBM sold its storage lines to Hitachi. I don’t know whether the patent has also been transferred or not. Another one from NetApp, published in FAST 2004. But if the inventors are competent ( a few of them worked at the IBM Almaden research center ), they should know their RAID-6 algorithm’s read and small write performance is much worse than IBM’s algorithm. If the NetApp’s algorithm is truly integrated into their shipped products as they claim, then their customers are not getting the best. There are a few other companies which have their own unpublished RAID-6 algorithm. But I tend to think those RAID-6 algorithms are no better than the IBM one. So while everybody in the industry knows it is imperative to have RAID-6 and beyond, technically it is hard to design such algorithms. And worse yet, those companies are not willing be spend big bucks on such R&D. Just talk to the CTO’s office or Advanced Technologies Group in those companies. They just don’t have a long-term view. This is where their marketing comes to brag about how reliable and sufficient their RAID-5 products are …
Bill Todd on Tuesday, 6 March, 2007 at 4:47 am

Hmmm. I guess Chuck’s blog is meant for more unidirectional communication than this discussion turned out to be: he closed comments early Monday morning without including those that came in over the weekend.

I found Ernst’s comparison of Chuck’s (and now a couple of disk manufacturers’) response to those of the tobacco industry to studies demonstrating the dangers of smoking to be compelling, and Chuck’s decision to close off discussion (just after a supporting diatribe from fellow EMCer Storagezilla) seems entirely consistent with a desire to try to control the discussion rather than actually engage in one. So I’ll just append my early Saturday morning response to him here, so that the discussion can continue if appropriate (whether he chooses to participate or not):

I guess perceptions vary at least as much as tastes do, Chuck. Equating sarcasm (and even mere questioning of something as dry as disk MTBF specifications – not even an EMC spec, for that matter) to ’emotion’ says, at least to me, considerably more about the personal involvement of the person making that claim.

Especially given that you were then responding to Ernst’s second post, which was not sarcastic at all but rather explained precisely the problem as he saw it. But instead of responding to that problem statement, you side-stepped it completely.

Now, I’d agree that Storagezilla’s post has a fair amount of emotion in it (though the post sequence suggests that this is not what you were referring to). As an EMC shill/attack dog he’s a fine complement to your own milder voice: disparaging those who don’t share one’s views as ‘trolls’ and ‘idiots’ probably isn’t something EMC would like a VP to be doing publicly.

Methinks you both protest too much – far too much. Perhaps you’re mildly dismissing (in your case) and aggressively berating (in Storagezilla’s) Ernst and me for statements that others have made on this subject elsewhere (not that you’ve provided any pointers here to suggest where that might be).

So in case you’re confusing us with someone else I will point out that neither of us has advocated anything resembling a ‘public hanging’ or even attempted to assign any kind of blame (save possibly for the manner in which you’ve been parrying rather than responding substantively here). Nor did the CMU study, which focused solely on the disparity between manufacturers’ MTBF specs and real-world experience (by the way, if Storagezilla actually read it before mouthing off he might have noticed that they did observe that even after hypothesizing that 43% of the disks they counted as failed were in fact good the numbers were still way out of line with the specs). The Google study too was very even-handed – unless one considers anyone with the temerity to question disk specs based on extensive real-world analysis to be ‘finger-pointing’.

The bottom line remains that there are now at least 5 large-scale studies of disk failures (all in relatively close quantitative agreement) which demonstrate that in actual use (under professional management) disks are nowhere nearly as robust as their manufacturers claim that they are. These studies, imperfect though they may be, include *far* more detail than the manufacturers have seen fit to release, and therefore (considering the industry experience of the people who conducted them) are commensurately more credible.

Rather than wave your hands vigorously, it might be more appropriate (and would certainly be more helpful to your customers) if you could contribute some actual data of your own. Meanwhile, responding substantively to Ernst’s second post might not be a bad start.

– bill
Richard on Tuesday, 6 March, 2007 at 6:36 am

Regarding RAID 6 algorithmsâ€¦ it seems that most people start by looking at Linux or published data on existing RISC processors, now available with hardware assisted RAID 5/6 algorithms.

As Prof John points out â€œwhere to a get FREE RAID-6 algorithm that has NOT been patented . Actually no where so far. â€œ â€¦and â€¦ â€œtechnically it is hard to design such algorithmsâ€â€¦..

Not to mention the time required to field test and prove any new R6 concepts, under various failure scenarios and system configurations.

Both of these are key issues.

This seems to be the reason why EMC were downplaying the importance of RAID 6. I suggest that they have not found a way (yet) to â€˜bypassâ€™ the existing RAID 6 patents.

Controllers from EMC use repackaged â€œcommodityâ€ dual X86 processor motherboard hardware without co-processing hardware. Traditional Reed Solomon software-driven solutions are very memory bandwidth & compute intensive â€¦ and there is likely to be a problem with speed.

On the other hand,â€¦ RAID DP is relatively easy to implement within the existing RAID5 XOR concept. In that sense RAID DP may be the only short-term solution for EMC and others.

I am not sure how this stands with NetApp regarding patents, etc.
Perhaps they could comment.
Bill Todd on Tuesday, 6 March, 2007 at 7:00 pm

Chuck responded privately to me this morning, and quite reasonably. Why he didn’t feel inclined to do so in his blog is not clear, but at least he made the effort elsewhere.

And now for something completely different (in the spirit of Robin’s start post – my daughter discovered it):

http://www.youtube.com/watch?v=-Wocg88DS_M

– bill
Dan on Wednesday, 7 March, 2007 at 2:08 am

While there is some importance to trying to get some responses from disk and storage vendors, I think a second, and to me more important, discussion is due.

Based on the two studies, what can we do to better protect our data?

Google mentioned a 3 copies scheme. How does one achieve such a scheme? Is it automated? are there tools that let you tailor such a scheme? Anyone thinks of a kind of a file-system that will triplicate data over distinct storage units? Are there other schemes people think of or use?

This can lead to the death of the RAIDs of the world.

Dan