Earlier this week StorageMojo published summaries of two papers from the USENIX FAST ’07 conference, Google’s Disk Failure Experience and Everything You Know About Disks Is Wrong. I also published a briefer summary on Computerworld.com.
The credibility of the industry is in question
Both FAST papers were listed on slashdot.org, resulting in over 100,000 unique visitors here, and who knows how many downloads of the original papers. In short, the topic of disk MTBFs (or AFR’s), along with related issues raised in the papers, excited a great deal of popular attention.
The papers suggested that important assumptions about disks, and by implication, arrays, are wrong – and not just a little.
- Failure rates are several times higher than reported by drive companies.
- Actual MTBFs (or AFRs) of “enterprise” and “consumer” drives are much pretty much the same.
- Drive failure rates rise steadily with age rather than staying flat through some n-year mark.
- SMART is not a reliable predictor of drive failure.
- Array disk failures are highly correlated, making RAID 5 two to four times less safe than assumed.
I believe many readers of these papers will conclude that uncomfortable facts were either ignored or misrepresented by companies that knew better or should have known better. For example, in all the discussion of RAID-DP I’ve seen, the argument is couched in terms of unrecoverable read error rates, not, for example, the likelihood of two drives failing in an array is greater than assumed. Given that field MTBF rates seems to be several times higher than vendors say, I’m now wondering about claimed bit error rates.
Many rivers to cross
The industry may have several responses:
- The paper’s conclusions are wrong (completely or in important respects) and here’s why. Our hands are clean.
- Gosh, we never correlated the behavior our field service and/or warranty groups saw with the claims made by our vendors or our marketing. We’ll do that now and get back to you with updated information. Thank you for bringing this to our attention.
- These academic studies may reflect the conditions seen in these point-off-the-enterprise-curve installations, but thanks to our superior supply-chain management, manufacturing, test, burn-in and skilled field service we’ve never observed these effects. Here to give an in-depth review of our service experience is our director of field service engineering. Thank you for giving us the opportunity to highlight our operational superiority.
Or most likely a combination of all three strategies.
Where do we go from here?
These issues resonate widely based on the comments I’ve seen. This being the age of interactive communication, you’ll need to engage with customers on multiple levels to regain the trust and credibility I know you’d like to enjoy.
I’m offering StorageMojo as a platform for your responses. I’d really like to hear what you have to say about these papers and the anomalies they’ve documented.
I’ll give each of you your own post to write what you will. StorageMojo readers, including me, will be free to comment. You’ll get your statements out without journalistic interpretation. If those of you with bloggers like Hu, Dave or Mark choose to respond there, I’ll be happy to link those posts for my readers who might not otherwise see them.
The StorageMojo take
The industry has an excellent opportunity to move to greater transparency with storage consumers. Sometimes relationships need a jolt to remind everyone just how much we rely upon each other. Storage is a vital industry with the responsibility to protect and access an ever increasing fraction of mankind’s data. Customers want the best tools for the job. It appears the industry hasn’t been providing them, at least for disk drives. I know some efforts are underway in IDEMA to improve the quality of the numbers. I’d get serious about ensuring that the revised processes actually benefit customers rather than soothing corporate egos. Otherwise this situation will arise again.
Further, the need to engage at a more personal level is a predictable outcome of the continuing consumerization of IT. This is an example of the new normal. Embrace it.
So how about it? Will you respond?
Update: after looking at this in the morning, I decided that it fell short of the clarity I strive for. So this version is punched up a bit from yesterday.
Update II: NetApp has responded. I’m hoping other vendors will as well.
More than ever, comments welcome. Moderation turned on to evade the phentermine dealers, among others. What is phentermine, anyway>
You’ve only got two of the major FC and SATA HDD manufacturers. Seagate #1 and HGST #3. Consider adding: Western Digital #2, Samsung #4, Toshiba #5 and Fujitsu #6
The other companies in your open letter do not manufacture HDDs but they OEM from the above, usually Seagate and one other.
Good input.
How does that work then? The OEMs don’t do any testing on drives?
My experience with Storage vendors is that they do a lot of testing at the drive level. Particularly during the qualifying of new drives.
As a matter of fact, I took a few vendors to task for spending more time on qualifying drives than on performance testing the Storage units. I wanted more time spent on Performance Under Load Information from testing by the vendor.
They always tap-danced me on that one saying they couldn’t duplicate my environment. So I developed a Generic set of specs. I also asked them if they didn’t have many of the same business environments that most of their customers had? And couldn’t they test their boxes with their own environments? They said no and no.
Well, I have set up Certification and Qualification Testing Labs that do just that. Usually they are three level to minimize crashing the Production operation.
Yes, there are sometimes still problems installing in Production after successfully passing levels 3 and 2 testing. There are some safeguards that must be employed to minimize this. The real reason no one does this seems to be that this process is not cheap and it takes some time.
As long as the market is not truly competitive this will be the case.
By the way, I don’t take the utopia view that “one size” of testing fits all.
The customer has the responsibility of knowing their environment well enough to inform the vendor and work with them to establish a reasonable test environment for both. Your IT shop may be very different than mine. That would be an unjust burden to place on the vendors.
At the moment, due to the “dumbing-down” of IT, customers are totally dependent on the vendors. This needs to change for the welfare of both.
Ok storagemojo, two of your points I’ve got a slight problem with.
# Failure rates are several times higher than reported by drive companies.
# Drive failure rates rise steadily with age rather than staying flat through some n-year mark.
The companies don’t quote a failure rate. They quote a drive life. There *is* a difference. The CMU paper got it wrong. Someone who is not in the reliability field attempted to perform reliability and did so incorrectly. MTBF MTTF. Any until we actually can have someone calculate the MTTF, we won’t know how close or far away from the manufacturers specs the real world numbers are. I will add that the MTTF may not be very difficult to calculate, it just depends on how the weibull was performed by these researchers and what was taken into account. There’s a lot of things to consider and they can make a difference.
Also, drive failure rate increasing with time is what is expected in ANY system where a wearout failure mode would be expected.
I’m not with the drive manufacturers in any way, but I AM a reliability engineer and I understand what has been done and the way it has been done incorrectly. Until it has been done properly, a proper comparison cannot be made.
IO, I left out the others partly because I didn’t think of them and partly because with the exception of Fujitsu they aren’t in the 3.5″ enterprise arena, AFAIK. Which is probably why I didn’t think of them.
I included the array vendors because they build systems using “enterprise” drives and “consumer” drives, so they should see the differences. Robert is correct, array vendors spend significant time and money qualifying drives down to the firmware rev level, and they only accept tested rev levels. Array vendors have good visibility into the behavior of large populations of drives.
Brian, I’m pleased to get a reliability engineer’s perspective. I must differ with you on one point: at least one Seagate drive, the Barracuda 7200.10 family, quote an AFR of 0.34%. More probably do; I haven’t checked.
Beyond that, I’d like you to expand your point about MTBF and MTTF. I’ve rarely seen disks repaired, so why wouldn’t the two numbers be equal? Or at least similar enough as not to matter? Or am I totally missing the point?
I agree with you that any mechanical system should expect increasing failure rates with time, yet I don’t think that expectation has been set for customers. Unlike Jack Nicholson in a “Few Good Men” I think storage customers can handle the truth, and deserve it. Some customers might choose to replace all drives every two years if given the data. Some might pay for a 3-6 month burn-in service. It all starts with the best data we can provide.
Robin
Just for clarification 3.5″ HDDs are manufactured by:
1. Seagate – FC & SATA
2. Western Digital – SATA
3. HGST – both types
4. Samsung – both
5. Toshiba – none
6. Fujitsu – FC only
Robin:
On the point about AFR and manufacturers, I was unaware as the Ms. Schroeder had mentioned MTTF, so thank you for that information.
You are correct in that the individual hard drive is not necessarily repairable. You are also correct that the MTBF = MTTF for an individual hard drive, and it would be good enough if our system was considered to be one hard drive. But, the system of hard drives is repairable. When a drive fails, a new one is put in place. That’s considered a repair of the system. The numbers calculated were an MTBF of the system. Ms. Schroeder had taken all of the accumulated hours, and divided by the total failures. So in this case, MTBF of the system is not the MTTF of an individual drive..
I ran a quick simulation the other day using some of the parameters from Ms. Schroeder’s paper. Here were my conditions. Weibully distributed systen with a shape parameter of 0.71, and a characteristic life based on a hard drive manufacturers quoted MTTF of 1,000,000 hours. I created a system of 4000 hard drives, assumed that they ran 24/7 and simulated random failures based on these parameters. I was noticing that even in the first month of operation, the MTBF was something like 200,000 hours, or about 1/5 of what the stated MTTF was. I’m going to re-run the simulation and generate a plot of the MTBF over time for this system. If you think that any of my assumptions are incorrect, let me know. For instance, assuming 24/7 operation may be too strenous. Also, 4,000 hard drives might be a bit much. Lastly, I assumed all were switched on at the same time initially. If there is a more proper deployment rate, I’d be curious to know.
FWIW, Western Digital’s RE2 3.5″ drives, with retries tuned for array use, appear in Agami’s enterprise NAS. I hope WD responds.
see blog post above.
Robin, you are free to consider this “my” official response if you like to post it on your blog, or point to mine, whatever is easier for you. Given that IBM no longer manufacturers the DDMs we use inside our disk systems, there may not be any reason for a more formal response.
I wonder if there would be any value (and if it is common practice)
for array vendors to use drives of the same model, but of different
production runs, in their arrays, so as to minimize the odds of a
manufacturing defect causing a multiple failure leading to data loss.
The correlation between disk failures in an array should have been expected. Standard statistical analysis for manufacturing assumes that the correlation matrix for a series of products is block diagonal. That means that rather than treating each product as an independent sample of a random process, you treat each production run as independent but assume that if a run produces some products with large errors then the implicit reliability of other products from the same run is lower. Since disk arrays are likely to be produced by taking N drives from the same run, you get the failure correlation described almost by default.
This implies that building arrays by sampling from different production runs would actually increase reliability substantially for RAID-5.
Regarding Richard Hamilton’s inquiry dated March 22nd, ’07.
I work at one of the largest array vendors. There really is no value in micro-managing the makeup of an array with regard to ensuring mixed production dates for disk drives. The arrays themselves are extremely fault tolerant and, depending on the implementation, can reconstruct any lost data very easily.
Without going into much detail (trade secrets and all that…) the intelligent arrays (storage processors) and the drives themselves are tested quite well so catastophic failures are weeded out before going to a customer. It would take multiple drives in the same RAID to fail catastopically at the EXACT SAME time in order for data to be irretrivably lost (and even then there are ways to prevent that from happening such as real time mirrors, off-site data replication SW, etc.)
Wow…I can’t believe that even among the “largest array vendors”, the effects of ever-larger disk capacities on failure modes in RAID arrays are apparently still not well understood.
For example, from above:
“It would take multiple drives in the same RAID to fail catastopically at the EXACT SAME time in order for data to be irretrivably lost…”
That’s incorrect. In reality, (RAID-5) it would only take a second disk failure to occur >>DURING THE REBUILD PERIOD<>above and beyond<>whole justification<>capacity<< at the expense of write performance and (now with bigger disks) also reliability.
With disk capacity now so cheap it’s almost free, cheap and simple disk mirroring and RAID-10 makes more sense than ever, especially when one considers that rebuilding a failed mirror goes about 10-100x faster than rebuilding a failed drive in a parity-RAID scenario, and stresses only ONE other disk, not ALL of the disks in the array.
Also…as regards Google, I see that no one has mentioned the fact that Google hangs it’s bare disk drives from the back of the server racks with velcro tape (Google patent 6,906,920), and then sticks a chaep plastic shroud around them to (purportedly) ensure cooling flow. Google’s results must be considered in the context of how Google is mounting disk drives — in a cheap plastic shell and velcro tape, with no consideration whatsoever around vibration control and a very “iffy” air-flow. Anyone who has been around one of these extreme-density server racks can probably appreciate the vibration that is transmitted throughout the structure by hundreds of (generally cheap) server and server power-supply fans, and the inevitable impact this would have on disk reliability.
Given that Google’s disk-mounting system is so far away from the kind of mounting systems that the disks were designed for, and how the quality of disk mounting and handling techniques is THE key variable in the disk longevity equation, it becomes abundantly clear that the vast majority of Google’s reported results are meaningless to anyone but Google.
Sorry Robin, I mangled the first post — here’s the corrected text:
Wow…I can’t believe that even among the “largest array vendors”, the effects of ever-larger disk capacities on failure modes in RAID arrays are apparently still not well understood.
For example, from above:
“It would take multiple drives in the same RAID to fail catastopically at the EXACT SAME time in order for data to be irretrivably lost…”
That’s incorrect. In reality, (RAID-5) it would only take a second disk failure to occur DURING THE REBUILD PERIOD of a first failed drive. The distinction may seem trivial except that as disk drives have gotten bigger and bigger (while IOPS have remained flat), the rebuild times stretch into many hours. In some cases — where rebuild of a very large disk is occurring in a system under heavy load, a full DAY or more to rebuild a terabyte drive in a 4+1 array).
The probability of a second (catastrophic) failure is directly proportional to the time required for a rebuild of the first failure.
Also, parity-based rebuild operations are extremely I/O intensive, giving all of the disks in the array quite a workout — above and beyond whatever production workloads are running. Driving all the disks silmultaneously up to 100% actuator duty-cycles “thrashes” the actuators and heats up the HDA beyond the baseline for the workload. Especially in an array of “older” disks, this strenuous workout further increases the liklihood of a multiple-disk failure, and so the probability of a catastrophic failure during parity-RAID rebuild now increases exponentially as a function of the rebuild time.
Of course this is why double-parity has become common, but the largely undiscussed “feature” of double-parity scenarios (apart from the fact that DP only mitigates but does not solve the problems above) is that the capacity economics — the >>whole justification<< for parity-based RAID — degrade to the point that DP RAID is only marginally better than simple mirroring!
The business-value of parity-based RAID has always been that it conserves disk capacity at the expense of write performance and (now with bigger disks) also reliability.
With disk capacity now so cheap it’s almost free, cheap and simple disk mirroring and RAID-10 makes more sense than ever, especially when one considers that rebuilding a failed mirror goes about 10-100x faster than rebuilding a failed drive in a parity-RAID scenario, and stresses only ONE other disk, not ALL of the disks in the array.
Also…as regards Google, I see that no one has mentioned the fact that Google hangs it’s bare disk drives from the back of the server racks with velcro tape (Google patent 6,906,920), and then sticks a chaep plastic shroud around them to (purportedly) ensure cooling flow. Google’s results must be considered in the context of how Google is mounting disk drives — in a cheap plastic shell and velcro tape, with no consideration whatsoever around vibration control and a very “iffy” air-flow. Anyone who has been around one of these extreme-density server racks can probably appreciate the vibration that is transmitted throughout the structure by hundreds of (generally cheap) server and server power-supply fans, and the inevitable impact this would have on disk reliability.
Given that Google’s disk-mounting system is so far away from the kind of mounting systems that the disks were designed for, and how the quality of disk mounting and handling techniques is THE key variable in the disk longevity equation, it becomes abundantly clear that the vast majority of Google’s reported results are meaningless to anyone but Google.