Comments on: Open Letter to Seagate, Hitachi GST, EMC, HP, NetApp, IBM and Sun

By: Kmann

Kmann — Thu, 21 Aug 2008 14:28:35 +0000

Sorry Robin, I mangled the first post — here’s the corrected text:

Wow…I can’t believe that even among the “largest array vendors”, the effects of ever-larger disk capacities on failure modes in RAID arrays are apparently still not well understood.

For example, from above:

“It would take multiple drives in the same RAID to fail catastopically at the EXACT SAME time in order for data to be irretrivably lost…”

That’s incorrect. In reality, (RAID-5) it would only take a second disk failure to occur DURING THE REBUILD PERIOD of a first failed drive. The distinction may seem trivial except that as disk drives have gotten bigger and bigger (while IOPS have remained flat), the rebuild times stretch into many hours. In some cases — where rebuild of a very large disk is occurring in a system under heavy load, a full DAY or more to rebuild a terabyte drive in a 4+1 array).

The probability of a second (catastrophic) failure is directly proportional to the time required for a rebuild of the first failure.

Also, parity-based rebuild operations are extremely I/O intensive, giving all of the disks in the array quite a workout — above and beyond whatever production workloads are running. Driving all the disks silmultaneously up to 100% actuator duty-cycles “thrashes” the actuators and heats up the HDA beyond the baseline for the workload. Especially in an array of “older” disks, this strenuous workout further increases the liklihood of a multiple-disk failure, and so the probability of a catastrophic failure during parity-RAID rebuild now increases exponentially as a function of the rebuild time.

Of course this is why double-parity has become common, but the largely undiscussed “feature” of double-parity scenarios (apart from the fact that DP only mitigates but does not solve the problems above) is that the capacity economics — the >>whole justification<< for parity-based RAID — degrade to the point that DP RAID is only marginally better than simple mirroring!

The business-value of parity-based RAID has always been that it conserves disk capacity at the expense of write performance and (now with bigger disks) also reliability.

With disk capacity now so cheap it’s almost free, cheap and simple disk mirroring and RAID-10 makes more sense than ever, especially when one considers that rebuilding a failed mirror goes about 10-100x faster than rebuilding a failed drive in a parity-RAID scenario, and stresses only ONE other disk, not ALL of the disks in the array.

Also…as regards Google, I see that no one has mentioned the fact that Google hangs it’s bare disk drives from the back of the server racks with velcro tape (Google patent 6,906,920), and then sticks a chaep plastic shroud around them to (purportedly) ensure cooling flow. Google’s results must be considered in the context of how Google is mounting disk drives — in a cheap plastic shell and velcro tape, with no consideration whatsoever around vibration control and a very “iffy” air-flow. Anyone who has been around one of these extreme-density server racks can probably appreciate the vibration that is transmitted throughout the structure by hundreds of (generally cheap) server and server power-supply fans, and the inevitable impact this would have on disk reliability.

Given that Google’s disk-mounting system is so far away from the kind of mounting systems that the disks were designed for, and how the quality of disk mounting and handling techniques is THE key variable in the disk longevity equation, it becomes abundantly clear that the vast majority of Google’s reported results are meaningless to anyone but Google.

By: Kmann

Kmann — Thu, 21 Aug 2008 14:24:24 +0000

Wow…I can’t believe that even among the “largest array vendors”, the effects of ever-larger disk capacities on failure modes in RAID arrays are apparently still not well understood.

For example, from above:

“It would take multiple drives in the same RAID to fail catastopically at the EXACT SAME time in order for data to be irretrivably lost…”

That’s incorrect. In reality, (RAID-5) it would only take a second disk failure to occur >>DURING THE REBUILD PERIOD<>above and beyond<>whole justification<>capacity<< at the expense of write performance and (now with bigger disks) also reliability.

By: Keith P.

Keith P. — Fri, 11 May 2007 18:04:30 +0000

Regarding Richard Hamilton’s inquiry dated March 22nd, ’07.

I work at one of the largest array vendors. There really is no value in micro-managing the makeup of an array with regard to ensuring mixed production dates for disk drives. The arrays themselves are extremely fault tolerant and, depending on the implementation, can reconstruct any lost data very easily.

Without going into much detail (trade secrets and all that…) the intelligent arrays (storage processors) and the drives themselves are tested quite well so catastophic failures are weeded out before going to a customer. It would take multiple drives in the same RAID to fail catastopically at the EXACT SAME time in order for data to be irretrivably lost (and even then there are ways to prevent that from happening such as real time mirrors, off-site data replication SW, etc.)

By: Anonymous

Anonymous — Fri, 11 May 2007 14:40:22 +0000

The correlation between disk failures in an array should have been expected. Standard statistical analysis for manufacturing assumes that the correlation matrix for a series of products is block diagonal. That means that rather than treating each product as an independent sample of a random process, you treat each production run as independent but assume that if a run produces some products with large errors then the implicit reliability of other products from the same run is lower. Since disk arrays are likely to be produced by taking N drives from the same run, you get the failure correlation described almost by default.

This implies that building arrays by sampling from different production runs would actually increase reliability substantially for RAID-5.

By: Richard Hamilton

Richard Hamilton — Fri, 23 Mar 2007 05:59:31 +0000

I wonder if there would be any value (and if it is common practice)
for array vendors to use drives of the same model, but of different
production runs, in their arrays, so as to minimize the odds of a
manufacturing defect causing a multiple failure leading to data loss.

By: Tony Pearson

Tony Pearson — Tue, 06 Mar 2007 05:26:05 +0000

see blog post above.

Robin, you are free to consider this “my” official response if you like to post it on your blog, or point to mine, whatever is easier for you. Given that IBM no longer manufacturers the DDMs we use inside our disk systems, there may not be any reason for a more formal response.

By: Paul

Paul — Mon, 26 Feb 2007 22:45:50 +0000

FWIW, Western Digital’s RE2 3.5″ drives, with retries tuned for array use, appear in Agami’s enterprise NAS. I hope WD responds.

By: Brian

Brian — Sun, 25 Feb 2007 20:22:17 +0000

Robin:

On the point about AFR and manufacturers, I was unaware as the Ms. Schroeder had mentioned MTTF, so thank you for that information.

You are correct in that the individual hard drive is not necessarily repairable. You are also correct that the MTBF = MTTF for an individual hard drive, and it would be good enough if our system was considered to be one hard drive. But, the system of hard drives is repairable. When a drive fails, a new one is put in place. That’s considered a repair of the system. The numbers calculated were an MTBF of the system. Ms. Schroeder had taken all of the accumulated hours, and divided by the total failures. So in this case, MTBF of the system is not the MTTF of an individual drive..

I ran a quick simulation the other day using some of the parameters from Ms. Schroeder’s paper. Here were my conditions. Weibully distributed systen with a shape parameter of 0.71, and a characteristic life based on a hard drive manufacturers quoted MTTF of 1,000,000 hours. I created a system of 4000 hard drives, assumed that they ran 24/7 and simulated random failures based on these parameters. I was noticing that even in the first month of operation, the MTBF was something like 200,000 hours, or about 1/5 of what the stated MTTF was. I’m going to re-run the simulation and generate a plot of the MTBF over time for this system. If you think that any of my assumptions are incorrect, let me know. For instance, assuming 24/7 operation may be too strenous. Also, 4,000 hard drives might be a bit much. Lastly, I assumed all were switched on at the same time initially. If there is a more proper deployment rate, I’d be curious to know.

By: IO Guy

IO Guy — Sat, 24 Feb 2007 04:08:27 +0000

Just for clarification 3.5″ HDDs are manufactured by:
1. Seagate – FC & SATA
2. Western Digital – SATA
3. HGST – both types
4. Samsung – both
5. Toshiba – none
6. Fujitsu – FC only

By: Robin Harris

Robin Harris — Fri, 23 Feb 2007 23:59:20 +0000

IO, I left out the others partly because I didn’t think of them and partly because with the exception of Fujitsu they aren’t in the 3.5″ enterprise arena, AFAIK. Which is probably why I didn’t think of them.

I included the array vendors because they build systems using “enterprise” drives and “consumer” drives, so they should see the differences. Robert is correct, array vendors spend significant time and money qualifying drives down to the firmware rev level, and they only accept tested rev levels. Array vendors have good visibility into the behavior of large populations of drives.

Brian, I’m pleased to get a reliability engineer’s perspective. I must differ with you on one point: at least one Seagate drive, the Barracuda 7200.10 family, quote an AFR of 0.34%. More probably do; I haven’t checked.

Beyond that, I’d like you to expand your point about MTBF and MTTF. I’ve rarely seen disks repaired, so why wouldn’t the two numbers be equal? Or at least similar enough as not to matter? Or am I totally missing the point?

I agree with you that any mechanical system should expect increasing failure rates with time, yet I don’t think that expectation has been set for customers. Unlike Jack Nicholson in a “Few Good Men” I think storage customers can handle the truth, and deserve it. Some customers might choose to replace all drives every two years if given the data. Some might pay for a 3-6 month burn-in service. It all starts with the best data we can provide.

Robin

By: Brian

Brian — Fri, 23 Feb 2007 19:44:35 +0000

Ok storagemojo, two of your points I’ve got a slight problem with.

# Failure rates are several times higher than reported by drive companies.
# Drive failure rates rise steadily with age rather than staying flat through some n-year mark.

The companies don’t quote a failure rate. They quote a drive life. There *is* a difference. The CMU paper got it wrong. Someone who is not in the reliability field attempted to perform reliability and did so incorrectly. MTBF MTTF. Any until we actually can have someone calculate the MTTF, we won’t know how close or far away from the manufacturers specs the real world numbers are. I will add that the MTTF may not be very difficult to calculate, it just depends on how the weibull was performed by these researchers and what was taken into account. There’s a lot of things to consider and they can make a difference.

Also, drive failure rate increasing with time is what is expected in ANY system where a wearout failure mode would be expected.

I’m not with the drive manufacturers in any way, but I AM a reliability engineer and I understand what has been done and the way it has been done incorrectly. Until it has been done properly, a proper comparison cannot be made.

By: Robert Pearson

Robert Pearson — Fri, 23 Feb 2007 18:49:10 +0000

Good input.

How does that work then? The OEMs don’t do any testing on drives?
My experience with Storage vendors is that they do a lot of testing at the drive level. Particularly during the qualifying of new drives.

As a matter of fact, I took a few vendors to task for spending more time on qualifying drives than on performance testing the Storage units. I wanted more time spent on Performance Under Load Information from testing by the vendor.

They always tap-danced me on that one saying they couldn’t duplicate my environment. So I developed a Generic set of specs. I also asked them if they didn’t have many of the same business environments that most of their customers had? And couldn’t they test their boxes with their own environments? They said no and no.

Well, I have set up Certification and Qualification Testing Labs that do just that. Usually they are three level to minimize crashing the Production operation.
Yes, there are sometimes still problems installing in Production after successfully passing levels 3 and 2 testing. There are some safeguards that must be employed to minimize this. The real reason no one does this seems to be that this process is not cheap and it takes some time.
As long as the market is not truly competitive this will be the case.

By the way, I don’t take the utopia view that “one size” of testing fits all.
The customer has the responsibility of knowing their environment well enough to inform the vendor and work with them to establish a reasonable test environment for both. Your IT shop may be very different than mine. That would be an unjust burden to place on the vendors.

At the moment, due to the “dumbing-down” of IT, customers are totally dependent on the vendors. This needs to change for the welfare of both.

By: IO Guy

IO Guy — Fri, 23 Feb 2007 07:14:19 +0000

You’ve only got two of the major FC and SATA HDD manufacturers. Seagate #1 and HGST #3. Consider adding: Western Digital #2, Samsung #4, Toshiba #5 and Fujitsu #6
The other companies in your open letter do not manufacture HDDs but they OEM from the above, usually Seagate and one other.