StorageMojo’s best paper of FAST ’10 is Understanding Latent Sector Errors and How to Protect Against Them (pdf) by Bianca Schroeder, Sotirios Damouras, and Phillipa Gill, University of Toronto.
The paper builds on research and a dataset that StorageMojo reviewed 2 years ago in Latent sector errors in disk drives. That research analyzed the error logs of 50,000 NetApp arrays with 1.53 million enterprise and consumer drives disks.
Understanding
Understanding LSEs does a statistical deep dive on the disk LSE dataset and then evaluates scrubbing and intra-disk redundancy strategies against the field data.
Latent sector errors are important for 3 reasons:
- 1 LSE can cause a RAID reconstruction failure in a single parity RAID system (RAID 5).
- Ever-tinier disk storage geometries make LSEs more likely.
- The insidious failure mode: no detection until access is attempted.
Schroeder et. al. used a subset of the LSE dataset that included only drives that had LSEs. This covered 29,615 nearline (presumably SATA) drives and 17,513 enterprise drives that had been in the field at least 12 months.
LSE metrics
Some of the papers conclusions:
- For most drives almost all LSEs are a single error. Multiple contiguous logical block errors are less than 2.5% of all LSEs.
- If there is a 2nd error, most are within 100 sectors of the 1st error.
- Depending on the model, between 20% and 50% of errors are in the first 10% of the drive’s logical sector space. Some drives have a higher concentration of errors at the end of the drive as well.
- LSEs are highly concentrated in a few short time intervals, not randomly spread out over a drive’s life.
- It appears that events that are close in space are also close in time.
The rest of the paper
The paper also goes into 2 interesting topics – intra-disk redundancy and scrubbing strategies – that deserve posts of their own. For the latter the research found that changing the order in which sectors are scrubbed can improve mean time to error detection by 40% – with no increase in overhead or scrub frequency.
Conclusions
Key quote:
We observe that many of the statistical aspects of LSEs are well modeled by power-laws, including the length of error bursts (i.e. a series of contiguous sectors affected by LSEs), the number of good sectors that separate error bursts, and the number of LSEs observed per time. We find that these properties are poorly modeled by the most commonly used distributions, geometric and Poisson. Instead we observe that a Pareto distribution fits the data very well and report the parameters that provide the best fit. . . . We find no significant difference in the statistical properties of LSEs in nearline drives versus enterprise class drives.
[bolding added -ed. However, nearline drives are about 4x more likely get an error.]
The StorageMojo take
Disk-based storage arrays are facing a real challenge from flash and possibly PCM technology. Disks win the $/GB race, but piling double and triple parity on arrays increases costs and firmware complexity.
Understanding the nature of the enemy – in this case latent sector errors – helps array designers develop more reliable and cost-effective arrays. Yet one has to wonder if the RAID paradigm is reaching the end of the line.
Parallel and object-based systems from Isilon and Panasas, for example, are very fast at disk rebuilds because they can draw data from many disk drives in parallel – without the performance-killing overhead that RAID rebuilds impose.
But those are larger systems. Putting these techniques together may give us reliable and economical RAID 5 systems for the SMB market for another decade or more.
Courteous comments welcome, of course. I’ve done work for Isilon – who also advertises on StorageMojo – and Panasas. The official best paper of FAST ’10 was quFiles which I blogged about last week.
If you spot a typo please let me know. Thanks!
I don’t get the object based-systems idea as an iteration past raid- you’re taking the raid5/6 overhead of 13% and replacing it with a 100%-400% overhead system. So disk densities doubled, but so did your protection scheme- so now you’re back to square one. Why not just stick with the smaller drives and RAID5/6? Especially with power/cooling moving to the forefront, I don’t see how technologies like isilon, lefthand, XIV, etc that have such abysmal raw-to-protected ratios make any sense? 1TB drives that I can only use 50% or less of them. Why not use 450gb 15k drives and spin about the same number of drives, but get 2x the iops?
s/spot a type/spot a typo/
(I assume.)
Just A Storage Guy,
Isilon does not have an “abysmal” raw-to-protected ratio. The reason being that the “RAID set” size is arbitrarily large. I put “RAID set” in quotes as it is not “disks” that Isilon protects, but rather data. So for example, in a cluster do dozens of controllers with hudnreds of drives, the effective protection of RAID plus 3 parity only costs you 3 drives worth of overhead, plus spares. This is a lot better than the classic RAID set limited vendors.
Your idea of using 15K drives instead of SATA, when compared to Isilon’s approach, would be quite a lot more expensive, unless perhaps you are comparing a 15K bank of drives implemented by white box technology to the approach. However, Enterprise Storage is Enteprise Storage, and white box is white box. Let’s not compare the two.
Joe Kraska
San Diego CA
USA
Joe,
I’m not talking about whitebox solutions, I’m talking about enterprise solutions from companies that actually turn a profit.
You’re talking about multinode striping- the main reason why isilon’s random io profile is terrible. I’m talking about N+N protection- maybe I should have used lefthand as a better example.
From what I recall of the history/evolution of RAID levels, the original points were…
+ protection: Against entire disk failures.
+ performance: balance random vs sequential and read vs write
+etc
As storage density continues to grow, the probability of failing to consistently/reliably/accurately read a block is becoming significant.
Depending on the type of error encountered…
+ the entire drive may be bad or dying.
+ the block may be bad
+ the block is good but the read was bad (probably correct if read it again).
Given the limitations of volume level RAID (slow rebuild times, etc) is it not time for file system or stripe level redundancy?