StorageMojo’s best paper of FAST ’10 is Understanding Latent Sector Errors and How to Protect Against Them (pdf) by Bianca Schroeder, Sotirios Damouras, and Phillipa Gill, University of Toronto.

The paper builds on research and a dataset that StorageMojo reviewed 2 years ago in Latent sector errors in disk drives. That research analyzed the error logs of 50,000 NetApp arrays with 1.53 million enterprise and consumer drives disks.

Understanding
Understanding LSEs does a statistical deep dive on the disk LSE dataset and then evaluates scrubbing and intra-disk redundancy strategies against the field data.

Latent sector errors are important for 3 reasons:

  • 1 LSE can cause a RAID reconstruction failure in a single parity RAID system (RAID 5).
  • Ever-tinier disk storage geometries make LSEs more likely.
  • The insidious failure mode: no detection until access is attempted.

Schroeder et. al. used a subset of the LSE dataset that included only drives that had LSEs. This covered 29,615 nearline (presumably SATA) drives and 17,513 enterprise drives that had been in the field at least 12 months.

LSE metrics
Some of the papers conclusions:

  • For most drives almost all LSEs are a single error. Multiple contiguous logical block errors are less than 2.5% of all LSEs.
  • If there is a 2nd error, most are within 100 sectors of the 1st error.
  • Depending on the model, between 20% and 50% of errors are in the first 10% of the drive’s logical sector space. Some drives have a higher concentration of errors at the end of the drive as well.
  • LSEs are highly concentrated in a few short time intervals, not randomly spread out over a drive’s life.
  • It appears that events that are close in space are also close in time.

The rest of the paper
The paper also goes into 2 interesting topics – intra-disk redundancy and scrubbing strategies – that deserve posts of their own. For the latter the research found that changing the order in which sectors are scrubbed can improve mean time to error detection by 40% – with no increase in overhead or scrub frequency.

Conclusions
Key quote:

We observe that many of the statistical aspects of LSEs are well modeled by power-laws, including the length of error bursts (i.e. a series of contiguous sectors affected by LSEs), the number of good sectors that separate error bursts, and the number of LSEs observed per time. We find that these properties are poorly modeled by the most commonly used distributions, geometric and Poisson. Instead we observe that a Pareto distribution fits the data very well and report the parameters that provide the best fit. . . . We find no significant difference in the statistical properties of LSEs in nearline drives versus enterprise class drives.

[bolding added -ed. However, nearline drives are about 4x more likely get an error.]

The StorageMojo take
Disk-based storage arrays are facing a real challenge from flash and possibly PCM technology. Disks win the $/GB race, but piling double and triple parity on arrays increases costs and firmware complexity.

Understanding the nature of the enemy – in this case latent sector errors – helps array designers develop more reliable and cost-effective arrays. Yet one has to wonder if the RAID paradigm is reaching the end of the line.

Parallel and object-based systems from Isilon and Panasas, for example, are very fast at disk rebuilds because they can draw data from many disk drives in parallel – without the performance-killing overhead that RAID rebuilds impose.

But those are larger systems. Putting these techniques together may give us reliable and economical RAID 5 systems for the SMB market for another decade or more.

Courteous comments welcome, of course. I’ve done work for Isilon – who also advertises on StorageMojo – and Panasas. The official best paper of FAST ’10 was quFiles which I blogged about last week.

If you spot a typo please let me know. Thanks!