Research (see Nightmare on DIMM street) a few years ago found that DRAM error rates were hundreds to thousands of times higher than vendors had led us believe.
But what is the nature of those errors? Are they soft errors – as is commonly believed – where a stray Alpha particle flips a bit? Or are they hard errors, where a bit gets stuck?
Errors soft and hard
If they’re soft and random, there’s little we can do. But if they’re hard, there are ways to lessen their impact while operating more efficiently.
According to University of Toronto researchers who looked at tens of thousands of processors at Google and several national labs (see Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design by Andy Hwang, Ioan Stefanovici and Bianca Schroeder), hard errors are common, but their nature isn’t binary either. Memory locations can become error prone, without being permanently stuck, perhaps sensitive to access patterns.
Researchers looked deep into 4 large installations: IBM Blue Gene/L (BG/L) at LLNL; Blue Gene/P (BG/P) at Argonne National Laboratory; an HPC cluster at Canada’s SciNet; and 20,000 Google servers. The Google systems weren’t as well instrumented as the others, so some errors were conservatively estimated.
Differentiating hard or soft errors means determining their root cause. At this study’s scale offline memory testing wasn’t feasible. Reasonable assumptions were made to classify errors, even though some intermittent errors only become permanent over time.
The key assumption . . . is that repeat errors at the same location are likely due to hard errors since it would be statistically extremely unlikely that the same location would be hit twice within our measurement period by cosmic rays or other external sources of noise. . . . Note however that in practice, hard errors manifest themselves as intermittent rather than on every access to a particular memory location.
What price data integrity?
In most consumer PCs – including all Macs except the Mac Pro – there is no DRAM error correction code (ECC). Workstations, servers and supercomputers commonly do.
The simplest ECC is the common detect-and-correct single bit errors and detect-but-not-correct 2 bit errors.
On the high end is the sophisticated and costly chipkill – developed by IBM – that can survive the loss of an entire memory chip – or many multi-bit errors. When you’re running a 6 month simulation job on one of the world’s most powerful supercomputers with many terabytes of DRAM, you don’t want a single chip failure to hose the job.
But all ECC systems rely – like RAID systems – on redundancy and extra computation to do their magic. Which means cost, power and performance hits. Thus the interest in optimizing an ECC strategy.
The paper makes a number of important observations:
- There are strong correlations between errors in space and time, suggesting hard errors.
- The frequency of multi-bit and chipkill errors also points to hard errors, as these are unlikely to be soft errors.
- Many nodes with correctable errors used advanced ECC mechanisms: 20%-45% activated redundant bit-steering; and 15% activated Chipkill.
- Background scrubbing does not significantly shorten the amount of time until a repeat error. This suggests some errors are intermittent and only seen under certain access patterns.
- Memory used by the OS seems to see more errors.
- Some pages account for a large fraction of errors.
- An operating system that could identify and retire error-prone pages would avoid 90% of errors by retiring only 10% of pages with errors.
- Stringent availability requirements might make chipkill economic.
The StorageMojo take
Consumers don’t care about ECC, but high-scale cloud providers have the economic incentive to optimize ECC strategies. The stepwise enhancement will drive useful changes in servers, but the benefits will accrue to cloud services, not enterprises.
It’s a matter of scale.
Google appears to have plans for this: they refused to release their server data from the study. The paper suggests both hardware and software strategies that would improve error handling and reduce costs, so why give the competition free data?
The prevalence of hard errors is the single most important conclusion. It demolishes the “DRAM errors are soft” myth so we can get on with the work of making DRAM – and all systems that use it – more reliable.
Courteous comments welcome, of course.