Research (see Nightmare on DIMM street) a few years ago found that DRAM error rates were hundreds to thousands of times higher than vendors had led us believe.
But what is the nature of those errors? Are they soft errors – as is commonly believed – where a stray Alpha particle flips a bit? Or are they hard errors, where a bit gets stuck?
Errors soft and hard
If they’re soft and random, there’s little we can do. But if they’re hard, there are ways to lessen their impact while operating more efficiently.
According to University of Toronto researchers who looked at tens of thousands of processors at Google and several national labs (see Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design by Andy Hwang, Ioan Stefanovici and Bianca Schroeder), hard errors are common, but their nature isn’t binary either. Memory locations can become error prone, without being permanently stuck, perhaps sensitive to access patterns.
Researchers looked deep into 4 large installations: IBM Blue Gene/L (BG/L) at LLNL; Blue Gene/P (BG/P) at Argonne National Laboratory; an HPC cluster at Canada’s SciNet; and 20,000 Google servers. The Google systems weren’t as well instrumented as the others, so some errors were conservatively estimated.
Differentiating hard or soft errors means determining their root cause. At this study’s scale offline memory testing wasn’t feasible. Reasonable assumptions were made to classify errors, even though some intermittent errors only become permanent over time.
The key assumption . . . is that repeat errors at the same location are likely due to hard errors since it would be statistically extremely unlikely that the same location would be hit twice within our measurement period by cosmic rays or other external sources of noise. . . . Note however that in practice, hard errors manifest themselves as intermittent rather than on every access to a particular memory location.
What price data integrity?
In most consumer PCs – including all Macs except the Mac Pro – there is no DRAM error correction code (ECC). Workstations, servers and supercomputers commonly do.
The simplest ECC is the common detect-and-correct single bit errors and detect-but-not-correct 2 bit errors.
On the high end is the sophisticated and costly chipkill – developed by IBM – that can survive the loss of an entire memory chip – or many multi-bit errors. When you’re running a 6 month simulation job on one of the world’s most powerful supercomputers with many terabytes of DRAM, you don’t want a single chip failure to hose the job.
But all ECC systems rely – like RAID systems – on redundancy and extra computation to do their magic. Which means cost, power and performance hits. Thus the interest in optimizing an ECC strategy.
Key observations
The paper makes a number of important observations:
- There are strong correlations between errors in space and time, suggesting hard errors.
- The frequency of multi-bit and chipkill errors also points to hard errors, as these are unlikely to be soft errors.
- Many nodes with correctable errors used advanced ECC mechanisms: 20%-45% activated redundant bit-steering; and 15% activated Chipkill.
- Background scrubbing does not significantly shorten the amount of time until a repeat error. This suggests some errors are intermittent and only seen under certain access patterns.
- Memory used by the OS seems to see more errors.
- Some pages account for a large fraction of errors.
- An operating system that could identify and retire error-prone pages would avoid 90% of errors by retiring only 10% of pages with errors.
- Stringent availability requirements might make chipkill economic.
The StorageMojo take
Consumers don’t care about ECC, but high-scale cloud providers have the economic incentive to optimize ECC strategies. The stepwise enhancement will drive useful changes in servers, but the benefits will accrue to cloud services, not enterprises.
It’s a matter of scale.
Google appears to have plans for this: they refused to release their server data from the study. The paper suggests both hardware and software strategies that would improve error handling and reduce costs, so why give the competition free data?
The prevalence of hard errors is the single most important conclusion. It demolishes the “DRAM errors are soft” myth so we can get on with the work of making DRAM – and all systems that use it – more reliable.
Courteous comments welcome, of course.
Some DRAM failures can be systemic; I replaced over 500 FBDIMMs in a 150-node Intel 53xx Xeon compute cluster, because a particular batch of AMB serial-to-parallel buffer chips was failing at a greatly increased rate, when the cluster was just over three years old (but still under parts-only warranty). The nodes ran at close to full power for days at a time with short drops back to idle between jobs, so the AMB chips would have seen regular and large temperature swings. A common failure mode for these bad AMB chips was to go completely at reboot, which was frustrating because the node would not even POST with the bad FBDIMM installed. It was actually a relief to see ECC messages, because it meant that I could directly replace one FBDIMM instead of pulling sets of FBDIMMs to isolate the one that was preventing the node from POSTing.
Fortunately, not all >500 DIMMs were replaced in that way; the vendor eventually identified the root cause of these high failure rates, and provided advance replacements for the remaining FBDIMMs with suspect AMB chips.
Bad AMB chips aside, anyone running extended full-cluster jobs will be exposed to enough other modes of hardware failure that regular job checkpointing is a vital part of such runs. Lots of batch clusters run with a single IB link per compute node, for starters. The clusters tend to be designed for overall robustness, as losing a few nodes over time at that scale is a fact of life; the important thing is not to lose too many hours of computation when a node dies, and to keep the job throughput high in the face of these small failures.
IBM has Chipkill, HP has Advanced ECC which is similar
ftp://ftp.hp.com/pub/c-products/servers/options/c00256943.pdf
I’ve been using Advanced ECC since it’s been standard on all HP gear since about 2003. Some of the really low end DL00-series systems don’t have Advanced ECC – so I tend to avoid that product line.
At one point I tried finding more in depth information on how Chipkill works but came up empty, the above document from HP has some interesting info though.
One of the reasons I don’t buy from Dell is they don’t offer similar technology. When your operating with 10s of gigs of memory, or in some cases 100gB+ this sort of tech is crucial. My current production systems run with 192GB/each. ECC alone is not enough.
The latest Intel CPUs has some sort of advanced memory protection features as well. So Dell has been able to offer something similar on their platform for the latest Intel chips(my servers are Opteron so no help there for Dell). Again I tried finding more in depth info on specifics behind this technology but came up empty.
Servers have been able to in many cases to mirror memory for some time as well though I’ve never seen a server deployed with that enabled.
I had my first minor corruption from bad memory on my desktop at work about a week ago. I’ve had RAM go bad before but it’s always resulted in full system crashes (of which I had one on this desktop in recent weeks), never data corruption (fortunately it was minor – as far as I know just one or two files and both files were text). The RAM that IT gave me as a system upgrade went bad (the brand they used I had never heard of before and so was not happy with that when they upgraded me).
btw Robin your RSS feed has been broken for a while!
Our observations are that many of these hard errors are caused by corner case design flaws. The unpredictable nature of the traffic flow on the memory bus once its deployed in the field brings out these problems. At the request of a large memory controller vendor we designed a new tool that can look for over 400 JEDEC spec violations on every clock tic. You can see information at these below links. We also find that validation is not a real strong point for some vendors. They just slap these systems together thinking that memory integrity is something that somebody else took care of. We routinely see spec violations that could easily lead to hard errors and system crashes on brand new systems right out of the box. The bottom line is that these recent research papers are good in that they highlight the problem. They rely on the system logs which really don’t give them the insight that they need. We are seeing the problem on a whole different level. The hardware level clock tic by clock tic. I have tried contacting the authors to share our insight but no response as of yet!
http://www.jedec.org/sites/default/files/Barbara_A_summary.pdf
http://www.memcon.com/pdfs/Response_to_the_Google_Study_Could_Those_Memory_Failures_Be_Caused_by_Protocol_Violations.pdf
video: http://www.memcon.com/conference.aspx