A 2½ year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought — a mean of 3,751 correctable errors per DIMM per year. Another piece of hallowed Conventional Wisdom bites the dust.
Google and Prof. Bianca Schroeder teamed up on the world’s first large-scale study of RAM errors in the field. They looked at multiple vendors, DRAM densities and DRAM types including DDR1, DDR2 and FB-DIMM.
Every system architect and motherboard designer should read it. And I agree with James Hamilton’s suspicion that even clients need ECC – at least heavily used clients.
If you can’t trust DRAM . . .
Here are some hard numbers from DRAM Errors in the Wild: A Large-Scale Field Study by Bianca Schroeder, U of Toronto, and Eduardo Pinheiro and Wolf-Dietrich Weber, Google.
What you don’t know can hurt you
Most DIMMs don’t include ECC because it costs more. Without ECC the system doesn’t know a memory error has occurred.
Which is part of the reason people aren’t more concerned. Ignorance is bliss.
Everything is fine until a memory error means a missed memory reference or a flipped bit in file metadata writing to disk. What you see is a “file not found” or a “file not readable” message, silent data corruption – or even a system crash. And nothing that says “memory error.”
The industry take on DRAM is summed in a quote from an old AnandTech FAQ that took the industry at its word:
Everyone can agree that hard errors are fairly rare. . . . For the frequency of soft errors. . . . IBM stated . . . that at sea level, a soft error event occurs once per month of constant use in a 128MB PC100 SDRAM module. Micron has stated that it is closer to once per six months . . . .
An even bigger surprise: it appears that hard errors, not soft errors, are the dominant error mode – the reverse of the conventional wisdom. This conclusion isn’t solid – the study’s data set didn’t distinguish between hard and soft errors – but the circumstantial evidence is suggestive. There may be a another study coming that uses error address data to distinguish hard and soft errors.
The paper has a few issues that make it difficult to understand. One issue is the use of the chip industry’s Failure In Time (FIT) metric.
One FIT = one failure per billion hours per mbit.
Confused? Me too. Taken at face value, FIT suggests that a 2 GB DIMM – 16,000 Mbit – has 16x the errors of a 128 MB DIMM.
But that isn’t what the study found: higher density DRAM doesn’t have more errors per DIMM. The FIT metric is most useful for comparing with earlier studies.
The study had some good news:
- Temperature plays little role in errors – just as Google found with disk drives – so heroic cooling isn’t necessary. Good news for data center air economizer architectures.
- Density isn’t a problem. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
- Heavily used systems have more errors.
- No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price – at least for ECC DIMMS.
- Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems – good news for users of smaller systems. Bad news for large-memory servers running in-memory databases.
Besides error rates much higher than expected – which is plenty bad – the study found that error rates were motherboard, not DIMM type or vendor, dependent. Some popular mobos must have poor EMI hygiene.
Route a memory traces too close to noisy components or shirk on grounding layers and instant error problems. Design or manufacturing problems in motherboards? The study did not do a root cause analysis.
Hardware failures are much more common as well and may be the most common type of memory failure. Google replaces all DIMMs with hard errors – as do most data centers – as a matter of policy.
The server error reporting could not always differentiate between hard and soft errors. Hard errors are discovered through memory tests run on off-line servers.
Other interesting findings
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform. There be lemons out there!
In more than 93% of the cases a machine that sees a correctable error experiences at least one more in the same year. They don’t get better by themselves.
High quality error correction codes are effective in reducing uncorrectable errors. There are “chip-kill” DIMM/mobo combinations that can detect and correct 4 bit errors, but few vendors offer those. Kingston and Corsair don’t.
Besides costing more, ECC DIMMs are about 3-5% slower than unprotected DIMMs. Few of us would ever notice that small a performance hit, but gamers might care.
HPC users might care too, for a different reason. James Hamilton noted a talk by Kathy Yelick – she doesn’t keep her web site updated – where she found that ECC recovery times are substantial and the correction latency slows the computation.
The StorageMojo take
You’d think that after several decades of semiconductor DRAM usage that this study would be old news. I did.
Like most folks I accepted industry assurances that DRAM is reliable. My main machine – a Mac Pro with an Intel server-class mobo – has FB-DIMMs whose 5-watt-per-DIMM overhead has irritated me. But when I found one DIMM reporting errors recently I felt better about it.
I suspect this is another example of the industry’s code of omerta. System vendors have scads of data on disk drives, DRAM, network adapters, OS and filesystem based on mortality and tech support calls, but do they share this with the consuming public? Nothing to see here folks, just move along.
Kudos to Google for doing the long-term research required for substantive results and then sharing those results with the rest of us. I expect ECC systems will become a lot more popular in the years ahead.
Courteous comments welcome, of course. Note: Much of this was published on ZDnet Sunday night. This version is updated after speaking to Prof. Schroeder Wednesday. This version also dispenses with some consumer-oriented content.