I was surprised at how many ZDnet readers reacted with disbelief to my recent Storage Bits series on data corruption (see How data gets lost, 50 ways to lose your data and How Microsoft puts your data at risk), claiming it had never happened to them.
Then I thought about it
What does data corruption look like to users? Does a window pop up with big red letters blaring “DATA CORRUPTION!!!” Nope, we get these “File not found” and other notices that could be – but who knows? – related to data corruption. Something goes badly wrong and you have to reinstall an application or the OS. But really, how prevalent is data corruption?
CERN does some research
That’s why I was delighted to see a new paper from CERN. Now, finally, some statistics are in, reported in a recent paper titled Data Integrity by Bernd Panzer-Steindel of the CERN IT group.
Petabytes of on-disk data analyzed
At CERN, the world’s largest particle physics lab, several researchers have analyzed the creation and propagation of silent data corruption. CERN’s huge collider – built beneath Switzerland and France – will generate 15 thousand terabytes of data next year.
The experiments at CERN – high energy “shots” that create many terabytes of data in a few seconds – then require months of careful statistical analysis to find traces of rare and short-lived particles. Errors in the data could invalidate the results, so CERN scientists and engineers did a systematic analysis to find silent data corruption events.
The program
The analysis looked at data corruption at 3 levels:
- Disk errors.The wrote a special 2 GB file to more than 3,000 nodes every 2 hours and read it back checking for errors for 5 weeks. They found 500 errors on 100 nodes.
- Single bit errors. 10% of disk errors.
- Sector (512 bytes) sized errors. 10% of disk errors.
- 64 KB regions. 80% of disk errors. This one turned out to be a bug in WD disk firmware interacting with 3Ware controller cards which CERN fixed by updating the firmware in 3,000 drives.
- RAID errors. They ran the verify command on 492 RAID systems each week for 4 weeks. The disks are spec’d at a Bit Error Rate of 10^14 read/written. The good news is that the observed BER was only about a 3rd of the spec’d rate. The bad news is that in reading/writing 2.4 petabytes of data there were some 300 errors.
- Memory errors. Good news: only 3 double-bit errors in 3 months on 1300 nodes. Bad news: according to the spec there shouldn’t have been any. Only double bit errors can’t be corrected.
All of these errors will corrupt user data. When they checked 8.7 TB of user data for corruption – 33,700 files – they found 22 corrupted files, or 1 in every 1500 files.
The bottom line
CERN found an overall byte error rate of 3 * 10^7, a rate considerably higher than numbers like 10^14 or 10^12 spec’d for components would suggest. This isn’t sinister.
It’s the BER of each link in the chain from CPU to disk and back again plus the fact that for some traffic, such as transferring a byte from the network to a disk, requires 6 memory r/w operations. That really pumps up the data volume and with it the likelihood of encountering an error.
The cost of accuracy
Accuracy isn’t free. The CERN paper concludes that taking measures to improve accuracy
. . . will lead to a doubling of the original required IO performance on the disk servers and . . . an increase of the available CPU capacity on the disk servers (50% ?!). This will of course have an influence on the costing and sizing of the CERN computing facility.
The Storage Bits take
My system has 1 TB of data on it, so if the CERN numbers hold true for me I have 3 corrupt files. Not a big deal for most people today. But if the industry doesn’t fix silent data corruption the problem will get worse. In “Rules of thumb in data engineering” the late Jim Gray posited that everything on disk today will be in main memory in 10 years.
If that empirical relationship holds, my PC in 2017 will have a 1 TB main memory and a 200 TB disk store. And about 500 corrupt files. At that point everyone will see data corruption and the vendors will have to do something.
So why not start fixing the problem now?
Comments welcome, of course.
Update: Peter Kelemen, one of the CERN researchers, kindly wrote in and pointed out that the it is the disks that are rated at 10^14, not the RAID card. There are no specs for the RAID cards. I’ve corrected it above.
There is an interview by scoble with the ZFS guys, in which they discuss amongst other things the data corruption research done by cern. ZFS seems to address this issue.
a must view..
http://www.podtech.net/scobleshow/technology/1619/talking-storage-systems-with-suns-zfs-team
Double-bit memory errors can’t be corrected, but can be detected, so if the OS is configured properly this will usually cause a crash instead of data corruption. Further, if the system scrubs memory then the odds of two single-bit errors being close enough in both space and time to generate a double-bit error are much reduced.
A second observation is that the study doesn’t seem to have done much to distinguish between media and interface errors except in the 3ware/WD case. Are the 20% of bit or sector areas bad on disk (in which case a re-read will yield similar incorrect results) or bad in transmission (in which case a re-read will yield correct data), or bad in a cache somewhere (which could show up either way depending on cache type and warmth)? Obviously the answer has many implications for future system design, though from an OS perspective all answers tend to suggest that end-to-end integrity checking is a good idea. Yet again, an idea that’s old in networking finally shows up in storage.
There are firmware updates that fix timeout issues for a variety of hard drive manufacturers including specific WD models. Depending on what hard drive you have and whether you have updated your firmware you will probably have a higher failure rate than the average or a lower one. Multiple drive failures are also more likely on these drives.
The CERN report dug deep enough to find that issue.
The other big failure area is bad blocks that are not routinely fixed followed by a failure of one of the raid members. The only way to see these that I am aware of are to look a raid controllers that put out logs specific error events.
One interesting question not answered is about transient errors. If you read a block from disk and get an error, do you still get an error if you read it again?
The way ZFS uses checksums on all blocks really helps with the integrity. In generally reading over all the ZFS documentation, they’ve thought hard about integrity issue and seem to have covered a lot of bases others are missing.
One bit of irony with the link you have to CERN above. Where was the world wide web created? At CERN as an “information management system” exactly for distributing information like this (http://en.wikipedia.org/wiki/World_Wide_Web#History). So what format is the paper at CERN in? PDF of course. *sigh*
Robbin,
Your right, the errors experienced at CERN are not unusual and many times are not repeatable. I have 20TB and another 10Tb array, I see bad data juju once a month.
The 3 * 10^7 BER, is not unexpected. I’m glad to see someone has finally blogged and brought the issue out of the closet. Its also good to see an end user has finally come out and disclosed the data errors they’re actually experiencing.
There is a lot of work done in the telco and satcom to model and correct different types of errors now experienced in the data storage arena.
Sometimes in the open systems world, unaddressed critical issues like data integrity is treated like the “crazy aunt locked in the basement” — no one wants to talk about. In the past, the open systems computer industry typically responds to this particular issue by passing the buck… Drive mfgs say “its the raid controller’s problem”, the raid controller mfgs push it off to the host computer systems, who in turn, push the problem off to the OS, where it either gets pushed back down the food chain or passed up to the applications to deal with. No one in the industry really wants to take ownership of this issue, its a complex problem, it requires participation from several diffent compnent vendors and requires significant changes to products lines. In other words…. its expensive to fix.
Are there any heroes in this story of corruption and buck passing intrigue ? That is besides Robin and his blogs.
In 2003, the T10 group took on an effort to standardize an “end to end” error detection scheme. The called it the T10 Data Integrity Field standard, commonly referred to as T10-DIF. It doesn’t correct the data corruptions, but at least they can be detected, until the data capacities get larger. T10-DIF uses an additional data field (DIF) generated by the host and accompanies the data from the host application to the disk and is returned to the host during data read operations. If any discrepancies occur between the DIF and the data during read operations, the host should able to detect it. The spec is still a work in progress in some areas. Although not complete, T10-DIF still provides significant value to storage data integrity issues.
The good new about T10-DIF ? – almost all 2GB fibre channel and Infini-band, all 4GB fibre channel HBAs and some midrange disk arrays support it. Also, there is a newly formed Data Integrity Initiative (Oracle/Emulex/LSI/Seagate) that will try to iron out the rest of the technical issues and try to insure all their products will interoperate at some level.
The bad news of T10-DIF… not yet supported by SAS, iSCSI and not supported by any low cost disk arrays. Also, their’s no independent authority validating that T10-DIF operates properly across products and platforms and there is no specification classifying different levels of DIF support ( a marketing jackbox). Most of these issues are temporary and should be worked out in the next 24 months.
We should see T10-DIF rolling out in most midrange arrays in the next 12 to 24 months. When is it going to reach the lower end of the markets, when the three 800 pound gorillas demand it. Is it going to help your 1 TB drive in your PC ? Sorry, you’ll have to wait another 10 years.
How do they get their corruption levels so low?
-jealous
The usual problem with approaches like the T10-DIF is that they still don’t catch ‘wild’ or ‘lost’ writes, since they write the validation information along with the data that it validates. Only a mechanism which writes some reasonable form of ‘checksum’ separately from the data that it protects (and then checks it on every subsequent read) can handle these kinds of errors (which admittedly don’t fall into the same category of errors characterized by BERs).
Both ZFS and its older progenitor WAFL do provide this form of separate checksuming (truly end-to-end in ZFS’s case, only end-to-end-within-server in WAFL’s case). It’s interesting that the CERN study reportedly stated that this form of protection doubles the disk write overhead, since reasonable implementations can have far lower average impact than that.
– bill
All,
The 10^7 number underscores the urgency of end-to-end data protection. So far, ZFS seems to be the only game in town.
The specific causes of the data corruption at CERN don’t seem that interesting given that they seem to be spread out over the entire I/O system. The data chain is so long and subject to so many kinds of error that errors are pretty much baked into the system. If you weren’t CERN would you even figure out the WD/3ware problem?
This is on top of the fact that RAID 5 no longer protects our large drives from URE’s. The whole data protection model is up for a serious re-think.
xfer, thanks for the T10-DIF reference. Hadn’t heard of that, even though I agree with Bill’s comment. Half a loaf is better than none.
Rob, good catch on the CERN using the PDF. Honestly though I save a lot of interesting web pages as PDF’s myself and use an iPhoto like PDF viewer to cruise through them.
Robin
(A third submission that disappeared without a trace:)
As I’ve observed before, calling ZFS “the only game in town” in this respect really isn’t fair to WAFL. WAFL provides the same level of end-to-end validation that ZFS does: the main difference is that by virtue of running only on a file server rather than as a local file system WAFL can’t provide end-to-end protection all the way to client RAM – but it does provide end-to-end protection from server RAM to disk and back again, and the normal network checksums provide protection from client RAM to server RAM and back again, so the validation is scarcely weaker than local ZFS protection.
– bill
Hi Robin, glad you liked the reference. Sorry it took so long to get back on the wire, the investor in a new project backed out at the last minute, so I caught scrambling to either find a new investor or another consulting contract. I still need to pay the mortgage…
Back onto T10 DIF…. The DIF is a block data integrity scheme targeting the transfered between the CPU/system memory and the storage medium. DIF wasn’t intended to enforce or detect data integrity outside of the block interface. It also works well for volume to volume block copies. Its not intended to detect missing blocks, ie the file system forgot to request or write data and the app reads stale data from the buffer. File based data integrity checks need to occur at another layer in the system. This case one size does not fit all.
ZFS is my new favorite technology of the year. More so than WAFL, no insult to the WAFL advocates. Working with different remote data synchonization technologies since the mid ’80’s any “anywhere” file system sends up redflags and flashbacks of long torch(turous) nights. When the sun is shining, everyone is your friend, but once the storm hits… with marginal interconnects between data stores, live systems that cannot go down, hard SLAs and long resync periods, life quickly becomes interesting. Its better than nothing at all.
Methinks you don’t understand what lost and wild writes are: they’re not the result of the file system failing to write or writing in the wrong place, they’re the result of the *disk* failing to write or writing in the wrong place (and failing to report any error as a result).
These are indeed data integrity deficiencies in the block interface. A lost write cannot be detected by mechanisms such as DIF, nor can the fact that a wild write did not update the block it was supposed to – but at least when the block that a wild write updated in error is subsequently read the DIF mechanism should catch that.
I’m also curious about precisely what worries you about WAFL – especially since ZFS (which you seem to prefer) also uses a very similar ‘write-anywhere’ approach.
– bill
Well Bill, to be honest with you, I was really hoping you didn’t mean the disk drive. Your are correct in the fact the can write in the wrong spot occasionally. Been there too many times in the last few years I’ve personally experienced that feature. But, a well designed raid system “should” catch that type of issue. Unfortunately because of performance degradation, some raid sytems do not validate the parity on each read or read after write operation. They only validate read data (off the disk) if they have a failed or removed drive from the raid set. You can ask, “what happenes if all the drives fail the same exact way?”. Then you have a design flaw, and all the error detection on earth may not help.
ZFS is my current favorite not due to write anywhere, which I feel is a very risky behavior for a file system, but because of the low transaction overhead. Its one of the things they have gotten right. I personally file systems should behave very predicatibly, the less variance, better chance for reliability. One major advantage is write allocation coalescingand meta data written to anywhere on the disk, but I think meta data writes should be under a little closer control for workload, capacity, performance planning, consistent workaround for undocumented features and repeatable operation. I tend to treat application servers more like big dedicated embedded systems. Especially for synchronous clustering and remote data synchronization. Treating a server and its systems like a general purpose computer, well, then you may as well put your apps on a desktop. I’m not saying that WAFL is junk, it just doesn’t scale linearly as I like. I have a preference for very linear scaling servers. Its a personal perference – no I don’t have a real favorite for linear scaling either ( I get asked that often after being on a soapbox).
The only RAIDs that commonly validate parity on every read are RAID-3 (and IIRC RAID-2) configurations – and not even all of them do. The common RAIDs (1 and 5) virtually never validate the entire stripe on every read, because people who desire that behavior should be using RAID-3 instead (choosing RAID-1, -4, or -5 indicates their preference for improved performance; if they want RAID-3-style validation, RAID-3 will perform better due to its spindle synchronization). Furthermore, even when RAID-3 validates the entire stripe, if it discovers a problem it can’t establish whether the problem is with some portion of the data or with the recorded parity – all it can do is propagate the error up the chain.
Read-after-write validation is occasionally offered as an integrity option but is seldom used – again, for performance reasons. And it can’t correct data trashed by a wild write: it can only ensure that the correct location is also written.
The virtues you claim for ZFS are surpassed in WAFL, which collects updates in its NVRAM and then coalesces them to write back to disk as a single request – and it doesn’t have to do so as often, since the NVRAM stabilizes the data without requiring as frequent ‘syncs’ to disk. Your closing comments above are so vague (and apparently confused) that I’ve really got to wonder whether you know what you’re talking about in this area – but I’m always willing to be educated if I’m missing something, so if you’d like to be more specific feel free to continue.
– bill
DataDirect Networks S2A9550 uses a form of RAID-4, and validate checksums on the fly while reading and writing. It uses some sort of CRC checksumming, and validates all the data path from the controller to the disk, too; the CRC allows it to use 1 or 2 parity channels.
I don’t know how it actually manage parity errors at the drive level, though.
read after write checking is basically free for NAND flash, because it usually tolerates many more reads than writes.
I’m not sure whether it would read-thru to the medium in allmost cases, but that’s an interesting thought.
Rocky
Can you elaborate on why you said IBM V7000 has No CRC or T10DIF