I’ve been ranting about data loss on Storage Bits. Data loss makes me irate because I see regular folks who know nothing about computers struggling with the fallout and it is so unnecessary.
The stimulus was a fine PhD thesis IRON File Systems (pdf) by Vijayan Prabhakaran, now of Microsoft Labs, exploring how commodity file systems corrupt data by injecting errors into ext3, ReiserFS, JFS, XFS and NTFS and then recording their responses.
Dr. Prabhakaran built an error-injection framework that enabled him to control what kind of errors the file system would see so he could document how the FS handled them. These errors include:
- Failure type: read or write? If read: latent sector fault or block corruption? Does the machine crash before or after certain block failures?
- Block type: directory block; super block? Specific inode or block numbers could be specified as well.
- Transient or permanent fault?
Sure enough, he found a lot of bugs in the file systems, even though, due to its proprietary nature, he couldn’t get as deep into NTFS as the others.
From our analysis results, we find that the technology used by high-end systems (e.g., checksumming, disk scrubbing, and so on) has not filtered down to the realm of commodity file systems. Across all platforms, we find ad hoc failure handling and a great deal of illogical inconsistency in failure policy, often due to the diffusion of failure handling code through the kernel; such inconsistency leads to substantially different detection and recovery strategies under similar fault scenarios, resulting in unpredictable and often undesirable fault-handling strategies.
And
We also discover that most systems implement portions of their failure policy incorrectly; the presence of bugs in the implementations demonstrates the difficulty and complexity of correctly handling certain classes of disk failure. We observe little tolerance to transient failures; most file systems assume a single temporarily-inaccessible block indicates a fatal whole-disk failure. We show that none of the file systems can recover from partial disk failures, due to a lack of in-disk redundancy.
This is what the EMC Centera is running on. Feeling better?
As hardware gets more reliable, software is a bigger problem
Software is always buggy, and with Moore’s Law, we have more software at more levels of the storage stack. File systems need to be the enforcers of data integrity in the storage stack since only file systems know where every block is and what every block is supposed to have in it.
The marketing problem
From my small-town perch, working with computer naifs, I know that most folks have absolutely no idea if a problem is caused by a lame file system or not. So how do you make people care?
I don’t think you can. People don’t care about whether their car has a timing belt or a timing chain, until they realize 2 things: first, it costs money to replace a belt and; second, timing chains don’t require replacement. Most folks will never put the two together.
All the vendor can do is add up all the features, like timing chains, electronic ignitions and platinum-tipped spark plugs and offer “no tune-ups for 100,000 miles.” People understand that, especially if you remember when a tune-up every 3,000 miles was common.
Sell the benefit, not the technology.
The StorageMojo take
One of the things I love about my other blog is that it exposes me to something closer to consumer thinking. On the one hand there are folks who understand some things about the technology – such as “clean power is good” – and don’t get, say, why a file system should be concerned with disk drive problems. It is partly education and partly cognitive.
But I think I also see something else: an emotional need for storage confidence; an unwillingness to confront the idea that storage systems fail. At one level I get it. Paranoia is time-consuming and not very productive.
But unlike CPU’s and networks, storage is all about persistence. For all its faults the industry cares deeply about that. How do we tap into the consumer’s concern for persistence in a way that spurs action rather than denial? I’m hoping Apple is coming up with some good ideas as they prepare to roll out Time Machine and ZFS.
Comments welcome, as always. I didn’t try to evaluate Vijayan’s architectural solution as that is beyond my competence. Somebody want to take a look at it and give us the pros and cons?
Part of the problem is detection of errors, correction when those errors occur and fatal failure. In most cases all I care about is that my filesystem is fast and survives a power outage. The disk failures I’ve seen have either been fatal (disk did not start) or the corruption so vast and obvious that restoration from backups was the only thing to be done. Bit flips will become more common but I’d much rather some background backup system (e.g. shadow volumes/snapshotting to somewhere else) than filesystem level checksumming and correction. Of it’s unlikely I’d detect bit flips in uncompressed data myself so maybe it’s always been happening… Additionally not everything is equally important on my disk…
(It’s also telling that this research was on Linux filesystems. The barrier to entry is low, there are several of them, you can change the code and the pay off high if your work is useful. Of course my question is – does the failure tested reflect how real disks fail?)
RE: [Storage Bits] “Maybe, someday, Microsoft will start measuring success in terms of software quality instead of market share”
Sounds good. Think about it.
Maybe software quality is in the eye of the beholder?
One beholders pleasure is another’s poison?
Maybe market share averages price/performance across the eyes of all beholders?
If the quality of the software falls below acceptable levels of the market it will disappear. Market history proves this.
Market history also proves that acceptable levels of quality can be abysmally low. Hence many products that should have died an early death live on and make money for the vendor who laugh and sing all the way to the bank. Because their personal livelihood doesn’t depend on the product or its level of quality. Only its acceptability.
Its like a valid test for parenting. Won’t happen.
That is a great thesis: Clean idea, testable hypothesis derived from that idea, and some good follow-on ideas for future work. Unfortunately, creating a reliable filesystem takes a lot of time, careful planning, and effort. (Witness the ReiserFS/3/4 saga. ) Starting from the working ixt3 code could be a good start. I haven’t been able to google sources for the prototype ixt3 driver. Has anyone seen a posted source online of the prototype IRON FS?