Today is the last day of FAST 17. Yesterday a couple of hours were devoted to Work-in-Progress (WIP) reports.
WIP reports are kept to 4 minutes and a few slides. One in particular caught my eye.
In On Fault Resilience of File System Checkers, Om Rameshwar Gatla and Mai Zheng, posed an interesting question: how fault resilient are *nix fsck file system checkers?
This really happened
New Mexico State’s Texas Tech’s HPC center had a power failure. Once power was restored, file system checking commenced. But then there was another power failure, which led these grad students to look at whether or not an interrupted checking process would further damage data integrity.
Why isn’t the answer ever no?
Bad news: yes, fsck interruptus can further corrupt data. Good news: it doesn’t always.
More bad news: fsck probably can’t fix the damage it produced on the second go.
The StorageMojo take
On my more pessimistic days I sometimes wonder that we have any uncorrupted data stored anywhere. But yes, our storage infrastructure often works, so that’s something.
This is just one more gotcha to be aware of. I hope Om and Mai can extend this research to further understand the sources of further corruption and figure out how to make fsck more robust.
Courteous comments welcome, of course.
The biggest problem with fsck nowadays, is that volume sizes are getting larger and larger. Scanning an entire volume can take a lot of time, even when clean. If there are errors, well….it will take much longer to correct those errors.
This is where Copy-on-Write (COW) file systems such as ZFS truly shine (no fsck). Yes, the file system does its own internal scrubbing to correct both silent and noisy data corruptions BUT, with a COW approach to write new data, where all operations are atomic, the pointer references to the new data are not updated until every operation before it is complete and known to be in a good state. This avoids file system inconsistencies.
And yes, I agree, fsck utilities alongside the file systems themselves need to be a bit more robust and should be designed to handle interruptions.
Thanks for reporting our work! Actually, this project also received positive feedback from kernel developers in the Linux Summit’17 (held immediately after FAST’17 in the afternoon of March 2). We would love to keep working on it and publish the results in the near future. Stay tuned:)
Btw, although we are from the New Mexico State University, the power outages which motivated our work actually happened at the Texas Tech HPC center. I guess we didn’t make it clear enough during the short presentation. Sorry about that.