Today is the last day of FAST 17. Yesterday a couple of hours were devoted to Work-in-Progress (WIP) reports.
WIP reports are kept to 4 minutes and a few slides. One in particular caught my eye.
In On Fault Resilience of File System Checkers, Om Rameshwar Gatla and Mai Zheng, posed an interesting question: how fault resilient are *nix fsck file system checkers?
This really happened
New Mexico State’s Texas Tech’s HPC center had a power failure. Once power was restored, file system checking commenced. But then there was another power failure, which led these grad students to look at whether or not an interrupted checking process would further damage data integrity.
Why isn’t the answer ever no?
Bad news: yes, fsck interruptus can further corrupt data. Good news: it doesn’t always.
More bad news: fsck probably can’t fix the damage it produced on the second go.
The StorageMojo take
On my more pessimistic days I sometimes wonder that we have any uncorrupted data stored anywhere. But yes, our storage infrastructure often works, so that’s something.
This is just one more gotcha to be aware of. I hope Om and Mai can extend this research to further understand the sources of further corruption and figure out how to make fsck more robust.
Courteous comments welcome, of course.