Guess what? There were TWO best papers at FAST ’07
I wrote about “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? (see Everything You Know About Disks Is Wrong). So it is equal time for the other the other “best paper” “TFS: A Transparent File System for Contributory Storage” (pdf) by James Cipar, Mark D. Corner, and Emery D. Berger of the U. Mass. Amherst Computer Science department. This is a great piece of work that I’d love to see people run with.
Think of it as SETI@home for data. A giant network-based data storage system that works in the background using what you aren’t to provide a world-wide storage resource. Cool.
OK, file system I get. What’s with Transparent?
Yeah, and contributory too.
Let’s start with contributory. It means a file system that uses disk blocks contributed by people on a network. You’d sign up your system and all the unused disk blocks on your hard drive(s) are contributed to this giant on-line file system, TFS.
“Stop right there!” I hear you say. “I don’t want all my unused disk storage sucked up by a giant data vampire!”
TFS is designed to work with contributory applications, such as folding@home, the protein folding app. You volunteer your machine and the app works in the background. These apps already understand that machine cycles can go away at any time, so they correctly handle lost files too. I believe TFS could be adopted to other applications as well – more on that later.
Performance and capacity
Using free disk space on nodes across a network is not a new idea. The problem is getting enough storage to make it worthwhile, while not irritating users. One idea is simply to take over a percentage of the node’s disk. In effect that simply adds another file and what if the user suddenly needs that space? Also, it limits the capacity contribution to far less than it could be with a more dynamic allocation.
Another idea is to watermark the contributed capacity, so when the node wants that capacity it has to delete the data first, setting off data replication on another node. This results in higher capacity contributions and much greater overhead – and a big performance hit as each write becomes a delete-first then write. Also, watermarking reduces a disk’s contiguous free space, hastening fragmentation even before the contributed storage needs overwriting.
Here’s where the Transparent comes in
TFS uses free space on your drive, true. It also allows the host system to overwrite any TFS blocks at any time for any reason. TFS is “transparent” to the host operating system: the OS doesn’t know or need to care about any TFS occupied blocks. As a result TFS has little impact on host system performance. If the host needs a big block of contiguous space, it can take it without worrying about TFS.
TFS requires five different block states (see state diagram) to do its magic. It takes more work and larger indices to keep track of the states, which seems a small price to pay for good host performance and more contributed disk space.
TFS does write its metadata as non-transparent files on the contributing system. Losing metadata is costly for TFS and the capacity used minimal for the host. Another, possibly more costly trade-off is that TFS does not allow the overwriting of open TFS files. If TFS files are large or many are typically kept open, that might be a problem. Intelligent application design should eliminate either problem.
Storage capacity, bandwidth and reliability
All storage is flaky so all storage relies on replicating content to assure availability. The flakier the storage the more replication is required. In TFS storage flakiness has more sources than usual:
- TFS data can be overwritten
- The disk or node can fail
- The network can fail
The authors analyze how much replication is required to achieve 0.99999 – five nines – availability. The issue basically boils down to how reliable each of the elements is, with the added fillip that a lot of block, disk or node churn will impact network bandwidth requirements.
Net net, TFS is well-suited to highly available private networks, such as campus-wide corporate nets. It also works well with stable groups of users. Trying to mine the laptop space of a group of road warriors, always up and down on the net, is TFS hell. Good node availability reduces network bandwidth requirements and increases the contributed storage. The flakier the nodes the more bandwidth is required for replication. Node and network stability helps TFS do its best work.
How well does it work?
The authors ran a number of tests on a small number of systems. Overall, they found it performed well. They tested four Linux systems with the Ext2 file system at varying levels of capacity contribution (0%, 5%, 35%) against a TFS system at 100% contribution. They found the TFS system performed most like the EXT2 system with 0% contribution. Furthermore, TFS provided 40% more capacity than the other contribution systems it was compared with. The figure gives the gory details.
The StorageMojo take
The TFS prototypes are implemented in the kernel, so this won’t be available to the average Linux user soon. Yet as the world’s broadband build-out continues and powerful computer systems propagate, there will be huge amounts of storage capacity available for contribution.
Once TFS or something like it has kernal support, the problem shifts to engaging large numbers of stable people to contribute. Social networking will solve that problem.
Finally, I wonder if TFS could be combined could be combined with Cleversafe’s open-source distributed storage system. Cleversafe enables public networks to store private data securely. If the two could be combined it would be possible to securely store private data on contributed storage. A group of musicians or artists could create a secure private storage system for their works in progress without having to worry about offsite backup.
Just a thought.
Comments welcome, especially if you have another idea. Moderation turned on to encourage you to find your own cheap car insurance.
Diagrams extracted from the paper along with everything else I know about TFS.
Professor Corner wrote and said that you can download the prototype source code right here. Let me know what you think if you look at it.