FAB: the software
This is the good part.
Each brick runs three pieces of software:
- The Coordinator that receives client requests and manages reads and writes
- The Block-manager that does the actual reading and writing of – you guessed it! – disk blocks
- And the Configuration manager that handles administrative changes
Taking it from the top
The Coordinator acts like a disk array controller from a client’s view. Since the Coordinator sits on every brick, any brick can function as an array controller for any client and any Coordinator can access any block. FAB really is a fully distributed system.
Since it is an array controller, the Coordinator must manage data replication, and it does. Like Google’s GFS, FAB does simple replication of blocks. No RAID 5. They did the math on replication and calculated that 3-way replication would give them a Mean Time To Data Loss of 1.3 million years and a mean unavailability of 1 second per year. All you enterprise guys: suck it up. No whining.
Unlike Google . . .
FAB is fully distributed, so there are no masters providing needed services. Personally, the FAB architecture feels more elegant, but distributed architectures do introduce some problems. Like, what happens when a write to all the replicas doesn’t complete? How do all those Coordinators on all those bricks know it without slow-to-a-crawl network overhead? What happens when the Coordinator itself fails and causes the incomplete write? Excellent questions, grasshopper. The Mojo is strong in you.
Each of the three (default) replicas has two time stamps. The first is the timestamp of the currently stored data. The second timestamp logs the time of the newest ongoing write request. So when a new Coordinator needs to look at the data on a set of replicas with an incomplete update, it looks at the first and second timestamp to detect incomplete replica updates. If the first timestamp is newer than the newest write request, the (not-the-original-writer) Coordinator knows that the data is OK. If it isn’t, then the rookie Coordinator checks all three ( or however many) replicas and chooses the one with the newest timestamps, sends that data to the client and updates the remaining replicas. Since timestamps are normally cached in brick RAM, this process takes much less time than a single disk read it overlaps with, so performance is good.
There are other wrinkles to timestamp management considered in the paper that I don’t consider here. Read the paper if you want to know more.
Tote that bale and balance that load
FAB replicates segments – by default 8 GB collections of 1 KB blocks – randomly across all the bricks. Least-busy replicas are read first, and if traffic spikes, a segment may be replicated across dozens of bricks if need by. Again, not unlike what the Google File System does.
Enterprise Storage On a Shoestring, Pt. III coming Real Soon Now. In the meantime, comments welcome.