How well does it work?
The paper discusses performance testing on the Boxwood prototype, a cluster of eight machines on a Gig E switch. Each machine housed a 2.4 GHz Xeon, 1 GB RAM, dual SCSI ports and 5 15k SCSI drives. Not bad for three years ago.
The holy grail of cluster technology is scalability. It doesn’t take long for even small scaling shortfalls to make adding nodes futile. For example, 90% scalability – where each additional node achieves 90% of the performance of the previous node – sounds pretty good. Until you do the math. .9^8 means that the 8th node has 43% the performance of the first node. At 12 nodes the 12th node has only 28% of the performance. At 24 nodes the 24th node adds only 8% of the first nodes mojo to the cluster. Your Internet Data Center won’t get far with only 90% scalability.
Even 99% scalability fails pretty quickly at internet scale. Node 100 will only get you 37% of the first node’s performance – the same as the 1000th node with 99.9% scalability. Given that Google has clusters with over 8000 nodes you can see the importance of linear scalability – or as close as you can get.
The Boxwood developers are at pains to demonstrate linear scaling with their tiny 8 node prototype cluster. Just so the suspense doesn’t kill you, they do a credible job.
First rule: if scaling isn’t required, don’t sweat it
A crucial detail of Boxwood’s replicated logical device or RLDev, is that while theoretically all RLDevs could be served from a single cluster of systems, the Boxwood team realized that it made more sense to have many RLDev servers. By spreading the RLDev load across many small clusters they eliminate the scalability issue for the all-important disk service. Nonetheless, for the sake of completeness, they did test the RLDev server up to eight nodes and found that:
At small packet sizes, the throughput is limited by disk latency. For large packet sizes, we get performance close to the RPC system imposed limit. In all cases, we observe good scaling.
In this mode RLDev works like a charm.
Second rule: see rule #1
Just above the RLDev is the chunk manager. While the RLDev only performs reads and writes, the chunk manager also allocates and deallocated chunks for applications. For reading and writing from the chunk manager there is a only bit of local address tranlation before the RLDev layer does its magic.
It is the allocation/deallocation work that takes cycles and interaction with the Paxos service to communicate chunk changes and the unique identifier each chunk has. Allocation is the expensive piece: a minimum of three writes for a single allocation which, going across IP takes a huge amount of time – 24 ms – which makes disks look fast. What the team found though is that by batching the allocations, like batching RAID 5 writes, they dramatically cut the time per allocation to as little as 1 ms. Even if you want a thousand allocations at a time. So far, so good.
Chopping down the B-tree
The one area that the team wasn’t comfortable with was the performance of the B-tree itself. They didn’t come out and say “B-tree performance stinks”. They are scientists after all. What they did say is
In general, the performance of B-trees is dependent on several parameters, which include the branching factor, the distribution of keys, and the size of the keys and data. We have not yet done an exhaustive characterization of our B-tree module at various parameter values. The particular trees we measure have 10-byte keys and data and each tree node has a branching factor of 6.
This is the performance red flag with Boxwood, but I don’t think it is fatal. Call me an optimist, but I agree that B-trees are highly tunable and that there are a number of optimizations available that the Boxwood team did not test in time for the paper.
The proof is in the pudding
The Boxwood team’s final test is an actual application: an NFSv2 server. They made a few changes to the NFS spec that they felt did not compromise the usability of the server, but I’m not competent to judge that. What they did find is that the BoxFS server performed very well despite the B-tree concerns:
|Test||BoxFS (sec)||NFS (sec)|
|Create 155 ﬁles and 62 dirs||0.7||1.0|
|1000 Chmods and stats||1.5||8.4|
|Write a 1 MB ﬁle 10 times||3.7||10.8|
|200 Renames and links on 10 ﬁles||0.9||1.9|
This is a subset of the tests they performed – if you want to dig into the details please read the paper. Yet the general thrust is clear: in this application, Boxwood obtained very good performance despite little B-tree tuning.
Tomorrow: the StorageMojo.com take on Boxwood
Comments welcome, as usual.