So I sent a note off to the nice folks at HP who wrote this paper, alerting them to their elevation to storage rock stars, and one of them wrote me back to alert me to a newer version of the paper. The newer version is less sprightly yet has a lot more information, so check it out.
A Federated Array of Bricks (FAB) is designed to be an enterprise-class, fully redundant and low-cost block storage system. FAB is a fully distributed system: all bricks run the same software; there are no “masters”; quorums are determined dynamically through an innovative majority-voting algorithm. A client can issue I/O’s to multiple bricks concurrently to improve performance.
FAB differs from the Google File System in a couple of interesting ways. First, there are no masters providing services – the FAB distributes those services across all the bricks – so all services *should* scale as the number of bricks grows. Second, HP’s idea of a brick is heavy on the storage side: 12 SATA drives and 1 GB of NVRAM running Linux, which says to me that low-power 2.5″ drives will find a home in this brick pretty fast. Like GFS, FAB uses commodity products to achieve enterprise class (10,000+ year MTTDL) data availability using smart software. Also, by default, FAB also maintains three copies of all data. And still costs way less than enterprise storage arrays.
Enterprise storage arrays benefit from 15 years of performance engineering, an advantage not easily overcome by a few PhD’s in a lab. Which is just one reason why big-iron storage arrays will be with us for years to come. Yet we don’t all need the highest performance. In fact, as data gets cooler, a lower and lower percentage of capacity will require high performance, which suggests that low-cost, high-availability storage has a very promising future.
The following benchmark consists of: “untar” 177 MB of Linux 2.6.1 source code to an external file system – a bulk write; tar the files back to the local file system – bulk read; and finally, compile the files on the target file system – mix of R/W and computes. To eliminate cache effects the target volume was unmounted after each step and the unmount time included in the results. The HP guys don’t actually say what the unit is, but I’ll assume seconds unless someone has a better idea.
|Local RAID 1||22.32||14.64||319.2|
|iSCSI + raw disk||24.21||24.32||323.9|
|FAB 3way repl.||21.57||24.61||316.0|
If clusters don’t scale, people don’t brag on them. Sure enough, the FAB team concludes
Overall, as expected, FAB’s throughput scales linearly with the cluster size. The exception is 64 KB random reads, which hit a ceiling due to the capacity limits of our Ethernet switches.
They also tested FAB’s distributed replication protocol against a master/slave replication protocol. Performance was similar for both, which suggests to me that in this case at least, implementation trumps architecture.
One of the irritations of conventional dual active/active RAID controllers is that failover can take a minute or more, possibly causing applications to time out, and just generally slowing things down. Since FAB is distributed, and any brick can service any client with any I/O, one would hope to see much less disruption when a FAB failure occurs. And one does.
This is a worst-case scenario, where a brick fails and five minutes later is declared dead, so its segment groups get re-balanced across remaining bricks.
While the actual data movement takes some time, but far less than rebuilding a similar size RAID 5 disk failure, disruption to the FAB is limited, having almost no impact on reads and about a 20% impact on writes.
The StorageMojo take
FAB demonstrates, again, that highly available, stable and well-performing storage can be built out of commodity hardware with the right software. Microsoft, Google, Amazon, Cnet and HP have demonstrated it. Moving from lab to product is non-trivial, yet the customer economic advantages are huge. Someone is going to do it and, I predict, turn the storage industry upside down.
Comments welcome, as always. Moderation turned on to control a growing deluge of comment spam.
Update:Wes pointed out that I didn’t include units for the performance benchmark, so I went back and added a paragraph explaining what the benchmark was and adding my guess as to what the units are. Thanks, Wes.