So I sent a note off to the nice folks at HP who wrote this paper, alerting them to their elevation to storage rock stars, and one of them wrote me back to alert me to a newer version of the paper. The newer version is less sprightly yet has a lot more information, so check it out.
The recap
A Federated Array of Bricks (FAB) is designed to be an enterprise-class, fully redundant and low-cost block storage system. FAB is a fully distributed system: all bricks run the same software; there are no “masters”; quorums are determined dynamically through an innovative majority-voting algorithm. A client can issue I/O’s to multiple bricks concurrently to improve performance.
FAB differs from the Google File System in a couple of interesting ways. First, there are no masters providing services – the FAB distributes those services across all the bricks – so all services *should* scale as the number of bricks grows. Second, HP’s idea of a brick is heavy on the storage side: 12 SATA drives and 1 GB of NVRAM running Linux, which says to me that low-power 2.5″ drives will find a home in this brick pretty fast. Like GFS, FAB uses commodity products to achieve enterprise class (10,000+ year MTTDL) data availability using smart software. Also, by default, FAB also maintains three copies of all data. And still costs way less than enterprise storage arrays.
FAB Performance
Enterprise storage arrays benefit from 15 years of performance engineering, an advantage not easily overcome by a few PhD’s in a lab. Which is just one reason why big-iron storage arrays will be with us for years to come. Yet we don’t all need the highest performance. In fact, as data gets cooler, a lower and lower percentage of capacity will require high performance, which suggests that low-cost, high-availability storage has a very promising future.
The following benchmark consists of: “untar” 177 MB of Linux 2.6.1 source code to an external file system – a bulk write; tar the files back to the local file system – bulk read; and finally, compile the files on the target file system – mix of R/W and computes. To eliminate cache effects the target volume was unmounted after each step and the unmount time included in the results. The HP guys don’t actually say what the unit is, but I’ll assume seconds unless someone has a better idea.
Configuration | Untar | Tar | Compile |
Local Disk | 21.76 | 14.80 | 318.9 |
Local RAID 1 | 22.32 | 14.64 | 319.2 |
iSCSI + raw disk | 24.21 | 24.32 | 323.9 |
FAB 3way repl. | 21.57 | 24.61 | 316.0 |
Scalability
If clusters don’t scale, people don’t brag on them. Sure enough, the FAB team concludes
Overall, as expected, FAB’s throughput scales linearly with the cluster size. The exception is 64 KB random reads, which hit a ceiling due to the capacity limits of our Ethernet switches.
They also tested FAB’s distributed replication protocol against a master/slave replication protocol. Performance was similar for both, which suggests to me that in this case at least, implementation trumps architecture.
Failover
One of the irritations of conventional dual active/active RAID controllers is that failover can take a minute or more, possibly causing applications to time out, and just generally slowing things down. Since FAB is distributed, and any brick can service any client with any I/O, one would hope to see much less disruption when a FAB failure occurs. And one does.
This is a worst-case scenario, where a brick fails and five minutes later is declared dead, so its segment groups get re-balanced across remaining bricks.
While the actual data movement takes some time, but far less than rebuilding a similar size RAID 5 disk failure, disruption to the FAB is limited, having almost no impact on reads and about a 20% impact on writes.
The StorageMojo take
FAB demonstrates, again, that highly available, stable and well-performing storage can be built out of commodity hardware with the right software. Microsoft, Google, Amazon, Cnet and HP have demonstrated it. Moving from lab to product is non-trivial, yet the customer economic advantages are huge. Someone is going to do it and, I predict, turn the storage industry upside down.
Comments welcome, as always. Moderation turned on to control a growing deluge of comment spam.
Update:Wes pointed out that I didn’t include units for the performance benchmark, so I went back and added a paragraph explaining what the benchmark was and adding my guess as to what the units are. Thanks, Wes.
“Someone is going to do it …”
Object Matrix have done just that. Check them out, they are a small startup with a couple of customers..
So what about TCO? What about the cost of management and the increased maintenance of commodity hardware? What about power and cooling costs for extra kit to provide the resilience?
How does TCO of this setup compare to say Amazon’s S3 service?
Jello, actually, a number of folks are. I’m thinking I should put together an article on them. Thanks for the pointer to Object Matrix.
Alex, all good questions. Probably the best answer to the TCO, management and P&C issues is found in – where else? – in StorageMojo’s Killing With Kindness: Death By Big Iron.
The short answer: first world people are expensive; custom-built, low-volume, highly-redundant hardware is expensive; everything else is much cheaper. As long as you’ve got enough of the everything else, you’re cool. Once you don’t you have an LP problem. We know LP.
The real cost of the coming storage revolution will be getting all the glass house conservatives re-trained or fired/retired.
BTW, Alex is an Oracle guru with Pythian Group, a company providing remote Oracle DBA services. Nice classical allusion guys. Pythian means relating to Delphi, the temple of Apollo at Delphi, or its oracle. PG claims they are at least 20% better/cheaper than your own Oracle DBAs, and in today’s winner-take-all world, who am I to doubt that?
Alex, hope to hear more from you in the future.
Robin
The real Storage Revolution may have started at:
“The Sands are Shifting on the Desert”
http://www.drunkendata.com/?p=698
At least I tried to start it after some “Right On!” comments by Pq65 and Chuck Hollis of EMC.
Robin,
The goal of FAB is to be admired, however, some of the assumptions may be false.
Here are a few of ‘lose’ observations.
Typical ‘motherboards’ are viewed as unreliable and the FAB paper proposes hardware triplication to achieve reliability… so more pressure on the cost of hardware. This is OK… but I suggest that the MTTDL situation is a lot worse than projected, as their calculation may not have considered software, power and cabling … all prominent sources of failure.
Testing of new software…. will take ‘some’ time. How are they going to get into a position to field test new software on multiple, very large ‘production’ systems… do such customers exist ?
Given time & money, the software will get better … but can you visualize the problems associated with cabling, not to mention the cost of power backup for short-term or prolonged outages… on a 5,000 node storage cluster ?
NVRAM is OK but it does not solve the problem… you still have caches in servers to flush…. presumably these are considered to be off-site and the need for power backup is a separate cost – center issue.
I have a problem with constant obsession with the so called ‘commodity’ hardware i.e. a low cost, standard ‘motherboard’ concept. Typical motherboards are very inefficient in terms of power and are not optimized to drive storage. Plug-in type of interfaces for host & backend disks (and the associated internal cabling)…. are not a good fit from the standpoint of cost or reliability. The cost simply moves to the plug-in cards and the internal wiring mess adds to labor and unreliability.
The major cost is in the processor chip, memory, NVRAM, disk interface, host interface and disks. Given time, software-related issues will be solved … but the hardware must be optimized to the task on hand… down to ‘chip’ level.
Hardware cost & reliability dramatically improve if one eliminates the processor socket, all of the plug-in options plus the remaining unnecessary hardware, expansion sockets, cabling, etc… and yes… you are right back to a well-designed ‘brick’ controller, similar to the existing RAID controller concepts.
Good processor speed is required to do what is projected inbase software… plus XOR software and the iSCSI stack. The X86 ‘commodity’ processor is not cheap and not the most efficient compute engine … i.e. higher MHz and more power.
This project should consider a RISC engine… where the compute efficiency, lower power budget and cost are more attractive. I suggest that perhaps a new form inter-connect technology is required … fibre-optic or wireless…. this should be evaluated.
As we all know, any hardware design can be ‘commoditized’ with volume supported by open specification & packaging standard, enabling multiple sources of hardware supply… all driven by an open software project … much like Linux.
None of this is new … but someone must take the lead ….and someone large needs to underwrite the concept. Any volunteers..?
With regard to Richard’s comments…
Good sound engineering.
Very appropriate in the classic Triangle of Unobtainium where the choices were:
Cost – reasonable, relative to Quality and Speed
Speed – relative to minimum acceptable Quality
Quality – 80% acceptable politically, financially, technically
Pick any two—
But in the new Triangle:
Low Cost – dirt cheap to a bargain
Speed – lightening fast, “we’ll fix it after it ships”
Quality – what will the customer go for?
What do people want?
The IDCs want really low cost based on commodity pricing and long-term Management control of the infrastructure. So they “Rolled Their Own”.
Outside of the government, the Financial Services companies, hospitals, and any others subject to regulation, people want what the IDCs have. The problem is they so not need enough to qualify for as much of an “Economies of Scale” as the IDCs do. So they want a Strategy that will allow them to use commodity pricing, low Management costs and to keep the law suits from “loss of Information” to a minimum.
Given those guidelines, what would your design criteria be?
Mine would be redundancy, redundancy, redundancy!
How many commodity priced motherboards do I need to support the failure rate?
How many parallel paths into the “Enabling” Units of Technology do I need to provide redundancy and acceptable performance given the known failure rate of commodity priced JBOD?
How do I manage this?
How do I build enough internal bandwidth into these “boxen” to support the bandwidth requirements of all the failovers?
The fallacy of all this is obvious.
Competition has given us really great automobiles at ridiculously low prices, relative to the engineering in them. I can remember when I had to check the oil, the water, the battery, the tires before starting my automobile each morning. I never look under the hood now.
People want the same thing with Storage.
Hi Robin,
I think you have this particular Alex (one and the same who left the previous comment on this article) confused with someone else.. 🙂
I’m actually a professional in the IT Services industry. One of my remits is looking at efficient (cost-effective) storage for my company.
Thanks for the link back to your previous article; must have missed that at the time. Very helpful 🙂
Alex,
Oops! Please pardon my leaping to a Google-powered conclusion. I should be more sensitive to the issue since if you Google “Robin Harris” you might conclude I am a dead comedian.
Richard,
I’ll also point you back to Killing With Kindness: Death By Big Iron. While there are certainly trade-offs when moving to a new architecture, a substantial cost advantage can be a powerful motivator. Inevitably, cost predictions are risky, which is why CFOs put much more emphasis on capital cost savings than the frequently mythical operating expense savings.
Your meaty comment deserves a longer response that is in itself a long article. I’ll be mulling that over the next few days.
Robert, you hit the nail on the head: utility storage. Cheap, reliable, and as nearly management-free as humanly possible. We’ll always have the F1’s, NASCARS and dragsters of storage, but your garden-variety data center wants a Toyota, not a Ferrari.
Robin
Come on, give us some units on that performance data!
Wes,
Good catch! I went back to the paper and, surprise! unless I’m going blind they don’t list the performance units either! Freaky.
But I’ll go out on a limb and guess that the measurement is in seconds. I don’t do a lot (ok, any) compiling, so you tell me: does compiling 177 MB of Linux source in 5 minutes sound about right?
Robin
Yes, I would guess seconds. If they’re getting 316MB/s that’s pretty awesome. 🙂