Sometimes in the midst of the endless tweaking needed to maximize storage performance one just wants to say “screw it! Put everything in RAM!” And that’s just what RAMCloud does.
Disk is the new tape, flash the new disk, DRAM the new flash.
RAMCloud is a research paper (pdf) and an open software project. The goal is enterprise-class availability with every bit of active data stored in DRAM, not disk or flash, for maximum performance. It is a key-value object store today, though as pure software that could change.
It’s the brainchild of John Ousterhout, a Stanford prof who invented Tcl back in the 80s at Berkeley.
Isn’t DRAM volatile and costly?
Right on both counts, grasshopper, so RAMCloud isn’t a 1 for 1 disk-style architecture. No Google FS-style triple replication here, or RAID-style erasure coding.
Instead RAMCloud uses buffered logging:
. . . a single copy of each object is stored in DRAM of a primary server and copies are kept on the disks of two or more backup servers; each server acts as both primary and backup. However, the disk copies are not updated synchronously during write operations. Instead, the primary server updates its DRAM and forwards log entries to the backup servers, where they are stored temporarily in DRAM.
Instead of working around crashes – using multiple object copies as scale-out storage does – RAMCloud recovers lost data from the DRAM logs or disk drives to replicate the lost data at high speed. That’s possible because all the log data is in DRAM or spread across many disks.
In a recent paper (Fast Crash Recovery in RAMCloud) (pdf) Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum (co-founder of VMware) go into more detail on this critical feature.
The key elements are:
- Scale. Servers scatter their backup data across all other servers so thousands of disks can serve the recovery.
- Log-structure. Reduces complexity and offers high performance.
- Randomization. Many decisions need to be made in a large cluster. Rather than CPU, time and bandwidth consuming determinism, injecting randomization speeds decisions with less overhead.
- Dynamic tablets. The key-value store tracks resource usage within a single table and ensures that no single partition is too large for fast restores.
DRAM is volatile so the log replication data is spread to other servers on other racks for redundancy before being committed to disk. Still, total system write throughput is limited by the disk write speed, whose limits are a key reason people are moving from disks. Flash drives may help, but other techniques, such as log truncation and sharding make it possible to get good performance from several thousand SATA drives.
How good? The team reports that in a 60 node cluster they recover 35GB in 1.6 seconds. With more nodes larger partitions should be restored even faster. Scale is good.
Lights out!
Power failures wipe all the data in DRAM. The obvious defense is to avoid failures: combine battery backup with diesel generator sets. Power ride-through will handle interruptions into the hundreds of milliseconds.
But who is going to trust that? That’s why future commercial implementations will insist on logging to stable storage, such as the flash SSDs.
They’re getting cheaper fast – faster than DRAM – which will make this a common approach.
Cost
Professor Ousterhout kindly sent a short note about cost, correctly noting that
. . . if you measure cost/operation, DRAM is roughly 100x cheaper than disk, since a disk can only perform about 100-200 operations/second. This is why RAMCloud makes sense for data-intensive applications. . . .
While you and I might find that persuasive, too many enterprises don’t. The deep conservatism of the storage culture – both figuratively and literally – makes cost a good excuse to stay with the tried and true, and easy to explain to CFOs.
The good news for the company I hope he is starting is that the primacy of $/GB is slowly eroding as customers see the system level savings from fast storage. SSD vendors and companies like TMS and Kaminario are breaking trail for RAMCloud.
The StorageMojo take
Make no mistake: RAMCloud is a research project, not a commercial product, years and million$ away from commercial application. But the concept is promising.
Imagine a world where data layout doesn’t matter, where apps are optimized for sub-millisecond storage, where 100 byte I/Os are faster and just as efficient as 8KB I/Os. The architectural implications are huge and would take a decade or more to absorb.
RAMCloud raises the thorny issue of tiering: getting hot data on the hot storage and everything else off to disk. There are OK answers for tiering but nothing insanely great.
RAMCloud shows we’re far from the end of the line in what storage can do. Faster, better, arguably cheaper: 2 out of 3 ain’t bad.
Courteous comments welcome, of course. A shorter version of this post appeared on ZDNet.
John Ousterhout is giving a tech talk on RAMCloud at Linkedin HQ in Mountain View, CA on Wed, Oct 12th. This event is free and open to the public.
Details and registration at http://events.linkedin.com/RAMCloud-Scalable-High-Performance/pub/793368
This the video recording of the above mentioned prestantion at LinkedIn here:
http://www.youtube.com/watch?v=lcUvU3b5co8
He mentions that they use 25 Gbit/s Infiniband in their tests, so the backplane on the switching hardware really is the bottleneck.