One of the decade’s grand challenges in storage is making efficient advanced erasure coded object storage (AECOS) fast enough to displace most file servers.

Advanced erasure codes can give users the capability to survive four or more device failures – be they disks, SSDs, servers, or datacenters – with low capacity overhead. By low I mean 40% over the net stored data, rather than the 3x replication default for many object stores today.

Advanced erasure codes are capacity efficient, but at a price: computational and read latency overhead. Accessing the data requires a lot of processing, which is why most of these codes are used for archives, not active data.

Yet as data volumes rise faster than either areal density grows (for disks) or cost-per-bit drops (for SSDs), capacity efficiency will become critical. Despite multiple variables and leverage points, there is a growing need for capacity efficient storage with at least good, and preferably great, performance.

Hitchhiker
Since CPUs and networks are not getting (much) faster, the obvious place to look for the needed performance improvements are in the algorithms underlying advanced erasure codes. In the paper A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers, K. V. Rashmi1, Nihar B. Shah1, and Kannan Ramchandran of UC Berkeley, and Dikang Gu, Hairong Kuang, and Dhruba Borthakur, of Facebook, present

. . . Hitchhiker, a new erasure-coded storage system that reduces both network traffic and disk IO by around 25% to 45% during reconstruction of missing or otherwise unavailable data, with no additional storage, the same fault tolerance, and arbitrary flexibility in the choice of parameters, as compared to RS-based systems. Hitchhiker “rides” on top of RS codes, and is based on novel encoding and decoding techniques. . . .

Piggybacking
The unintuitive part of Hitchhiker is that it builds on top of existing Reed-Solomon codes. So how does adding more data to an existing code make it more, rather than less, efficient?

At this point, those with a professional interest should download the PDF for detailed explanation. Essentially, the piggyback framework uses finite field arithmetic to add one byte of data to impart new properties to the underlying RS code.

These added properties can be designed to achieve different goals. The team focussed on reconstruction efficiency.

The authors present three different codes to demonstrate the concept and to test for production efficiency. Two of the codes use only low-overhead XOR operations, while the third – and most efficient – requires complex finite field arithmetic.

Test implementation
The team implemented their algorithms in the Hadoop Distributed File System (HDFS), which is widely used at Facebook. They built on the HDFS-RAID mmodule using RS codes as normally deployed in Facebook infrastructure. Here’s a diagram of what they implemented:

Click to enlarge.

Click to enlarge.

Results
The team evaluated both computation times for encoding and degraded reads and read times for degraded reads and recovery. As expected, the additional computation overhead for encoding the different Hitchhiker variants is higher than straight RS codes.

Bottom line: substantial improvement in read times over traditional RS codes:

Click to enlarge.

Click to enlarge.

This graph compares encoding times:

Click to enlarge.

Click to enlarge.

The StorageMojo take
Most files aren’t accessed more than a handful of times. So why put them on costly high performance file servers?

Better to use commodity object storage with advanced erasure codes to get lower cost and higher availability than legacy active-active file servers can provide. Of course, but the performance penalty for advanced erasure coding has been a problem, as Cleversafe and others found.

Nonetheless, this paper demonstrates that significant progress is possible. Expect a decade of stepwise enhancements until AECOS displaces the vast majority of enterprise file servers.

Courteous comments welcome, of course. AECOS is a terrible acronym. Anyone have a better idea?