We want our data protected from device failures. When there is a failure we want to get our data back quickly. And we want to pay as little as possible for the protection and the restore. How?

Recent research by hyper-scale system managers – mostly Microsoft and Facebook engineers and scientists – has tried to answer that question. And the answers are way better than what we have today.

In XORing Elephants: Novel Erasure Codes for Big Data, authors Maheswaran Sathiamoorthy of USC, Alexandros G. Dimakis, Megasthenis Asteris, and Dimitris Papailiopoulos of [then] USC and now UT Austin, and Dhruba Borthakur, Ramkumar Vadali and Scott Chen of Facebook delve deeply into the issue.

RAID repair problem
Standard Reed-Solomon erasure codes aren’t well suited to hyperscale Hadoop workloads. Repair is big issue: both time and overhead.

Reed-Solomon codes suffer from an efficiency versus repair trade-off. Standard RAID5 or RAID6 needs a wide stripe for capacity efficiency, but a wide stripe makes the time to repair a failed disk much longer. Data has to be transferred from every other disk in the stripe, using up scarce disk I/Os and internal storage bandwidth.

Replication nation
To avoid this problem a decade ago scale out storage dispensed with Reed-Solomon codes in favor of simple double or triple replication. Because they used inexpensive disks rather than expensive arrays this was economical.

But exponential data growth has overwhelmed the ability of big web companies to build infrastructure fast enough and large enough to handle the tsunami of data. Something had to give. Triple replication was it.

But even triple replication isn’t enough for them. Since they can’t back up they need a level of redundancy that even RAID6 cannot approach.

These systems are now designed to survive the loss of up to four storage elements – disks, servers, nodes or even entire data centers – without losing any data. What is even more remarkable is that, as this paper demonstrates, these codes achieve this reliability with a capacity overhead of only 60%.

Xorbas the Geek
They examined a large Facebook analytics Hadoop cluster of 3000 nodes with about 45 PB of raw capacity. On average about 22 nodes a day fail, but some days failures could spike to more than 100.

Facebook node failures

As Hadoop clusters are usually network bandwidth constrained – a few active disks can fill a 10Gb pipe – the network traffic generated by the failure repairs was non-negligible. An optimal storage solution would not only be capacity efficient, but also reduce network repair traffic.

The authors developed a new family of erasure codes called Locally Repairable Codes or LRCs that are efficiently repairable in disk I/O and bandwidth requirements. They implemented these codes in a new Hadoop module called HDFS–Xorbas and tested it on Amazon and within Facebook.

LRC test results found several key results.

  • Disk I/O and network traffic were reduced by half compared to RS codes.
  • The LRC required 14% more storage than RS, information theoretically optimal for the obtained locality.
  • Repairs times were much lower thanks to the local repair codes.
  • Much greater reliability thanks to fast repairs.
  • Reduced network traffic makes them suitable for geographic distribution.

Here’s the table comparing replication, Reed Solomon and LRC. The (10, 6, 5) refers to data stripe blocks, parity blocks and local redundancy blocks respectively.

Facebook LRC results

The StorageMojo take
Good news: WebCos will be able to store massive amounts of data more efficiently than ever before. Bad news: so will anyone else.

These codes increase pressure on enterprise storage vendors. The cheaper web storage makes CIOs more defensive when CFOs ask why their CapEx and OpEx are so high compared to WebCo. And that, in turn, forces CIOs to look beyond their current array vendors.

Creative destruction is too bad for the array vendors who’ve enjoyed a nice 15 year run, but it is necessary if we are to have a persistent digital culture. We’re still a long way from the longevity and easy access of a book, but this buys us more time to figure everything out.

A couple of the authors were kind enough to review the post and suggest changes to make it more accurate, which are now incorporated. Also you can access the project page for more details and a link to the code.

Commments welcome, as always. Do you think the web industry is doing enough to protect our data?