The crack StorageMojo analyst team has finally named a StorageMojo FAST 15 Best Paper. It was tough to get agreement this year because of the many excellent contenders. Here’s a rundown of the most interesting before a more detailed explication of the winner.

CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems impressed with its ambition.

Analysis of the ECMWF Storage Landscape, an analysis of a 100PB active archive, impressed with its scale.

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs answered important questions about big data and flash.

Reliable, Consistent, and Efficient Data Sync for Mobile Apps holds out hope of a fix for a major failure of most sync services.

The mostly EMC paper RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures offered useful insight on drive failure modes based on EMC’s internal support records – something StorageMojo has been agitating for these many years past.

Towards SLO Complying SSDs Through OPS Isolation offered the long needed observation that

. . . performance SLOs [Service Level Objectives] cannot be satisfied with current commercial SSDs. We show that garbage collection is the source of this problem and that this cannot be easily controlled because of the interaction between VMs.

And Skylight — A Window on Shingled Disk Operation – the FAST Best Paper winner – definitely deserves a post. But there can only be one StorageMojo Best Paper.

Best Paper
The winner is Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage and Network-bandwidth, by K. V. Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran, all of the University of California, Berkeley.

The paper explores a holistic view of high-scale storage, simultaneously optimizing I/O, capacity and network bandwidth.

Our design builds on top of a class of powerful practical codes, called the product-matrix-MSR codes. Evaluations show that our proposed design results in a significant reduction the number of I/Os consumed during reconstructions (a 5× reduction for typical parameters), while retaining optimality with respect to storage, reliability, and network bandwidth.

In a Reed-Solomon 7+1 RAID 5 (7 data blocks and 1 parity block) the loss of a single data block causes 7 data reads and 7 times the size of single block in bandwidth consumption. When the loss is a terabyte+ disk, the glacial pace of reconstruction is mute testament to this feature of Reed-Solomon codes.

Dimakis et al. introduced minimum-storage-regeneration (MSR) codes, that can reduce the data transfer for reconstruction by 2/3rds or more.

However, the I/O overhead of MSR codes can be much higher than the Reed-Solomon codes used in current RAID arrays and some scale-out storage. For disk-based systems, that’s a problem.

The paper proposes product-matrix reconstruct-by-transfer (RBT)codes that achieve optimal system resource utilization. They also offer an algorithm that converts any product-matrix vanilla code into an RBT code.

Performance
The paper offers some graphs showing the results of experiments with Reed-Solomon (RS), product-matrix (PM) and RBT codes carried out on Amazon EC2 instances:

RBT Performance

The StorageMojo take
Disks are going to be with us for decades to come for cost and streaming performance. Networks – typically ethernet – are a limited and costly resource as well. Learning how to optimize both in scale-out systems is necessary.

The shift to high-IOPS media, like flash drives, means cheap I/Os on expensive media. But that doesn’t change anything for disk-based scale-out storage where massive capacity guarantees that data reconstruction will be common.

For future research I’d like to see more on the latency impact of advanced erasure codes. As object storage continues to displace file servers latency will become a critical issue. Update: K.V. Rashmi was nice enough to let me know that they are indeed working on the latency issue. Good to know! End update.

Courteous comments welcome, of course.