Storage is at a tipping point: much of the existing investment in the software stack will be obsolete within two years. This will be the biggest change in storage since the invention of the disk drive by IBM in 1956.

This is not to deprecate the other seismic forces of flash, object storage, cloud and the newer workloads that are driving investment in scale-out architectures and no-SQL databases. But the 50 years of I/O stack development – based on disks and, later, RAID – is essentially obsolete today, as will become obvious to all very soon.

In a nutshell, the performance optimization technologies of the last decade – log structured file systems, coalesced writes, out-of-place updates and, soon, byte-addressable NVRAM – are conflicting with similar-but-different techniques used in SSDs and arrays. Case in point: Don’t stack your Log on my Log, a recent paper by Jingpei Yang, Ned Plasson, Greg Gillis, Nisha Talagala, and Swaminathan Sundararaman of SanDisk.

Log structured writes are written to free space at the “end” of the free space pool, as if the free space were a continuous circular buffer. Stale data must be periodically cleaned up and its blocks returned to the free space pool – the process known as garbage collection.

Log structured file systems and the SSD’s flash translation layer (FTL) both use similar techniques to improve performance. But one has to wonder: what is the impact of two or more logs on the total system? That’s what the paper addresses.

The paper explores the impact of log structured apps and file systems running on top of log structured SSDs. In summary:

We show that multiple log layers affects sequentiality and increases write pressure to flash devices through randomization of workloads, unaligned segment sizes, and uncoordinated multi-log garbage collection.

How bad?
The team found several pathologies in log-on-log configurations. Readers are urged to refer to the paper for the details. Here are the high points.

  • Metadata footprint
    Each log layer needs to store metadata to keep track of physical addresses as they append new data. Many log layers support multiple append streams, which, they discovered, has important negative effects on the lower log. File system write amplification could increase by much as 33% as the number of append streams went from 2 to 6.
  • Fragmentation
    When two log layers do garbage collection, but at different segment sizes and boundaries, the result is segment size mismatch, which creates additional work for the lower layer. When the upper layer cleans one segment, the lower layer may need to clean to two segments.
  • Reserve capacity over-consumption
    Each layer’s garbage collection requires consumption of reserve capacity. Stack the GC layers and more storage is used.
  • Multiple append streams
    Multiple upper layer append streams – useful for segregating different application update frequencies – can cause the lower log to see more data fragmentation
  • Layered garbage collection
    Each layer’s garbage collection runs independently, creating multiple issues, including:

    • Layered TRIMs. TRIM at the upper layer doesn’t reach the lower layer, so the lower layer may have invalid data it assumes is still valid.
    • GC write amplification. Independent GC can mean the lower layer cleans a segment ahead of the upper layers, causing re-writes when the upper layer communicates its changes.

The StorageMojo take
Careful engineering could solve the log-on-log problems, but why bother? I/O paths should be as simple as possible. That means a system, not storage, level attack on the I/O stack.

50 years of HDD-enabling cruft won’t disappear overnight, but the industry must get started. Products that already incorporate log structured I/O will have a definite advantage adapting to the brave new world of disk-free flash and NVM memory and storage.

Storage is the most critical and difficult problem in information technology. In the next decade new storage technologies will enable a radical rethink and simplification of the I/O stack beyond what flash has already done.

Six months ago I spoke to an IBM technologist and suggested that the lessons of the IBM System 38 – which sported a single persistance layer including RAM and disk – could be useful today. He hadn’t heard of it.

The SanDisk paper doesn’t directly address latency, but that’s the critical element in the new storage stack. Removing multiple log levels won’t optimize for latency, but it’s a start.

Courteous comments welcome, of course.