Storage is at a tipping point: much of the existing investment in the software stack will be obsolete within two years. This will be the biggest change in storage since the invention of the disk drive by IBM in 1956.
This is not to deprecate the other seismic forces of flash, object storage, cloud and the newer workloads that are driving investment in scale-out architectures and no-SQL databases. But the 50 years of I/O stack development – based on disks and, later, RAID – is essentially obsolete today, as will become obvious to all very soon.
Why?
In a nutshell, the performance optimization technologies of the last decade – log structured file systems, coalesced writes, out-of-place updates and, soon, byte-addressable NVRAM – are conflicting with similar-but-different techniques used in SSDs and arrays. Case in point: Don’t stack your Log on my Log, a recent paper by Jingpei Yang, Ned Plasson, Greg Gillis, Nisha Talagala, and Swaminathan Sundararaman of SanDisk.
Log structured writes are written to free space at the “end” of the free space pool, as if the free space were a continuous circular buffer. Stale data must be periodically cleaned up and its blocks returned to the free space pool – the process known as garbage collection.
Log structured file systems and the SSD’s flash translation layer (FTL) both use similar techniques to improve performance. But one has to wonder: what is the impact of two or more logs on the total system? That’s what the paper addresses.
The paper explores the impact of log structured apps and file systems running on top of log structured SSDs. In summary:
We show that multiple log layers affects sequentiality and increases write pressure to flash devices through randomization of workloads, unaligned segment sizes, and uncoordinated multi-log garbage collection.
How bad?
The team found several pathologies in log-on-log configurations. Readers are urged to refer to the paper for the details. Here are the high points.
- Metadata footprint
Each log layer needs to store metadata to keep track of physical addresses as they append new data. Many log layers support multiple append streams, which, they discovered, has important negative effects on the lower log. File system write amplification could increase by much as 33% as the number of append streams went from 2 to 6. - Fragmentation
When two log layers do garbage collection, but at different segment sizes and boundaries, the result is segment size mismatch, which creates additional work for the lower layer. When the upper layer cleans one segment, the lower layer may need to clean to two segments. - Reserve capacity over-consumption
Each layer’s garbage collection requires consumption of reserve capacity. Stack the GC layers and more storage is used. - Multiple append streams
Multiple upper layer append streams – useful for segregating different application update frequencies – can cause the lower log to see more data fragmentation - Layered garbage collection
Each layer’s garbage collection runs independently, creating multiple issues, including:- Layered TRIMs. TRIM at the upper layer doesn’t reach the lower layer, so the lower layer may have invalid data it assumes is still valid.
- GC write amplification. Independent GC can mean the lower layer cleans a segment ahead of the upper layers, causing re-writes when the upper layer communicates its changes.
The StorageMojo take
Careful engineering could solve the log-on-log problems, but why bother? I/O paths should be as simple as possible. That means a system, not storage, level attack on the I/O stack.
50 years of HDD-enabling cruft won’t disappear overnight, but the industry must get started. Products that already incorporate log structured I/O will have a definite advantage adapting to the brave new world of disk-free flash and NVM memory and storage.
Storage is the most critical and difficult problem in information technology. In the next decade new storage technologies will enable a radical rethink and simplification of the I/O stack beyond what flash has already done.
Six months ago I spoke to an IBM technologist and suggested that the lessons of the IBM System 38 – which sported a single persistance layer including RAM and disk – could be useful today. He hadn’t heard of it.
The SanDisk paper doesn’t directly address latency, but that’s the critical element in the new storage stack. Removing multiple log levels won’t optimize for latency, but it’s a start.
Courteous comments welcome, of course.
I say IT people often forget lessons of history. Not hearing of System/38 [1] is a common example. It was one of the best-designed systems from the time period: portable, future-proof apps without source re-compile; high reliability; capability-based security; memory as objects with runtime checks; OS in high-level language w/ object checks during compilation; integrated data store with checkpointing. I’d take a system like that over Windows or UNIX-like systems any day long as the price is reasonable. Like to see FOSS copy its good qualities, too, but the bazaars rarely copy the cathedrals.
Anyway, they dropped capability-security at processor level, added stuff, renamed it AS/400 (know it now?), added open standards/langauges/etc, added virtualization at firmware, and now they called it IBM i. Most locations of my company have one that’s as old as the location with no malware or unplanned downtime. Real workhorses. The interface is horrible by today’s standards, though.
[1] http://homes.cs.washington.edu/~levy/capabook/Chapter8.pdf
Hi Robin –
Although legacy I/O stacks will increasingly evolve and/or be bypassed as media and use cases evolve, it should not be a result of media-related write logging technology. The log-on-log problem you discuss is understood, and is well on its way to full resolution without obsoleting higher level I/O stacks. NetApp’s multi-decade experience with log-structured methods helped us identify this issue early and push for a standardized solution. We expect these improvements will enable SSD and system vendors to improve both flash capacity utilization and wear efficiency, saving end users millions of dollars/euros/you-name-it over time.
You wrote: “Careful engineering could solve the log-on-log problems, but why bother? I/O paths should be as simple as possible. That means a system, not storage, level attack on the I/O stack.”
I’m unsure what you propose, but I would disagree with any notion that the log-on-log problem obsoletes existing I/O stacks. In particular, neither applications nor upper layers of an I/O stack need to (or should!) implement log structure for media optimization themselves; I discuss why below. The log-on-log problem described is real, but its ideal solution is local, not global. Applications may benefit from tagging their random write streams that will later be rewritten or deleted with similar temporal locality, but that absolutely does not imply they should convert random writes to sequential streams themselves.
In general, one wants any log-structured layout (transforming random writes to sequential streams) to be implemented as close to the relevant media as possible without sacrificing essential efficiencies (particularly wear and/or capacity efficiency). In the absence of parity RAID (or comparable erasure codes) and byte-granular compression above an SSD, the best place to transform random writes to sequential is generally within the SSD FTL.
The application layer is generally NOT the best place to implement log-structure because of the amplified bandwidth costs that span the portion of the stack from the log-structured layer to the media. Even if that cost were acceptable, other serious issues include media sensitivity, media sharing challenges, and loss of centralized data management functions. Further, not all storage media will even need log-structured writes, and those that do have varying geometric characteristics, so there is no one-size-fits-all log-structured method that an application could appropriately apply across heterogeneous media.
NetApp has understood the log-on-log problem for many years, in fact years before SSDs were conceived. When NetApp first proposed a log-on-log solution to SSD vendors a few years ago, it seemed that we were the first customer requesting multiple segregated write streams and minimal overprovisioning. These capabilities enable (practical) elimination of amplification within the drive for multiple sequential write streams, and without consuming reserve space. Now these SSD features are being standardized and starting to appear in commodity drive implementations. The authors of “Don’t stack your Log on my Log” cited Samsung’s HotStorage paper “The Multi-streamed Solid State Drive,” but did not explore the benefits of simply mapping each upper layer stream onto an independent lower stream exposed by the SSD. Even Samsung’s paper obscures this potential by focusing on hints and methods (like TRIM) that are unnecessary for log-structured writes within consistent segment boundaries.
In a parity RAID system, to minimize wear one wants to perform write logging above RAID (thus avoiding parity-related write amplification), in which case logging within the drives needs to be able to cooperate with (and optimize for) log-structured methods above the drives. (Write logging above RAID also negates alignment and fragmentation issues for compressed extents, so it facilitates maximum capacity savings from byte-granular compression.) Single stream log-on-log wear and capacity costs can be very small even without special optimization, as long as the upper level stream’s segment size is a decent multiple of the lower stream’s segment size. The need for multiple independent SSD write streams comes both from multiple append streams within a single log-structured layer and from multiple peers sharing an SSD, and it is this need that is driving the current set of optimizations.
I hope this helps explain the broader context for the log-on-log problem and how it can be addressed in a straightforward way within flash storage firmware/software.
Best regards,
– Jeff Kimmel, NetApp flash systems architect
Log-structured file-systems as an optimisation of the last decade? Ahem, Spiralog from Digital on OpenVMS in 1996!
http://www.hpl.hp.com/hpjournal/dtj/vol8num2/vol8num2art1.pdf
I suppose I’ll concede “optimisation” of the last decade, rather than everyone else finally catching up to the brilliance of DEC’s storage guys.
The mini’s in the mid-range always had the best stuff. Sigh…
This may sound like a stupid question, but in an age of cheap hardware, why do the logs need to be written to the same device?
In fairness, not too many people have heard of the System/38, but if you say “AS/400”, you’re likely to get a lot more nods.
It’s interesting to think that most of the issues raised by the transition to solid-state storage were correctly identified by IBM’s (and other) engineers back in the 70s, but it took a lot longer for the technology to actually develop than it seems they anticipated.
Reminds me of the good old days of Stan Poley, SOAP and the IBM 650. Optimizing Assembly to make the ensure the next instruction is about to go under the read head of the drum memory.
http://www.columbia.edu/cu/computinghistory/650.html
All the old things are new again eventually.
Just musing here but what about the file system as part of the storage device? I’m thinking basically an SSD with file system built in for example (rather than implemented in its storage by outside software as today). It would cause some headaches in not being flexible, but could certainly solve the log-on-log issue. They could put more on hardware if they wanted. It doesn’t go as far as your IBM System 38 example so would be more easily achievable. Normally I would prefer to stick with storage devices being completely standardized for system access though as a tremendous amount of efficiency comes out of this along with making new innovation easier. Perhaps another standardization layer can be used in the middle. I wonder what the Windows and OS X developers would say about this.