A new Usenix paper looks at NAND flash SSD performance. From a team at Microsoft Research and the University of Wisconsin, including Ted Wobber who worked on last year’s A Design for High-Performance Flash Disks [see Flash chance for the StorageMojo take on that excellent paper – a post Ted was kind enough to review and comment on].

Design Tradeoffs for SSD Performance (by Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse and Rina Panigrahy) makes a deep dive in flash translation layer (FTL) issues. As the authors note, flash vendors keep their FTL designs secret, so the team developed a NAND flash simulator to look at how design choices affected performance.

What they found
They ran several workloads on their trace-based simulator, including TPC-C, Exchange and some file system benchmarks. They found several critical issues in SSD design.

  • Data placement Needed for wear leveling and load balancing.
  • Parallelism Single flash chips aren’t very fast so they need to work together.
  • Write ordering Small random writes are a killer.
  • Workload management You can optimize for sequential or random workloads, but managing both well is hard.

Canonical part
The paper’s discussion of flash memory is based on the spec for Samsung’s K9XXG08UXM 4 GB Single Level Cell (SLC) package. Other parts may differ, but NAND physics are the basic challenge.

The Samsung part has 2 2 GB dies (chips) in the package. Each die has 8192 blocks – a block is 64 4 KB pages – organized into 4 planes of 2048 blocks. The dies can be addressed independently, while cross-plane operations are limited to planes 0 & 1 or 2 & 3. Each page has 128 bytes for metadata.

Cross-plane operations are a form of parallelism. The Samsung part also provides a copy-back operation so one page can be copied to another without transporting the data off of the die. Copy-back is limited to copies within the same flash plane of 2048 blocks.

Expensive writes
NAND flash is a type of EEPROM. About the only characteristics it shares with disks are block structure and persistence. To write – or as the flash guys say program – it must first be erased. And you can’t just erase a 4 KB page – you have to erase an entire block.

An erase operation takes 1.5ms, making it considerably more expensive than a read or a write. To maintain a supply of empty blocks a cleaning process – garbage collection – runs when the free block supply gets low.

SLC flash is good for about 100,000 writes, so not only do you have to manage the full block erasure problem, but you also have to manage the life span of each block – the wear-leveling problem.

[Wear-leveling will become even more acute with next-gen 3 and 4 level cells. Speculation is that the write spec could drop as low as 1,000 per cell.]

Here is a table of the operational flash parameters for the Samsung part from the paper:

SSD controller architecture
The flash packages of course are only the building blocks of an SST. Much of the magic comes from the architecture and optimizations of the SSP controller logic. This is a generalized block diagram for an SSD controller:

Key elements:

  • Host interconnect SATA, USB, FC, PCI-e
  • Buffer management for pending and satisfied requests.
  • Multiplexer to manage instruction and data transport along the serial connections to the flash packages.
  • Processor to manage request flow and mappings from the logical block address to physical flash locations.
  • RAM for the processor.

On a cheap USB thumb drive all these elements may be integrated into a single chip. On a high-performance fiber Channel SSD these elements may be separated on their own PC board.

The size of the flash packages also has an impact on cost and architecture. A 32 GB SSD build with the Samsung parts would require 136 pins at the controller. Larger SSDs may not have enough pins for full interconnection between the controller and the flash packages, requiring additional engineering trade-offs.

Faking it
Borrowing a simulator, DiskSim from Garth Gibson’s Parallel Data Lab at CMU, the team modified it to reflect SSD latency and architecture. Features unique to SSDs, such as multiple request queues, logical block maps, cleaning and wear-leveling states were added.

Workloads
They used a collection of workload traces they named TPC-C, Exchange, IOzone and Postmark, as well as a group of microbenchmarks generated by DiskSim.

The TPC-C trace came from a large-scale configuration comprising 14 HP MSA1500 FC controllers supporting 28 36 GB disks. Exemplifying the current high-end OLTP problem, each controller had over a terabyte of disk, but the benchmark used only 160 GB of that capacity.

The Exchange server was similarly over-configured with 6 RAID controllers each running 1 TB capacity, while the 15 minute trace utilized only 250 GB of that with a 3 reads for every 2 writes workload.

Microbenchmarks
These were run using 4 KB I/Os. With cleaning enabled the write operations include the extra overhead. Sequential I/Os have less cleaning overhead. Note cleaning has a ~30% hit to the random write rate.

Trade-off summary
The researchers looked at several design techniques:

  • large allocation pool
  • large page size
  • over provisioning
  • ganging
  • striping

These deserve some explanation.

A large allocation pool is convenient for achieving performance, but there is a cost. If the page size is small, there is more overhead of managing the pages.

If the page size is large, it is easier to manage the pages, but writes smaller than the page size require a read-modify-write operation, which kills performance.

Over provisioning reduces the cleaning overhead, at the cost of more expensive storage.

Ganging requires more explanation. A flash package is made of one or more dies or chips. The serial interface to the flash packages is a primary bottleneck for SSD performance. Spreading a write across multiple serial interfaces is an obvious way to improve performance. The cost comes in the interconnect density between the packages and the dies.

If a write can be interleaved across multiple flash packages, read or write bandwidth can be substantially improved. The ability to place multiple packages in an SSD, and to interleave operations across those packages, is key to the performance improvements that SSD vendors have been advertising.

The StorageMojo take
This paper is too rich in detail to summarize well. If understanding SSD controller design is important there is no substitute for a careful read.

The net is that engineers have many options in configuring and managing flash devices inside a solid state disk. The interaction of these design choices with applications is likely to remain a fruitful area of study for years to come.

Expect to see many performance oddities as new solid state disk designs are released. This is a different world than disk drives. There is much innovation and much to learn.

A macro longer-term trade-off is the extent to which SSD vendors should attempt to alter operating system behavior to better match SSDs. In the short term designers must conform to today’s disk I/O oriented operating systems. In the long term however, there must be major opportunities to tweak operating systems to enhance solid-state disk performance.

For this reason SSDs is may find their best short term market to be inside storage arrays where array vendors have complete control over the interface to the array software. This will be no small advantage as array vendors struggle to remain relevant in a world where high performance solid state disks have the potential to replace midsize arrays.

Comments welcome, of course.

Update:
Ted Wobber kindly wrote in with a comment I’m reproducing in full, since he does a better job of getting to the heart of the matter than I did:

I think the bottom line is that flash devices are a lot more complicated than you might think they would be. At first glance, the conventional wisdom is that something constructed out of solid-state circuitry should be fundamentally simpler than a device with very small parts moving at high speed. However, you have to remember that NAND-flash is built on quantum tunneling, and while the software layers that build up from there don’t involve advanced physics, the properties of the medium create complexities and tradeoffs that might not be expected.

We don’t talk with SSD vendors at a great level of detail since we’d prefer not to be under NDA unless there is a good reason. However, informal discussions and other materials I’ve seen have convinced me that our evaluation of the state of affairs isn’t far from the truth. It’s my opinion that most manufacturers are well aware of these sorts of tradeoffs, and they carefully consider them along with the requirements of their target markets and cost structures. The point of our article was to talk about these tradeoffs in an academic forum unconstrained by IP issues, and to begin to tease apart the tangle of related issues.

In sum, SSDs constitute a marvelous step forward and are really useful in many applications. However, they are not a panacea, at least not yet.

/Ted

Thank you, Ted.