A new Usenix paper looks at NAND flash SSD performance. From a team at Microsoft Research and the University of Wisconsin, including Ted Wobber who worked on last year’s A Design for High-Performance Flash Disks [see Flash chance for the StorageMojo take on that excellent paper – a post Ted was kind enough to review and comment on].

Design Tradeoffs for SSD Performance (by Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse and Rina Panigrahy) makes a deep dive in flash translation layer (FTL) issues. As the authors note, flash vendors keep their FTL designs secret, so the team developed a NAND flash simulator to look at how design choices affected performance.

What they found
They ran several workloads on their trace-based simulator, including TPC-C, Exchange and some file system benchmarks. They found several critical issues in SSD design.

Data placement Needed for wear leveling and load balancing.
Parallelism Single flash chips aren’t very fast so they need to work together.
Write ordering Small random writes are a killer.
Workload management You can optimize for sequential or random workloads, but managing both well is hard.

Canonical part
The paper’s discussion of ï¬‚ash memory is based on the spec for Samsung’s K9XXG08UXM 4 GB Single Level Cell (SLC) package. Other parts may differ, but NAND physics are the basic challenge.

The Samsung part has 2 2 GB dies (chips) in the package. Each die has 8192 blocks – a block is 64 4 KB pages – organized into 4 planes of 2048 blocks. The dies can be addressed independently, while cross-plane operations are limited to planes 0 & 1 or 2 & 3. Each page has 128 bytes for metadata.

Cross-plane operations are a form of parallelism. The Samsung part also provides a copy-back operation so one page can be copied to another without transporting the data off of the die. Copy-back is limited to copies within the same flash plane of 2048 blocks.

Expensive writes
NAND flash is a type of EEPROM. About the only characteristics it shares with disks are block structure and persistence. To write – or as the flash guys say program – it must first be erased. And you can’t just erase a 4 KB page – you have to erase an entire block.

An erase operation takes 1.5ms, making it considerably more expensive than a read or a write. To maintain a supply of empty blocks a cleaning process – garbage collection – runs when the free block supply gets low.

SLC flash is good for about 100,000 writes, so not only do you have to manage the full block erasure problem, but you also have to manage the life span of each block – the wear-leveling problem.

[Wear-leveling will become even more acute with next-gen 3 and 4 level cells. Speculation is that the write spec could drop as low as 1,000 per cell.]

Here is a table of the operational flash parameters for the Samsung part from the paper:

SSD controller architecture
The flash packages of course are only the building blocks of an SST. Much of the magic comes from the architecture and optimizations of the SSP controller logic. This is a generalized block diagram for an SSD controller:

Key elements:

Host interconnect SATA, USB, FC, PCI-e
Buffer management for pending and satisfied requests.
Multiplexer to manage instruction and data transport along the serial connections to the flash packages.
Processor to manage request flow and mappings from the logical block address to physical flash locations.
RAM for the processor.

On a cheap USB thumb drive all these elements may be integrated into a single chip. On a high-performance fiber Channel SSD these elements may be separated on their own PC board.

The size of the flash packages also has an impact on cost and architecture. A 32 GB SSD build with the Samsung parts would require 136 pins at the controller. Larger SSDs may not have enough pins for full interconnection between the controller and the flash packages, requiring additional engineering trade-offs.

Faking it
Borrowing a simulator, DiskSim from Garth Gibson’s Parallel Data Lab at CMU, the team modified it to reflect SSD latency and architecture. Features unique to SSDs, such as multiple request queues, logical block maps, cleaning and wear-leveling states were added.

Workloads
They used a collection of workload traces they named TPC-C, Exchange, IOzone and Postmark, as well as a group of microbenchmarks generated by DiskSim.

The TPC-C trace came from a large-scale configuration comprising 14 HP MSA1500 FC controllers supporting 28 36 GB disks. Exemplifying the current high-end OLTP problem, each controller had over a terabyte of disk, but the benchmark used only 160 GB of that capacity.

The Exchange server was similarly over-configured with 6 RAID controllers each running 1 TB capacity, while the 15 minute trace utilized only 250 GB of that with a 3 reads for every 2 writes workload.

Microbenchmarks
These were run using 4 KB I/Os. With cleaning enabled the write operations include the extra overhead. Sequential I/Os have less cleaning overhead. Note cleaning has a ~30% hit to the random write rate.

Trade-off summary
The researchers looked at several design techniques:

large allocation pool
large page size
over provisioning
ganging
striping

These deserve some explanation.

A large allocation pool is convenient for achieving performance, but there is a cost. If the page size is small, there is more overhead of managing the pages.

If the page size is large, it is easier to manage the pages, but writes smaller than the page size require a read-modify-write operation, which kills performance.

Over provisioning reduces the cleaning overhead, at the cost of more expensive storage.

Ganging requires more explanation. A flash package is made of one or more dies or chips. The serial interface to the flash packages is a primary bottleneck for SSD performance. Spreading a write across multiple serial interfaces is an obvious way to improve performance. The cost comes in the interconnect density between the packages and the dies.

If a write can be interleaved across multiple flash packages, read or write bandwidth can be substantially improved. The ability to place multiple packages in an SSD, and to interleave operations across those packages, is key to the performance improvements that SSD vendors have been advertising.

The StorageMojo take
This paper is too rich in detail to summarize well. If understanding SSD controller design is important there is no substitute for a careful read.

The net is that engineers have many options in configuring and managing flash devices inside a solid state disk. The interaction of these design choices with applications is likely to remain a fruitful area of study for years to come.

Expect to see many performance oddities as new solid state disk designs are released. This is a different world than disk drives. There is much innovation and much to learn.

A macro longer-term trade-off is the extent to which SSD vendors should attempt to alter operating system behavior to better match SSDs. In the short term designers must conform to today’s disk I/O oriented operating systems. In the long term however, there must be major opportunities to tweak operating systems to enhance solid-state disk performance.

For this reason SSDs is may find their best short term market to be inside storage arrays where array vendors have complete control over the interface to the array software. This will be no small advantage as array vendors struggle to remain relevant in a world where high performance solid state disks have the potential to replace midsize arrays.

Comments welcome, of course.

Update:
Ted Wobber kindly wrote in with a comment I’m reproducing in full, since he does a better job of getting to the heart of the matter than I did:

I think the bottom line is that flash devices are a lot more complicated than you might think they would be. At first glance, the conventional wisdom is that something constructed out of solid-state circuitry should be fundamentally simpler than a device with very small parts moving at high speed. However, you have to remember that NAND-flash is built on quantum tunneling, and while the software layers that build up from there don’t involve advanced physics, the properties of the medium create complexities and tradeoffs that might not be expected.

We don’t talk with SSD vendors at a great level of detail since we’d prefer not to be under NDA unless there is a good reason. However, informal discussions and other materials I’ve seen have convinced me that our evaluation of the state of affairs isn’t far from the truth. It’s my opinion that most manufacturers are well aware of these sorts of tradeoffs, and they carefully consider them along with the requirements of their target markets and cost structures. The point of our article was to talk about these tradeoffs in an academic forum unconstrained by IP issues, and to begin to tease apart the tangle of related issues.

In sum, SSDs constitute a marvelous step forward and are really useful in many applications. However, they are not a panacea, at least not yet.

/Ted

Thank you, Ted.

5 Comments

rakesh on Thursday, 17 July, 2008 at 4:28 am

Nice article.You summarized the points very well.
Now as the SSD’s has to support wear leveling and load balancing, Microsoft have to redisgn their file system design.
Ed on Tuesday, 22 July, 2008 at 10:13 pm

Slightly off topic and sorry for asking stupid questions.
The old / current controller chip all uses 4 channel. The only different between faster and slower SSD is the use of Slow MLC, Fast MLC or SLC SSD.

The newer generation uses 8 Channel and achieve 120MB/s + using MLC only. The speed that previous 4 Channel generation would require SLC Chips.

Do we see a future of 16 or even 32 Channel? I dont know if 32 Channel is even possible because a normal 2.5″ would only fit 16 chips inside. Unless the 2 die per chip and uses a channel individually.

Therefore controller chip is obviously the bottleneck in current SSD. As Intel has mentioned. So would be see controller not fast enough to cope with the SSD speed?
The current controller is getting fairly big. I dont know how process it is built on and if die shrink helps.

And How will OS help with performance on SSD. What about File System?
The funny things is that NO ONE has ever mention OS / File system problem when we setup Virtual Drive using DRAM. Which is much faster.
wei Gao on Tuesday, 2 December, 2008 at 4:49 am

I am learning SSDs, and I think this paper is excellent! absolutely!!
wei Gao on Tuesday, 2 December, 2008 at 4:50 am

By the way, I am using CMU’s Disksim and SSD Extension for DiskSim Simulation Environment. I hope your paper can give me some help.
Tad Hunt on Friday, 26 March, 2010 at 3:41 pm

As usual, Robin takes a a very complex process apart and boils it down, exposing the most important parts for end users. Ted Wobber (co author of “Design Tradeoffs for SSD Performance”), backs this up with great supporting detail on why it is such a complex issue.

In this case, it’s clear that even over a year later that little has changed. A key takeaway remains true: “The interaction of these design choices with applications is likely to remain a fruitful area of study for years to come”.

Innovation here continues to drive IOPS & throughput up and prices down while meeting or exceeding expectations for reliability. I don’t see any signs of slow down.

It seems that Microsoft has moved the paper, if you’re looking for it, here’s the new link:

http://research.microsoft.com/pubs/63596/USENIX-08-SSD.pdf

Trackbacks/Pingbacks

Aboard the SuperSpeedDrive train | Fun To Think About - [...] Intel, Samsung or the smallest newcomer, Indilinx. These three companies make the SSD controllers, geekily explained here, though it…

Submit a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Design Tradeoffs for SSD Performance

5 Comments

Trackbacks/Pingbacks

Submit a Comment

Recent Comments

Recent Posts

Categories