From their earliest days, people have reported that SSDs were not providing the performance they expected. As SSDs age, for instance, they get slower. But how much slower? And why?

A common use of SSDs is for servers hosting virtual machines. The aggregated VMs create the I/O blender effect, which SSDs handle a lot better than disks do.

But they’re far from perfect, as a FAST 15 paper Towards SLO Complying SSDs Through OPS Isolation by Jaeho Kim and Donghee Lee of the University of Seoul and Sam H. Noh of Hongik University points out:

In this paper, we show through empirical evaluation that performance SLOs cannot be satisfied with current commercial SSDs.

That’s a damning statement. Here’s what’s behind it.

The experiment
The researchers used a 128GB commercial MLC SSD purchased off-the-shelf and tested it either clean or aged. Aging is produced by issuing random writes ranging from 4KB through 32KB for a total write that exceeds the SSD capacity, causing garbage collection (GC).

They then tested performance in each mode using traces from the Umass Trace Repository. The traces were “replayed” generating real I/Os to the SSD for three workloads: financial; MSN; and Exchange.

In addition to clean and aged SSD performance, they tested each VM with its own partition on a clean SSD and running the workloads concurrently on a single partition on a clean SSD.

They repeated the tests using an aged SSD, to notable effect:

 IO bandwidth of individual and concurrent execution of VMs.

IO bandwidth of individual and concurrent execution of VMs.

One of the major effects of garbage collection is in the over provisioning space – the OPS of the title. While you can confine a single VM to a single partition the over provisioning space in an SSD is shared among all partitions – at least as far as the authors know.

Garbage collection
The authors ascribe the massive performance deltas to garbage collection. For those new to this issue the basic unit of flash storage is the page – typically a few KB – which are contained with blocks – typically anywhere from 128KB to 512KB.

But the rub is that entire blocks – not pages – have to be written, so as pages are invalidated there comes a time when the invalid pages have to be flushed. Once the number of invalid pages in a block reaches a threshold, the remaining good data is rewritten to a fresh block – along with other valid data – while the invalid data is flushed.

Erasing a block takes many milliseconds, so one of the key issues is tuning the aggressivenes of GC against the need to minimize writes so as to maximize flash’s limited life. This is but one of the many trade offs required for engineering the flash translation layer (FTL) that makes flash look like a disk.

Black box
But, as the researchers note, it is not possible to know exactly what is going on inside an SSD because the FTL is a proprietary black box.

Our work shows that controlling the SSD from outside the SSD is difficult as one cannot control the internal workings of GC.

GC is the likeliest explanation for the big performance hit when VMs share a partition. The GC process affects all the VMs sharing the partition, causing all of them to slow down. Here’s another chart from the paper:

(a) Data layout of concurrent workloads in con- ventional SSD and (b) number of pages moved for each workload during GC.

(a) Data layout of concurrent workloads in conventional SSD and (b) number of pages moved for each workload during GC.

Another variable is the degree of over provisioning in the SSD. Since flash costs money, over provisioning adds cost to the SSD. Over provisioning may be as little as 7% for consumer SSDs to as high as 35% for enterprise SSDs.

Yet another variable is the how the OPS is shared among partitions. If shared at the page level, much extra data movement – and reduced performance – is virtually assured. But again, that is under the control of the FTL, and it is hard to know how each vendor handles it.

The StorageMojo take
Flash storage has revolutionized enterprise data storage. With disks, I/Os are costly. With flash, reads are virtually free.

But as the paper shows, SSDs have their own issues that can waste their potential. Until vendors give users the right controls – the ability to pause garbage collection would be useful – SSDs will inevitably fail to reach their full potential.

My read of the paper suggests several best practices:

  • Give each VM its own partition.
  • Age SSDs before testing performance.
  • Plan for long-tail latencies due to garbage collection.
  • Pray that fast, robust, next-gen NVRAM gets to market sooner rather than later.

Comments welcome, as always.