Flash drives are known to have latency issues. The requirement to erase and program large blocks – even for small writes – means that if the drive runs out of free blocks a 50+ ms delay is possible while garbage collection works to provide one.

Since free blocks are used up in write intensive operations these slow downs occur when the system is busiest and rapid response critical – hence “write cliff.” Vendors understand the problem and take measures to reduce or eliminate the free block exhaustion problem.

Update: This is another in a series of posts on SSDs. The first, Are SSD-based arrays a bad idea? garnered some industry responses. The second, SSD arrays: limits of architectural critiques was a response to one of those from Howard Marks. I use the term SSD in a narrow sense to refer to flash drives packaged in standard SATA/SAS disk drive form factors rather than, for example, the flash DIMM design of the Sun F5100.

In theory, one would expect little difference between flash packaged as DIMMs or as disks. But this TPC-C data shows that the packaging makes a difference. End update.

So how’s that work in the real life?
Defining “real life” as audited TPC-C benchmarks simplifies things. While there’s wiggle room in the process – who would configure a real-world system with 4000 LUNS? – the basics are the same for everyone.

I took a look at the results from the last couple of years – all included SSDs except for the HP DL370 G6 (whose performance was competitive) – to see if any conclusions could be drawn about SSD performance. Focusing on the 2 most common transactions – New-Order and Payment – I graphed the results for the tested systems. The 2 non-SSD systems used either Sun’s F5100 or Violin Memory arrays.

The numbers measure how long it takes to complete a transaction that is typically made up of 10 or more I/Os. Thus small differences in I/O latency start adding up.

In the graphs the systems that primarily use flash in non-SSD form factors are marked with an asterisk. The non-SSD flash storage comes from either Sun/Oracle or Violin Memory.

Let’s start with average response times.

Now let’s look at 90th percentile response times. I use different colors to alert readers to the fact that the scales are different: although all measurements are in seconds, the number of seconds varies on the X axis varies.

Now here are the maximum latency times. Note that some systems go out to more than 80 seconds.

The StorageMojo take
It appears that with glaring exception of the SPARC Cluster, the 3 systems with the lowest latency at the average, 90th percentile and maximum response times do not use SSDs. Some SSD-based systems equal or exceed the non-SSD systems at some points, but overall the non-SSD systems seem to have an important latency advantage.

What about the SPARC Cluster? Since the SPARC Cluster uses the same storage as the much lower latency SunFire, it is likely the issue is with the cluster, not the storage. Perhaps someone can run DTrace and figure it out.

What does this mean? While correlation does not prove causation, the results suggest that the behavior we expect from SSDs – the write cliff – is seen in real life. If so, it means that the current measures taken by vendors aren’t solving the problem.

Of course, if we didn’t have non-SSD flash storage there wouldn’t be a “problem.” We’d just be comparing performance, thankful that we had an option to disk arrays. But since we are going to flash, shouldn’t we have the fastest flash?

But I’m open to hearing other views on these observed differences. What else could explain these results?

Courteous comments welcome, of course. I’ve been doing some work for Violin Memory.