Maybe software will eat the world, but sometimes the physical world gives software indigestion. That fact was evident at the Flash Memory Summit this month.
As mentioned in Flash slaying the latency dragon? several companies were showing remote storage accesses – using NVMe and hopped up networks – in the 1.5 to 2.5µsec range. That’s roughly 500 times better than the ≈1msec averages seen on today’s flash storage.
That’s amazing. Really. But will it help?
Fusion-io investigated sharing storage – to help amortize the cost of their PCIe cards across multiple servers – for years, but the tools weren’t there to make it work. Now, with NVMe and PMC’s Switchtec PSX PCIe Gen3 or Avago’s ExpressFabric PCIe storage switches the hardware tools are there.
The software problem
But lopping off 998µsec from storage I/O isn’t the boost we’d like, because the storage stack is so freakin’ sl-o-o-w-w. How slow?
In a recent record-setting SPC-2 benchmark, an EMC VMAX 400k achieved 3.5ms response time with 64KiB transfers, 800 streams, and 4 IOs per stream.
However, looking at a recent TPC-C benchmark – which are at the application level, not the storage device level – we see that minimum response times are 110ms with maximum response times of almost 10 seconds. Clearly, 1ms doesn’t make much difference.
Granted, the TPC-C results include application and database overhead – in this case SAP – not just the storage stack. But with all-flash arrays averaging under 1ms SPC response times, performance improvements need to come from the software.
The StorageMojo take
The yawning chasm between SPC and TPC results calls into question the value of the SPC benchmarks. Great for vendor’s “plausible deniability” when customers complain about performance. But as such a small portion of total latency it’s obvious that software – and perhaps server hardware – are key to reduced latency and higher performance.
Software may be eating the world, but the days of easy performance boosts from new CPUs are over. Software has to step up to improve performance.
Courteous comments welcome, of course.
Great observations Robin. Reviewing the results of technical advancements in a vacuum can set false expectations with real world adoption. We’ve long stated that replacing disk storage with solid stated storage will accelerate and application by 10X. Optimize an application and IO stack for flash / solid state and the gains grow to 100X or more.
Removing IO constraints is forcing the rewriting of the software stack. Viva la innovation!
– cheers,
v
There are two different problems here: average latency and maximum latency.
Often it is maximum latency that a customer really notices. An SSD drive has a big impact on how often a Windows machine “goes out to lunch” but a Hybrid drive doesn’t
Average latency is attacked by “cutting the fat”, but maximum latency has a lot to do with the architecture of distributed systems and the expectations we have of them.
I lived in Germany in 1999 and back then the pipe to the U.S. had several seconds of buffer capacity and from morning to midday you would see the buffer fill up and eventually you would have a long latency time AND 30% packet loss; if there had been no buffer at all, you could have had some packet loss without the latency which would have been a lot more tolerable. I guess Deutsche Telecom was happy with the situation because it made VoIP impossible.
The moral is that (i) mechanisms that you think “buffer” the system from stress can make the reaction to stress worse, and (ii) adding parts to a system usually makes these problems worse.
When you average random variables, the central limit theorem applies and you get increased predictability. If you take the maximum you converge on a different distribution which has a violently long tail which gets worse as you add variables. For instance, if you have to query N database shards to answer a query, you have to wait for the slowest one as you go from N=10 to N=100 and N=1000 the tail latency gets progressively worse.
A related problem is the hysteresis built into most servers. For instance, if you hit a web server with a pulse of requests you may increase the memory consumption dramatically and cause the machine to swap, and at that point the original volume of requests will keep the server in a bad state. In a distributed system, a slowdown at point A can cause things to back up at point B and thus snowball.
The answer to this involves having a different attitude about reliability; if you can’t accept latency more than 0.01 sec, you have to stop the request at 0.01 sec — you have to trade latency for failures, and then you have to deal with the “half-done requests” and whatever problems in error handling that the client has. This is a big mental shift for a lot of people.