Unpacking the data services vs performance metric debate. Why we should stop the IOPS wars and focus on latency.
IOPS is not that important for most data centers today because flash arrays are so much faster than the storage they replace. That’s why the first post was titled IOPS is not the key number.
The point of that post was that in the context of all flash arrays the greater benefit comes from lower latency, not more IOPS. Everyone agrees more IOPS aren’t much use once the needed threshold value is crossed. But lower latency is a lasting benefit.
The second post Data services more important than latency? Not! was more controversial. I was responding to a Twitter thread where an integrator CTO first asserted that customers don’t care about latency (true, but they should) and then questioned the datacenter savings due to flash performance.
My response: where has this guy been for the last 10 years? Hasn’t he noticed what flash has done to the market? Could he not wonder why?
What his tweets underscored is that we as an industry have done a poor job of giving customers the tools to understand latency in data center performance and economics. We clearly don’t understand it well ourselves.
Safety doesn’t sell
Compare this to auto safety. 50 years ago Detroit argued that “safety doesn’t sell” because consumers didn’t care about it. They fought seatbelt laws, eye level brake lights, head restraints, airbags and more because, they said, consumers don’t want to pay for them.
Today, of course, safety does sell. There are easily understood (and sometimes controversial) benchmarks for crash safety that make it easy for concerned consumers to make safety-related choices. Not all do, but clearly safety is a constant in mass-market car ads today, showing how much market sentiment has shifted as consumers understood it meant keeping their children, family and friends safer.
In regards to latency, the storage industry is where Detroit was 50 years ago. People like the CTO, who should know better, don’t.
The VMware lesson
VMware offers a more recent lesson. They offered a simple value proposition: use VMware and get rid of 80% of your servers.
That wasn’t entirely true, but it encapsulated an important point: you can save a lot of money. Oh, and there are some other neat features that come with VMs, like vMotion.
Give people a simple and compelling economic justification and they will change. But it has to be simple and verifiable.
Data services platform?
The rapid rise of the “data services platform” meme is a tribute to EMC’s marketing. Google it and you’ll see that until EMC’s SVP VMAX, Fidelma Russo wrote about it a couple of weeks ago, it wasn’t even a thing. Now we’re debating it.
Likewise, asserting that data services are more important than performance contravenes 30+ years experience with customers. Yes, data services are important – mostly because today’s storage is so failure prone – but give a customer a choice between fast enough and not fast enough with data services and you’ll quickly see their place in the pecking order.
EMC is changing the subject because the VMAX is an overpriced and underperforming dinosaur. Until they get the DSSD array integrated into the VMAX backend, it will remain underperforming.
The StorageMojo take
Is performance – thanks to flash arrays – a solved problem? Those who argue that flash arrays are fast enough for most data centers seem to think so. And they may be correct for a few years.
It’s easy to forget that we’ve had similar leaps in performance before, most notably when RAID arrays entered the scene almost 25 years ago. It took a few years for customers to start demanding more RAID performance.
What happened is what always happens: the rest of the architecture caught up with RAID performance. CPUs and networks got faster; applications more demanding; expectations higher.
Storage is still the long pole in the tent and will remain so for years, if not decades, to come. In the meantime we need to refocus customers from IOPS to latency.
How? A topic for future discussion.
Courteous comments welcome, of course.
Good article. Performance is what really matters and latency is the key metric by which to measure performance. The inability to measure application workload latency has been a gap in the toolset for most storage professionals. I’d encourage you to check out Virtual Instruments. The core of our business is infrastructure performance management and helping customers solve performance issues and enhancing their ability to tune and optimize application workload.
Robin,
First we as an industry have to stop talking about IOPS and latency separately. An IOPS number should always specify the latency as well. We can quibble about whether that’s average, 90th percentile or worst case latency but while IOPS tell you how much work the system is doing latency is really the measure of how fast each piece of work gets done.
My view is that there are a small number of customers (100-5000 or so) who have single workloads big enough to require 500K+ IOPS at 1ms or lower latency. For those users capacity and latency are all that should matter.
For most of the rest of us even if we need a million or two million aggregate IOPS at a latency of <5ms that's to support many workloads on many VMs. If 1 brand F system can deliver the performance without data services but I can buy 4 brand Q boxes and get both I'm going with brand Q. The extra work of 4 devices to manage is less than the work needed to live without data services.
In truth there's only fast enough and not fast enough. Newer, faster gear causes three feedback effects:
1: our definition of fast enough increases to 105% as fast as the fastest thing we've ever seen
2: Developers use higher performance hardware as an excuse for writing less optimized code
3: Some new software application or technique becomes practical because there's so much horsepower available.
Number 3 is an especially nasty one because the new tech is only barely usable on current hardware so we need to chase faster hardware to make that new tech work better. All three are a big part of why a new tech like the AFA or disk array has a few year head start before it's not fast enough any more.
Not to be overlooked is the Enterprise. The large vendors can throw a lot of experienced bodies at a problem. I recall a storage win when a big player pointed out the competition had 3 nationwide support engineers. Dirty pool? Not really. If you are running a business, you might want to consider a lot of factors in a storage decision. Similarly, there are a number of companies that go with “one throat to choke” strategy which will dictate the storage. Every now and then they trot in competition just to get a better price. But when issues arise (they do), there is no finger pointing as the vendor owns all the pieces (or most). That is a heck of a good thing. We have ALL been in the situation where vendor A says: “Appears to be an issue with vendor B kit, give them a ring…. click”
I agree with Howard somewhat when he says “that developers use higher performance hardware as an excuse for writing less optimized code”. I don’t think they purposely write bad code, they just don’t get the right incentives to make their algorithms and code paths as efficient as possible. The answer to poorly performing software products is always “get faster hardware” (opinion fully supported by hardware vendors BTW).
Customers seem much more willing to spend big bucks on a new system that runs its software twice as fast. They wouldn’t dream of spending the same amount of money for software that runs twice as fast on their old hardware, however. Inefficient code not only requires more time, but it burns power, generates heat, and wears out hard drives.
Software vendors would spend a lot more time making every function more efficient if doing so translated directly to their bottom line. I’m sure the guy who built my house would have done a better job of insulating and installing more energy efficent appliances if he got a cut of every dollar I saved in energy costs.
Andy,
Great point! It’s another big advantage for cloud suppliers because they DO have the incentive to improve the software as much as the hardware.
Robin
Re: Andy’s comment,
Please excuse my facetiousness, but new hardware never asks for a raise, when “optimizing” a app to run 2*_old_speed…
Then, of course, optimizing a application won’t make any core OS services run faster, depending how much time any app spends calling those, likely a lot, if running something clustered, many other scenarios cone to mind.
This is before we think of the ability to depreciate new kit, energy consumption or space cost savings, and so on.
Then again, throwing hardware at problems can pose questions that require a modicum of advanced thinking. In memory databases don’t fall into the no thought required category too often. Even reloading memory on a node fail is going to require some thought.
I’m inadvertently harking at my ideal of doubling programming teams as systems admin and ops, because of the overlap, not only when you throw radical upgrades at hardware. But that’s a ideal only for some shops, unlikely those with big audit requirements, e.g.
Andy’s point about inefficient code burning up resources is too true. But in reality, if you have really inefficient code, you have incumbent staffing anyhow, so what do you do? If they are good, they have told you when they expect to optimize. If they’re less than ideal, you’re not going to miraculously get improvements. HR headaches are so much less fun, than taking delivery of shiny new iron…
My cynicism is that the era of copy and paste code being “good enough”, plus some judicious Stack Overflow action, is keeping sales humming. Buying hardware alone is scary fun recently. Even on very small budgets. Has the industry started acting like IBM shipping new Z ‘frames to run long forgotten code? You could take a devil’s advocate position that there’s enough libraries out there to do enough of the jobs wanted to be done, that all we are doing now is glue code and UI… but that’s enough cynicism, already..
The same challenges are also playing out in Networking:
“More Bandwidth Doesn’t Matter (much)” – Belshe – https://docs.google.com/a/chromium.org/viewer?a=v&pid=sites&srcid=Y2hyb21pdW0ub3JnfGRldnxneDoxMzcyOWI1N2I4YzI3NzE2
or:
“Latency: The New Web Performance Bottleneck” – Grigorik https://www.igvita.com/2012/07/19/latency-the-new-web-performance-bottleneck/
My original point was that most programmers are rarely given an incentive to create really efficient code or go back and fix inefficient code. If it works “good enough” there is almost no pressure or incentive to make it better.
If a Toyota software engineer discovered a way to make all of their cars get an extra 1 mpg by changing the software controlling the fuel injection system, then Toyota would probably spend lots of time and money to get that code developed, tested, and deployed. It would do this because it knows customers watch mpg tests very carefully and often make buying decisions based on them.
But if a Microsoft software engineer wanted to make a change to improve the performance of one of the Windows base disk drivers, they might never get the chance. Even though that change might save every customer $5 in real costs over the life of their computer (and would save more than $1B globally), Microsoft might not even think about doing what is necessary to make that happen.
Why? Because the customer would never know that they are spending an extra $5 due to the inefficient code. The customer would never pay an extra $2 for the software in order to save the ambiguous $5. Microsoft would make the same amount of money with or without the fix. Microsoft does not have to pay for any of the extra power, cooling, or wear and tear to the hardware caused by the inefficient code.
How much inefficient code are we all running that is costing us unnecessary money just because of this condition?