The coming all flash array/NVMePCIe SSD dogfight

by Robin Harris on Thursday, 23 February, 2017

In this morning’s post on ZDNet on the diseconomies of flash sharing I discuss the fact that many NVMe/PCIe SSDs are as fast as most all flash arrays (AFA). What does that mean for the all flash array market?

Short answer: not good
Today a Dell PowerEdge Express Flash NVMe Performance PCIe SSD – ok, is spec’d to – offer ≈700,000 IOPS, with gigabytes per second of bandwidth. That’s in the range of many AFAs. The NVMe/PCIe SSD does all that for thousands of dollars, not $100k or more. And you can order one from NewEgg or Dell.

There are two obvious problems with the idea that NVMe/PCIe SSDs can take a major piece of the AFA market.

  • Services. AFAs offer many services that enable managing and sharing the storage. NVMe/PCIe SSDs are drives, leaving the management up to you.
  • Sharing. Put an AFA on a SAN and you have a shared resource. Any PCIe device is marooned in its host server.

But if hyper scale datacenters have taught us anything, it is that shared nothing clusters can offer many services and share hardware. All it takes is an appropriate software layer and lots and lots of network bandwidth.

With the rapid advent of 25 Gb/sec and faster Ethernet, the bandwidth issue is manageable. That leaves the software.

Given the size of the market opportunity, the software should arrive soon.

The StorageMojo take
AFAs and their more cost-effective hybrid brethren aren’t disappearing. There will always be applications where they will make sense, and a cadre of people who just don’t like NVMe/PCIe SSDs for enterprise work.

But I think this will be a hot area of contention, since most of the SSD vendors don’t make AFAs. They have little to lose by pushing NVMe/PCIe SSDs for broad adoption.

But it will mark the beginning of the end for array controllers as service platforms. Why rely on a doubly redundant array controller when you can rely on a virtually immortal cluster to host services?

This is going to be fun to watch.

Courteous comments welcome, of course.

{ 8 comments… read them below or add one }

Fazal Majid February 23, 2017 at 2:41 pm

You forgot to mention latency, where the PCIe DAS thoroughly trounces the SAN AFA by easily one order of magnitude. For many applications like OLTP databases, that has a huge impact on real-world performance and throughput.

Wes Felter February 23, 2017 at 2:44 pm

If an AFA is $100K and its hardware is $10K then clearly the software is worth $90K. It will get cheaper over time but disaggregating software and hardware isn’t magic; a software layer enabling sharing of SSDs is exactly the same thing as the firmware inside an AFA. AFAIK SolidFire, XIV, and SVC are already available in software-only versions.

Robin Harris February 23, 2017 at 6:17 pm

Wes, good to see you comment. But I must disagree on a couple of points.

The AFA controller software I’m aware of doesn’t scale across hundreds or thousands of nodes. With that level of scaling, I’d expect – and most customers would too – that the software cost per unit of goodness would decline significantly.

In addition, I’d argue that much of what customers pay for when they buy an AFA – or any hardware array – is the integration the vendor has done and the promise of one throat to choke to get it fixed or improved. But much depends on the (hypothetical) software vendor’s channel strategy.

I guess we’ll just have to wait and see when and how this pans out.

Steve Chalmers February 23, 2017 at 6:29 pm

This will absolutely be fun to watch, I agree!

I made a long comment on this morning’s article and won’t repeat that here.

Over time, byte addressable storage class memory will allow the same kind of transformation we saw in network routers (L3 switches) back a decade or two ago, where there is very lightweight (think 10’s of instructions) execution in the data plane, and all the things we think of as done by storage systems today are done in a control plane (in drivers which set up address mapping and access rights tables, rather than in lines of code which execute with each I/O). In this best and highest use of byte addressable storage class memory, I assure you the storage system and network have just as many lines of code as they do now (just look at the 10 million + lines of code it takes for a full function layer 3 ethernet switch). They’re just implemented differently.

Happy to talk if this is from an unfamiliar perspective…


Ryan February 23, 2017 at 8:10 pm

Robin, do you see Microsoft’s S2D solution as a player in this market? They’re emphasizing NVMe/PCIe SSD support in their shared-nothing architecture.

Robin Harris February 24, 2017 at 9:26 am

Ryan, S2D is a form of server managed storage, without the need for an array – flash or otherwise – to provide the capacity. Unless there are bottlenecks in the Windows Server implementation, using high performance NVMe/PCIe SSDs would simply add more go-fast to the cluster. When you look at the MS commitment to large physical memories – 24TB – and their work on storage class memory – 3D XPoint – it looks like they’re all in on using all available hardware technologies to enable customers to build the fastest possible Windows clusters.

Bottom line: MS doesn’t have a dog in this fight, but they are fully exploiting new hardware opportunities as they come up. They aren’t helping the array vendors, but hurting them is collateral damage.

Wes Felter February 24, 2017 at 11:55 am

Robin, I think these are both good points but it seems like they could go either way.

Software that’s more scalable tends to be more complex and thus costs more to develop; if something costs more but we’re paying less for it, what are the implications?

Integration is a lot of work and thus it’s really valuable. In a disaggregated world we’ve seen different approaches to this. Enterprise software tends to be certified against hardware (MS and VMware are the masters of this), but that’s N times as much integration work as an appliance so again I wouldn’t expect it to be cheaper. Open source tends to have a DIY approach to integration and this goes wrong unless you have skilled staff to do it; because the people are largely a fixed cost this is more expensive at small scale and cheaper at large scale. And in the middle you have “enterprise-washed” open source where it’s certified and pre-integrated but if you modify it you lose the certification.

In my lab I have an A9000 (you aren’t even allowed to install it yourself) and a Ceph cluster so we’ll see.

billybathgates February 28, 2017 at 11:19 am

I’m skeptical Fazal, would this really be an order of magnitude faster than a SAN array? (Unlike many of you, I’m not a vendor selling anything, I’m a storage end user)

What is the typical percentage of latency in the (properly working) san itself, in the array, and in the host stack? Usually it has been the storage backend (media latency), not the array or san that is the biggest contributor, although obviously it’s a much smaller piece of the pie for fast flash, e.g. flashcore, and still a moving target.

Also are you talking about for just that single host to the DAS? To use this in a generally useful way (a shared resource), you would still have to provide network access to the blocks via sxome sort of “software defined storage”.

As kind of mentioned by others, arrays (and SDS) are providing additional functions that do take time and add latency, but are useful (compression dedupe etc…)

What is the exact situation where you get the order of magnitude?

Leave a Comment

{ 1 trackback }

Previous post:

Next post: