Scale and the all-flash datacenter

by Robin Harris | Monday, May 9, 2016 | Architecture, Cloud computing & storage, Disk, Enterprise, SSD/Flash/NVRAM | 7 comments

There’s a gathering vendor storm pushing the all-flash datacenter as a solution to datacenter ills, such as high personnel costs and performance bottlenecks. There’s some truth to this, but its application is counter-intuitive.

Most of the time, storage innovations benefit the largest and – for vendors – most lucrative datacenters. OK, that’s not counter-intuitive. But in the case of the AFDC, it is smaller datacenters that stand to benefit the most.

Why?
It’s a matter of scale. Small, low data capacity datacenters are naturals for all flash. The initial cost may be higher, but the simplification of management and the generally high performance make it attractive.

Your databases go fast with little (costly) tuning and management. VDI is snappy. Performance-related support calls – and their costs – drop off. Ideally, SSD failures will be lower than HDDs, but make sure you’re backing up, due to the higher rate of data corruption on SSDs.

Scale drives this because even though flash may be only 5-10x the cost of raw disk capacity, as capacities grow the media cost – SSD and/or HDD – comes to dominate the The costs for an array controller and associated infrastructure outweigh the media cost until some threshold capacity is reached.

This explains why Nimble’s average customer is interested in AFDC’s, while Google, Facebook and AWS, aren’t. Nimble’s SMB customers are a fair example of where AFA will often make sense.

Where is the cutoff? Today it looks like 250 to 350 TB is where it makes sense to include disk in your datacenter. It’s not likely that you’ll be pounding on 300TB enough to justify flash. But expect the cutoff to rise over time, as it has for tape.

The StorageMojo take
The scale vs cost issue isn’t new. Tape continues as a viable storage technology because the cost of the media is so low. But tape’s customer universe is limited because for more and more users backing up to disk, or cloud, object stores is a cost/functional equivalent.

What is new is that disk is going down the same path that tape has been on for decades. The bigger problem for HDDs is that the PC market – the disk volume driver – continues to shrink while flash takes a larger chunk of the remaining PC business. Disk vendors have to adjust to a lower volume market, just as tape vendors have.

Lest SSD vendors get complacent, the really high performance database applications are going in-memory. It’s a dogfight out there! And even more changes are in store.

Courteous comments welcome, of course. In your experience, where is the cutoff point?

7 Comments

Ernst Lopes Cardozo on Monday, 9 May, 2016 at 3:36 pm

Is disk to SSD really like tape is to disk? I found that the most demanding application consistently was the nightly backup. With tape, you can make only so many incremental backups, lest your restore is taking forever. So usually it was incremental during the week, then a full back up during the weekend. For a 7×24 shop, that was a problem. So we moved from tape to disk as backup medium, because disks can be updated. As long as your backup disks are in a known state, you only have to update the backup. Just make sure your backup disks never get corrupted, or you will have to do a full copy (while being unprotected).
So when disk took the place of tape in the hierarchy, the process changed. With disk as backup for SSDs, you can use the same processes, but updating your disk backup needs to be quick enough. You may need to write a sequential transaction log to SSD that is then processed asynchronously to update the backup disks.
SSD may take part of the management chore away, but unless you backup to SSD, the backup process will remain a critical process that has to be tuned and monitored.
Of course, backup serves to make your storage â€œindestructibleâ€, it is not meant to be an archive where you pull out data from last decadeâ€™s project.
Jon Forrest on Monday, 9 May, 2016 at 5:36 pm

“Your databases go fast with costly little tuning and management.”

I think you mixed up some words. Shouldn’t this be

“Your databases go fast with little costly tuning and management.”
Robin Harris on Monday, 9 May, 2016 at 7:06 pm

Jon, right you are! Thanks for the catch.

Robin
Dennis on Tuesday, 10 May, 2016 at 8:26 am

Another couple items to consider that is leading us to replace all of our spinning disk (~500T) with flash at a neutral budget impact is:

– Maintenance costs on aging disk go up over time
– Software to support the existing storage is being rolled into the HW cost of the flash arrays or running directly on the flash controllers (or not needed). For us, SW costs hit our budget in full the year of purchase. HW costs are depreciated, usually over a 3 year period. Due to the nature of flash storage and how long we keep our storage in general, we are pushing for a 5 or 7 year depreciation.

The age of our storage on the floor is necessitating a buy this year. It is pretty much to the point that we can’t justify not going all flash.
Kevin Stay on Thursday, 12 May, 2016 at 10:22 am

Regardless of datacenter “size” the challenges in providing acceptable performance to any given application lies in guaranteeing performance in specific areas. (What those areas are and what performance is required in each obviously varies from application to application) Were it possible you would keep your “working set” of data all in L1 cache on the CPU. Since that is not really an option we have seen L2, L3 and L4 caches introduced and in memory database and now flash storage. In the end it is all about getting instructions onto CPU pipeline and storing data efficiently the massive amount of the rest of the time.

If you can identify your “working set” for each system and the inter-system communications for each application then you can decide what makes sense there. So, the first challenge is effectively and efficiently identifying that for each system and collection of systems making up an application. The next challenge is having an architecture in place allowing you to then provide the properly sized resources for each system; again as efficiently as possible.

The “smaller” datacenter can now economically deploy both RAM and NAND (in various flavors) in capacities allowing identified working sets to be properly assigned to the best resource and provide vastly better performance than was possible at any price even a few years ago. At the same time, the difference between the size of working data sets and the total size of data needing to be stored is significant and probably always will be. I believe we still have a fair few years until AFA makes sense for the whole thing.

So, for me the goal is identification and proper classification of working data sets (>64k block random read etc.) then assigning them to the proper resource among RAM, NVMe, low cost SAS-TLC. For the next few years I see caching with relatively large amounts of each as preferable for most small datacenters.
Andy Lawrence on Monday, 16 May, 2016 at 2:05 pm

Data has always grown to fill the affordable storage capacity available. I don’t see that changing anytime soon. There are many tiers where our data resides now:

Tape
HDD
SSD
RAM
CPU caches
CPU registers

Upcoming NVRAM technologies like 3D XPoint will add even more layers. Each layer seems to differ by an order of magnitude for both capacity and price when compared to the level next to it.

One of the most important things with respect to data management has been how to get the right 1% (or 5% or 10%) of the data that makes up your working set into the next layer to speed up performance.

Big data has made it more important than ever to make sure metadata for each element is as compact as possible. The file system is the perfect example. Every file has a small metadata record (256 byte inode in Ext3, 4096 byte FRS in NTFS, etc.) None of those structures seem that big until you start multiplying them by tens of millions of files. If you have 100 million files on your NTFS volume then the file table alone takes up 400 GB. Try putting that in RAM to speed things up.
Brian Politis on Tuesday, 2 August, 2016 at 8:25 am

My organization fits neatly into the parameters listed above.

Through this blog and some other sources we saw this shift coming and we actively managed our refresh budget cycle so that our VMWARE hosts and our SAN would be due for a major refresh simultaneously.
We are converting to All Flash VSAN across the board for our production work loads. Including upgrading our pilot VMWARE hybrid array to All Flash.

One thought about combing all flash with HyperConverged that I don’t see mentioned often is Density. Flash can provide densities that doesn’t exist with disk.

A single VMWARE host with in our environment can support a qty 24 of 2.5″ disks -we are using 3.84TB capacity disks – so that host can carry 60+TB raw of disk even after dedicating host slots to caching disks. Once Erasure Coding\Dedupe\Compression are factored in this host can hold nearly 100TB of our typical data sets. This is a density you can’t get with spinning SAS disks. You might come close with SATA 3.5″ disks but the write performance for destaging from cache then falls down considerably. Given our write peformance needs we calculated we could make due with a relatively few very large SSDs.

Given how VMWARE VSAN works we have chosen to trade off spindle counts now that would allow higher speeds for de-staging writes from cache by using only 3 capacity disks per host in our initial build. Still the performance far outstrips what we’ve historically engineered around on our NetApp filers. Reads are at the 1MS latency level continously for VMs. And although we see flutters in write performance it is almost never above 5MS even with only 3 “spindles” backing the write cache SSDs.

We now have room to triple our initial storage allotment by scaling out in small increments over time. The availability to scale up will always be there as well – although I don’t see our compute needs expanding past the 6 VMWARE hosts that are in each VSAN clusters today.

So one thing I tell colleagues who are considering all flash is to remember to consider density as well in your calculations. Depending on workloads and your actual performance requirements using a very high capacity SSD particularly combined with hyperconverged storage can provide some outstanding value.

I’ve heard people complain that VSAN requires too many hosts and too large an investment on the host software side of the equation.

The density levels available with All Flash VSAN really shift those numbers in my opinion. Given that even greater SSD Densities will be coming to the market Density is something that people should consider when looking at an All Flash Data Center. Beyond the initial purchase justifications we foresee these density levels having a significant impact to our space and power budgets. Even if the business grows dramatically we will probably be able to shrink those budgets or at least maintain them at current levels in the coming years.