Dear StorageMojo: should I go all SSD?

by Robin Harris | Tuesday, March 19, 2013 | Architecture, Enterprise, NAS, IP, iSCSI, SSD/Flash/NVRAM | 8 comments

This came in this morning’s email from a reader I’ll call Perplexed. How would you advise Perplexed?

I’m looking at a new iSCSI storage system for two sites with ~ 20 servers each – 10TB each should do it. Picture two fairly usual manufacturing/mining sites, 200-500 users, email, file/Print, finance and production database services, MS Domain etc.

Looking at IOPS – we would be serviced by 24 x 2.5″ SAS 10K disks in a RAID6 array.So – the thought occurs – that SSD would easily match that performance with far less devices.

Say 15 VM’s and 5 Servers per location. Requirement for about 5TB of data with limited growth – lets say 10TB storage and under 1000 IOPS.

Throughput is not an issue except for backup and DR. If we can saturate 2 or 3 Gigabit Ethernet links that is adequate.

This would be served comfortably by 24 x 10K 2.5″ RAID6 arrays at each location. Two of them for redundancy.

But – a single Intel 710 SSD could meet that IOPS rate and probably throughput as well. One SSD disk replacing an entire 24 disk array!

I would then ask, why have RAID at all? RAID is based on spindles being the smallest block for failure. With SSD, that block could be much smaller. The controller is already doing some ECC for wear management with overprovisioning.

Is there a new paradigm the granularity is no longer a “spindle”? Should we simply over-provision by 50%? SSD generally comes with provisioned spare capacity – starting to sound like redundancy and error correction is built into the controllers to some degree already.

What would be ideal is a 1RU box full of 10TB solid state storage with 10G iSCSI – no separate disks.

Has SSD let us start to move beyond RAID? With the death of spindles and the huge IOPS available, is the entire R1, R5, R6, R10 debate finished? Does RAID have it’s place in a box full of chips, and if yes, does it look the same as what we know?

Has the world started to change in storage, or is SSD still just non-moving spindles?

10TB, 1000IOPS, 10G iSCSI – how would you buy it?

Readers, what say you?
What suggestions do you have for Perplexed? The IOPS are low and he doesn’t suggest heavy bandwidth requirements either. But he does seem very interested in reliability.

Update: Vendors are welcome to comment. I only ask that you identify yourself as such. End update.

The StorageMojo take
Aside from cost – I’d expect a minimum of $4-$5 per gigabyte or â‰ˆ$100k+ for the storage – the low IOPS requirement means SSD could be overkill. Perhaps a hybrid SSD/disk solution? SSDs can and do fail, so relying on a single SSD is as dangerous as relying on a single HDD.

A number of companies might be appropriate, including Nexsan, Nimble, TwinStrata, Nexenta, Nutanix, Tintri, Violin, Pure, Nimbus, Tegile and Avere among others. Some have features, such as WAN replication or cloud backup, that might prove useful. Others have VM support, but not with iSCSI.

Performance isn’t likely to be an issue with any of these vendors, so I’d focus on availability, management, support and then look at cost.

Courteous comments welcome, of course. I’ve done work for some of the companies mentioned.

8 Comments

Fazal Majid on Tuesday, 19 March, 2013 at 11:43 pm

We’re in the process of upgrading from servers with 24x10K rpm SAS to 1 Intel 910 drive (800GB capacity), but we have extremely high throughput/low latency requirements.

Not all SSDs are made equal, in terms of endurance, lack of the dreaded SSD performance cliff, and also how resilient they are in case of power failure. The failure modes of flash are also quite weird.

The only drives I’d trust are the Intel 910, FusionIO’s cards, the old Intel 320, the Intel SSD DC3170 and the Crucial m4. Most of the enterprise SAS SSDs are probably fine as well.

PCIe SSDs are difficult to hot-swap, which may be a consideration based on their SLA.

For their system it would make more sense to load up on RAM for buffer caching, keep the spinning rust and add a NVRAM-backed RAID controller.
Jerry Leichter on Sunday, 24 March, 2013 at 7:57 am

Before going 100% SSD, you might want to read http://www.cse.ohio-state.edu/~zhengm/papers/2013_FAST_PowerFaultSSD.pdf – cited here as one of the best papers from FAST’13. Abstracting the abstract:

“[W]e propose a new methodology to expose reliability issues in block devices under power faults. Our framework includes specially-designed hardware to inject power faults directly to devices, workloads to stress storage components, and techniques to detect various types of failures. Applying our testing framework, we test fifteen commodity SSDs from five different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that thirteen out of the fifteen tested SSD devices exhibit surprising failure behaviors under power faults, including bit corruption, shorn writes, unserializable writes, metadata corruption, and total device failure.

(We dealt with the inherent unreliability of disks by designing redundancy algorithms tuned to their failure modes. In fact, we’re *still* improving on those – witness the spread of erasure codes in place of more general codes. We’re just beginning to understand the failure modes of SSD’s and what appropriate redundancy algorithms for them might be.)

— Jerry
Chris M Evans on Sunday, 24 March, 2013 at 11:15 am

There are a couple of things to think of here. First of all, rather than just reliability, there’s availability. You could dispense with the array altogether and use a couple of high end PCIe cards in a server. However, servicing a failure would mean downtime to remove the failing card and replace it.

You could replace the disks in an array with SSD; that too is a viable option. Devices such as the Drobo already take SSD as an acceleration layer and let you grow your capacity requirements as you need. This potentially is the ideal solution, where you use cache to accelerate the writes, with the ability to scale out capacity storage at the back end. You’d do this because most of your data doesn’t need the IOPS density (IOPS per GB) that flash/SSD can offer.

I’m not suggesting Drobo as the only solution, but it is one of them. I’d suggest that kind of hybrid approach could work best. Obviously if you aren’t price sensitive, then just use SSD in a redundant configuration.

Chris
Hans De Leenheer on Monday, 25 March, 2013 at 1:19 am

I have to agree with Chris that a hybrid solution in your case would be a better choice than all flash. All flash would be if you have HPC (High Performance Computing) or hundreds of VDI’s or a high demanding DB application which you don’t. And even with VDI’s a lot of the hybrid solutions have proven to be sufficient. The killer is in the last word: sufficient. What is sufficient today might not be tomorrow. Don’t buy anything today that you are not ready to replace in 3 years. Yes, you heard me, not 5 but 3. In 3 years storage architectures WILL look different. Don’t ask me how much but it will. If you look at what we bought less than 3 years ago and how it compares to the value versus price of what is on the market today … nuff said! As for the choices Robin in this area bot Tintri and Nimble would be on my watchlist.

Now towards networking. You mentioned 10GbE. Make your smart choices here! 10GbE is still pretty expensive but this equipment probably will last longer than those 3 years. So don’t look here as a secondary option and only focus on the storage part. Choose a platform that you are certain of decent development for storage protocols and that it’s hardware is good for high bandwith, high memory, … And as for the protocols: although it might have been true a few years ago I would definitely not dismiss file protocols over block.

Last point is a bit radical move: don’t want to care about that SAN/NAS/servers/… architecturing? Do have a look at Nutanix. x86 mainframe well thought cluster-in-a-box.

PS: things fail like Jerry said but if I drive 10 cars for a 100 miles at 8000 rpm … I think I know the results. Just don’t push your limits, that’s why they are called limits. And yes, still then cars will break. Get your insurance.
Karl Katzke on Monday, 25 March, 2013 at 7:30 pm

In a word: No. No, you should not go all SSD.

SSDs are useful in the enterprise as caching disks and startup disks. With the current level of -production- non-vendor RAID systems (including linux kernel software RAID and LVM) and implementations of TRIM, you should not be using SSDs as part of RAID arrays. You should especially not be using SSDs with RAID arrays in devices that the vendors have not fully supported SSDs in.

The problem is that when SSDs fail, they tend to fail all at once. While we haven’t recorded hard numbers in this situation, we’ve been using SSDs purchased in one batch to run a bunch of SQLite databases. We’ve tried configuring them as Linux Software RAID (but with XFS, as ext4 was a dog for some reason), with LVM striping, and with hardware RAID on the Adaptec controller. We’ve had at least half a dozen cases across 20 machines over 2 years of production where we had more than three of the SSDs in the group fail within the rebuild time of the array; even if we offline the machine so that 100% of the disk bandwidth is being used for rebuild. This is logical; SSDs have a failure rate that goes up dramatically after a certain number of read/write cycles, and we’re presumably hitting that number of read/write cycles at once.

Spinning rust does not have that failure case, though. Where the failure rate against r/w cycles for SSDs is an almost perfect bell curve, the failure rate for spinning rust has a very long tail.

Now, an exception to this is if you’re using the iSCSI enclosure to centralize storage, and you’re assigning an individual drive as a LUN and relying on your backups for redundancy. That doesn’t seem to be the way that you’re using these. Another exception is if you’re working with a SAN or other device that can create a mirrored pair (RAID1) on a spinning disk and an SSD. I know that some storage middleware, such as Oracle’s ASM, can handle this. With Linux software RAID and most hardware RAIDs, the extremely different write rates will force almost constant resyncs.

This limitation will obviously go away once the portion of the enterprise vendor community that cares about stability has developed better wear-leveling, wear detection, and possibly data protection algorithms — the latter being focused on proactively migrating data away from disks that are showing early signs of failure. Those things don’t exist right now (aside from SMART, which isn’t) because the long tail -style failure modes of spinning rust allow arrays to rebuild or human operators to make things right again before the next failure. But those things aren’t out of the lab and in the mainline stable releases of operating system kernels or RAID cards yet. They are, however, starting to appear in SAN vendor solutions. They’re still immature in practice, though, despite what the salesdroid says.

I’m a bit of a curmudgeon because I’m laser-focused on reliability and uptime in my current role, but in my opinion, the only spaces where SSDs should be used in the enterprise currently is for startup disks in machines that are easily replicated via some sort of configuration management system like chef or puppet, for swap disks and temp partitions, and for caching. Enterprise data is too valuable and the time required to restore or recover from a failure is too expensive to trust your enterprise data to SSDs.
Chris McCall on Friday, 29 March, 2013 at 10:44 am

I agree with Hans and Chris, hybrid is the best route (full disclosure, I work for a hybrid storage company – NexGen Storage). There’s a lot of appeal around the performance solid-state brings to the table but from the list of applications, it seems that not all of the apps require high performance, for some, capacity is more important. If you choose an all-SSD route, you’ll be paying a higher $/GB (even with deduplication technology) for applications that only need capacity. If all of your apps were extremely sensitive to latency and performance, all SSD would be the way to go.
Phil Robins on Thursday, 4 April, 2013 at 3:56 am

The failure of SSDs that were purchased a few years ago has been addresses with newer models – those that do not focus so strongly on speed using controllers from Hyperstone (or Novachips – a team from Indilinx) don’t have the fastest in R/W and IOPS performance – but have a stable performance and better wear levelling. It’s like anything really – the design gets better over time.

For this application you would do well to look at GreenBytes Desktop Virtualisation platform http://getgreenbytes.com/ – come to me if you need a quote… These are amazing in IO performance and can demo how 30TB can be squeezed to 300GB without a loss of performance!
KD Mann on Wednesday, 22 May, 2013 at 7:16 am

Perplexed, the reason you are perplexed is that you started with a bad assumption.

“Looking at IOPS â€“ we would be serviced by 24 x 2.5â€³ SAS 10K disks in a RAID6 array.So â€“ the thought occurs â€“ that SSD would easily match that performance with far less devices.”

The disk array you describe will have a substantial amount of DRAM-based write cache fronting those disks. Think of this DRAM write cache as “tier-zero” storage that is (because it is DRAM), about 100-1000x faster than Flash in terms of burst IOPS and response time.

Because modern operating systems and file systems are so fantastically good at read-ahead and metadata caching, read IOPS are less important than ever, and the vast majority of storage performance bottlenecks are now found in the form of high latencies on write operations. This is where Flash is weakest, so DRAM write cache wins this easily, at far lower costs than SSD.

You’ve talked about IOPS but not about response times, so the thought occurs that you probably don’t understand your workload yet. I would start there and try to understand the problem you are trying to solve, before picking a hardware solution.