Over at Network Computing my friend Howard Marks took StorageMojo to task for questioning using commodity SSDs in solid state storage arrays. In his response he made several points about latency, bandwidth, cost and reliability.
[Robin] says that SAS/SATA stacks, being designed for disks with millisecond latency, aren’t optimized for the 50-microsecond latency of flash and are adding latency themselves. While that’s probably true for the volume managers and file systems of most operating systems, the 6-Gbps SAS/SATA chips used on server motherboards and RAID controllers were designed knowing they may be connected to SSDs and introduce minimal latency. After all, SSD-based array vendors like Nimbus Data and Whiptail can deliver latency of less than 250 microseconds from SAS SSDs.
OK, I’m impressed that Nimbus and Whiptail have achieved sub-250 µS from SSDs. But as I’ll document in a future post, flash-using storage latencies are all over the map. Some folks may have broken the code but not everyone has.
[Robin’s] second performance argument is that the 6-Gbps disk interface doesn’t provide enough bandwidth between the flash and processor. While this can be true for systems that use SAS expanders to multiplex the traffic for multiple SSDs through a single channel, the difference between a 6 G-bps SAS channel and an 8-Gbps, 16-lane PCIe 2.0 slot is 33%, not an order of magnitude.
Fair enough, although SandForce already has a controller chip that can saturate 6Gig SATA. That’s why SATA-IO is developing the SATA Express spec that enables SATA devices to use PCIe bandwidth up to 16Gbps.
. . . Robin argues that the cost of DRAM on a DIMM is well over 90% of the cost of the DIMM, where the flash in an SSD accounts for only 50% to 65% of the cost of that SSD. My problem with this argument is that DIMMs are the very definition of a commodity. A DIMM is just 36 RAM chips and a very simple controller. . . .
SSDs are, by comparison, much more complex. An SSD maker, even one building SSDs from the industry parts bin, needs to choose a controller, select firmware options, add a RAM cache, over-provision the flash to improve performance and device life, and, for an enterprise SSD, add an ultra capacitor to power the flushing of the RAM cache to flash in the event of a power outage.
Howard is making the case for me, and I thank him. The issue is whether or not all these needed SSD features are best provided at the device or system level. If you can take them off the device and put them into the system, the device becomes more DIMM-like and my comment stands.
Yes, there are differences between DIMMs and NAND storage devices. But since NAND devices can be simpler than disks, they probably should be.
Robin’s reliability argument is based on Google’s experience with over 100,000 disk drives that showed that only about half of disk failures were in the mechanical marvel we call the head disk assembly. If, he argues, the SAS or SATA interface electronics cause half of all disk drive failures, why are we building SSDs with these failure-prone components? However, I’m pretty sure most of the electronics failures on hard disk drives aren’t the Marvell, LSI or PMC-Sierra SAS or SATA chip, but the head preamplifiers, read channels and other, more analog components on the hard-disk PC board.
I’d expect – but don’t know – that failures in the data path would likely be seen by SMART. But since about half the failures aren’t seen by SMART, where do they come from?
Howard supposes they come from analog components. The likelier suspects in my experience are power-related. But neither of us has hard data to settle the issue.
The StorageMojo take
Our discussion underlines the often inconclusive nature of architecture-based arguments. The gap between architecture and implementation can be too wide to resolve system-level issues.
Even if an architecture doesn’t support some function, clever engineering can often find another way to achieve the same result. On the other hand, poor implementation makes a mess of the most inspired architecture.
But the meta-role of architecture is that of a Platonic ideal. Even though real-world implementations will have bugs and economic trade-offs, architectures are an ideal that we may fail to achieve, but that point to a better world.
Courteous comments welcome, of course. Tomorrow I shift my discussion with Howard to the realm of documented application-level outcomes. Stay tuned.
An interesting philosophical discussion. I think we’re both agreed that the proof of the pudding is in the eating so implementation is more important than architecture.
More battling blog posts to come.
I’ve commented on this elsewhere, but the popular misconception repeated here really needs to be dispelled.
“…since NAND devices can be simpler than disks, they probably should be.”
NAND devices cannot be simpler than disks, because both are fundamentally “block devices”. Even if Flash was an otherwise perfect storage medium, it would still be limited to block addressability only.
In fact, the opposite is true. Of necessity, (in large part because of the huge disparity between read vs. write performance), NAND based storage must use a far more complex layer of controller logic to achieve dollar-for-dollar perfomance parity wih spinning disk for write operations.
Of necessity, every Flash-based storage product on the market (excepting thumb-drives and the like) must be and is constructed of an array of NAND chips. This introduces an array controller which, on an SSD, can be very tightly coupled to Flash. In the SSD model, aggregate array controller performance scales up with each additional SSD added. Moreover, because of slow write performance of Flash, a DRAM or SRAM write cache/buffer also must be used.
Add this all up and you find that every Flash implementation requires the literal equivalent of a write-caching RAID controller, regardless of where you put the Flash.
PCIe 2.0 x16 is 8GB, not 8Gb, per second. Slight difference there. (64Gbps).
So the improvement from 6Gbps SAS to 8GB/s PCIe is: 966%
Which I’d say can be safely rounded up to “an order of magnitude”.