The Sun Oracle F5100 flash-DIMM array joined Violin Memory’s flash array as the storage on the lowest latency TPC-C benchmarks (see The SSD write cliff in real life for that data).
Alert reader KD Mann pointed out that the F5100’s flash DIMMs are simply “. . . a SATA SSD without the sheet metal wrapper.” He later contended that it was the speed of redo log, not the storage devices per se, that drove latency.
A Reliable Source close to the F5100 told StorageMojo:
Yes the F5100 is just an SSD array. The DIMMs are just repackaged SSDs that were designed to optimize the amount of flash per controller based on a surface area computation from several years ago. I’d actually say that it’s much worse than a JBOD (JBOS?) since the modules aren’t hot swappable and the individual DIMMs get none of the benefits of economies of scale that typical SSDs do. The only advantage potentially is that they sized the SAS switches in the F5100 to push the IOPS limit of the DIMMs.
. . . The DIMM wasn’t a bad idea at the time, just in hindsight. The MicroSSD format didn’t exist with the MicroSATA connector (ala the MacBook air). And there was no clear standard for small format SSDs.
I’ve asked a followup question about those generously-sized SAS switches and will update the post when and if there’s an answer.
Update: Got a quick answer to the question:
The SAS switches were tuned to deliver optimal throughput and bandwidth for the devices, and to have enough capacity to max out all of them simultaneously. The germane comparison isn’t to SSDs — as I said, the miniDIMMs are just SSDs — it’s to the SAS switches in other JBODs.
End update.
The StorageMojo take
Mr. Mann’s argument that redo logs are the key TPC-C latency bottleneck is provocative. These benchmark systems are carefully configured and latency numbers are posted in the short summary documents, so it stands to reason that latency would get serious attention.
If Mr. Mann is correct, the benchmark teams are falling down on the job of showing their systems in the best possible light. How could that be?
Readers, what say you?
Courteous comments welcome, of course.
You should not discount the OS as a factor. Systems using the Sun array were likely using Solaris, which has provisions for SSD acceleration for reads and writes separately in ZFS (ZIL logzilla and L2ARC).
Benchmarks on other storage arrays were probably done using Linux, Windows, HP-UX or AIX, which do not always have these optimizations and are thus not fully able to exploit the SSD’s latency benefits.
The conventional wisdom of using SSDs for redo logs is inaccurate, based on experience in tuning Oracle on Texas’ RAMSAN. Redo logs have sequential I/O patterns that spinning rust is very good at. The throughput on enterprise-grade hard drives is as good as SSDs, latency isn’t, but a good NVRAM caching write-through controller would take care of that. Most transactions are read-write, not write-only, and thus to optimize latency you need to speed up access to the datafiles as well (if they don’t fit in the database’s buffer cache, which is no longer a given with an obsolete benchmark like TPC-C).
The optimal strategy in many cases is in fact to put key tablespaces on SSD rather than the redo logs or transaction journal.