Fusion-io – great demo. Now comes the hard part.

by Robin Harris | Friday, September 28, 2007 | Future Tech, SSD/Flash/NVRAM | 10 comments

Several readers have asked
Boy, do the Fusion-io guys give a good demo! Now if only they can ship a product.

I like startups.
People putting their heart and soul into an idea that rarely turns out as well as they hoped. As is said of 2nd marriages, the triumph of hope over experience.

I expect near-delusional optimism and careful misdirection. All part of the fun.

Fusion io
Yet the the Fusion-io media blitz leaves me skeptical. The info is sparse and perhaps inconsistent. And the claims seem too good to be true – which usually means they are. Here are some concerns.

The economics are dicey. $30/GB means a $1200 40 GB drive – assuming the $/GB number comes from the smallest configuration and not the more likely 640 GB version. Some gamers and blade buyers will shell out that kind of money, but why not just max out your RAM at $35/GB?
No word about their architecture – evidently some kind of parallel channel like the STEC Zeus drive – on their site. That just seems odd.
Finally, their numbers don’t pass the smell test. 600 MB/s of 4KB writes equals 150,000 IOPS. The STEC Zeus drive, whose experience in flash I find credible, claim 50,000 IOPS for their FC drive. What does Fusion-io know that STEC doesn’t?

The StorageMojo take
I’d love to be proven wrong, but I’ll be waiting for independent confirmation of the product’s performance and price. Besides the price, the product has all the disadvantages of direct attached storage. Gamer heaven? Yes. Economic server enhancement? I’ll wait and see.

Comments welcome, of course. Has anyone played with one of these puppies?

10 Comments

Hans on Friday, 28 September, 2007 at 9:33 am

For the current flash devices: random read access is 25uS + 50nS/bank
1. So for a 4KByte random read you have 25uS + (200/M)uS where M is the array size in bytes.
The picture shows 8 flash chips + probably ECC, say they are x16 devices = means the array is 8Ã—16 bits = 16bytes wide.
So their random 4KB read on a 16byte wide array takes approx 38uS
That’s about 25K transactions a sec (approx).

2. Now Flash chips can be stacked up to 4 high on the same package. Say they did 4chips stack. That’s give you 4 planes of 16byte width each. By pipelining 4 random reads through the controller the will get 100K operations/sec

3. Now here is an interesting thing. 400K transactions over a PCI-e is NOT the same as 400K IOPS on a storage array. A Flash-disk under Linux or Windows is just a direct-attached block device. So..
a. Operating as a fast internal disk replacement this thing is pretty darn good. Say you put one of this inside an Isilon cluster storage sever – I think you will see throughput goes up. How much – I don’t know – it depends on Isilon’s sw. Sujal over their can probably tell you off the top of his head.
b. If you stuff one in a server and run linux on it and use it as a NFS server – how much IOPS will you see from the network? More than a BlueArc? Better than Isilon? Will it be cheaper? TBD

3. This now gives users a new option: (a) add server memory (b) Use a PCI-e card RAM expander at $35/GB {does anybody even makes these?} (c) Use fusion-IO at $15/GB

You probably will see the asian builders flooding the market with cards like this if this has any traction.

One last note – it’ll be interesting to take an equal-logic, intransa, ibrix or isilon storage cluster array and bake them against a standard server with the fusion-io cards.

Flash cards like these may make the clustered storage guys obsolete….

– hz
{disclaimer: I don’t know the fusion-io guys and was not at DEMO. I did speak with somebody who was at DEMO and saw them. My analysis is strictly off generally available information}
xfer_rdy on Friday, 28 September, 2007 at 5:05 pm

Another solid state disk (SSD). Whether its flash, dram, mirror-bit or some other technology, it doesn’t matter. Once you have more than one memory chip on the card, you can have parallel access to data. In fact, optimized ramdisks can easily achieve 400K iops on today’s quad cpu computers. I have seen chelsio’s iSCSI product reach over 500K iops ( I never reached the 700k – test fixture issue). So i don’t think 160K iops is unreasonable for a SSD. At $15/gb is a lot lower cost than Texas Memory’s products.

Still, for some reason, I feel like I’m being sold 25 year old solution. But then again, history has a way of repeating itself and some of the best ideas are the simplest.

As for bang for the buck, its not bad for a small business, web server cache, a desktop video editing or prepress application.

Just for grins, hookup an HSM to it…

Any idea on number of write cycles ?? Any BER numbers ??
David Flynn on Sunday, 30 September, 2007 at 6:58 pm

“Finally, their numbers donâ€™t pass the smell test. 600 MB/s of 4KB writes equals 150,000 IOPS. The STEC Zeus drive, whose experience in flash I find credible, claim 50,000 IOPS for their FC drive. What does Fusion io know that STEC doesnâ€™t?”

Peek bandwidth and peek IOPS ratings are two separate numbers – it was never stated that it was 100K IOPS *at the same time* as 600MB/s. In the demo I pointed out that it was doing 400MB/s of 4K packets. Clearly the peak bandwidth is using large packets.

FYI: the array is 20 wide by 8 banks deep of independent NAND dies. It has a theoretical 1GB/s bandwidth and 320K seeks per second (8 banks doing 40K IOPS).

The factors that limit performance are the bandwidth of the PCIe x 4 bus, the DMA engines, and more importantly the operating system’s ability to handle that request rate. At the end of the day, I fully expect to achieve more than even 150K IOPS – we’ve simply used a conservative number (what we’ve already achieved).

The Zuess IOPS doesn’t stand a chance at competing in performance – it has to suck data through the FC/SCSI bus/protocol. At hundreds of dollars per GB (or wait, is it thousands?) and requiring FC infrastructure to use – it’s much harder to justify / integrate.

-David Flynn

CTO Fusion-io
Duane Sand on Monday, 1 October, 2007 at 11:53 pm

Hi David,
The prelim datasheet at fusionio.com/iodrivedata.pdf does claim sustained 600 MB/s at random 4KB writes, ie 150K writes/sec. Perhaps it needs correction.

I’m puzzled about some of the numbers reached in the iozone benchmark at Demo. With 8 reader processes, each doing one random 4KB page at a time and each page going to a single NAND die, I would expect the bandwidth to be capped to 8 times the 40MB/s rate of the NAND die interface, ie 320 MB/s. But iozone is reporting 383 MB/s. I wonder if Linux’s block cache in DRAM is here saving a significant fraction of pages from the prior write test. Or do you stripe the user’s 4KB page across several die?

I was surprised to see iozone report sequential read rates of 925 MB/s. This is signficantly higher than the 800 MB/s stated by Fusionio in several places. And it seems to show the PCIe x4 bus carrying DMA data at 92% of the bus’s 1GB/s msg protocol datarate. I thought the maximum DMA payload for PCIe was about 85% for the block transfer sizes supported by Intel or AMD chipsets. Perhaps the Linux memory cache is skipping some of these reads? Or does that particular HP server box have a chipset supporting especially long DMA blocks? Do you get similar iozone results when running on commodity motherboards?

I see that the demo’d PCIe board has a doublesided daughtercard containing 20+ NAND packages plus a controller. 80 GB total, if each package contains two 2GB NAND dies, stacked together. With 20 independent NAND interface channels, one per package. Very fast and very dense.

I presume the 640 GB product would contain 8 of these daughtercard modules, plugged sideways into a single PCIe-slot card. And the x4 PCIe bus is subdivided 8 ways. What are the performance and cost trade-offs, of installing multiple PCIe-slot cards, each containing a single 80GB module, versus installing a single card containing multiple modules?

I am excited to see that your random write rates are nearly as fast as your random reads, instead of being many times slower as on all other announced NAND drives (including Zeus). This shows that you’ve solved NAND’s “write amplification” or garbage collection problem, where rewriting a single random page can trigger overhead of copying of up to 63 old data pages. Intel has apparently solved this too, and so has EasyCo’s MFT software.

— Duane Sand
Bill Todd on Tuesday, 2 October, 2007 at 3:07 am

(Since your system apparently ate my earlier post, I’ll try again:)

The demo is certainly cute, but the product seems (in both performance and pricing, though its capacity may be noticeably larger) equivalent to a simple battery-backed RAM card. 600 MB/s is nothing special for such a card, and of course only equates to about 10 disk drives streaming data full-tilt (and they’d cost only about 1/10th as much doing so). 100K IOPS is also nothing special for such a card, and while it would take on the order of 1,000 disks to equal that (at far higher total cost) they’d have on the order of 1,000 times the card’s capacity as well…

Given your comment about a $30/GB price I didn’t bother looking further: for small-read-intensive workloads buying more system RAM instead at a fairly similar cost/MB is almost certainly a better (and more flexible) solution, and while fast NVRAM can be a boon for small-write-intensive workloads a relatively small NV write-back cache fronting conventional rotating storage is usually a far more cost-effective solution.

– bill
David Flynn on Saturday, 6 October, 2007 at 9:09 pm

Duane,

Yes, the data sheet implies 150K write IOPS by stating 600MB/s bandwidth @4K packet size. That is an error… sort-of. While the ioDrive is capable of 150K write IOPS, we’ve found that the block layer of the OS isn’t really capable of handling that. So, it will take larger packets to get the OS to go at the maximum bandwidth.

Each request is not sent to a single die of a single chip, but to the same die across several chips.

You are correct that the maximum completion unit size for our tests was larger than normal. As you might imagine the northbridge and DMA tuning at these rates makes a big difference. HP has facilitated some great tuning.

We put only 2GB of DRAM in each server to reduce page cache effects. The total transfer size chosen was 8GB to further reduce the effects.

The 640GB card doesn’t require additional ioMemory modules, just the one – loaded with 128Gbit NAND chips stacked, 16GB * 40 chips = 640GB.

-David
Duane Sand on Monday, 8 October, 2007 at 5:44 pm

Thanks David,

You are striping small 4KB records across multiple NAND chips to reduce transfer times. That has a side effect of multiplying the NAND chip’s erase-block size. For most designs, bigger erasures would further worsen their problem of very expensive random writes. But you have extremely fast random writes anyhow! Your method for combining and reusing partially-stale blocks must be very very efficient!

640 GBytes in just 40 chips? !

The TGDaily interview quotes you as saying “the card has 160 parallel pipelines”. I had assumed this meant that its NAND controller(s) has up to 160 independent external 8-bit data buses, each shifting data into and out of a single NAND storage chip. And I assumed that the demo’d 20-chip product had 20 such pipelines. Or is your pipeline something else?

If your highest-capacity cards have 8 stacked 2GB die inside each chip package, those 8 die would all be sharing pins for a single data bus. Only one die per package can be doing its slow data shift-in/shift-out work, so at most 40 of the 160 pipelines are transferring data concurrently. But 40 is fine, because 20 is already enough to saturate the PCIe x4 bus anyhow. Perhaps the 160 pipe design was to allow for an alternate, less-dense packaging of the NAND chips?

Does the electrical load of stacking 8 die together cause their bus to clock slower? Samsung’s commodity 4-stack double-high packages have only half the clock rate of their 2-stack and unstacked products. With 40 chips, you maybe have enough excess parallelism to compensate for possibly slow buses.

Is this degree of die stacking common in the next generation of SSD drives? It sounds costly and slower to bring to market.

— Duane
I on Thursday, 18 October, 2007 at 12:55 pm

Anyone knows how much 80GB and 160GB versions may cost?
And will they run in something like RAID5 config (3d+1p)?
Any chance they’ll support Solaris 10 (x64) and/or Suse 10?
Rubbish on Thursday, 28 February, 2008 at 9:32 pm

You can only get that kind of throughput performance by using it as a tape drive, do some math you will figure out the random throughput will be far less from what has been claimed.
I have one on Friday, 20 February, 2009 at 11:56 am

We are testing one of this for a sophisticated enterprise surveillance application that needs to perform lots of SQL Server transactions (some large ones) very quickly. I don’t know a lot about the technical aspects of NANDs and Chip stacking and all the other technical terms here. I do know from years and years in the business that specs are tedious and the real test comes when you install and use the gear. I makes little difference if it is a hard drive, an infrared camera, or anything else.

I can tell you firsthand this thing PERFORMS. It tears through rapid-fire SQL transactions and large reads like they were childsplay. We’re storing motion data and video analysis real-time from tens of thousands of cameras, with data connections coming in from all over the world, and our old SAS / RAID 5 setup was struggling to keep up. This baby is handling all we can throw at it, no sweat.

We also tested a RAM-SAN from Texas Memory Systems, which performed just all well (real-world, don’t know about specs) but was so unweildy we opted for this solution.

So feel free to talk your talk, but I’m here to tell you these babies are sweet. If you need a real solution to a bogged down database solution, I’d give it a shot.