1 million IOPS in 1 RU

by Robin Harris on Monday, 12 October, 2009

Sun announced the, or removed from stealth mode, the F5100, their flash-based storage array that uses SO-DIMM form-factor flash modules (see last month’s post for the StorageMojo take on the unannounced product). With 20 flash modules and 480 GB of capacity it starts at $50k, which means for big customers it will be less than $40k.

The box supports 64 SAS connections, zones, up to 1.9 TB of capacity and up to 9.7 GB/s sequential write bandwidth. Here’s the official Sun web page. Joerg Moellenkamp has a good short writeup as well. (Thanks, David!)

Here’s a picture:
picture_36

The StorageMojo take
A shareable flash resource at $100/GB list should be popular. It should also get more data center guys thinking about power per IOP instead of just performance.

Another shot across the bow of the big iron storage flotilla and a nifty advance. Good luck to them.

Courteous comments welcome, of course. Would a project name like “The Beast” have upped the sex appeal?

{ 16 comments… read them below or add one }

Ed Monday, 12 October, 2009 at 9:08 am

“Would a project name like “The Beast” have upped the sex appeal? ”

Like the Nexsan SATABeast, DATABeast or SASBeast? Sorry, but “The Beast” would have probably gotten them sued.

The Oracle F5100 will, IMHO, die. I’m seeing a lot of grumbling from corporate customers suggesting that they’re moving away from Sun hardware because of Sun’s acquisition by Oracle.

Emmanuel Florac Tuesday, 13 October, 2009 at 10:06 am

This is the Oracle Perfect System : tailored for your biggest database needs. :)

KD Mann Thursday, 15 October, 2009 at 9:33 am

F5100…a million IOPS, or only 10,000? Maybe not even that?

On Sun’s F5100 web page under the “performance” tab, there are three application benchmark scenarios described:

Case 1: “ABAQUS IO Intensive”: Sun says “With four RAID0 72GB 15K RPM internal drives, the benchmark execution time was 958.81 seconds. With a 20 Flash Module F5100 Flash array, the time dropped to 462.74 seconds. This is a 2.1x improvement for this benchmark test case.”

Hold on here…it took TWENTY SSDs to beat Four HDDs by 2.1X??? Must be an anomaly. Next up…

Case 2:MSC/NASTRAN MDR3 : “…Sun Storage F5100 Flash Array beat the leading posted results from an HP BL460c G6 with striped SAS drives…F5100 Flash Array: 1970 Seconds; HP Striped SAS Array: 2062 Seconds.”

Ok…That’s a 5% performance gain by plugging in a 20 SSDs versus a striped array of…hold on…waittaminute…Sun didn’t say how many HDDs.

http://www.mscsoftware.com/support/prod_support/nastran/performance/vr3_par.cfm#Vend_HP_BL465c

Got it. An F5100 array of 20 SSDs can outperform an array of six HDDs by 5%.

Case 3: PeopleSoft Payroll Benchmark “F5100 Flash Array…produced World Record Performance on PeopleSoft Payroll 9.0 benchmark…”

The tested configuration was a “two tiered” setup; an F5100 with 40-SSDs, and a traditional additional array of 12 x 15K HDDs. Sun’s blogger said setup delivered…drum roll please…15% faster performance than a single-tiered array of 58 x 15K HDDs.

http://blogs.sun.com/BestPerf/entry/oracle_peoplesoft_payroll_sun_sparc

But that’s not how it shook out. Since I had to (again) track down the actual published result for the HDD-based system to count spindles…while I was there I noticed that the numbers on Sun’s blog didn’t match those in result published by Oracle. The 58-disk HDD system was actually quite a bit faster than reported on Sun’s Blog, and the claimed 15% performance increase for SSD vanished.

http://www.oracle.com/apps_benchmark/doc/peoplesoft/performance-report/ps9-na-pay-9_ora_hp_rx6600.pdf

Turns out, a Flash-free single tier of 58 x 15KRPM HDDs delivered performance exactly equal to the F5100′s 40 x SLC SSDs plus 12 15K HDDs. At this point, I’m beginning to wonder which tier is the fast one.

Payroll is the ubiquitous, archetypal “IO intensive” enterprise transactional application. In this benchmark, a two-tiered deployment where the tough stuff (log-file writes) are being offloaded to spinning disk…we still can’t replace 2 spinning disks with a single Flash SSD.

Something wrong here?

Jörg M. Friday, 16 October, 2009 at 9:53 am

@KD Mann: Yes … you can’t cite. The numbers you’ve used for NASTRANS are for ABAQUS, and vice versa. You and some other commentators at other blogs deserve a blog comment that i’m writing at the moment to explain some stuff.

Kebabbert Saturday, 17 October, 2009 at 8:35 am
Kevin Closson Saturday, 17 October, 2009 at 9:03 am

KD Mann wrote”
“Something wrong here?”

Yes, there is something wrong. It is no small correction for me to point out that the F100 is not an array of SSD but an array of flash modules. KD Mann’s comment cites “SSD” nearly 10 times. Flash modules are not SSD. I’m not joining this thread to quibble about the numbers, but it occurs to me that Sun must not be making the distinction clear.

Since Sun Oracle Exadata Storage Server comes with 4 of the 96GB Flash cards (same family of flash products) and I spoke to at least 50 people at Oracle Open World that also interchange the term SSD for flash modules I think it is time for me to make a blog entry on the matter.

Joerg M. Saturday, 17 October, 2009 at 10:44 am

@Kevin:

1. Those cards have 24 GB each.
2. Where is the distinction for you between an SSD and a flash card with it’s own controller, it’s own cache, it’s own wear-leveling ?

puff65537 Saturday, 17 October, 2009 at 11:59 am

KD: its the shitty latency, note the following 2 paths.

CPU – QPI – PCIe – BBC on raid card.

CPU – QPI – PCIe – cheap HBA – Multiplexed x4 SAS cables – SAS zoner- BBC on flash module

and by cheap I mean $100

The obvious solution: move the cache closer by using a smart raid controller doesn’t work because nothing out there can handle the B/W or IOPS. The Sun F5100 DB acceleration cook book acknowledges this obliquely by stating that lightly threaded apps won’t benefit.

PS, for HP gear measured latencies for a light to moderate threaded app are 400 microseconds (P410i +512MB BBC) vs. 2000 – 20000 microseconds (SC08ge)

KD Mann Sunday, 18 October, 2009 at 9:28 am

@Kevin Closson,

Each FMOD in the F5100 “array” is a SAS-interface SSD using SLC, connected via the SAS/SATA expanders in the F5100 to external SAS ports. An F5100 is an Array of high-end SSDs. The Marvell interface chip is easily identifyable in the photograph here:

http://storagemojo.com/2009/09/01/the-sun-4-tb-flash-array-f5100/

@Jorge M.,

Abaqus and MSC/NASTRAN benchmarks indeed got juxtaposed…by Sun, not by me. Sun’s blogger got them wrong, by reporting 4x72GB HDDs for MSC/NASTRAN…

http://blogs.sun.com/BestPerf/entry/mcae_mcs_nastran_faster_on

…which actually used 6×72. (see link I provided above). The numbers I reported are correct, taken from the actual benchmark results.

Jorge, I appreciate your lengthy rebuttal at cotodoso.org, but you offer only one possible explanation for why none of these configurations can replace more than 1.5 fast HDDs per SSD…your assertion that the applications required more capacity than what could be economically provided by SSD. This is not true for any of the three.

For MSC and Abaqus, (respectively) neither 6x72GB HDD nor 4x72GB HDD provides more capacity than 20x24GB SSDs in the F5100.

For Peoplesoft, Sun’s 40 x 24GB FMODs (RAID-0) plus 12 x 450GB 15K HDDs (RAID-0) provides more capacity than the HDD only system with 58GB x 146GB (RAID-1).

rockmelon Sunday, 18 October, 2009 at 6:17 pm

The 24GB Fmod is actually a SATA (not SAS) SSD which happens
to be packaged in an SO-DIMM form factor. They talk to SAS HBAs
via the SATA Tunnelling Protocol feature of SAS.

In the Sun X2270, there are two Fmod sockets on the motherboard, hooked
up directly to the ICH10-R SATA ports. It’s the same part number as used
in the F5100. For more details, see the X2270 Server Architecture white
paper, or

http://www.sun.com/storage/disk_systems/sss/flash_modules/specs.xml

Kevin Closson Monday, 19 October, 2009 at 12:26 pm

I’ve gone through this thread a couple of times and realize I commented from the standpoint of my overly myopic view of storage. When I contrast SSD to the Flash Oracle offers in Sun Oracle Exadata Storage Server I actually aimed to contrast Enterprise Flash Drives ala STEC ZEUS to PCIe flash cards because that is what Oracle customers have boiled the comparison down to so it seems. During the lifespan of Exadata Version 1 Oracle was routinely questioned on why Exadata Storage Servers did not include Flash SSD connected via the HP SmartArray and a lot of folks naturally presumed that when Exadata did include Flash it would indeed be downwind of a standard drive controller. I think the 4 PCIe Flash cards (used as cache) alongside the 12 spinning drives in the Sun Oracle Exadata Storage Server is a much better approach than plumbing CPU->QPI->controller->Drive.

Sorry if I hijacked the broader conversation.

Wes Felter Monday, 19 October, 2009 at 2:43 pm

Too bad the Sun F20 (used in Exadata 2) is not a PCIe flash card; it’s a SAS HBA with four SSDs bolted onto it.

Kevin Closson Monday, 19 October, 2009 at 4:32 pm

Wes Felter writes:

“…the Sun F20… is not a PCIe flash card…”

Wes,

Sun calls it “Sun Flash Accelerator F20 PCIe Card” and I’ve personally pressed down with my thumb seating the card into a 4275 server PCIe slot. Can you point out where they, and I are wrong on the matter? Am I just missing the point in terminology? Either way, please let me know.

See http://www.sun.com/storage/disk_systems/sss/f20/

And, yes, there are 24 GB components in the card and that is why I mentioned in my first comment on this thread that they (the F20 and 5100 ) are of the same family of products.

Som Sikdar Wednesday, 28 October, 2009 at 12:15 pm

Kevin Closson writes:

..”Sun calls it “Sun Flash Accelerator F20 PCIe Card” and I’ve personally pressed down with my thumb seating the card into a 4275 server PCIe slot. Can you point out where they, and I are wrong on the matter? Am I just missing the point in terminology? Either way, please let me know…”

I believe this is the same ‘Flashfire’ PCI-e card Sun was showing in their booth in Oracle Openworld. If so..
1. I was told by Sun folks at the booth that it uses the standard LSI 1068 controller used in other HBA
2. There are four flash DIMMs attached through SAS/SATA ports to this controller and mounted ‘saddlebag-style’ on the card
3. Additional ports of of the LSI controller is brought out to the end of the card and can be attached to regular hard drives

So….. this card IS equivalent to connecting regular SSDs like a STEC or Intel X off a RAID card. There may be secret sauce in the Sun flash DIMMs (note, however, that the modules use the Marvell controller used by other SSD makers. I was told the Samsung supplied flash chips are ‘binned’ for better durability)

If indeed the LSI1068 chip is used – then it’s not even the highest performance controller. That controller is used in entry-level LSI SAS/SATA HBAs and is found on a lot of supermicro motherboards for on-board disk interface.
It’s not even Gen2.0 PCI..
http://www.lsi.com/DistributionSystem/AssetDocument/files/docs/marketing_docs/storage_stand_prod/SCG_LSISAS1068E_PB_040407.pdf

rockmelon Wednesday, 28 October, 2009 at 8:36 pm

Just because the LSI1068E is found on some Supermicro motherboards does not
mean it is a low-end or low-performance part. It’s also used on this HBA:

http://www.lsi.com/DistributionSystem/AssetDocument/documentation/storage/hbas/sas/lsisas3801e_pb.pdf

Applications – mid to high-end servers.

Typical Uses – Mission critical applications

Features – provides over 140,000 IOPS

It looks to be a good fit for the 100k IOPS claimed for the F20 at

http://www.sun.com/storage/disk_systems/sss/f20/specs.xml

Som Sikdar Thursday, 29 October, 2009 at 12:05 pm

rockmelon writes..
…”Just because the LSI1068E is found on some Supermicro motherboards does not mean it is a low-end or low-performance part. It’s also used on this HBA:”…

Agreed. My comment was not meant to imply that there is anything wrong with the LSI 1068E. It’s a very widely used controller and as you have pointed out it’s got enough headroom beyond Sun’s F20 spec.

My points were:
1. The Sun F20 is architecturally equivalent to a RAID+SSD setup (unlike a Fusion IO or TMS PCIe card).
2. LSI has higher ‘grade’ controller chips – including support of next generation PCIe and SAS

I am huge fan of the ExadataV1 and V2(where the Sun F20 is used) – both from HW and SW arch perspectives. But that discussion is for a database thread…
I am still trying to sort out where the Sun F5100 is substantially better than a server full of SSDs or the Violin flash appliance. And by the way – the Violin appliance IS architecturally a PCI-e flash chips setup.

Disclosure: I am not affiliated with Violin, Sun, Fusion IO or TMS. I think all these solutions are innovative and have their uses.

Leave a Comment

Previous post:

Next post: