Dear StorageMojo: make NFS go fast!

by Robin Harris | Friday, December 10, 2010 | Architecture, Enterprise, NAS, IP, iSCSI | 26 comments

Most of us know what it is like when a relationship goes bad: the sinking feeling that this just isn’t going to work.

Can this configuration be saved?
Dear StorageMojo:

I joined a company last year that is running Oracle 10g on a NetApp NAS/SAN.

Immediately I asked why they were not using Clustering, Oracle RAC, Oracle ASM or Fiber Channel. No answer.

Fast fwd to a year later and they are asking me to deploy this to an I/O bound customer with hundreds of connections and lots of transactions to their DB over NFS.

Long story short it’s slow-w-w-w. They tried trunking multiple network connections. They tried tuning. They tried a bunch of stuff. And it’s still a dog.

How slow?

I have a screaming Dell r710 running a 7TB database attached over SAS to a set of MD3000 storage arrays. I am getting 450MBs…..and this barely suffices…..

The “new” system they showed me gets 50MBsec…the same screaming Dellr710 but connected over NFS (instead of SAS) to the NetApp NAS.

Do you have any suggestions?

Thank you for reading this nightmare.
Bob

Poor Bob! He’ll be getting grief from the client for months, maybe years to come, unless this gets fixed.

The StorageMojo take
Maybe Bob could have been better about developing a relationship with the guys configuring the systems. More questions, fewer conclusions, at first.

Suggestions to the customer for acceptance testing might be in order.

But there are 2 problems here:

What to do now.
How to keep this from happening again.

What would you suggest to Bob on either or both topics? I’ve asked him to watch the comments, so if more info would be useful, I hope he’ll provide it.

Courteous comments ~~welcome~~ needed. When a multi-billion dollar near-sighted telescope can get sent into orbit, it is surprising more IT projects don’t go wrong.

26 Comments

Jeff Tabor on Friday, 10 December, 2010 at 9:02 am

Hi Bob,

For Oracle, many people prefer NFS over SAN or directed attached storage since NFS is simpler to scale and manage. There can be a performance penalty for NFS, which sounds like you are suffering from now.

Avere Systems (www.averesystems.com) provides scalable appliances that accelerate NFS for applications like Oracle and they are cost effective since they will work with your existing NFS solution. We have a couple different models but I think a couple of our FXT 2700 appliances that each provide 64GB of RAM and 512GB of SSD will dramatically improve your performance. Check out the product pages on our website for more info (http://www.averesystems.com/Products.aspx) and feel free to send me email (jtabor@averesystems.com) if you need help.

Jeff
Ray Van Dolson on Friday, 10 December, 2010 at 9:49 am

Bob, you may also get some great responses from the Toasters list (http://toasters.mathworks.com/toasters/Explanation.html). Fantastic expertise on that list.
Matty on Friday, 10 December, 2010 at 10:09 am

I see two main issues here:

1. How to address the technical issues.

2. How to save face in front of the customer.

For the first one I would recommend reaching out to your vendors to see if they can help. Also, I would see if the folks on the various NFS lists can offer additional suggestions for boosting throughput. Getting 450MB/s out of the various NFS clients will be tough, especially when you are dealing with a massive number of small I/Os.

Regarding #2, I would be open and honest with the customer. Tell them about the issues you are encountering and give them options to address it. I’m a firm believer in being open and honest with people, and hopefully they will be understanding and work with you to re-engineer a solution that scales to the I/O levels you need to obtain.

Great blog!
– Ryan
Jason on Friday, 10 December, 2010 at 11:48 am

Which bottleneck are you actually hitting? Interface speed? TCP Offload capacity? NFS thread limit?

Are you using Oracle’s NFS client, or the default OS one? What OS are you running? Linux? Solaris?

Are the queries tuned appropriately? (I assume they are: I’m also assuming that you just changed the storage back-end).

Is the NetApp licensed for Fiber Channel attached? Could you use the same storage, and access it via block storage? What’s the speed difference between NFS vs FC attached?

Oracle was really pushing their DB on NFS on NetApp for a while, so I’m assuming they have a ton of white papers on how to tune that. Start there.
Steve Shockley on Friday, 10 December, 2010 at 12:41 pm

Are you doing mostly reads or writes? It might be interesting to mount NFS async as a test. Usually NFS writes on NetApp aren’t that bad because of the NVRAM though. In the end, disks are disks, so you really should be able to get about the same performance with the same number of DAS vs. NAS spindles.

Is there other load on the filer?
Have you checked to see if you need a WAFL defrag (reallocate measure)?
Is the filer using SATA disks with a small RAID group size? It’s possible it’s a configuration that can’t give you the I/O you need.
You trunked the network, is it actually using all the links, or just one link worth of bandwidth? There may be a /vol0/etc/log/lacp_log file, that’ll tell you if the trunk is working or not. Jumbo frames may help, maybe not.

There are lot of filer tools to diagnose performance problems, but that’s probably outside the scope of a blog comment. You may find http://www.netapp.com/us/library/technical-reports/tr-3322.html interesting, I’m not an Oracle user but it appears Oracle over NFS is the suggested configuration by NetApp.
Dimitris Krekoukias on Friday, 10 December, 2010 at 12:43 pm

Hello everyone, Dimitris from NetApp here.

NFS is not the issue per se – we have some of the largest and busiest databases in the world (including most of Oracle themselves) on NFS.

Doing it right is another matter, like with everything. There are several tunings for NFS for pre-11 versions of Oracle (with 11i you can use dNFS, which screams and is easier to deploy).

I suggest you reach out to NetApp support, who will be able to assist you with setting the right parameters.

Of course, if this is all GigE, you’ll need to use multiple links to hit over 500MB/s, and the right back-end. I have no idea how many disks you have.

Reaching out to NetApp DB performance experts can help size this properly.

I’d also recommend you check the various efficiency technologies NetApp provides with respect to Oracle (FlexClones, RAID-DP, SnapManager for Oracle to name a few). Performance is one aspect (easily tackled). Functionality and reliability is a totally different story.

Otherwise, everyone would be just doing RAID0 on JBOD… 🙂

Thx

D
Jean on Friday, 10 December, 2010 at 12:53 pm

Slow NFS, and other protocol over IP, are often caused by the following:

– Network switch using over subscription. Several models are doing over-subscription without warning. Read fine print in your documentation. They are not rare at all.
– Isolate storage traffic network like you would with FC. Use separated IP switch and Ethernet port on your server. So no VLAN with data storage devices.
– Avoid multiple link aggregation. Move to 10Gbits to have speed. This way you avoid all latency and weird aggregation protocols who rarely work fine with storage devices.
tgs on Friday, 10 December, 2010 at 1:31 pm

This is like watching a kid with that toy that has the square, triangle, and circle blocks with corresponding holes, trying to mash the triangle through the circle slot, no matter how hard he pushes it won’t go through. Some things just are not meant to be, frankly

It’s old, but its got similarities to your situation: http://media.netapp.com/documents/tr-3496.pdf

Even Netapp knows that NFS can’t compete with FC, but thats a given.
Alex McDonald on Friday, 10 December, 2010 at 3:24 pm

Alex McDonald, NetApp here; please use NetApp support! You may want to read this too; http://media.netapp.com/documents/tr-3862.pdf ” â€œOracle on NetApp NFS: A Database Implementation over NFSv3 and NFSv4â€

There are also documents for specific platforms;

â€¢ http://media.netapp.com/documents/tr-3408.pdf â€œAIX Performance with NFS, iSCSI, and FCP Using an Oracle Database on NetApp Storageâ€ May 2009
â€¢ http://media.netapp.com/documents/tr-3557.pdf â€œHP-UX NFS Performance with Oracle Database 10g Using NetApp Storageâ€ Nov 2008

(Search our media library at http://www.netapp.com/us/library/ on “10g” for example)

I’d also consider using dNFS, which is pretty bullet proof.

And again, please contact NetApp for advice.
Dimitris Krekoukias on Friday, 10 December, 2010 at 3:51 pm

@ tgs –

I beg to differ. True, comparing a single 4Gb FC link to a single 1GbE link, performance on FC will be of course better. Even looking at that paper, NFS performance isn’t 10x slower, but 20-something PERCENT slower – in THAT old document.

Technology has evolved though: http://media.netapp.com/documents/tr-3700.pdf is a more recent document showing the same thing but with newer Oracle and dNFS. Look at the speeds. That’s without 10GbE…

The way to get GbE performance is parallelization if you don’t have 10GbE.

If you can use 10GbE as Jean suggested, performance is stellar.

You always have to design with the requisite performance first in mind, then decide on everything else.

I wouldn’t mind knowing the exact config and the customer name – contact me off-blog please 🙂

Because, you see, I have no idea whether your box is a 2020 with 6 drives, or a 6280 with 1400 drives and 8TB cache. Kind of a big range of possible boxes in between. If your box is small, it won’t be able to hit 500MB/s.

Also check the switches, NICs… there’s an entire chain of stuff there.

If all else fails, as someone else suggested, you can always use FC on the NetApp system anyway… it supports everything.

Thx

D
Joe Kraska on Sunday, 12 December, 2010 at 1:03 pm

Hi,

There’s another ethernet-based option that involves superior performance without having to buy 10GE and without having to switch over to FC and the expensive switches:

iSCSI with MPIO.

iSCSI with 1GE links and a good MPIO driver will fairly well dust any NFS implementation over the same thing. We’ve proven this with a strong degree of rigor in our VMware environment with a NetApp 3070 and clients with 4 1GE links.

Also, one might also wish to experiment with a direct-mount iSCSI LUN (via SW driver or iSCSI card) directly to the database host. We did this with an IOPS-intensive workload once, and found the whole arrangement to be vastly superior to NFS on the very same filer.

And yes, we very much know how to tune a NetApp.

Joe Kraska
San Diego CA
USA
TS on Sunday, 12 December, 2010 at 9:53 pm

@Bob

First of all, you didn’t mention which “NetApp” you are using. Even the fastest FAS6000 series isn’t exactly “FASt” in today’s x86 standard. How many spindles and how much write caching do you have?

One thing ZFS guys like us know is that synchronous NFS writes are slow if you don’t have flash write caching like the ZFS ZIL.

One of the secrets the NetApp guys don’t want you to know is that “WAFL style file system” is now a open source commodity. Entry level NetApp gear typically don’t have more horsepower than a $1000 PC these days and spindle cost are exuberant.

Let me point you to the right direction:

Call up Dell:
1. buy a pair of R710s with E5620 with 72GB ram and fill them with 6x LSI 9200-8es.
2. buy 6x MD1000 with AAMUXes
3. fill them with Hitachi UltraStars 1TB from provantage(20pk)
4. Add a pair of Vertex 2EX 50GB ZILs and 4x X25-E/M L2ARC per R710

Then Call up Nexenta for an active active NexentaStor license and run cross-JBOD, cross-HBA RAIDZ2s over the 6 MD1000s. Benchmark NFS performance on that. You can even test it for free over the Nexentastor community edition
Brian Z on Monday, 13 December, 2010 at 7:58 am

I think Jason and Jean are onto something here. In order to troubleshoot and come up with a solution, we’ll need a bit more information. Many organizations have Oracle over NFS, but that doesn’t really mean they all work the same.

When it comes to IP storage, I personally would examine the network first. After spending so much money in their new storage systems, many organizations tend to ignore the network. Plus, didn’t the sales guys make a point about using your existing infrastructure when setting up IP Storage? 🙂

I agree with Jean’s suggestions. Check your switches and make sure you have enough throughput in the backplane first; and you should also make sure you isolate the storage traffic from the rest of the network.

Ultimately, it’d be great to upgrade to 10GbE, but it could cost a lot if the infrastructure is not in place today.

Regarding the comparison to the MD3000 setup, I’m not surprised, but I don’t think you’re comparing Apples to Apples. Pound for pound, MD3000 is just not in the same league as FAS2000 or above.

My 2 Â¢.
Joe Kraska on Monday, 13 December, 2010 at 7:50 pm

As for TS’s recommendation to fill Dell MD1000’s with third party storage, one will want to be aware that this will void the Dell warranty on the MD1000. Since I cannot imagine an MD1000 breaking, I don’t see how this will be a problem, but one does want to be clear in advice one gives. 50MB/s is far too low for a decently provisioned midrange netapp filer, even with SATA drives. We’d need to know a lot more about the configuration before saying what is wrong with it, if anything. Later 3000-series filers are capable of well in excess of 8X this much.
Joe Kraska on Monday, 13 December, 2010 at 7:53 pm

Additionally, one will want to take care. I recall that some Dell storage products use commodity hard drives but with custom tagged firmware. Without the tag, the drive will not be accepted by the array. I have no idea if this is true of the MD1000.
Steve Sherman on Tuesday, 14 December, 2010 at 12:19 am

Bob,

There are potentially two separate issues you are dealing with:

1) NFS is slower than FC/SAS for block storage. Its a file access protocol and both the protocol overheads and the file system used by the NAS will make reads much higher latency than on direct attached storage. Additional read latency reduces CPU efficiency and database application performance.

2) The disk media on the NAS may also be too slow. To sustain 500MB/s with 16K access, you would need 30K IOPS . On a typical NAS you would need 300 HDDs ( a whole rack) available to your application.

The more cost effective solution to increasing IOPS is using flash memory arrays either as SAN attached storage or as an NFS cache. You can fit 7TB into a single 3U system with more than double the bandwidth you need.

If you’d like to discuss this option, please contact Violin Memory at support@vmem.com.

Good luck
Steve
TS on Tuesday, 14 December, 2010 at 4:04 am

@Joe Kraska

It is only true for the MD3000 where the active active RAID controllers inside the MD3000 restricts the drive firmware. Dell’s H800/H700/H200 HBAs also started forcing drive firmware restriction. But you won’t have a problem with LSI 9200-8e controlling MD1000s. If you don’t like Dell, Supermicro SBB JBOD chassis is a perfect replacement for MD1000, even with upgraded LSI SAS 2.0 Expanders.

You may think of my solution as the “getto” solution. But in reality, that’s exactly what Dell/Oracle end up selling to you, only with a reputable brand re-badge job.
AaronM on Thursday, 16 December, 2010 at 6:37 pm

@TS

I think Pogolinux sells exactly what you’re describing, the SuperMicro SSB with Nexenta and external shelves if you want it… Reduce the number of support points.
A
ps no affiliation, just saw them in passing
John (other John) on Saturday, 18 December, 2010 at 7:41 am

This’ll be contentious.

At your price range, (you want clusters, yes?) get RDB and some new Hitachi direct attach. Take the hand holding. Get refunds.

Lot of speculation here, but what makes sense to me is Oracle playing FS gurus, and it working not well downstream.

– j
jon on Saturday, 18 December, 2010 at 10:00 am

I’d rule out a network problem first. Get yourself an x86 box and attach it in place of the SAN. Run iperf on both sides and verify that the switch(es, etc) can pump the kind of bandwidth you’re expecting. As someone earlier pointed out, MOST ethernet switches are oversubscribed (2:1 or 4:1 in some cases, because of ASIC sharing across multiple ports). Depending on your port availability it might be as simple as trying ports 1, 5, 10, 15 and then bonding all those together. Bottom line you need to isolate each subsystem and troubleshoot them individually. Start with the network.
jon on Saturday, 18 December, 2010 at 10:05 am

Oh! Another note, anyone looking at moving to 10GbE needs to take a VERY hard look at Arista. You can pick up a 7148S (48 10Gb SFP+ non-blocking WIRESPEED ports in a 1U package) for under $20k – we’re talking ~$400/10GbE port. Or for very high density, their 7500 series (2010 Best of Interop Grand Prize winner) which packs 384 10GbE (again, wirespeed, no oversubscription) ports into an 11U package. Andy Bechtolsheim started the company (Sun founder, who also went on to start Granite Systems who built 1GbE switches in the 1990s that cisco bought before making Andy the head of their gigabit switching line for years).

No affiliation, just a fan 🙂
John on Monday, 20 December, 2010 at 3:45 pm

If i had to guess I would say it’s your switch/switches.
Put you SAN ON ITS OWN SWITCHES and make sure those switches have a large back plane. (no cheapie gig switches with tiny little backplane) I have seen waaaaay to many people try routing san traffic over the same switches as network EVEN using vlan you are NOT going to get good results. It will be crap. Don’t let someone tell you different.
John (other John) on Tuesday, 21 December, 2010 at 11:21 am

John,

can we know please if there’s any raw sockets being done there?

– John

(I did write another comment, but I was on a link which timed out. Just random. You guys are right to fix the network gear, I just wish the OP could bypass it all.)
Raj on Tuesday, 21 December, 2010 at 3:10 pm

@Bob
You mentioned that R710 is bonded. R710 has 4x 1GbE onboard interface ports. So, I assume that there is some kind of bonding (linux bonding module or other link aggr used) to get 40Gbps bandwidth.
But most of the bonding techniques assigns only one connection to a link, which is just 1Gbps when connection made between 1 machine to another machine as the src and dest mac/ip/port are the same.
In your case, 50MBps over NFS sounds like your bonding algorithm is assigning just one link and not using all four.

But there are other bonding modes that uses all 4 links at the same time – like balance-rr (mode 0) which round robins the packets across all the links or mode 6 (balance-alb) adaptive load balance (preferred one). You can try using these modes and see if you can increase the nfs performance.

If you Netapp has enough spindles in your raidgroup and can deliver much higher nfs throughput then definitely network is your bottleneck. If there is 10Gb infrastructure available then it will be well worth to upgrade the filers with 10GbE and the R710 with 10GbE.
Good luck!
-Raj
Joe Kraska on Sunday, 26 December, 2010 at 5:37 am

Regarding Raj’s hope that bonding multiple 1GE links can produce NFS read throughput higher than that of 1GE link, I am not optimistic. For one, the choice of link aggregation on the client has little to do with packets flowing in the client’s direction. For two, considering the possibility of configuring the switch facing the client to have a suitable link aggregation setup, I know for certain that Cisco configurations won’t produce the desired throughput results. They just won’t work that way, I’m sorry to inform.

Be that as it may, from extensive testing I know that you can get near 124MB/s link utilization (incl. line overhead) with good Cisco switches (e.g., 4948) and NetApp 3000-series filers. This would be with large-block continuous front-to-back-of-file I/O. If the situation is more random, and read-oriented, consider a PAM card. If the situation is write-oriented and random, you’ll have to tell us the IOPS instead, as we cannot assess whether or not your results are good or bad with a MB/s reading in that case.

Regardless of all this, a few things:

For the network, make sure that flow control is “on” (in the switch). If your packets are > 9K, also make certain to have jumbo frames configured end-to-end in this system. As for the filer itself, you’ll need two shelves of 15K drives before you are configured with enough disk to be at full speed. If you are using 7200RPM instead, you’ll need 4 shelves. If IOPS are your desire, you should stick with the faster drives. And if you are doing read-related IOPS, I’ll say again to consider the PAM card.

All this said, nothing really compares to a local JBOD. It’s all in the latencies, and it’s pretty hard to beat the latencies of direct attached storage.
Eric Anderson on Wednesday, 5 January, 2011 at 9:39 pm

A tool like CopperEgg’s FocusOnNAS could help understand the issues and bottlenecks in performance for NFS.

Ok, so I am biased. 🙂

Feel free to contact me (anderson at copperegg com).
Or: http://CopperEgg.com

Eric