Reader Kyle asks a good question:
SANs are advertised up the wazoo as having lots of internal redundancy such as redundant power, redundant controllers, etc. I’ve spent enough time with redundancy to know that having two pieces of hardware often doesn’t cut it. I was wondering what the real story is from someone who has spent a lot of time in the storage space. Do complete SAN failures really pretty much *never* happen or are they just on the rare side? If so what are the common points of failure? Perhaps people, the OS, non-redundant hardware parts?
Please, SAN folks, tell StorageMojo readers your experience. In the meantime, here’s
The StorageMojo take
Kyle asks 2 questions: how reliable and available are the individual devices that make up a SAN and how reliable and available is the system – the SAN as a whole.
Redundancy is aimed at ensuring availability. Because of the redundancy’s greater component count you also have more failures.
Failures of redundant components shouldn’t affect availability – assuming, that is, that failures are not correlated. That assumption turned out not to be true of RAID arrays, making them less available than advertised.
How much redundancy is enough? Customers generally prefer triple redundancy if they can afford it, partly for availability and partly for performance: losing ⅓rd of system performance is less painful than ½. But for the moonshots, NASA chose quintuple redundancy on critical systems.
Yet I’d guess that most are more concerned about SAN system availability – which includes not only what we consider SAN hardware, but also the server-side HBAs, drivers and management software. It is here that the nastiest bugs lurk: untestable interactions between applications, drivers, firmware and architecture that bite us hard – and crash entire SANs.
But what is your experience, gentle reader? Many of us would like to know.
Courteous comments welcome, of course. Update: Bayesian analysis is the best tool to evaluate system-level availability, as noted in this StorageMojo video. Sadly, the tool referred to is no longer online. Anyone want to take a whack at a new one?
We’ve been running a clustered iSCSI SAN for about 7 years now (HP P4000, formerly known as LeftHand). Individual devices are about as reliable as standard servers (which they are in this case, just HP DL-series boxes running Linux and the storage software). The individual boxes have indeed gone down (bad RAID controller, flaky RAM stick, etc.). But the cluster as a whole, and all volumes, have had 100% availability and zero data loss in in 7 years. So clustering unreliable components works.
Of course, you need redundant networking in place too for a clustered SAN. In a typical cluster design like the HP P4000, there are 3, 5 or 7 nodes acting as “cluster managers”, and they vote on availability and reachability of individual storage nodes.
In traditional “monolithic” SANs, you have just two controllers connected to each other and a bunch of disk shelves. In order to avoid the “split brain” that could happen in such a setup, typically some other component (disk shelf or FC switch) acts as a tiebreaker and one controller is shut down or forced into passive mode to prevent both nodes from writing to the same disks at the same time. I have no direct experience with such a setup, but they sell well, so it probably works too.
I can say in my experience with 2 different largish SANS, maybe 50-100TB, that one had a complete failure and parts replacement were an issue in getting it up again, and the other didn’t fail completely but the lower cost disk, ie. non-FC disk, had massive performance issues and were unusable, thus driving up the cost considerably.
I would personally only consider a SAN as available as it was designed for, and then also include in complexity of operating the entire system (ie. more complex could lead to human error and downtime), and most are designed for dual-redundancy.
I think large SANS are often given much more credit then they are due in terms of uptime. I don’t see head-based storage dominating for much longer, as scale-out type solutions take hold, ie. drop the cost of the raw storage but spread it out over more hosts, eg. Isilon, Gluster, OpenStack Swift…etc.
My experience is with enterprise class Fiber Channel SAN. Its large hosting many TB of data and has thousands of ports. We see an outage on a large number of hosts once or twice a year usually lasting a couple of hours – including time to recover the hosts. These outages have come in two types:
1) A partial failure somewhere in the SAN on a ISL where connection will flap up and down or send lots of errors. Rather then recognizing the issue and offlining the failing path the SAN will just wreak havoc on the datapath causing SCSI timeouts (remember FC is a lossless protocol) many machines will offline the disk crashing their file systems etc.
2) All our FC disk arrays have a virtualization layer between them and the server. If a disk frame is saturated with IOPS and not responsive enough to the virtualization layer it will offline the disk array killing all the servers that rely on it for their storage. I think this could happen without the the virt layer in between, it would just take enought SCSI command timeouts between your machine and the host.
This gentle reader (in a small/mid-sized datacenter) is happy to report that while we have suffered outages due to failure over the years, we have never had a catastrophic failure resulting in data loss. All of the outages that I can think of were not related to a single component but rather some kind of interdependent issue. For example (without gory details):
The core storage bits are running but the network dies which “kills the SAN” (this has happens multiple times).
The storage hardware is running fine but a particular tier misbehaves which causes the whole virtualized stack to crater.
In my experience the SAN (meaning all of the stuff needed to deliver data) overall is fairly reliable. The most reliable component overall (by far) is the storage hardware. The least reliable component is (no question) the humans.
NASA also required two independent teams to develop critical software, to avoid correlated software failures.
How many SAN do anything like that?
First, thank you for posting my question. It will be great to see what your readers think.
You mentioned “Bayesian analysis is the best tool to evaluate system-level availability”. Although less formal and less precise (as far as probabilities are concerned) you can draw up a a fault tree in Visio to try brainstorm deductively about what could go wrong. I did a write up a year ago or so on this at http://blog.serverfault.com/post/747186396/ . It is technique that might be well suited for smaller businesses or more amateur analysts.
We have some fairly large sans comprising many san switches. All servers connect to dual/redundant sans.
We have had several bad situations.
1) One time we had switches rebooting themselves. The problem turned out to be a bug in the microcode that handled ethernet management port that caused a memory leak. At some point the switch would run out of memory and reboot. With switches rebooting on both sides of redundant fabrics we had problems! In other words, regardless of the hardware redundancy we were still running the same microcode with the same bug on both sides. We had switches out on both redundant fabrics concurrently, but never such that we lost redundancy to particular servers.
2) We had a situation where certain VMware servers could not handle failover events. This turned out to be driver code problems. It resulted in full outages on some vmware servers when they didn’t handle the failover events related to storage array code upgrades.
3) We had a case where a third party san mgt software hit a bug where it pushed out a bad zoneset and took out one entire side of redundant fabrics.
My experience is that the hardware is fairly reliable but the software can cause big problems.
I completely concur with Robin’s last point. For one organization that I worked for it was how we designed our directory layout that doomed our very pricey NetApp SAN. We hit a practical wall with the number of files in our directories and the thing ended up crashing every other day until it was rectified. Not fun at all.
@Nik:
I ran into that Netapp OnTap FS (Was it WAFL?) limit once (http://serverfault.com/questions/76018/maximum-number-of-files-in-a-single-directory-for-netapp-nfs-mounts-on-linux/76325#76325). Generic solution I thought of was to just to hash the file with something like md5sum and use the first N characters as directories. In theory that would create a reasonably balanced tree structure: (http://serverfault.com/questions/217043/optimal-folder-structure-for-storing-100k-files-on-a-usb-drive/217054#217054). Never had the chance to try it though and see how it actually works.
I’ve been running NetApp, Brocade, and HP server gear for years and found them to be pretty much bullet proof. Outages due to power source migrations and the odd driver/firmware/kernel kludge have happened. Servers tend to be the most brittle part. Running dual, clustered, redundant systems have always served me well.
Working in Tech Support I saw lot’s of situations when redundancy did not help at all, unfortunately. As described above a bug could hit both fabrics as in many cases they are in the same situation (being on the same code, consisting of similar devices, polled by the same monitoring tools, managed by the same scripts, parallel zonings, etc). The same with conceptual flaws and mis-configurations, because they are also often done in parallel. Also the points where both fabrics come together again could face problems which render the redundancy useless such as bugs in multipath drivers or storage array codes.
From plain SAN point of view, I think bottlenecks such as mis-configured or over-utilized ISLs and especially credit starvation / congestion introduced by slow drain devices can be a killer for a whole environment’s performance. A slow drain device is usually connected to both fabrics, so if for example a host is not able to cope with the incoming traffic in a timely manner, it will often congest both fabrics. For Brocade-based SANs I wrote an article about the bottleneckmon recently – a tool to help finding bottlenecks: http://seb-t.de/bneck
Don’t forget the human component of the system. The worst enterprise SAN event I’ve seen was when someone blew away a LUN configuration, including the backup.
I’m in the camp that for the most part serious faults in SANs are pretty rare(I have a couple of war stories from that side). But all you really need to do is dig into the release notes as to “what is fixed” in a particular release of SAN firmware to see what kinds of horrors there may be in store if you happen to be unlucky.
Or talk to the insiders at the manufacturers. I know tons of nastiness that various customers have encountered from this myself but I don’t talk about it since well it’s private information.
Things are only as strong as their weakest link, which is usually the software. Both of my biggest SAN failures involved both controllers failing at the same time(different companies, different manufacturers). Controller #1 detected a fault and rebooted, then Controller #2 encountered the same fault and rebooted before controller #1 was back up. It sucked, hard, as I’m sure you all can understand. Both failures involved days of trying to recover data (or flat out deleting corrupted data), the first failure we were still occasionally encountering corrupted data a year later (big Oracle DB).
Both failures there really was no budget for a backup storage system so the main system was IT.
All things considered though, my life is made a lot easier with a good SAN than with something that’s likely going to be a lot more complicated, and so provided I have the budget to get a 2nd system (or set expectations in the event we don’t), I can live with the rare failure, a few days of pain every half dozen years, vs constant pain that I live with every day with something more annoying to deal with.
At the last company I was at I remember one comment from the VP of engineering, he said “Well the XXX (insert vendor here) wasn’t supposed to go down, but it did”. I said, “yeah, true – they make replication software for a reason, it’s up to the customers whether or not they want to take advantage of the functionality”. Ironically enough after that “incident” I happened to be working on a DR project which would of prevented data loss had we had another event, and the company ended up killing the budget for the DR project to direct the funds to another project that they had massively under budgeted for.
The other side of fault tolerance is how do you measure downtime? If you have a two-controller storage system and one controller has failed gracefully, so now your running on a single controller, in write-through mode for the next 4 hours while your vendor diagnoses the issue and gets replacement parts & staff on site – the remaining controller may or may not be able to handle the load so your application may be suffering and you may be out of SLA. I suppose you can work around this to some degree by massively undersubscribing your system(s), one storage guy I talked to recently suggested on the NetApp he was proposing we get not load the controllers more than 40% of total capacity, which makes sense for that kind of situation but seems so wasteful. I’d rather have a storage system that is more efficient in it’s utilization of resources. My last big array the engineering department bout shit their pants when they saw how much I/O we were shoe horning through our SATA disks on our SAN. They insisted we get more disks even though I had a year’s worth of data that showed we didn’t need them. It was funny. I gave the quote to my boss and said you get get option A, or B, or not do anything. I was surprised how quickly he signed off on the most expensive upgrade option. I gave my notice to leave the company the next day (totally unrelated to the situation but kind of funny).
Same goes for disk failures and latency incurred with rebuilds. Also the risk of double/tripe disk failures if your array takes too long to rebuild.
One company I was at for some reason decided to run SQL server on top of CIFS (!) of all things, Microsoft even approached them to do a case study years ago they never went through with it. But SQL server was so sensitive to latency on CIFS they ended up devoting a 110-disk array(10k RPM) to databases that were pushing in the realm of maybe 1000 IOPS at peak. If you breathed on it wrong SQLserver would drop the DBs. This storage system was ONLY capable of NAS at the time so they could not use something like FC or iSCSI to talk to the disks, it was NFS or CIFS.
I have seen several 497 day bugs that reboot your switches.
If all your switches are installed and updated at the same time… 497 days later both your redundant independent fabrics reboot at the same time. Duh!
How to avoid? Always space reboots so uptimes are never the same.
I have seen many installs where the redundant pair of SAN switches are installed on top of each other with cabling run in such a way as to make replacing either switch impossible without disconnecting both.
How to avoid? Always install redundant hardware in different racks.
Our customers usually start the conversation by trying to plan for every possible failure scenario which always starts out with the question “what happens if the SAN or Storage goes away” to which I tell them our data centers are FAR more likely to fail then our SAN and Storage (good N+1 power, cooling etc. are much less sexy then big storage but just as costly.) Also I guess I consider the Storage Area Network a separate entity physically and logically from the Storage frames. We have two redundant fabrics that have no logical or physical interconnect so we can, and have, taken out entire core nodes on one fabric at a time without any service interruption. And yes you never rack big core node switches in the same rack.
Hello,
Over the years I’ve been responsible for operating a number of different vendors equipment. I’ve never seen a platform, which across all of the instances we’ve run has delivered 100% uptime. Some do get quite close though, but to make up for that – some were really, really awful.
Some vendors have real uptime data aggregated from their dial home data, you’ll need to sign an NDA to get it – but it’s worth trying to get your hands on if you can. Its also worth understanding how they collect it and what it really means, so you can make meaningful comparisons between vendors.
You need to properly plan your overall architecture with the underlying assumption that ALL components (yes, even resilient ones that shouldn’t fail), will eventually fail. Hopefully your big expensive disk array will fail a lot less often then your 1k USD webserver. However it will fail eventually – you need to make sure that you’ve not made assumptions which when it does happen will hang you out to dry. You’d be surprised how often components which are there to protect you from failures cause the failures – ie RAC causing your database to crash, the clusterware breaking and taking everything down. There is a trade off between increased complexity and uptime, and its a hard one to get right.
I would imagine that you’d get significantly better uptime from a SAN which can do 99.995% uptime, and an application stack on top of it which can recover quickly in the event of failure vs a SAN with 100% uptime with an application stack which can’t recover quickly in the event of failure. So don’t gold plate the storage (or any part of the solution) if the money spent to do so would add more to total system uptime if spent elsewhere.
Direct answers to your questions
1) Complete SAN failures do happen – your vendors might (and some vendors definitely do) have uptime data if you ask the right questions, and are planning on spending enough money. But you’ll definitely need to sign an NDA to get it. Most vendors can get to 99.99 or 99.999% uptime with their mid range kit, but getting total system uptime beyond this requires a lot more than just a well engineered SAN.
2) Its not possible to get 100% uptime from any single thing or single set of things (especially a complex system, managed by humans).
What is your business uptime requirement ?
Thanks,
Alex
In addition regarding the “How much redundancy is enough?”-paragraph in the StorageMojo Take: In my opinion you should not have any performance degradation at all if one of two fabrics fail, because otherwise the second fabric can’t be regarded as redundancy. If you need both fabrics to achieve 100% performance for your applications, you have in fact no real redundancy! Some more thoughts about that here: http://seb-t.de/100
@Kyle Brandt
That was basically our solution. Each file related to an id which we hex converted then reversed. Then we did directories of 00-FF nested 4 levels to distribute the files. Worked pretty well. Was only a beast when we needed to crawl the structure to do some selective pruning.
1) firmware upgrades often cause outages, regardless of how expensive the SAN is
2) SANs have a lot of hard drives, yet often don’t realize a drive is failing until it’s actually dead
3) SAN volumes are a lot harder to support and tune than local storage. For example, running RHEL5 with ext3 on an LPAR might be a common thing, but you could be the first using the latest Ubuntu or ext4.
We ran into some firmware bugs with IBM’s DS4000 line about five years ago that caused us some serious headaches, but haven’t had any serious problems since.
I do second what mrperl says about firmware upgrades, but I would add boot from SAN as another complicating factor. I recall an instance where the customer was booting all of their “development” LPARs from SAN. They did a firmware upgrade on their DS4000 unit right before a critical code push and it went badly wrong. After we helped them clean up the firmware upgrade, the production LPARs booting from local disk were fine, but the development LPARs couldn’t boot at all. If I remember correctly the solution turned out to be unplugging half of the HBAs so that the LPARs could find their boot disks, but it took us a while to figure that out.
RAID with DAS often don’t realize a drive is failing until it’s already dead as well, some DAS is so stupid that it can’t pro-actively fail a failing disk as well.
I remember two different times the last time I used DAS (HP MSA) where a disk was failing, the controller knew it, but there was no way to force the drive off line short of physically pulling it from the enclosure (this was ~2006/2007). Leaving the drive in the array was *killing* performance to the point the array was practically unusable for production needs. HP told me at the time the lack of ability to force a drive off line (something my 3ware cards have had for as long as I can remember) was a known issue and of course due to be fixed in firmware at some point. This was a remote facility (sort of – 45 minute drive from the office), so not being able to force the dive off line was certainly annoying – and I didn’t trust remote hands to do that sort of work they weren’t very technical.
As for tuning, it really depends on the SAN, for the most part tuning SAN volumes is SIGNIFICANTLY easier than local disk because you can convert between different levels of RAID online, you can move volumes to different tiers online, and with the very sophisticated caching algorithms in the controllers in a lot of cases you really don’t have to do anything.
Other controllers tout their ability to pin things directly in cache, or have fancy QoS algorithms to maintain SLAs.
Tuning for DAS is often significantly more complex because the system is so rigid, you often end up either massively over provisioning to cushion against performance issues or you end up taking large amounts of downtime to do data migrations when things change significantly to do data migrations (unless you can manage to do the migration at a higher level such as Oracle ASM or something).
What happens when you grow a DAS? Add more disks? Re-striping existing data for the most part isn’t an option on DAS (again unless you do it at a higher level like ASM), so you end up with hot spots all over the place assuming your I/O bound.
Most organizations I have worked for really lack the information needed to be able to effectively configure the data layout for a storage system (many lower end storage systems – including the aforementioned MSA (even today, I was having a talk with a friend at HP storage on this just two days ago) simply don’t give you good performance metrics either). So the ideal solution in that situation is to have a storage system that can be adjusted on the fly without impacting application availability so when you do see these change in patterns, and get the real data you can respond.
But most organizations aren’t to that point yet which is why industry utilization rates are still hovering well below 20% in a lot of cases.
I was at a company that was leveraging the Amazon cloud. This is the cloud! What was their disk utilization rate? 3%. THREE PERCENT. This was in large part due to two factors, the biggest being Amazon has cookie cutter designs for their virtual machines, forcing you to pay for hundreds of gigs (if not more) depending on the instance size even if you may only need 5GB. The second was striping of mysql databases over many volumes in the attempt to get more performance.
The scenario I mentioned above where a company I was working at had devoted an entire rack of spindles to something that only did about 1,000 IOPS because their storage system was not flexible enough, and didn’t give enough performance information to really give good performance information.
When I think of SAN I think of a modern intelligent infrastructure, I don’t think of a fibre channel fabric filled with HP MSA, Infortrend, LSI logic type stuff. Also I’m not thinking of older generation SAN systems while technically a “SAN” they had little to no intelligence or flexibility.
One NAS vendor I knew (they were bought by Dell), had the performance of their NFS clusters double(with the same number/type of back end disks) simply by using a different back end storage system.
Another NAS vendor I knew (they were bought by HDS) had a standard practice to disable the write cache on their disk subsystems because the caching algorithms were so bad they were able to get better performance by caching it up in their own cache and devoting the array cache to be completely a read cache.
SAN’s don’t always make sense of course, it really depends on the environment, how centralized the applications are and what sort of scale your at.
DAS is really only good for if you *really* know your application(s) I/O profiles and can plan in advance and can handle storage intelligence at a higher level, Or just massively over provision knowing you won’t get good utilization out of the system and have islands of storage all over your data center that in most cases is wasted since it’s decentralized.
Same goes for using physical servers vs vitual infastucture. You can think of SANs as the vmware of storage. If you know you workload and know you can tax the sever to it’s full extent then you probably don’t NEED to go the VM route, since the biggest benefits to VM is getting that 10:1 all the way to maybe 50:1 or greater consolidation ratios.
I’m a SAN admin for a large hospital in Arizona.
My experience? Depends on the vendor.
We’ve had both IBM and EMC here in the shop, an IBM ESS F20 and 800… Later, we became an EMC shop. We currently have an EMC Clariion CX4-480, and a CX4-240 in Vegas. By far, the more problematic has been the EMC.
Case in point: Our CX4-480 has a fairly limited amount of cache. 4GB. It is very, very easy to have a 480 operating within specifications be in a state where the rate of incoming writes exceeds the controller’s ability to efficiently destage those writes to platter. Result: Your SAN is now dead in the water. With no acknowledgement on writes coming back to the hosts, things go south in a hurry. It isn’t very smart about smoothly mitigating this situation, as you might expect. It will literally allow itself to become clogged.
Worse, it is also fairly easy to exceed 50% CPU utilization on each SP simultaneously…. So, in the event of a failover, you’ll have one remaining SP trying (and failing) to handle not only its own workload, but the computational workload of it’s peer. Same story. Result: Your SAN is now dead in the water, intermittently deadlocked. And, like one might imagine, latency will hockeystick and cartwheel the hosts into the netherworld of latency on par with a DDoS attack.
Again, each of these problems can be encountered simply by virtue of normal use. Caveat emptor.
IBM was different. But then again, that was nearly 10 years ago. We never, ever, at any turn, ever had questioned our faith in the SAN’s ability to accept or provide data. Both our ESS F20 and 800 were built like a Sherman tank. Ridiculously awful web front-end aside, and like much of IBM, the hardware was God-like but the software looked like it was written by teenagers at a summer break computer camp.
Redundancy is not something you should hang your hat on. It’s like an alarm system on a car. What good is it, if society has gotten so used to the sound of car alarms that no one pays attention? If you’re SAN is operating in such a way that redundancy doesn’t matter, you’re still exposed.
I’ve been working Fibre Channel and enterprise class storage for more than a decade and I have seen my share of issues. But I do think the technology is very reliable when compared to other IT stuff I have worked on.
The biggest mistake I see is not testing failover. With multiple host paths, multiple storage ports, the various masking and fabric zoning tools, multipathing software (MPIO, DSM, Powerpath), etc., you have to perform failover tests to ensure it is all implemented correctly. There are too many point where a mistake can be made. Companies spend all this money to ensure availability, and nobody test it. Then there is an failure everyone is surprise that it didn’t work. Only one path was properly zoned or masked. Both nodes of a cluster weren’t zoned or masked to all the same disk. Multipathing software was setup for failover but not failback. Vendor recommended timeout values were not set.
The few times we’ve had a “SAN” outage, it’s really been a storage array outage. A couple of times it was the (unnamed) vendor’s engineer pulling the wrong cpu board out (if you pull the only other working processor board from a server, it’s going to crash). The other time was was a real bug in a drive firmware combined with faulty line card that caused the crash.
From a topology perspective I’ve seldom ever seen a systemic failure, except one time when we were migrating our infrastructure to new core switches. But that is a risk that’s always going to be there. Could it have been mitigated if we reined in some cowboys in our shop? Probably…imho human errors happen because of poor planning (impatient cowboys jumping the gun).
Last week The Register wrote about events a nordic IT-service vendor Tieto took into with an EMC storage system in November.
Titsup EMC VNX kit unleashes 5 days of chaos in Sweden
http://www.theregister.co.uk/2012/01/13/tieto_emc_crash/
Flash drive meltdown fingered in Swedish IT blackout
http://www.channelregister.co.uk/2012/01/16/tieto_vnx5700/
I think the articles + comments are basic reading to anyone involved in or getting into storage systems. With great mass comes even greater inertia.