Reader Kyle asks a good question:
SANs are advertised up the wazoo as having lots of internal redundancy such as redundant power, redundant controllers, etc. I’ve spent enough time with redundancy to know that having two pieces of hardware often doesn’t cut it. I was wondering what the real story is from someone who has spent a lot of time in the storage space. Do complete SAN failures really pretty much *never* happen or are they just on the rare side? If so what are the common points of failure? Perhaps people, the OS, non-redundant hardware parts?
Please, SAN folks, tell StorageMojo readers your experience. In the meantime, here’s
The StorageMojo take
Kyle asks 2 questions: how reliable and available are the individual devices that make up a SAN and how reliable and available is the system – the SAN as a whole.
Redundancy is aimed at ensuring availability. Because of the redundancy’s greater component count you also have more failures.
Failures of redundant components shouldn’t affect availability – assuming, that is, that failures are not correlated. That assumption turned out not to be true of RAID arrays, making them less available than advertised.
How much redundancy is enough? Customers generally prefer triple redundancy if they can afford it, partly for availability and partly for performance: losing ⅓rd of system performance is less painful than ½. But for the moonshots, NASA chose quintuple redundancy on critical systems.
Yet I’d guess that most are more concerned about SAN system availability – which includes not only what we consider SAN hardware, but also the server-side HBAs, drivers and management software. It is here that the nastiest bugs lurk: untestable interactions between applications, drivers, firmware and architecture that bite us hard – and crash entire SANs.
But what is your experience, gentle reader? Many of us would like to know.
Courteous comments welcome, of course. Update: Bayesian analysis is the best tool to evaluate system-level availability, as noted in this StorageMojo video. Sadly, the tool referred to is no longer online. Anyone want to take a whack at a new one?