Stupid storage failures

by Robin Harris | Tuesday, November 25, 2008 | Architecture, Disk, SSD/Flash/NVRAM | 14 comments

Valiant but doomed
The ZFS discussion thread had an interesting comment from Sun’s Jeff Bonwick, architect of ZFS, on storage device failure modes. How do you know a disk or a tape has failed?

You don’t. You wait, while the milliseconds stretch into seconds and maybe even minutes. Jeff states the problem – and Sun’s solution – this way:

. . . we’re trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) . . . there’s a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA [Fault Management Architecture] fault diagnosis and then tell ZFS to take appropriate action.

With all due respect to Jeff, that solution seems iffy: how will you ever keep up with all the devices and firmware levels needed to make that work?

A community of prima donnas
There are lots of messy failure modes in computer systems. The literature around the Byzantine Generals Problem (Wikipedia – for a rigorous treatment download The Byzantine Generals Problem by L. Lamport et.al) tackles the problem of the malicious server in a community of network servers. That is a hard problem.

Knowing whether a storage device is alive, dead or only sleeping shouldn’t be so hard. They have powerful 32-bit processors – more powerful than a VAX 780 – and lots of statistics on what the drive is doing.

It seems like a disk could give a modulated heartbeat signal to drivers – “ready” “reboot” “caught in retry hell” “dead” – to decrease uncertainty.

The StorageMojo take
Drive vendors may think that non-standards for drive condition reporting are a form of lock-in, but that misses the bigger picture: the quality and timeliness of condition reports – even with a standard format – would be a competitive differentiator.

At the margin it would help slow the move to commodity-based cluster storage by enabling array vendors to improve their error handling and perceived reliability. It would also help disks versus flash SSDs, whose perceived reliability is partly due to the gap between user-judged drive “failures” and vendor “no trouble found” test results.

Storage systems all know how to deal with disk failures – they have to. So drive vendors, how about getting together to help make knowing a drive’s status a lot easier? Hey, IDEMA, make yourself useful!

Courteous comments welcome, of course.

14 Comments

Steve Todd on Tuesday, 25 November, 2008 at 11:00 am

Robin,
When I was designing CLARiiON’s internal disk failure handling in the 80s/90s I had to struggle with this very same issue. At that time we came up with an atomic solution that would handle any “messy failure modes” that might occur: we turned off the power to the drive and marked it as dead. We decided we would rather shoot a drive than have it lead to a data integrity issue.
I’ve often wondered how software RAID on top of anybody’s JBOD could provide this same level of data integrity. It seems like you agree. I’m not sure standardizing error reporting will cover all the bases, however.
Steve
marc farley on Tuesday, 25 November, 2008 at 11:06 am

Robin,

I thought error handling was one of the last value adds that separates low cost drives from high cost drives? Remember how “enterprise SATA drives were differentiated from desktop drives? The relationship between controller and device is client/server, as described in the SCSI standards. The controller sends commands and waits for a response. There is no close coupling, no heartbeat, no “Damn it Jim, were at full impulse power already and if we give it any more she’s going to blow!” I don’t think it fits the Byzantine Generals model very well because devices and controllers are not peers and devices usually don’t take actions themselves – except for error recovery.

I think what you are asking for is economically impossible for the disk drive industry. The market doesn’t want to pay for better error recovery because the market is mostly system and subsystem vendors who think they can solve the problem better anyway (like Sun). Disk drive vendors have historically been punished for trying to add functions that others further up the food chain want credit for.
Rob McCrea on Tuesday, 25 November, 2008 at 8:09 pm

Marc,

I don’t think this is about a hard drive manufacturer adding anything to the drive – other than a small firmware change. All the hard drives I see already keep a stunning amount of information on health and could easily give a status on the state of the drive. And, they do give this information when the service interfaces are used.

I think the bigger problem here is getting the HDD manufacturers to report in a standard way. SMART data could be useful but it isn’t really good for a yes or no status. But, returning something in a SCSI LOG SENSE command might be useful.

Robin,

“Drive vendors may think that non-standards for drive condition reporting are a form of lock-in, but that misses the bigger picture: the quality and timeliness of condition reports – even with a standard format – would be a competitive differentiator.”

I agree.

If every car manufacturer made a gas gage read differently on each car it just makes life difficult for the driver. If the gas gage is the same on each car that doesn’t reduce the competitive advantage one brand of car has over another. However, it does become apparent which car is better on gas …

I think it’s the last point that may be at the heart of the issue. Standard reporting may result in HDD manufacturers being held to a higher standard.
Robert Pearson on Tuesday, 25 November, 2008 at 8:43 pm

RE: …”With all due respect to Jeff, that solution seems iffy: how will you ever keep up with all the devices and firmware levels needed to make that work?”

Jeff seems to be talking about the first step in the long over due Event Management System (EMS).
Any event occurring in the “managed” IT infrastructure is registered.
The local decision making policy whether to monitor, alert or sound the crisis alarm is a very busy process. Of necessity it is a multi-level decision making hierarchy.
This is not a Unit of Technology function. Any of several Warning, Failure, Severe Failure levels can be indicated by the Unit of Technology if it is a Managed Unit of Technology.
If there was a defined procedure, protocol and API that Unit of Technology manufacturers could adhere to – would they?
Not having an Event Management System standard feel free to be creative and design your own. I did.
Mine was process based.
Every Unit of Technology commissioned, de-commissioned, reconfigured or re-deployed. Every application process in use was registered and de-registered when removed.
The manual system to do this would drive people crazy and be full of human errors so it has to be automated.
One way to do this is with the “infotone”. The infotone is a local way of giving Units of Technology and Managed Units of Technology, as well as application processes, a “known” local identity. The infotone runs constantly. Preferably out of band but it can run in band.
The original of this was pretty crude. Most people fell down laughing when it was mentioned. It would be pretty slick today, especially with “Flashman” in the Flash sub-layer.
The last thing I worked on was using “emoticons” for EMS status indicators in the NOC. Cryptic text messages just don’t cut it.
You could click on the emoticon and drill down for more information.
With SOA and “Flashman” the Event Management System would be awesome.
M.S. on Wednesday, 26 November, 2008 at 1:41 am

Steve,

I’m afraid, I don’t get your point. Why can’t I apply the very same “strategy” for software raid, too? Why should software (for making up a RAID) not be as good as a hardware raid? Several database vendors (Oracle et al.) put software RAID into their flagship products – do you think they would to that while knowing it would compromise their data?!
Jeff Bonwick on Wednesday, 26 November, 2008 at 2:39 am

Standardizing disk failure modes would certainly be helpful, but it’s not enough. The problem is integration of information. As a disk drive vendor, you know everything there is to know about your disk drive’s innards. What you don’t know — but we, the systems vendor, do know — is which fans cool the drive, which power supplies power the drive, the thermal and vibrational envelopes of the enclosure, and so on. Since every disk drive exists in some known environment (laptop, array, blade, etc), the failure analysis is probably best done higher in the stack. My ideal disk drive would have very simple failure semantics: yes, no, or later, with no timeouts. The difference between no and later is just like dating: later probably means no, but it’s acceptable to retry some (small, respectable) number of times before giving up.
hirni on Wednesday, 26 November, 2008 at 3:03 am

Diks-drives nowadays are actually “rotating rust managed by a huge OS”.
Just add the ego-trips of the “RAID-OS code kis” – and here you go.
We ended up with a storage eco-system more complex than Washington DC’s lobby machinery.
Sure – everyone has an excellen job-security – but technologically it doesn’t bring you forward… – instead every year – we add more and more useless bloat to the whole thing enabling us to distinguish states like:
“something failed”, “was just kidding”,”no idea” … hmmm.
the storage anarchist on Wednesday, 26 November, 2008 at 8:47 am

M.S. –

I believe that Jeff’s point is that an architected storage array can be designed such that individual drives can be shut down by turning off the power to them (and hard reset by restoring the power – which often will clear any errors). Software RAID on generic hardware lacks this ability, since few (if any) standardized/COTS server platforms include the ability to power off individual drives.

And FWIW, it’s astonishing how often power cycling is required, as is the frequency that this actually corrects the errors (or at least clears them long enough to snatch a copy of the data).
Rob on Wednesday, 26 November, 2008 at 2:20 pm

” Software RAID on generic hardware lacks this ability, since few (if any) standardized/COTS server platforms include the ability to power off individual drives.”

Forget powering them off. Get a real OS that:

– Retries IOs
– Will timeout a drive and give it the boot if unresponsive. Fix it later.

Shadow (host mirror) across datacenters if your data is important. You can’t
afford loss of a frame or datacenter to impact application availibility.

Here’s one of the many parameters that you can modify in this OS to affect
behaviour:

SHADOW_MBR_TMO=10 ! Allows 10 seconds for physical members to fail over
! before removal from the shadow set

Yeah – VMS, old but reliable. Surely many study it to figure out how to
improve upon their OSes (many does not equal ALL).
Steve Jones on Thursday, 27 November, 2008 at 7:45 am

‘At the margin it would help slow the move to commodity-based cluster storage by enabling array vendors to improve their error handling and perceived reliability. It would also help disks versus flash SSDs, whose perceived reliability is partly due to the gap between user-judged drive â€œfailuresâ€ and vendor â€œno trouble foundâ€ test results.’

I would have thought the opposite – one of the reasons that companies buy integrated, monolithic arrays is that they are very carefully tested with a very limited range of disk drives, I/O interfaces and so on. That includes a huge amount of work on soak testing, error handling, disk firmware levels, reporting and so on. They aren’t generally systems where you are allowed just to plug in any old drive.

On the other hands, storage clusters seem to be the other way round – far more commodity level servers, storage and drivers with mix-and-match. Indeed they are “just” software laid over the top of commodity servers and storage. To make those work reliably then it seems to me it is very important to have standardised exception behaviour. Even on top-end arrays, we have observed performance effects of non-determistic behaviour during exception handling. With shared storage it doesn’t have to be that high to have huge knock-on effects (especially with timing sensitive things like chared disks in clusters).

With Storage clusters, individula nodes still have exactly the same requirement to drive physical drives. It might be possible to standardise the exception handling at the cluster level so it can “map out” an abberant nodes and use the cluster redundancy, but if that happens too often you will have severe operational problems. It might be that it is possible to tolerate the occasional bit of erratic device driver/disk behaviour in a single server as the extent is constrained (depending on the role of the server). However, once you move to a shared storage model then reliability, predicability and deterministic behaviour become much more critical.
Ask BjÃ¸rn Hansen on Saturday, 29 November, 2008 at 7:50 pm

Hoping for smarter drives to detect failures is sorta missing the point.

By definition when there’s a failure the drive isn’t working right; so you can’t trust whatever it says.

Of course the better the drives are at detecting and dealing with failures the better it’ll be; but that doesn’t mean the higher layers won’t have to get smarter, too.

Just because you can plan for some failures, doesn’t mean you can get out of planning for the unforeseen.

– ask
Pete Steege on Monday, 1 December, 2008 at 6:48 am

These issues are exactly what is gating SSD adoption in the enterprise. Disk drives are much more than media. These days their true value is in integration – making hundreds or thousands of them work in unision & collectively adapt to unpredictable changes.
Jason on Tuesday, 2 December, 2008 at 10:25 am

ANSI T10-DIF
Chuck McManis on Tuesday, 9 December, 2008 at 5:34 pm

Ok, why do you care?

On the surface that might sound specious but lets look at this for a moment. Disk drives are probablistic devices, they “probably” can return the data you wrote. Even if everything in the world is perfect in 1E10-14 or 10-15 bits (depending on who you believe) a disk drive won’t give you back your data. Period. Does is really matter why?

I believe the answer is no, it doesn’t matter. You play the probability game the way everyone does, you add error correction bits outside the failure domain of a single drive and you use them. So rather than design a system where error correction is a computationally complex, make it simple and use it. Consider a JBOD with 12 disks split into groups of four each on a separate powersupply and interface chain. Replicating x 3 (with standard ondisk block checksums to insure the disk gave you the block you asked for) puts you into the five 9s category. Sure its 3x the raw storage but its 1000x more reliable than 1 drive. Worth it? And since computation on replication is essentially nil if one of the drives is having a bad hair day who cares? You note it and move on, if it continues to have issues you kick it out and replace it.

So the question of “machine learning for enhanced understanding of failure modes” is not worth it in relative terms. As long as drives have to be reliable enough to use in singles for desktops, using triples of them in the Enterprise will be just fine.

–Chuck