I wrote about how clouds fail on ZDNet today, but there was another wrinkle in the paper that I found interesting: high redundancy hurts. Counter intuitive?
This comes from the paper Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, by Peng Huang, Chuanxiong Guo, Lidong Zhou, and Jacob R. Lorch, of Microsoft Research, and Yingnong Dang, Murali Chintalapati, and Randolph Yao, of Microsoft Azure. The paper explores the “gray failure” problem, where component failures are subtle, often intermittant, and thus are difficult to detect and correct.
Go read the ZDNet piece to get the gist of their findings. This post focuses on the problem of redundancy reducing availability.
Department of redundancy department
Cloud networks are configured with high redundancy to better tolerate failures. A switch stoppage is usually a non-event because the protocols re-route packets through other switches. Thus redundancy increases availability in the case of a switch failure.
But some switch failures are intermittant gray failures: random and silent packet drops. The protocols see the dropped packets and resend them, so the packets are not re-routed. But the applications see increased latency or other glitches as those lost packets are resent.
Let’s say your cloud has a front-end server that fans out a request to many back-end servers, and the front-end must wait until almost all of the back-end servers respond. If you have 10 core switches that fan out to 1000 backend servers, you have an almost 100% chance that a gray failure at any core switch will delay nearly every front-end request.
Thus, the more core switches you have, the more likely you are to have a gray failure, and, with a high fan-out factor, the more likely you are to have a gray failure that delays nearly every front-end request.
Ouch!
The StorageMojo take
The paper is a highly recommended read if you architect for or rely upon one of the major cloud vendors, especially if your main focus is software. While human errors are a major cause of cloud outages, the authors make the point that undetected gray failures tend to accumulate over time, stressing the healthy infrastructure, and can lead to cascading failures and a major outage.
As anyone experienced with hardware can tell you, gray failures are regretably common, and a total bear to diagnose and correct. The late, great Jim Gray coined the term Heisenbugs to describe them, because, like quantum particles, they behave differently when you try to observe them.
The bigger lesson of the paper though is that scale changes everything. Even the kinds of bugs that can take 100,000 server system down.
Courteous comments welcome, of course. If you’re a cloud user, have you seen behavior that that gray failures might explain. Please comment!
I’ve seen this play out in some very storage-specific ways too. For example, it’s pretty easy for an error or failure in one replica to make its entire replica set unresponsive as the others wait for it. Maybe they’ll eventually detect the failure, but often after way too long; effective availability was already reduced by the detection interval. Maybe they won’t detect the failure, and continue to believe the replica is up (because it responds to pings) even though it’s clearly misbehaving. Developers will argue that it’s a Byzantine failure and they don’t promise to handle those, but that excuse wears thin when it’s their code causing the failure. Availability was still reduced, and the more replicas you have the harder you can get bitten by this.
The most interesting case is when it’s not a failure but just a sporadic slowdown. You might hardly even notice if it’s just one, but sooner or later some other server needs to wait for the one that’s slow, then another one needs to wait for that one, and so on until the snowball has grown to include the entire cluster. This is particularly evident with wide striping or erasure coding, which can cause literally every server to have some kind of performance dependency on every other. Even if the individual probabilities are tiny, in a large enough system these “amazing coincidences” start to occur with depressing regularity.
When greater redundancy drives greater dependency, it’s time to take a good hard look at whether the net result is still a good one.
War story time: Large newspaper needed a fully redundant in-house network. They used smart switching routers and the network was tested and proven to be redundant. Switch off one of the core routers and with minimal delay the traffic would be rerouted. Until the active core switch developed a memory problem and, being a smart switch, it shut down that memory bank and struggled along, crippled all traffic. This smart switch did not know that it was part of a redundant network and that it could and should pass on the responsibility when sick, rather than trying to cope with the problem itself. The lesson here is: for redundancy to work, every component needs to be designed/configured with redundancy awareness.
Oh, this one makes me cringe.
How many years has it been that we’ve listened to so-called ‘software engineers’ insist that they don’t need reliable hardware widgets, because they can achieve the same system reliability from a pile of cheap consumer/commodity widgets by virtue of using large numbers of them? The first example I can recall was when Google’s founders made a case for their ultra-cheap servers by comparing a mountain of them to a ridiculously priced IBM Power system with similar aggregate compute capability. Of course, being >>software<< engineers their reliability numbers failed to include the concepts discussed in this paper. They also failed to do the math on power consumption — but that's another story.
I think there is no such thing as a 'software engineer' — and that is why most countries — Canada for one — prohibit the use of the word 'Engineer' in such contexts.
Real engineers (the ones who study applied math and physics) know that both failures and failure modes increase exponentially as the number of interrelated widgets used to accomplish a given task increase. Real engineers have known this forever. Software 'engineers' are just figuring it out.
Another great paper Robin — but I think it's chief value is that it hopefully teaches arrogant software engineers that they need to go and talk to real engineers more often.
Hardware advances have caused a number of programmer problems. Higher performance CPUs make coding efficiency less appealing. It has made some programmers lazy and their managers unwilling to devote resources toward optimizations.
Why spend an hour optimizing a crucial function to be 50% faster when it will only save 50 milliseconds every time it is called? If it is called 10 times a month, then it is not worth it. When the code changes so it is now called 10 times a minute, it adds up fast. How much lost time and MW of power are wasted due to some inefficient code in a popular operating system or application?
When it comes to “big data” the architects just want to chop it up and fan it out to commodity servers because that is cheap and easy (in the short term). A whole new algorithm that would give you the same performance on 10 servers as the old one on 100 servers is not even explored half the time. They just want to throw more hardware at the problem. This leads to the gray failures you talk about.
I am working on a new data management system that focuses on speed and efficiency at the node level. If each node works 10x better because of efficient algorithms and data structures, then when you distribute it across a cluster you need far fewer nodes to do the same job.
Instead of converting all the data into 100 copies and sending them out to 100 servers and making every data call look on each server, partition the data so that there are only 3 copies of each piece of data in the network. When a request comes in, only query the 3 servers with that copy of the data so the chance of gray failures is low. The trick is knowing which 3 servers have the data without querying all of them. With today’s architectures, that is easier said than done which is why I am redesigning from the ground up.
Google will use strategies like Canary requests, Backup requests with cross-server cancellation, Latency-induced probation, and Proactively Abandon Slow Subsystems to mitigate some of these issues.
> both failures and failure modes increase exponentially as the number of interrelated widgets used to accomplish a given task increase.
Not necessarily. If a distributed system is properly designed, individual nodes’ ability to cover for each other will outweigh their dependency on each other, so adding nodes really will improve overall availability. Such systems exist. I’ve worked on a few. Unfortunately, they’re damn rare. Too many people read the papers (really submarine ads IMO) from Google or Amazon, which focus on the positives and gloss over the negatives, so of course they come away thinking that adding nodes is magic.
The main gray failures I’ve seen in production are:
1) stale caches, often with excessive TTLs “to fix the origin outages”
2) load balancer health checks choosing only the “most healthy device” instead of “all healthy devices”
3) application-level monitoring always green despite failover pair(s) being down since “they got a 200 response back”
Attempting to interpret monitoring displays when the above issues were never resolved is pointless, since green doesn’t signify that the system is indeed working.
All of those issues increase latency, so adding latency checks can help (if you know what it should be in the first place.)
This leads into the difference between “availability” and “observability” in monitoring systems. In 2017, most are still primarily focused on the former, when we really need the latter for insight into how well the system is actually working.
Grey failures are a nightmare to diagnose, heisenbugs is a excellent name for them.
Monitoring is can help however the problem is always with the interpretation of the data and the fact that noise increases at a greater rate than signal. Expect a monitoring epiphany soon similar to the one over more general big data , namely this stuff is hard and not as useful as we thought.
In terms of grey failures most of them discussed above appear to be network related possibly another manifestation of the data volume problem.