I wrote about how clouds fail on ZDNet today, but there was another wrinkle in the paper that I found interesting: high redundancy hurts. Counter intuitive?

This comes from the paper Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, by Peng Huang, Chuanxiong Guo, Lidong Zhou, and Jacob R. Lorch, of Microsoft Research, and Yingnong Dang, Murali Chintalapati, and Randolph Yao, of Microsoft Azure. The paper explores the “gray failure” problem, where component failures are subtle, often intermittant, and thus are difficult to detect and correct.

Go read the ZDNet piece to get the gist of their findings. This post focuses on the problem of redundancy reducing availability.

Department of redundancy department
Cloud networks are configured with high redundancy to better tolerate failures. A switch stoppage is usually a non-event because the protocols re-route packets through other switches. Thus redundancy increases availability in the case of a switch failure.

But some switch failures are intermittant gray failures: random and silent packet drops. The protocols see the dropped packets and resend them, so the packets are not re-routed. But the applications see increased latency or other glitches as those lost packets are resent.

Let’s say your cloud has a front-end server that fans out a request to many back-end servers, and the front-end must wait until almost all of the back-end servers respond. If you have 10 core switches that fan out to 1000 backend servers, you have an almost 100% chance that a gray failure at any core switch will delay nearly every front-end request.

Thus, the more core switches you have, the more likely you are to have a gray failure, and, with a high fan-out factor, the more likely you are to have a gray failure that delays nearly every front-end request.


The StorageMojo take
The paper is a highly recommended read if you architect for or rely upon one of the major cloud vendors, especially if your main focus is software. While human errors are a major cause of cloud outages, the authors make the point that undetected gray failures tend to accumulate over time, stressing the healthy infrastructure, and can lead to cascading failures and a major outage.

As anyone experienced with hardware can tell you, gray failures are regretably common, and a total bear to diagnose and correct. The late, great Jim Gray coined the term Heisenbugs to describe them, because, like quantum particles, they behave differently when you try to observe them.

The bigger lesson of the paper though is that scale changes everything. Even the kinds of bugs that can take 100,000 server system down.

Courteous comments welcome, of course. If you’re a cloud user, have you seen behavior that that gray failures might explain. Please comment!