Amazon’s outage was caused by a failure of the underlying storage – the Elastic Block Storage. Here’s what they learned.

EBS
The Elastic Block Store (EBS) is a distributed and replicated storage optimized for consistent and low latency I/O from EC2 instances. EBS runs on clusters that store data and serve requests and a set of control services that coordinate and propagate I/Os.

Each EBS cluster consists of EBS nodes where data is replicated and I/Os are served. Nodes are connected by 2 networks: a primary high-bandwidth network for traffic between the EBS nodes and EC2 server instances; and a slower replication network intended as a backup and for reliable internode communication.

Newly written data is replicated ASAP. An EBS node searches the cluster for a node with enough capacity, connects to it and replicates the data, usually in milliseconds.

If connectivity to a node it is replicating to is lost the node assumes the other node failed and tries to find another node to replicate the data. In the meantime it holds onto all data until it can confirm the data is replicated.

The outage
During a network change on April 21 to upgrade primary network capacity a mistake occurred: the primary network data traffic was shifted to the slower secondary network.

The secondary network couldn’t handle the traffic which isolated many nodes in the cluster. Losing contact with nodes they were replicating to the remaining EBS nodes sought new nodes, but the few remaining nodes were quickly overwhelmed in a retry storm.

The now degraded secondary network then slammed the coordinating control services. Configured with a long timeout the retry requests backed up and the control services suffered thread starvation.

Once a large number of I/O requests were backed up the control services had no ability to service I/O requests and began to fail I/O requests from other Amazon availability zones. Within two hours the Amazon team had identified this issue and disabled all new create volume requests in the cluster.

But then another bug kicked in.

A race condition in EBS caused them to fail when closing a large number of replication requests. Because there were so many replication requests the race condition caused even more EBS notes to fail, re-creating the need to replicate even more data and again the control services were overwhelmed.

Recovery
The Amazon team get control of the replication storms in about 12 hours. Then the problem was recovering customer data.

Amazon optimizes its systems to protect customer data. When a node fails it is not reused until its data is replicated.

But since so many nodes were failed the only way to ensure no customer data was lost was by adding more physical capacity – no easy chore – but that wasn’t all.

The replication mechanisms had been throttled to control the storm, so adding physical capacity also meant delicate management of the many queued replication requests. It took the team 2 days to implement a process.

Amazon Relational Database Service
The Amazon Relational Database Service (RDS) uses EBS for database and log storage. RDS can be configured to operate within a single Amazon zone or replicated across multiple zones. Customers with a single zone RDS were quite likely to be affected, but a 2.5% of multi-zone RDS customers were affected as well due to another bug.

Lessons learned
The network upgrade process will be further automated to prevent a similar mistake. But the more important issue is to keep a cluster from entering a replication storm. One factor is to increase the amount of free capacity in each EBS cluster.

Retry logic will be changed as well to back off faster to focus on reestablishing connections first before more retries. And of course, the race condition bug will be fixed.

Finally, Amazon has learned it must improve the isolation between zones. They will tune timeout logic to prevent thread exhaustion, increase control services awareness of zone loads and, finally, move more control services into each EBS cluster.

The StorageMojo take
Data center opponents of cloud computing will point with alarm to this incident to make the case that they are still needed. But they forget that today’s enterprise gear is reliable only because of the many failures that led to better error handling.

While painful for the affected, the Amazon team’s response shows a level of openness and transparency that few enterprise infrastructure vendors ever display. Of course, that is due to the public nature of these large cloud failures; nevertheless the outcome is commendable.

But the battle is not only between large public clouds and private enterprise infrastructures, but between architectures. Traditionally, enterprise infrastructures have focused on increasing MTBF. Cloud architectures, on the other hand, have focused on fast MTTR – Mean Time To Repair.

What can be scaled up can also be scaled down. Not every application is suitable for public cloud hosting. But small-scale, commodity-based, self managing infrastructures are very doable. They are the bigger threat to the large proprietary hardware vendors of today.

Courteous comments welcome, of course. I speculated in Amazon’s experience: fault tolerance and fault finding about the cause of the failure, but I was wrong. A failure precipitated by a network upgrade? Way-y-y too simple.