Amazon’s outage was caused by a failure of the underlying storage – the Elastic Block Storage. Here’s what they learned.
EBS
The Elastic Block Store (EBS) is a distributed and replicated storage optimized for consistent and low latency I/O from EC2 instances. EBS runs on clusters that store data and serve requests and a set of control services that coordinate and propagate I/Os.
Each EBS cluster consists of EBS nodes where data is replicated and I/Os are served. Nodes are connected by 2 networks: a primary high-bandwidth network for traffic between the EBS nodes and EC2 server instances; and a slower replication network intended as a backup and for reliable internode communication.
Newly written data is replicated ASAP. An EBS node searches the cluster for a node with enough capacity, connects to it and replicates the data, usually in milliseconds.
If connectivity to a node it is replicating to is lost the node assumes the other node failed and tries to find another node to replicate the data. In the meantime it holds onto all data until it can confirm the data is replicated.
The outage
During a network change on April 21 to upgrade primary network capacity a mistake occurred: the primary network data traffic was shifted to the slower secondary network.
The secondary network couldn’t handle the traffic which isolated many nodes in the cluster. Losing contact with nodes they were replicating to the remaining EBS nodes sought new nodes, but the few remaining nodes were quickly overwhelmed in a retry storm.
The now degraded secondary network then slammed the coordinating control services. Configured with a long timeout the retry requests backed up and the control services suffered thread starvation.
Once a large number of I/O requests were backed up the control services had no ability to service I/O requests and began to fail I/O requests from other Amazon availability zones. Within two hours the Amazon team had identified this issue and disabled all new create volume
requests in the cluster.
But then another bug kicked in.
A race condition in EBS caused them to fail when closing a large number of replication requests. Because there were so many replication requests the race condition caused even more EBS notes to fail, re-creating the need to replicate even more data and again the control services were overwhelmed.
Recovery
The Amazon team get control of the replication storms in about 12 hours. Then the problem was recovering customer data.
Amazon optimizes its systems to protect customer data. When a node fails it is not reused until its data is replicated.
But since so many nodes were failed the only way to ensure no customer data was lost was by adding more physical capacity – no easy chore – but that wasn’t all.
The replication mechanisms had been throttled to control the storm, so adding physical capacity also meant delicate management of the many queued replication requests. It took the team 2 days to implement a process.
Amazon Relational Database Service
The Amazon Relational Database Service (RDS) uses EBS for database and log storage. RDS can be configured to operate within a single Amazon zone or replicated across multiple zones. Customers with a single zone RDS were quite likely to be affected, but a 2.5% of multi-zone RDS customers were affected as well due to another bug.
Lessons learned
The network upgrade process will be further automated to prevent a similar mistake. But the more important issue is to keep a cluster from entering a replication storm. One factor is to increase the amount of free capacity in each EBS cluster.
Retry logic will be changed as well to back off faster to focus on reestablishing connections first before more retries. And of course, the race condition bug will be fixed.
Finally, Amazon has learned it must improve the isolation between zones. They will tune timeout logic to prevent thread exhaustion, increase control services awareness of zone loads and, finally, move more control services into each EBS cluster.
The StorageMojo take
Data center opponents of cloud computing will point with alarm to this incident to make the case that they are still needed. But they forget that today’s enterprise gear is reliable only because of the many failures that led to better error handling.
While painful for the affected, the Amazon team’s response shows a level of openness and transparency that few enterprise infrastructure vendors ever display. Of course, that is due to the public nature of these large cloud failures; nevertheless the outcome is commendable.
But the battle is not only between large public clouds and private enterprise infrastructures, but between architectures. Traditionally, enterprise infrastructures have focused on increasing MTBF. Cloud architectures, on the other hand, have focused on fast MTTR – Mean Time To Repair.
What can be scaled up can also be scaled down. Not every application is suitable for public cloud hosting. But small-scale, commodity-based, self managing infrastructures are very doable. They are the bigger threat to the large proprietary hardware vendors of today.
Courteous comments welcome, of course. I speculated in Amazon’s experience: fault tolerance and fault finding about the cause of the failure, but I was wrong. A failure precipitated by a network upgrade? Way-y-y too simple.
IMHO there is even more to learn
1) data sets, especially large ones, have their own logistics problems
2) being successful and large as AWS is a two-sided coin: when you fail, you might fail really hard (again see logistics problems)
3) data stores with strong consistency requirements like block storages, are bound to fail hard when they fail (even though this might be highly unlikely). All the algorithms for consistent and distributed data stores with a high resilience (at least the ones I know of) are effectively incompatible with block storage’s low-latency access patterns. There is an inherent problem to provide block storages in a cloud environment at Amazon-scale AFAICS.
Martin, this is an example of Brewer’s CAP theorem…
Mxx, yes. But it also shows a problem inherent with AWS’s scale and size. It’s not just CAP, IMHO EBS is right and wrong offering at the same time; wrong mostly because of the scale that Amazon is operating at and because block-storage-workload is not sustainable in a cloud setting.
Just my 2 cents.
Well, they started offering EBS because customers were asking for a persistent instance storage. There are situations were nfs simply wouldn’t work.
Robin:
With all due respect…
This is like many other reviews of the outage that basically reprises the Amazon explanation. Because Amazon talks about customer data in terms of “volumes” throughout, most people see the potential loss to customers as minimal (0.07 % of volumes were lost according to this explanation). However, what I think you should be asking is:
How large were those volumes?
How critical was that data in those volumes?
AWS could have lost a significant amount of customer data (if the lost volumes were large) and data that could threaten the continued heath of its customers depending on how critical the data was that was lost.
John,
Good point. How elastic is Elastic? If someone goes bust because of it we’ll know.
The Universe hates our data – which is why storage geeks will always have jobs.
I see this over and over again (even at my last org, which stored everything in AWS) – teams confusing AWS with a production platform. It’s been built with developers in mind, and while admirably suited to startups (which aren’t self-funded, making uptime/reliability less important) or certain models (like Netflix, which is almost all reads), not every application makes sense in AWS.
There are other clouds, of course (Joyent takes a markedly different approach to architecture than Amazon does), and really, large enterprise has always used ‘the cloud’ in some sense as we consolidate and virtualize resources. It’s just that now it’s on the other side of the firewall if you don’t have the money/expertise to do it yourself.
I strongly suspect that real competition will begin to appear for AWS as customers begin to have carrier-grade expectations of web apps. Once a market segment becomes saturated with players, performance and stability will become differentiators in the public’s eyes.
At Zadara Storage we have a totally different architecture, that allows enterprise storage relaibility and capabilities for the cloud. Our Storage is even available at AWS for EC2 machines, for those customers that prefer to have a “storage array” in the cloud as oppose of a volume from a huge pool of shared resources.
Hi, Nelson: does Zadara have its own data centers or just use AWS as physical backend storage pool? If the latter, then how do you deal with AWS outage?