Mission Impossible: Managing Amazon’s Datacenter, Pt I

by Robin Harris on Monday, 18 September, 2006

The Church of Small Business
Dave Hickey, one of my favorite culture critics, describes Mission Impossible as the Church of Small Business: the team shows up on time, they know their jobs, they don’t complain, and their equipment, their equipment, my friend always works!

Next time things go loopy in your datacenter, wouldn’t it be nice to call in an MI team of Really Smart People to figure out the problems and make them go away? A team including one of the smartest and most productive computer scientists in the world? With other folks who are tops in their fields? Welcome to the Church of the Datacenter. Welcome to Amazon.com.

Advanced Tools for Operators at Amazon.com (pdf) details just such an effort. OK, the guys are probably all taller than Tom Cruise, and they walked in the front door and signed in just like everyone else, but still. David Patterson is best known for his work developing RISC and later RAID, inventions that roiled two major IT industries. Then there’s the Michael Jordan of statistical learning theory, Prof. Michael Jordan. A smart grad student Peter Bodik. Stanford prof Armando Fox and your own homegrown team of sharp CS types.

While Amazon is not as forthcoming as Google about their infrastructure and technology, they have much smarter marketing. IMHO Amazon is Google’s real competition, not Yahoo, Ebay or Microsoft, because they offer an alternative to Google’s tone deaf arrogance. For example, instead of offering web applications, such as Writely, Amazon is offering building blocks for other developers with Amazon Web Services. Amazon must be looking at taking on Ebay, since they already have the OLTP infrastructure and, I suspect, lower costs as well.

There are two threads in the paper I find interesting. The first thread is what it tells us about Amazon’s internal operations and infrastructure. The second is the tools and their use developed by the MI team. I’ll take them in that order, although they are interwoven in the paper itself.

Guppies with frikkin’ laser pointers!!
Unlike Google’s clean sheet approach to creating internet-class infrastructure, Amazon has made every mistake in the book. The original site was one hairball, database, OLTP and web server all on one system. Due to their phenomenal growth they’ve been playing catchup ever since. In a fit of desperation they even bought a mainframe hoping it would solve their problems. They unplugged it a year later. They seem to use a lot of blade server clusters today, but information on the rest of their infrastructure is scarce.

Amazon has adopted a service-oriented architecture (SOA) approach. Individual services run on dedicated clusters that range in size up to several hundred machines. The services rely on each other to get the job done, so when an entire service bogs down or fails the problem propogates rapidly.

Amazon’s IT ops don’t employ guppies, but other than that it sounds like a gonzo data center:

  • They run complex systems with a lot of software churn and many dependencies that cause failures to propogate
  • Operators understand the part of the system they spend the most time on, but no one understands all the dependencies across the systems
  • Data rich and information poor – the fine grain instrumentation of the Amazon infrastructure often overwhelms operators with detail
  • Amazon uses multiple software frameworks, the same kind of legacy problem mere enterprise data centers face

With all that, fewer than a dozen sysadmins monitor the entire Amazon website 7×24.

Ugly crash: bits spewing out the back, fried data on the ceiling . . .
Amazon classifies problems as either Severity 1 (sev1), problems that impact customers, or sev2, problems that could affect customers if not fixed.

Amazon’s unique (AFAIK) response is to require that development teams also support the apps they develop. There are a only few dozen development teams, so most developers also rotate into beeper-carrying bug chasers for several days every couple of months. A clever way to encourage sound coding and testing practices, don’t you think?

Sev1 problems result in a concall that the designated support guys have 15 minutes to join. A common problem is that the failure of one component leads to alarms and failures of other components. In the cascade of failures operators are often reduced to examining alarm time stamps to determine where the problem started.

The dev-team resolvers get to know the other services they rely on, but nobody at Amazon has a mental map of the entire system. There is no “big picture” at Amazon.

Don’t like the software? Wait a minute.
With hundreds of software developers and pressure for new features, higher performance and bug fixes, code churn is a fact of life at Amazon, more so than at enterprise data centers. The code churn creates its own climate of continuing new sev2 problems. Documentation is often out of date or incomplete. This is one reason it makes sense to turn developers into problem resolvers – they are the only ones who know enough to intelligently attack the problems.

All-in-all, Amazon sounds like one of the crazier data centers to work in. Then you look at the services they are rolling out and the fact that they are a money-spinning machine, and crazy as it sounds, Amazon folks are making it work. It isn’t a church, but a cathedral of innovative web services.

Next: The Tools the MI Team Cooked Up

{ 1 comment… read it below or add one }

PuReWebDev December 26, 2008 at 11:29 pm

It’ll be interesting to see if some of those same data center issues are making their way into Amazon Web Services. This post was very interesting, kind of puts into perspective the amount of effort that Amazon needs, in order to operate and keep the money machine running.

Leave a Comment

{ 3 trackbacks }

Previous post:

Next post: