Here are my notes from Alyssa Henry’s Keynote on Amazon Web Services. Alyssa is the GM of Amazon’s S3. Not much editing and no slides – yet – to link. Update: slides are here. Alert readers Tim and Justin found the link. Thanks, guys. Update 2: I’ve added a couple of her slides to the post. End updates.

The quotes are hers. I didn’t catch them all.

Goals

  • Durable
  • 99.99 availability
  • Scalability
    • Scalability – virtually infinite
    • Support an unlimited number of web scale apps
    • Use scale as an advantage – linear scalability
    • Vendors weren’t interested in engineering for the 1% – want to engineer for the 80%
  • Secure
  • Fast
  • Simple
    • Straightforward APIs
    • Few concepts to learn
    • AWS handles partitioning – not customers
    • Some customers have billions of objects
  • Cost-effective

amazon_s3_objects_stored

The big issue
Uncertainty to a constant at AWS. They don’t know

  • Predominant workload type
  • Usage consistent or changing
  • Object access frequency – volume – locality – lifetime – size
  • Use cases varied and may change – suddenly – or over time
  • Must embrace uncertainty

This isn’t the standard data center mindset.

“Everything is vague to a degree you do not realize till you have tried to make it precise.”
-Bertrand Russell

Security
Amazon stores millions of credit card numbers – a criminal bonanza and a commercial disaster if breached. Secure protocols, authentication mechanisms, access controls log-able.

Forensic drive wipes. Not transparent about where their data centers are located

Need for speed
You’ve heard numbers about how quickly people start leaving as response times rise. So has Amazon.

S3 latency insignificant relative to Internet latency. They reduce Internet latency by adding new locations. They also perform multiple low-latency retries before returning a high-latency error to customers.

Simple
Simple self-service with a straightforward API. Few concepts to learn, limited command sets.

For storage they handle partitioning, not customers. Just throw it in the bucket. And some customers have billions of objects.

Cost effective
Pay as you go. Pay only for what is used. No long term contracts or commitments.

Amazon uses their own software and massive scale on commodity boxes to reduce costs.

Failure happens
“Try again. Fail again. Fail better.” -Samuel Beckett

What are the odds? Many failures happen frequently – but even low probability events happen at high scale.

  • Natural disasters
  • Load balancers corrupt packets
  • Techs pull live fiber
  • Routers black hole traffic
  • Power and cooling failure
  • NICS corrupt packets
  • Disk drives fail
  • Bits rot

amazon_failure_types

Failure types
Scope: small to large
Duration: temp or permanent
Effect: harmless to catastrophic

Techniques
“Do not let what you cannot do interfere with what you can do.”
John Wooden

Amazon’s basic techniques for dealing with the depressing litany of tech failure includes:

-Redundancy
Department of redundancy department.
A broadly applicable technique that increases durability, availability, cost, and complexity. Interesting analogy: seat belt & air bag vs belt & suspenders.

The focus is on avoiding catastrophe (seat belt/air bag) rather than over-engineering to avoid inconvenience (belt/suspenders).

Plan for the catastrophic loss of entire data center: redundantly store data in different data centers. Expensive, but once you’ve done it smaller features become belt & suspenders kinds of features: costly, inconvenient and they don’t solve big problems. [Ed. note: like RAID.]

-Retry
Resolves temporal failures: real time or later – quicker for AWS to retry than for customers to do it. Leverage redundancy – retry from different copies.

-Idempotency
An idempotent action is one whose result doesn’t change if the action is repeated. Taking a number’s absolute value or reading a customer record are actions that can be repeated without changing the result. If an idempotent action is taking too long, run it again. Lather, rinse, repeat.

-Surge protection
Rate limiting is a bad idea – build the infrastructure to handle uncertainty. Don’t burden already stressed components with retries. Don’t let a few customers bring down the system.

Surge management techniques include exponential back off (shades of CSMA/CD!) and caching TTL (time to live) extensions to avoid waiting for renewals.

-Eventual consistency
Spectrum of choices. Time lapse typically result of node failure.

Amazon sacrifices some consistency for availability. And sacrifices some availability for durability. A matter of priorities.

For example, objects are written to multiple data centers. Within each data center there are multiple pointers to the objects in case a pointer gets corrupted. Pointers are cheap; retrieving an object from another data center isn’t.

-Routine failure
Component failure is normal. Everything fails. Therefore don’t have unused/rarely used code paths since they are most likely to be buggy.

Amazon routinely fails disks, servers, data centers. For data center maintenance they just turn the data center off to exercise the recovery system.

-Diversity
Monocultures are risky. For software there is version diversity: they engineer systems so different versions are compatible.

They also maintain diversity within commodity hardware. Had one case where all the drives from one vendor failed. Another case where a storage server had a firmware bug: a failed drive would nuke the server. Another shipment of servers had faulty power cords. Correlated failures happen – especially at scale.

Diversity of workloads: customer workloads can be interleaved. Amazon is not a monoculture

-Integrity checking
Identify corruption inbound, outbound, at rest. Interesting failure mode: NICs that corrupt data after computing the check sum. Store checksums and compare at read – plus scan all the data at rest as a background task.

Application level check-summing: increases cost, complexity for the customer. Increases durability, availability.

-Telemetry
Internal, external. Real time, historical. Per host, aggregate. Strong telemetry platform.

-Autopilot
Human processes fail. Human reaction time is slow. Want system to be on autopilot and just run. If a human screws up an Amazon system, it isn’t the human’s fault. It’s the system.

Final thoughts
Storage is a lasting relationship that requires trust.

Reliability at low cost achieved through engineering, experience and scale.

The StorageMojo take
Amazon is the world leader in scale out system engineering. Google may have led the way, but the necessity to count money and ship products set a higher bar for Amazon. Plus they made every mistake in the book getting to where they are today. Hard-won learning.

Amazon Web Services is a logical extension of Amazon’s massive infrastructure investment. It may very well dwarf their products business in a few short years. I’d like to see them open the kimono more in the future.

Courteous comments welcome, of course. In a later conversation with Alyssa I noted that Google seems to have a problem scaling their clusters beyond 7-8 thousand nodes and asked where Amazon’s clusters topped out. She said they hadn’t seen that problem yet. Hmm-m-m. . . .