Amazon’s Glacier secret: BDXL

by Robin Harris on Friday, 25 April, 2014

Remember when Amazon Web Services (AWS) announced Glacier, a data archiving service, almost 2 years ago? Long-term, slow-retrieval (3-5 hours) storage for 1¢/GB while maintaining several copies across geographies.

Pretty amazing. Less amazing now that disk prices are reaching 3¢/GB, but there’s still power, cooling, mounting and replacement costs to consider in addition to multiple copies.

Tape? Amazon denied that. Plus the long-term storage requirements for tape require a level of climate control that their data centers may not support.

Not tape.

Hard drives to the rescue?
That left disk. Perhaps Shingled Magnetic Recording (SMR) drives that, in theory, could double existing drive density at the cost of expensive rewrites. Which an archive wouldn’t have.

Seagate announced they’d sold a million SMR drives – and not through NewEgg. WD is getting on board with SMR as well.

But as Rick Branson, an Instagram engineer, suggested in a tweet:

Economics of AMZN Glacier: 3TB drives are about $0.003/mo/GB racked and powered + erasure encoding = thin, but survivable margins.

That’s ≈$108/drive/year. Since 3TB drives cost about that, it’s clear that over a 3-5 year life the rack, power and redundancy cost is 2/3rds to 4/5ths of total cost. At Glacier’s $10TB/mo in 2012 – and today – and a 2012 cost in excess of today’s $108, you wouldn’t need Bezos’ financial acumen to see a non-starter. SMR could double margins, but in 2012 – remember the Thai floods – even Amazon couldn’t make this pencil.

Not disks. Even SMR disks.

The plot thickens
But if Glacier’s data was stored on disks – even spun down disks – why the tape-like 3-5 hour retrieval delay? Fake delay to make sure only archive data is stored? Disk drive robots?

Disks are sensitive – never mind the specs – to physical handling. I’ve never seen an HDD handling robot or the Zero-Insertion Force drive connector that would be required to minimize physical shock.

One more thing: tape libraries – the obvious robotic starting point – are designed to handle 200 gram tapes, not 600+ gram 3.5″ HDDs.

Not disk robots.

Lightning strikes
It was a couple of blog post from AWS architect and all-around nice guy James Hamilton that cleared things up.

James wrote Glacier: Engineering for Cold Data Storage in the Cloud at the time of the announcement. The post carefully avoids discussing the underlying storage, but in the comments James says

Many of us would love to get into more detail on the underlying storage technology.

Almost 2 years later, no one has.

Timing is everything
In June 2010 the BD-R 3.0 Spec (BDXL) defined a multi-layered disc recordable in BDAV format capable of 100/128 GB. In July 2010 Sharp announced 3 layer 100GB triple layer players and recorders.

Two years later, in August 2012, Amazon announced Glacier. Two years is about the time it would take to develop a custom optical disc mass storage system, test it, and announce the service.

Despite the obvious lack of consumer uptake, development continues on high capacity optical. Somebody is buying these things in volume. Unlike commercial Blu-ray discs – which are stamped, not written – writable optical requires chemistry.

Figure 10,000 3 layer discs per petabyte, the number of petabytes that AWS, FB and others are putting into cold storage, and that’s millions of discs per year. Pure OEM revenue with very low sales, marketing and support costs, and regular massive orders delivered every month by the semi-load.

Another piece of the puzzle
In February James wrote another post Optical Archival Storage Technology.

He starts with an important comment about today’s market:

It’s an unusual time in our industry where many of the most interesting server, storage, and networking advancements aren’t advertised, don’t have a sales team, don’t have price lists, and actually are often never even mentioned in public. The largest cloud providers build their own hardware designs and, since the equipment is not for sale, it’s typically not discussed publically.

Then he starts discussing the growth of cold data and what FB will be showing at OCP Summit V:

This Facebook hardware project is particularly interesting in that it’s based upon an optical media rather than tape. . . . [T]hey are leveraging the high volume Blu-ray disk market with the volume economics driven by consumer media applications. Expect to see over a Petabyte of Blu-ray disks supplied by a Japanese media manufacturer housed in a rack built by a robotic systems supplier.

I’m sure his friends at FB previewed the preso, but the lack of surprise or affect at the viability of 10,000 Blu-ray discs in a rack is telling: this is the discussion about Glacier he’d like to have. More telling: “the volume economics driven by consumer media applications” as if BD-R and BDXL were a great success. Which they are, but only at Glacier.

Media cost?
The biggest objection to mass optical storage is media cost. While 100 piece online BD-R 25GB media ranges from 46¢ on up – 2¢/GB, triple-layer BDXL media quantity 1 starts at ≈$25 or 25¢/GB. How does THAT pencil?

Disc production costs are mostly fixed. Once you set up a line the variable cost of plastic, chemicals and test are less than $1/disc. If the line is properly sized for expected demand, it can run 24/7, and the learning curve will drive prices even lower.

Assuming aggressive forward pricing by Panasonic or TDK, Amazon probably paid no more than $5/disc or 5¢/GB in 2012. Written once, placed in a cartridge, barcoded and stored on a shelf, the $50 media cost less than a hard drive – Blu-ray writers are cheap – Amazon would recoup variable costs in the first year and after that mostly profit.

The StorageMojo take
Therefore, by a process of elimination, Glacier must be using optical disks. Not just any optical discs, but 3 layer Blu-ray discs.

Not single discs either, but something like the otherwise inexplicable Panasonic 12 disc cartridge shown at this year’s Creative Storage conference. That’s 1.2TB in a small, stable cartridge with RAID so a disc can fail and the data can still be read. And since the discs weigh ≈16 grams, 12 weigh 192g.

For several years I didn’t see how optical disk technology could survive without consumer support. But its use by major cloud services explains its continued existence.

Courteous comments welcome, of course. This analysis is inspired by one of my favorite books, Most Secret War, the great story of British Scientific Intelligence from 1939 to 1949 told by its young physicist director, R. V. Jones. Competitive analysis with life and death stakes.

Update: In the just-added link to the Sony-Panasonic press release above, Sony closes by saying:

In recent years, there has been an increasing need for archive capabilities, . . . from cloud data centers that handle increasingly large volumes of data following the evolution in network services.

Gosh, whose cloud data centers could they have in mind? End update.

Update 2: Best tweet on the topic comes from Don MacAskill of Smug Mug:

@StorageMojo FWIW, this contradicts what I’ve heard from ex-AWS employees. Their explanation sounded crazier than yours, though. :)

{ 19 comments… read them below or add one }

hemancuso April 25, 2014 at 9:59 am

A “former S3 engineer” commented in a Hacker News thread during the Glacier launch. Nothing verifiable, but it suggests some contrast with the idea that Glacier is optical backed [also interesting: he suggests that S3 has an erasure encoding strategy.]

https://news.ycombinator.com/item?id=4416065

“They’ve optimized for low-power, low-speed, which will lead to increased cost savings due to both energy savings and increased drive life. I’m not sure how much detail I can go into, but I will say that they’ve contracted a major hardware manufacturer to create custom low-RPM (and therefore low-power) hard drives that can programmatically be spun down. These custom HDs are put in custom racks with custom logic boards all designed to be very low-power. The upper limit of how much I/O they can perform is surprisingly low – only so many drives can be spun up to full speed on a given rack. I’m not sure how they stripe their data, so the perceived throughput may be higher based on parallel retrievals across racks, but if they’re using the same erasure coding strategy that S3 uses, and writing those fragments sequentially, it doesn’t matter – you’ll still have to wait for the last usable fragment to be read.”

Justin Alan Ryan April 25, 2014 at 10:50 am

There’s actually a video of one of Facebook’s hardware engineers demo-ing an optical cartridge robot as part of OpenCompute. I’m sure Amazon is not as forthcoming, but they must have the same friends.

Robin Harris April 25, 2014 at 12:31 pm

I don’t doubt that drive vendors would do this for Amazon, or that Amazon wouldn’t find it useful as another tier. For me the critical issue for the Glacier application is the 3-5 hour wait time to access data. Is that a fake marketing requirement? Or does it reflect operational requirements? If Amazon could offer Glacier with a shorter access time at the same price, why wouldn’t they? I think the delay reflects the underlying technology.

hemancuso April 25, 2014 at 1:03 pm

@Robin what if Amazon did something as simple as extremely wide parity striping across 2-3 datacenters and did their best to keep as much of it powered off as possible? In this scenario you can localize the racks you’re ingesting writes into, but when they fill up you would want a strong disincentive on reading them if you hope to keep power off most of the time. A good way to do that would be to queue up requests and turn on small subsets of the stripes at max every 4-5 hours.

Emmanuel Florac April 25, 2014 at 1:32 pm

Spectra have been making disk packs for its tape libraries for ten years. It’s actually quite easy to use disks only in such a library, so the “disk robot” is a real possibility.

Ryan April 25, 2014 at 1:54 pm

Love your articles! Thank you for your continued insights into the storage industry that let folks like me peer behind the veil.

geoffrey April 25, 2014 at 3:30 pm

Glacier is S3 with code added for waiting. Future releases might do something different. For now Glacier is S3 with waiting and lower prices. All of this speculation is interesting.

trans April 25, 2014 at 4:07 pm

It could be, they are simply eliminating lots of duplicate data.

mark April 25, 2014 at 7:18 pm

Brilliant deductions @mojo! What a great read!

anon April 25, 2014 at 9:45 pm

Glacier data is stored on densely-packed spun-down disks where only a fraction can be on at a time due to heat and vibration constraints. The wait time reflects the scarcity of disk bandwidth due to these constraints. It was initially fake, but it may be increasingly real as Glacier sees heavier loads.

Rob April 25, 2014 at 10:21 pm

I’ve speculated in various forums about what it could be. I’d say the data would have to exist in at least 2 places, possibly 3 to guarantee the “9s” they do. I’m not sure if you take that into account.

Rather silly or stubborn to avoid tape – if it isn’t tape. Run on really
thin margins with no headroom?

http://www.spectralogic.com/blog/index.cfm/2012/8/24/Economics-of-Tape-Indicate-Warm-Waters-for-Glacier

“$0.0008 (cost per GB per month amortized over 5 years). Add in power, floor space and personnel costs for all 5 years and the total cost should still be well below $0.01 per GB per year for the period.”

With those numbers, you actually make money at $0.01/GB per
month (Glacier charges) with data duplicated in two places, parity
protected at a tape level to boot? Maybe.

Even if you can almost print money, if you have enough data you
*still* have to use tape. It isn’t sexy, but imagine how much it
would cost to have point in time offline backup in gmail not on tape?

http://www.tested.com/tech/1926-why-google-uses-tape-to-back-up-all-your-emails/

The headline isn’t addressed anywhere in the article. Why use tape?
They could perform those same backups to spun down [fill in the blank].
But tape is probably a factor of 5-10 cheaper versus other targets and with the data you are looking at there, that’s a lot of money saved.

From that link:

“But it just goes to show that despite the many advances we’ve seen in storage technology sometimes that which might seem archaic is still the most reliable. ”

Let’s fix that:

“But it just goes to show that despite the many advances we’ve seen in storage technology sometimes that which might appear to be on its way out still makes the most sense. “

Rob April 25, 2014 at 10:29 pm

“For me the critical issue for the Glacier application is the 3-5 hour wait time to access data. Is that a fake marketing requirement? Or does it reflect operational requirements? If Amazon could offer Glacier with a shorter access time at the same price, why wouldn’t they? I think the delay reflects the underlying technology.”

Yep. Takes a while to get all those tapes mounted and staging the
retrieval, heh. But Amazon is working on a disk based solution with faster retrieval:

http://www.theregister.co.uk/2013/12/23/amazon_to_introduce_diskbased_archive/

“Vulture Central’s storage desk thinks Amazon may introduce a disk-based archive with instant retrieval in the new year. It will be priced less than EVault’s LTS2, but still cost more than Glacier.

Let’s call it Snowfield for short and predict a cost of $0.0125/GB/month with Glacier potentially dropping to $0.0075/GB/month.”

Seraj April 27, 2014 at 12:27 am

Glacier is very cool – hence the name ;p – but it’s so hard to maneuver especially for a normal non-tech guru type of person. I’ve found a number of solutions that are built upon the Glacier technology but my favorite was by far Zoolz; so simple to use and I could backup whatever the heck I wanted with their unlimited section for $2/month I know a bargain when I see it and I love this one :)

Nikunj April 29, 2014 at 8:17 am

@geoffrey said:
Glacier is S3 with code added for waiting.

I can’t help but notice that “lean practitioners” would definitely see a strong case for doing above in the beginning. Why make huge investments upfront in actual datacenters without validating how big the market would be?

But that won’t have explained the overly complicated retrieval pricing model AWS chose (some details here: https://aws.amazon.com/glacier/faqs/#How_much_data_can_I_retrieve_for_free). Made it too difficult for customers as well as solution providers to package Glacier well.

So for me it’s back to disk. Optical or low powered or whatever.

Robin Harris April 29, 2014 at 10:15 am

“I can’t help but notice that “lean practitioners” would definitely see a strong case for doing above in the beginning. Why make huge investments upfront in actual datacenters without validating how big the market would be?”

Great point! I wish I’d made it myself.

Robin

piet May 1, 2014 at 1:56 am

Glacier is nothing more then a bunch of tape libraries, most likely with disk staging in front of it. My bet would be a cluster of SL8500’s with LTO if they run cheap, or T10K drives if the went the enterprise route. The last drives can fit 8TB of data on a single tape and the througput is much higher then any disk at this time can manage. Tape was, is and will remain the most effective medium for large datasets. Hell, CERN and a whole lot of other research facilities that work with really large datasets use tape, with a damn good reason.

The usuage of optical storage is well known in medical and insurance type environments, mostly WORM because of the need to keep records for x years which can not be tampered with. So the continued development of that type of technology is well justified.

And bto disks? maybe, but why make things so complex when you do not need to? From a business pov you do not want to be the only customere with a technology like that, it is likely expensive or presents a larger risk then commodity hardware.

So I’ll go with tape storage, perhaps with some form of HSM (like DMF) or LTFS.

It’s been said for years that tape is dead and that tape is not fancy, but I personally predict the demise of spinning rust BEFORE tape, ie. we’ll have SSD like storage and tape storage but disk will be gone.

Matthew June 7, 2014 at 4:24 pm

The robots are coming, they will be necessary for SSD/Flash based farms, not just for librararian purposes but for “wear” levelling and retirement.

Tape will persist, as there is still no easier way to move big data physically. It’s all very well to talk about storage costs, but the bandwidth costs and the opportunity cost of propagation delay must be taken into account.

Disclosure : ex-IBM.

Tibaut Houzanme September 19, 2014 at 7:44 am

I agree ODD tech is the most likely option AWS is using. Beside, the economics makes sense, as well as energy saving is good for the environment. The only inconvenience is how to carry the disc cartridge around from system to system, but I digress…there are robots, just for that! AWS could reduce the time to deliver a ‘pull’ or ‘get’ request by taking human out of the process.

Alican December 13, 2014 at 4:35 pm

It was a great read! I was suspecting that they are using tapes but blueray optical drivers are much more make sense to me now.

Leave a Comment

{ 2 trackbacks }

Previous post:

Next post: