Glacier redux

by Robin Harris | Wednesday, April 30, 2014 | Architecture, Cloud computing & storage, Object storage | 14 comments

Reactions to the post on Amazon’s Glacier secret were varied and sometimes enlightening – with one savvy observation that I wish I’d made. The post made Hacker News (h/t to Mark Watson for the alert) and received 40+ comments.

Alternate ideas
A number of folks suggested that Glacier used some variation on what Amazon already does: disks in commodity servers. With wrinkles:

S3 + code to ensure waiting.
Special low-RPM disks designed to stay powered down, but racked and ready to spin up, with low-power controllers.
Old hard drives that are no longer economical for more intensive service, supported by disk-handling robotics.

Long story short, almost everyone thought it was disks in some power-down configuration. Which makes sense if power is the driving cost for an Internet-scale data center.

But power isn’t the driving cost. It is one cost, but when you buy megawatts, your pricing is very different than at home.

Why not disks?
An easy reason is that the cost of a disk slot is significant. It has to be powered, racked, controlled and managed. While powering down the disks allows for power system over-provisioning – which lowers the cost per unit of the power system – you still have to cable it up. As StorageMojo noted in a review of a Google paper:

The capital cost of provisioning a single watt of power is more expensive than 10 years of power consumption.
Data centers are most economically efficient operating at close to 100% of provisioned power.
The greatest opportunity for power savings comes reducing the power consumption of idle kit, not from making busy kit more efficient.

Saving power is a Good Thing, but at Internet scale it is also a Different Thing. While power is important to operating expense, it is the capital expense – the first money in – that drives economic efficiency. In 10 years all the servers, switches and disks get replaced, so you can improve OpEx, but capital dollars sit there forever.

Unless the prices of copper, PDUs and diesel-generators have started following Moore’s Law, this is probably more true today than in 2007.

Bottom line: even if power cost nothing, nada, zip, you’d save at most 50% and probably less. And when Glacier came out, the savings over S3 were much greater, but even today it’s 1/3rd the price of S3. Power savings alone can’t justify Glacier’s pricing.

The savvy observation I wish I’d made came from StorageMojo commenter Nikunj Verma:

I canâ€™t help but notice that â€œlean practitionersâ€ would definitely see a strong case for doing above in the beginning. Why make huge investments upfront in actual datacenters without validating how big the market would be?

That answers the question of why supposedly ex-AWS people might believe Glacier is disk-based.

Glacier’s pricing wrinkle
Some people pointed to Glacier’s pricing as evidence, unless, as some suggested, AWS doesn’t need to make money. Uh-huh. But one point neither they or I mentioned is Glacier’s price for data deletion within 3 months of upload:

In addition, there is a pro-rated charge of $0.03 per gigabyte for items that are deleted prior to 90 days.

Thus AWS is intent on getting at least 3Â¢/GB out of all data uploaded to Glacier. Which suggests that they have some fixed costs they want to recover, such as, say, media? No deletion charge on S3.

The BDXL question
But NONE of the Hacker News commenters addressed Sony and Panasonic’s continued investment in high-density optical disc technology. Not only are the making triple-layer BDXL today, but they’ve announced plans to go from 300GB to 1TB over time.

It could be that they’re stupid and/or obstinate – which explains a lot of real world behavior – but unlikely given the financial stress both companies are under. There has to be a business reason for the continued investment, i.e. customers prepared to buy a lot of product in the future and buying a lot right now.

The intelligence community could buy a lot and probably does. But I’ve seen credible suggestions that Facebook and Amazon each buy petabytes of storage a week. If, as research has found, much of that data is not accessed after a few months, it would make sense for them to go optical, as FB has announced it is testing.

The need for higher data bandwidth also explains why Panasonic has a 12 disc optical RAID. With replication you could even skip the RAID.

BDXL discs on Amazon are at least $45 each. You can buy a 1TB disk for about that. So somebody is buying BDXL discs in bulk or they wouldn’t exist – and it sure isn’t consumers.

The StorageMojo take
The biggest surprise of the Hacker News comments was how reductionist most views of the issue were. Cheap storage? Powered down disks.

But power isn’t the major driver of cost at Internet scale.

Maybe AWS is making stuff up, or don’t need to make a profit on the service. In competitive analysis, you assume you’re dealing with a rational actor, or anything goes. That may over-estimate their smarts – as the British did looking at German radar during WWII – but at least you won’t be caught unawares. Much better than under-estimating smarts, as the Germans did with Enigma decryption at the same time.

The solution space has to take into account these facts:

Glacier is significantly cheaper than S3
They charge for deletions in the first 3 months
Power is not the driving cost for Internet scale infrastructure
Sony and Panasonic continue to invest in a product that has no visible commercial uptake
Facebook believes optical is a reasonable solution to their archive needs

So unless these aren’t facts, the answer points to optical media. But please offer another suggestion.

Courteous comments welcome, of course. Commenters, start your engines!

14 Comments

guest on Wednesday, 30 April, 2014 at 3:11 pm

Most expensive part of a data center is power.

Machine has banks of spinning hard drives. Each ‘bank’ is on for 10 minutes every three hours. There are 18 banks per server, representing 540 terabytes per ‘rack’.

Instant write, and in order to read, you have to wait until the bank that contains your data is powered on which is then copied to a staging location.
James B. on Wednesday, 30 April, 2014 at 3:14 pm

Hi Robin.

I’ve been following cold storage for a while too, and I think modern tape or optical disc media is the web-scale answer because:

1) incredibly low cost/space
2) most people believe powering down hard disks for an extended time period tends to shorten their lifespan compared to spinning them.

I did not see any worthwhile HN comments on the topic, and “lean startup” tips don’t apply to FB, etc.

James.
Ursitoare on Wednesday, 30 April, 2014 at 5:06 pm

To be honest I didn’t expect to hear that optical will be the solution, but I am glad you pointed that out that BDXL discs are bought in a big quantities because it sure makes sense now regarding their price (compared to 2 years ago).
But let’s not forget about the advantage of using an hard-disk, in case you want some sort of remote access to your data without too much hassle
Average Joe on Wednesday, 30 April, 2014 at 11:29 pm

What I find more interesting than Glacier pricing is Google’s $10/TB/month without any of the hidden fees and with instant access.
piet on Thursday, 1 May, 2014 at 2:07 am

The data density of disks are such that the growth is flattening (and has been for he past years) ever more. The data density growth for tape is still more or less linear, as long as you use enterprise class tapes that is, like IBM or oracle drives. (LTO cannot grow since as an OPEN tape it cannot use the closed media type in the enterprise level tapes) This means if you buy a library today it will hold twice the amount of date with only upgrading a handfull of drives (and the tapes of course) every x years.

The library investment therefore is written of usually over periods of 10 years or more. The powderhorn, one of STK’s most legendary libraries was supported from the early nineties up to 2010 if memory serves. Tape is also much better equiped to be moved around then disks are. You can drop a tape without problems, you can place a magnet on top of the case, even move it OVER the tape and your data is still there (not saying you should take that risk!), a disk is far more fragile.

My bet is still: tape.
Tonio Loewald on Thursday, 1 May, 2014 at 6:26 pm

I think it’s an interesting idea but upon reflection clearly wrong. First, a quick search of Amazon.com reveals 100GB BDXL media costing $45. This is horribly expensive, and if enterprise customers can somehow drive prices lower than consumers it would be something of a first. It’s not like consumers wouldn’t love a cheap, effective 1TB backup system. The problem with BD is that it’s smaller than the media we need to back up which makes backing up a pain. CDR was successful for a long time because when CDR went mainstream a big hard disk was 200MB. DVDR never got past the starting gate as a backup medium because it started out too small. With BDXL we’re talking $0.45/GB vs. $0.03. $0.03 sounds suspiciously like $100 per 3TB hard disk.

Second, the argument about provisioning power being the decisive expense misses the point. If you build a facility which expects 99.9% of its hard disks to be idle, you only need to provision for 0.01% (+ safety margin) of power. You save both provisioning and usage costs.

And of course you’re ignoring the cost of building custom robots to move the disks around. This all starts to sound expensive and complicated, when I suspect the whole point is simply to maximize economies of scale and more-or-less break even. Deduplication will be a huge win here, and the cost for ephemeral data ensures that glacier won’t be wasted on things like rotating daily backups and log files.
Rob on Thursday, 1 May, 2014 at 10:08 pm

Robin,

I appreciate the effort in trying to walk this back and figure out what is the under-pinnings. It has fascinated me also.

In that other thread I wondered if you took in account that the data has to exist in multiple places, from their FAQ:

Q: How durable is Amazon Glacier?

Amazon Glacier is designed to provide average annual durability of 99.999999999% for an archive. The service redundantly stores data in multiple facilities and on multiple devices within each facility.

…

Multiple devices in each facility. How about a design that does full stride writes to say 14 tape drives with DP protection ala Netapp, in three data centers. Would that hit 11 – 9’s availability? I guess the point here of course is you need to add another fact: “Multiple devices in multiple data centers.” Now.. get ‘er done without tape, let us know how those numbers work with optical.
guest2 on Saturday, 3 May, 2014 at 7:41 pm

guest is 100% correct. I’ve seen them, touched them, serviced them. 90+ disks to a 4U shelf. 3 shelves to a head. I’m pretty sure there were 4TB drives in the trays 6 months ago. Shingle drives are also likely to be deployed soon if not by now. Given Shingle drives’ extreme sensitivity to vibration there can only be like 3 other spinning disks in the tray. But since S3/Glacier use what is tantamount to write-anywhere, re-writing tracks is less a problem than your more normal workloads.
The racks are underprovisioned in power significantly so they can only handle about 1/4 of the drives actually spinning at any given point in time or they’ll blow the circuit breaker. S3 and Glacier both use erasure-coding though I don’t know what N:K ratio Glacier uses.

So *please* people enough with the stupid optical media/tape poppy cock. It’s disk. Always has been disk. The nonsense supposedly “smart” industry people will believe…

Amazon operates at an effective profit margin of 0.7% if you bothered to read their disclosures. Admittedly AWS is vastly more profitable than that. Joe Six-pack can buy a 3TB drive for $100. Why ON EARTH would you use those numbers as if they had any bearing on what Amazon pays for their gear? Amazon buys how many thousands of drives every week? Each rack has ~1000 drives and deploying anywhere up to a dozen glacier (or S3) racks EVERY week somewhere in the world is hardly unusual.
SuperGreg on Sunday, 4 May, 2014 at 2:30 pm

What about this? Sony blows away record with 185TB cassette tape http://zite.to/1fIjYu6
Steven on Monday, 5 May, 2014 at 12:57 am

Glacier would be a profit center for Amazon if they just bought Backblaze Storage Pods (fixed cost of about $0.05 per GB at Backblaze’s prices) and some low overhead error correction codes to deal with drive failures. I doubt they are actually using Backblaze Storage Pods, but they are probably using something with similar storage cost per GB.

Back of the envelope calculations:
Fixed Cost: $.05 per GB
Monthly Recurring Cost: $.001 (let’s say $2k per rack for 2,000 TB)
Monthly Profit Margin for Amazon: $.009
Time till break even: 5.5 months
If the storage pod lasts 36 months they make back over 6x their initial investment; not bad…
guest2 on Monday, 5 May, 2014 at 7:42 pm

Seems to me the point of Glacier is to decimate the value position of and indeed annihilate the very thought of ‘tape’ across the entire IT landscape. If you’re big enough to be spending millions on a 20-ft diameter, multi-thousand tape juke box and zillions on the robotics, I don’t think Bezos has you in mind. But if you were thinking of buying a 4-12 drive and 1500 cartridge unit from the likes of Quantum and retain the services of IronMountain et. al., the economics of Glacier should put any such foolishness out of your mind (ignoring for the moment the trade-off of not having positive, physical control of your data or the time and bandwidth charges needed to read it back).

Once the likes of Ceph and Gluster implement erasure-coding the business viability of the present-day preposterously expensive commercial offerings goes to zero.

Will banks and other financial, chemical, and bio-tech companies fall over themselves to go Glacier? Rather doubt it. But anything that is archive (genomes, census data, economic statistics, geo-location, weather observations, GIS and all kinds of massive datasets, even more pedestrian stuff like executive email archives, legally producible artifacts) would be just fine. Just encrypt your data before you send it over.

Backblaze pods are complete and total junk, IMO. Why Backblaze haven’t engaged the ODMs to build them a non-suck-#$% chassis and backplane is beyond me. Admittedly they won’t get Amazon’s volume pricing but the professional in me cringes at the silly levels of questionable consumer-grade stuff they used. Sure, it’s worked out for them. However, I wouldn’t dream of using such low-grade components for my own personal, home server. Guess I’m a snob but 20% more for a SAS multiplier backplane and properly robust 1200W power supplies makes sense to me.

By contrast, the storage nodes for S3/Glacier have to WORK. It’s not just a few hap-hazard home users on slow connections using your service in a throw-away fashion. It’s 10’s of millions of users and quite a few entities whose business offerings are heavily dependent on it. Of course one would write the software to expect hardware failure, but there is a limit as to how far that can be pushed. It’s one thing to not spend money on motherboard features or high-speed SAS controllers you don’t really need, but it’s another thing entirely to use cut-rate consumer parts or screw up basic Mechanical and EE.

Penguin Computing has rather nice examples of compute+storage in a 4U combined enclosure. No they aren’t $3000 each (more like 2x that) but that’s in quantity onsie-twosie. Put an order in for 1000 and watch that price come WAY down.

Tape and other “3x” strategies are just pathetically inefficient and lacking. The major problem with optical and tape for that matter is that you can’t write randomly to them. Let’s say your erasure coding is 8:15, so to write any given piece of data you need no less than 15 drives and 15 pieces of media spread across no less than say 3 or 4 geographic locations. If it’s 10’s of MB it would be pretty lousy. Multiple 10’s of GB and we’re starting to get better trade offs. Say all 15 pieces were successfully written but now 2 of the disks are bad. How do you detect that they’re bad besides engaging the robot, doing a media scan and checksum? And you can’t just leave the file degraded forever lest you keep losing more and more pieces till there aren’t enough to recover the file, so now you have to proactively rebuild it. You’ll need to load no less than 8 drives with the right disks, compute the missing pieces and write those out to other cartridges in the proper geographic locations. Given that it takes 20+ seconds to load a tape cartridge (assuming a drive is available) and all kinds of time to seek to the file, the only rational system is to use disk which has access times in the milliseconds and which also lends itself wonderfully to repeated and blazing fast turn-around for integrity checks.

I wouldn’t be surprised that the reason for the ‘delete penalty’ in Glacier has everything to do with influencing user behavior so people treat it appropriately. Just look at your own backup strategies for the glaring stupidity of most backup regimes. Indeed I can’t imagine a Glacier delete is really any different than a S3 delete – update some metadata and put the data blocks on the ‘free’ list. It’s just that the housekeeping jobs take a whole lot longer to run on Glacier than S3 because giant cross-sections of the environment aren’t online.
Rob on Wednesday, 7 May, 2014 at 10:58 pm

Guest2:
“Seems to me the point of Glacier is to decimate the value position of and indeed annihilate the very thought of â€˜tapeâ€™ across the entire IT landscape.”

“But if you were thinking of buying a 4-12 drive and 1500 cartridge unit from the likes of Quantum and retain the services of IronMountain et. al., the economics of Glacier should put any such foolishness out of your mind.”

Few thoughts… the bandwidth to the tape drives is quite a bit higher than the bandwidth to and from Glacier. Unfortunately, bandwidth costs to the internet are much too high and so Glacier long term archive use cases are rather narrow (5 hours to begin the restore, and then slow due to *most* end-user internet connectivity bandwidth – lack thereof).

Regarding value proposition and Iron Mountain… Tape is still considerably cheaper:
http://www.spectralogic.com/blog/index.cfm/2012/8/24/Economics-of-Tape-Indicate-Warm-Waters-for-Glacier
“Quantum today saying they can offer 10PB of tape stored for 5 years for $669,663.70 versus around $6,000,000 for Glacier.” .. Now maybe that is all bluster but tape is quite a bit cheaper as far as I can tell. Regarding Iron Mountain, that’s so 1990s. What is happening now is tapes being managed in-house and shipped to COLO unless you are really large and sure, let Iron Mountain do it – less management on your end. That isn’t the be-all-end-all. Avamar grids replicating to and from COLO. Of course database backups going to VTL and de-staged to tape, change rate too high to go to Avamar for most folks. Cost per TB for Avamar is outrageous. My point here is there is quite a tapestry out there and tape isn’t going away in large shops, SMBs somewhat as the bean counters are the blockers when it comes to doing away with tape.

Finally…
“I wouldnâ€™t be surprised that the reason for the â€˜delete penaltyâ€™ in Glacier has everything to do with influencing user behavior so people treat it appropriately.”
That is exactly what I was thinking.
Tom on Monday, 9 June, 2014 at 11:57 am

Glacier, using BDXL WORM, fits a very specific niche in the Cloud archive storage market. If you know that your need to access a specific set of data is a very low probability, if the data “chunks” are generally smaller than 50GB and you are mature/sophisticated enough in your data archive deployment planning, then an automated BDXL archive system makes perfect sense as one option.

Energy savings in tech companies is often measured in social responsibility and NOT in cost savings. Look at their construction and operations investments to save legacy energy or exploit alternative energies and the “energy is not important” arguement seems foolish to even voice.

The BDXL OPARG disks are really big WORM-style media, intended to be left alone after the data is recorded. If you want to “delete” data, you have to keep a deletions database to tell the system what is not supposed to be available. At some point, you would re-burn the BDXLs to omit the deletions and shrink the database overhead and physical size of the media collection. This easily explains a delete penalty in Glacier. Pruning the data after it is commited to WORM is a major headache. Doing it in the correct order is necessary.

Finally, there is a fundamental observation that has not be discussed — BDXL disks are not mechanical devices. The drives are relatively robust and inexpensive. Likewise, they can be purpose built to perform either the write or the read function more efficiently.

Tape has always suffered from being a mechanical device within another mechanical device. Cassettes malfunction even when the record/playback device is functioning perfectly. Hard disks are encapsulated mechanical devices where the media is not seperate from the device. In terms of cost and longterm reliability, simplicity breeds low cost, low capital investment and lower overhead costs.
Walex on Wednesday, 27 July, 2016 at 10:42 am

J Hamilton has a blog entry that without saying as much hints that Glacier are using or will be using Blu-Ray:

http://perspectives.mvdirona.com/2016/03/everspan-optical-cold-stroage/

“But, leveraging an existing disk design will not produce a winning product for archival storage. Using disk would require a much larger, slower, more power efficient and less expensive hardware platform. It really would need to be different from current generation disk drives.

Designing a new platform for the archive market just doesnâ€™t feel comfortable for disk manufacturers and, as a consequence, although the hard drive industry could have easily won the archive market, they have left most of the market for other storage technologies.”

It is ironic though that BackBlaze are offering a low-brow alternative to Glacier based on consumer 4TB drives: http://www.sabi.co.uk/blog/16-two.html#160701