Building a 1.8 exabyte data center

by Robin Harris on Sunday, 12 October, 2008

StorageMojo gets questions from baffled civilians out in prospect-land. This one seems worthy of a thorough airing.

The writer is a student and a storage newbie, but she has the kind of question that more folks are asking. Here’s her note, edited for clarity:

I am working on an archive. The main idea is to store large files of 90 Gb each. This archive is meant for research purposes (that’s why the files are so big) so we can think of storing millions of these files (20-40 million). I read the article about google’s warehouse and I think that’s the scale of the project.

I’m sending you an image [ed. note: a BlueArc system] with the kind of rack that I found so far (1TB/blade) but I am not completely sure that’s the latest technology I can use for this huge scale project.

By the other hand I have the problem of trying to figure out the electricity required for it (thinking of thousands of racks with blades working in a net with backup servers).

I appreciate your time and help in advance.

Let’s do the math
OK, arithmetic. 20 million 90 GB files is, um-m, 1.8 billion GB or, hm-m, 1.8 million TB, which is 1800 petabytes or 1.8 exabytes.

2 million 1.5 TB drives would just handle this with no redundancy. At an average price of $100 per drive – they are higher now of course, but unboxing 2 million drives would take some time – our student is looking at $200 million just for drives.

High density packaging is required to reach 500 3.5″ disks in a rack. The Sun Fire 4540 puts 48 drive, 4U box could support 480 drives in 42U – if OSHA would lighten up and you didn’t mind using a ladder to replace failed drives. Any other ideas?

The 4540 also supplies the server cycles this project needs. Other server/storage enclosures would have a hard time equalling the 4540’s density (note: 10 4540’s in a rack is illegal in the US and not practical anywhere) but HP’s ExDS9100 comes close.

At 750 TB per rack (500 drives), 4,000 racks could handle it. A cheap 4 post/42U rack might cost $200 in volume, so call it $800,000 for racks. I don’t know what the boxes would cost but it would be at least several times the racks.

This could also be a great application for Copan’s enterprise MAID systems if the workload is right. With up to 896 TB per rack they have the density. Price/performance would be the next issue to look at.

Would you like a data center with that?
4,000 racks require about 40,000 square feet of floor space. Racks typically account for about 1/3rd of a data centers floor space – hallways and other equipment occupy the rest – a 120,000 sq. ft. or 11,100 sq meter facility is required.

At $500/sq. ft. the data center would cost an additional $60 million. YMMV.

Oh, and redundancy is extra.

Power
The drive’s operational power consumption is 10 watts. Drives alone eat 20 megawatts. Side note: disk power consumption is the reason that storage vendors find differentiation on power consumption elusive.

Assume 1 250W server for every 100 drives means another 5 MW for servers – a low-side estimate. Leaving aside network infrastructure and lighting, the HVAC load for 25 MW is around 12.5 MW, according to some rules of thumb.

Let’s round up and call it 40 megawatts. You’ll want to locate this facility near the Columbia River to get cheap hydropower – maybe next door to Google in The Dalles, Oregon.

I haven’t deciphered BPA power pricing, but I’d guess 40 MW would run about $2 million a month. Copan, much less.

The StorageMojo take
Well, what do you know? Building an exabyte data center is feasible. All it takes is money – $400 million with all the goodies – and power.

Time to readjust the mental model of storage possibility. Other than the NSA’s acres of disk at Fort Meade though, I’m not aware of any exabyte data centers.

Technically, I’m not aware of the NSA’s facility either.

Google isn’t a good hardware model, since their storage density is only about 120 drives per rack, not 500. Plus they don’t use the highest capacity drives in favor of the lowest $/GB – drives nearing the end of their product life cycle. Google’s GB/sq. ft. is way lower than my interlocutor could afford.

Their software – BigTable and GFS – looks to be a good fit. But even Google doesn’t run 20,000 node clusters. That is Lustre territory.

Who has the best solution to the exabyte problem? Nominations are now open.

I remember hawking 1.8 – one point eight! – gigabytes in a single 42″ rack. For $50,000! A terabyte was almost beyond imagination. Today my home system has 2.5 TB. Whoa.

Comments welcome, of course. Who else could use an exabyte of data? Disclosure: I’ve done work for HP and Sun. None for the NSA, though.

{ 25 comments… read them below or add one }

Matt Bernstein October 12, 2008 at 11:10 pm

Your last point–the 1.8G for $50k–is possibly the most pertinent. You’ve estimated the cost of storing 1.8EB _today_. The time taken to just write such a volume of data suggests it will be written over a long period of time.

Let’s guess that 2.5″ 10TB drives are available in about 5 years. That gives you over 4PB per rack. So why not ramp up, starting with 3.5″ 750GB or 1TB drives today working up to whatever’s available when the extra capacity is needed? You’re surely going to need to hire a sysadmin anyway, so they could participate in the expansion planning.

I bet it doesn’t all need to be available. 1.6TB LTO5 drives and tapes should be available next year, which are a lot cheaper to cool than spinning hard drives.

Nickolay October 13, 2008 at 2:22 am

Hi. Could you explain “10 4540’s in a rack is illegal in the US and “, how come it is illegal?
Thanks.

Ewan Leith October 13, 2008 at 2:47 am

Thinking about it, this would surely be the ultimate test case for data de-duplication software?

I’ve no idea of the actual usage of the data, but I can’t imagine 1.8 Exabytes of data being used in random read-write situations at the moment, it’s most likely to be serial read-only? Because of this, I’d be looking hard at DataDomain (maybe too pricey?), or the GreenBytes Cypress appliance, which after all if just a Sun X4540 with de-dup software on top – http://www.green-bytes.com/cypress.html

$90,000 list price per 46TB of raw storage for the GreenBytes box is pretty good, though they only put in a de-dup rate of around 30% in their examples, compared to DataDomain who seem happy to claim 90%+ de-duplication.

Either way, I think the most difficult problem in this project would be getting the budget to do it, not the technology to actually make it happen :)

DarkFlib October 13, 2008 at 3:06 am

Besides a research establishment on a similar scale to CERN or the various Three-letter agencies around the world, I have a hard time seeing a need to such large amounts of space, even the internet archive is only in the 100 TB range at present.

As to the building of the archive, if we assume that 1 engineer can install 500 drives per day and that we have 10 engineers doing this task, then for 2 million drives it would take 400 days to install just the drives, assuming no days off.

As to getting data into the archive, unless the archive is collocated next to the data source, its unlikely that full use could be made of it in a reasonable timeframe. 1800PB @ single GigE speeds would take 173000 days to fill or 475 years. Even at single 10GigE it only brings it down to 47.5 Years. If we assume that there are 100 10GigE links between the source and the archive, then we still would take 173 days (assuming 100% ethernet line speed) to fill the archive.

Clearly, the best way to do this would be an incremental expansion. Where you bring on new servers/arrays every day and upgrade capacity on drives as they become available. Its unlikely that even if you could have the archive online from day 1 at full capacity, that you would be able to fill it in under a year. At $500M, there is an appreciable amount of interest that could be gained from delaying even part of it for a few months.

Robin Harris October 13, 2008 at 3:39 am

Matt, good points. Assuming the disk industry keeps tracking 10x capacity every 5 years, we’ll see 10-15 TB 3.5″ drives in 2013. That would dramatically decrease the rack and floor space requirements as well as power and cooling. $400 million today; $40 million in 5 years.

LTO4 cartridges today – 1.6 TB compressed, 800 GB raw – are ~$60 in volume today, or about 1/3rd the cost of 1.5 TB drives. That could work. How big is the largest tape library?

Nickolay, I can’t cite chapter and verse, but Sun’s product page for the 45xx series expansion unit says “192 drives in 16 RU.” A mechanical lift is required to install a fully loaded unit. I believe that the US Occupational Safety and Health Administration (OSHA) has weight and balance requirements that drive the the 4 unit limit.

Robin

Jason October 13, 2008 at 3:53 am

I feel the need to echo Matt’s comments about LTO5 (and LTO4) drives. The classic answer to questions of this magnitude is to throw Powderhorns at it (BIG Tape libraries). The Sun StorageTek SL8500 is expandable to hold 70 PB worth of tapes. Multiply that out to get the density you require.

You (of course) need some really big disk based systems to work for realtime access, but they’re MUCH smaller than your current scale.

This of course depends on your data usage model… How frequently will each 90gb file be touched? If it’s more than 3-4 times per day on average, this solution probably won’t work. However if a file will be used a lot (many times per day) early in it’s life, and then very rarely after a certain period, this solution is perfect… who cares if the file takes an extra 10 minutes to spin out from tape if you only needed it every 3 months or so?

–Jason

Paul October 13, 2008 at 10:12 am

I can’t see deduplication being a good fit here. Just seeing how long it takes my smaller 3020 Netapp to dedupe a 1TB volume, deduping an exabyte just wouldn’t be feasible.

The idea of using something like a Powderhorn is good if the data doesn’t need real time access. If you can live with it taking the time to pull it off tape a cache it on disk but the tape management might make this too complex.

Fazal Majid October 13, 2008 at 5:55 pm

Pay Caterpillar $10M to design a Texas autoloader robot – it uses Sun 4540s as data cartridges…

Pete Steege October 14, 2008 at 5:50 am

I agree with Matt.
Robin, can you calculate the annual cost/TB “decay” that comes from the relentless march of progess? That would drive a very interesting discussion.

Jake October 14, 2008 at 6:20 am

DS8300 in 6 frames (racks) would get you an exabyte for approximately $10M list.

Robin Harris October 14, 2008 at 6:47 am

Jake, did you slip a few decimal points? It looks like a fully expanded DS 8300 is about a thousand drives in a couple of racks. You’d need over a thousand to reach 1.8 EB.

Robin

Jake October 14, 2008 at 7:19 am

Fully loaded capacity is 512TB each, so yes you would need more than my original petabyte calculation. It still is a far better choice than a DIY solution using Sun boxes to front end.

Kevin Closson October 14, 2008 at 8:21 am

I didn’t see the word compression anywhere in this post (did I speed read?). Since it isn’t mentioned then we are either fantasizing about 1.8 exabyte of compressed data (perhaps 6 to 8 exabytes of uncompressed data). If the “Thumpers” (SunFire 45XX) were to write uncompressed data as fast as the AMD HT 2.0 interconnect could sustain it would take something like 3 years of nonstop maxed-out writing of random bytes to lay 1.8 exabyte onto disk.

Did the Soviet’s start talking about strapping Yuri a lightning bolt and sending him to another solar system after his first lap around the earth ? :-)

Matt L October 14, 2008 at 6:32 pm

What about Data DeDup options like Data Domain, EMC, FalconStor and others. Even Compressed Data can be Dedupped at the block level and only Unique Data will be Archived. Now admittedly I am a little green in the storage game but what type of availiblity do you require for the data. How quickly does it need to be accessed, how will it be accessed?

Joe Kraska October 16, 2008 at 7:10 pm

Another issue with data storage at this scale will be namespace management. 1.8 exabytes divided by typical maximum volume sizes today will lead to a volume proliferation nightmare, upon which much very nasty OPEX could very easily rest. As far as who is most suited for this activity now, I’d take a look, perhaps, at HP’s Extreme Data Store when combined with PolyServe. You have both industry-leading density (1.6 racks per PB) as well as fairly large volumes.

This, assuming that you wish to store everything on spinning media.

BTW, I did an analysis like this very recently for a future sustained 17GB/s problem (sustained: 24/7/365). That’s a 1.5PB/day write rate. Kinda big, eh? Anyway, the problem is much more achievable than it sounds if the problem is in 2013. In 2013, I would expect 8TB SATA drive (or something equivalent) to be readily procured at or less than today’s 1TB drives.

Others mentioned deduplication and compression, but there is no mention regarding the duplicative nature of the data or its current compression. I will say this: it sounds a bit like a content addressable storage (CAS) problem. The CAS space seems to be kind of formative to me, but some of those technologies might complement a truly gigantic archive.

Finally: I would think such an archive would be dying for ILM of some sort.

Joe.

Joe Kraska October 16, 2008 at 7:13 pm

BTW, the 1.8 exabyte archive: how fast does it fill up? If it takes a year, I believe that this is ~60 GB/s … 24/7/365. So this archive may have some needs for a few fairly hefty OC links as well.

Joe.

Richard B October 17, 2008 at 8:49 am

Ewen, and others who’ve mention de-dup ratios, you should take a large pinch of salt with vendors’ claims about the reduction they can effect. Data Domain, EMC Avamar and others talk in the context of *backup* data – that is repeated storage of very similar data, so they are able to point to huge reductions. You will clearly see a lot of block level duplication in this sort of environment. When you’re looking at primary storage at GreenBytes seem to be, and NetApp are now starting to, the type of data is all important. If this is archived data it could be scanned images, in which case the amount of duplication could be close to zero.

Nicolai Plum October 18, 2008 at 1:56 pm

About the “OSHA won’t let you put ten X45xx in a rack”: the weight of the J4500 is specified at 77kg. Most racks and the datacentre floors they sit on are rated for something like 400kg including the weight of the rack, PDU, etc. I expect that Sun’s 4 boxes-in-the-rack figure is based on what’s generally achievable in most datacentres today.
Sure, you could get racks that will hold 800kg, and floors to hold up the rack-and-storage combo that weighs in at about 1000kg/rack all up, but they’re not standard items. Might want to use racks sitting on the subfloor, run the power and network and air overhead, and talk to the architect about the building’s design parameters.
You’ll also still need a baby forklift to get the units into the top of the rack.
Then each rack is going to consume and emit 11KW, heavy duty air circulation will be required.

Joe October 25, 2008 at 10:11 am

I didn’t see this post till today. It is interesting given what occurred at a small regional event we went to.

We had a booth at a small event (Ohio Linux Fest: … hey we sold a JackRabbit to an attendee from that last year, so we were hoping to replicate that success). You can see your humble contributer at some of the pictures here .

The interesting thing was, during our booth time, someone came up to me and started grilling me on JackRabbit and ΔV capability. While Sun and others were there, he didn’t quite look like your typical G2-gatherer.

He finally caved, indicating he was looking to build an exabyte sized data center. I gave him a rough estimate of the number of units of anyone’s storage, power consumption, costs, etc. Not that far off Robin’s.

I had chalked this up to a somewhat … eccentric … person asking if it were possible. I doubt this as a person from a three-letter-agency (though curiously we are getting more hits on the sites from the “Maryland procurement office” these days … hmmmm).

Now Robin, you are giving me pause to reflect upon this conversation. They asked me similar questions. If that person is the same one who asked you about this, and they are lurking, I am curious as to how serious they were. Intrigued, not from a vendor perspective, but from a management perspective. As you scale up the number of parts, your probability of failure over some interval approaches an asymptotic limit of 1. Which means that no matter what you do, you will always have to contend with some aspect of a failed part. This is IMO far more important than other considerations mentioned. Some mentioned de-dup as a technology to use for this, though this begs the question as to why they think that 90 GB data sets would have duplication in them? I would imagine that run length encoded compression may be more beneficial and far faster than de-dup.

More to the point, I didn’t get a sense from my questioner what their data sets were. I don’t know if Robin got that. If this is imagery, and they want do avoid doing lossless compression, RLE and other lossless techniques could help, at the cost of processing power. If this is genomic or similar data, you have other techniques. De-dup doesn’t quite factor into these.

It seems to me that the phased build out approach would make the most sense. Moreover, someone suggested slotting in and out x4500’s (ok, we would prefer JackRabbits. This may be feasible, though rather than adapt the robotics to handle that, mount the units vertically, and use the robotic mechanisms to slot in and out drives into the chassis.

The large tape storage folks could do this. Then the question of how to have some sort of file system handle this. You would need to envision some sort or large cache file system for handling inbound data, some sort of distributed meta-data mapper (standard meta-data plus a directory of where the data lives). Rather than de-dup, you would want to either dup or code the data to handle drive lossage.

Would be interesting to talk about the tech behind this.

Joe Kraska October 25, 2008 at 10:37 am

The exabyte data center is not far off from being real.

We have been working a coming challenge problem for a three letter agency involving ingest and store rates of 1.5PB/day. That’s about 17GB/s sustained write, 24/7/365. Individual streams can come in at 1+ GB/s.

The real humdinger of it all is that the customer prefers disk and not tape for all storage.

Neither dedup nor compression is possible with these data types (they are not duplicative, and they are already compressed).

–Joe

Emmanuel Florac October 30, 2008 at 4:51 am

The new Datadirect 6620 may be a step in the good direction : 60 drives per 4U enclosure, with sophisticated power management ( automatic drive spindown). They actually seem to sell the setup S2A9900+6620 enclosure with 1TB drives, it’s 1.2PB per 2 racks with 2 RAID controllers, something like 1PiB available storage (with parity and spare) with an average power consumption of 36.6kW. You’ll have to had some servers to that setup, however.

Joe Kraska October 31, 2008 at 4:01 pm

I think it’s 1.6 racks per PB, formatted and usable on the DDN storage. They also have some nice power savings capabilities. Their RAID-6 rebuild on the fly and real time capability is, however, what I find most exemplary about DDN. You basically never even really need to know that RAID rebuild is going on.

–Joe.

Bill Mottram November 8, 2008 at 11:12 am

There is a limit to how many Sun fire 4540 can be mounted in one rack. The limit I understand is 4. This will increase the rack count significantly.

joe December 3, 2008 at 5:55 pm

Is this a joke? When you say “I dont know what the boxes would cost but it would be at least several times the racks.”, are you budgeting $1000 for 10 Sunfire 4540 boxes? These numbers are way, way wrong.

Mark October 4, 2011 at 1:23 pm

Perhaps it is in poor taste to revive a dead thread, however given my present (2011) is now almost three years in the future from this past conversation, I am wondering how far we are now toward the projected 2013 technology and what could now be projected to 2015 and beyond.

Leave a Comment

{ 1 trackback }

Previous post:

Next post: