StorageMojo gets questions from baffled civilians out in prospect-land. This one seems worthy of a thorough airing.

The writer is a student and a storage newbie, but she has the kind of question that more folks are asking. Here’s her note, edited for clarity:

I am working on an archive. The main idea is to store large files of 90 Gb each. This archive is meant for research purposes (that’s why the files are so big) so we can think of storing millions of these files (20-40 million). I read the article about google’s warehouse and I think that’s the scale of the project.

I’m sending you an image [ed. note: a BlueArc system] with the kind of rack that I found so far (1TB/blade) but I am not completely sure that’s the latest technology I can use for this huge scale project.

By the other hand I have the problem of trying to figure out the electricity required for it (thinking of thousands of racks with blades working in a net with backup servers).

I appreciate your time and help in advance.

Let’s do the math
OK, arithmetic. 20 million 90 GB files is, um-m, 1.8 billion GB or, hm-m, 1.8 million TB, which is 1800 petabytes or 1.8 exabytes.

2 million 1.5 TB drives would just handle this with no redundancy. At an average price of $100 per drive – they are higher now of course, but unboxing 2 million drives would take some time – our student is looking at $200 million just for drives.

High density packaging is required to reach 500 3.5″ disks in a rack. The Sun Fire 4540 puts 48 drive, 4U box could support 480 drives in 42U – if OSHA would lighten up and you didn’t mind using a ladder to replace failed drives. Any other ideas?

The 4540 also supplies the server cycles this project needs. Other server/storage enclosures would have a hard time equalling the 4540’s density (note: 10 4540’s in a rack is illegal in the US and not practical anywhere) but HP’s ExDS9100 comes close.

At 750 TB per rack (500 drives), 4,000 racks could handle it. A cheap 4 post/42U rack might cost $200 in volume, so call it $800,000 for racks. I don’t know what the boxes would cost but it would be at least several times the racks.

This could also be a great application for Copan’s enterprise MAID systems if the workload is right. With up to 896 TB per rack they have the density. Price/performance would be the next issue to look at.

Would you like a data center with that?
4,000 racks require about 40,000 square feet of floor space. Racks typically account for about 1/3rd of a data centers floor space – hallways and other equipment occupy the rest – a 120,000 sq. ft. or 11,100 sq meter facility is required.

At $500/sq. ft. the data center would cost an additional $60 million. YMMV.

Oh, and redundancy is extra.

The drive’s operational power consumption is 10 watts. Drives alone eat 20 megawatts. Side note: disk power consumption is the reason that storage vendors find differentiation on power consumption elusive.

Assume 1 250W server for every 100 drives means another 5 MW for servers – a low-side estimate. Leaving aside network infrastructure and lighting, the HVAC load for 25 MW is around 12.5 MW, according to some rules of thumb.

Let’s round up and call it 40 megawatts. You’ll want to locate this facility near the Columbia River to get cheap hydropower – maybe next door to Google in The Dalles, Oregon.

I haven’t deciphered BPA power pricing, but I’d guess 40 MW would run about $2 million a month. Copan, much less.

The StorageMojo take
Well, what do you know? Building an exabyte data center is feasible. All it takes is money – $400 million with all the goodies – and power.

Time to readjust the mental model of storage possibility. Other than the NSA’s acres of disk at Fort Meade though, I’m not aware of any exabyte data centers.

Technically, I’m not aware of the NSA’s facility either.

Google isn’t a good hardware model, since their storage density is only about 120 drives per rack, not 500. Plus they don’t use the highest capacity drives in favor of the lowest $/GB – drives nearing the end of their product life cycle. Google’s GB/sq. ft. is way lower than my interlocutor could afford.

Their software – BigTable and GFS – looks to be a good fit. But even Google doesn’t run 20,000 node clusters. That is Lustre territory.

Who has the best solution to the exabyte problem? Nominations are now open.

I remember hawking 1.8 – one point eight! – gigabytes in a single 42″ rack. For $50,000! A terabyte was almost beyond imagination. Today my home system has 2.5 TB. Whoa.

Comments welcome, of course. Who else could use an exabyte of data? Disclosure: I’ve done work for HP and Sun. None for the NSA, though.