StorageMojo’s favorite FAST 08 paper

by Robin Harris on Friday, 14 March, 2008

It didn’t win Best Paper honors at FAST 08 – IIRC it was An Analysis of Latent Sector Errors in Disk Drives (the link is to the StorageMojo review of that excellent paper last month) but I really like the thinking behind Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage.

Written by Mark W. Storer, Kevin M. Greenan, Ethan L. Miller (UC Santa Cruz) and Kaladhar Voruganti (NetApp) the paper discusses a prototype that

. . . is a distributed network of intelligent, disk-based, storage appliances that stores data reliably and energy-efficiently. While existing MAID systems keep disks idle to save energy, Pergamum adds NVRAM at each node to store data signa- tures, metadata, and other small items, allowing deferred writes, metadata requests and inter-disk data verification to be performed while the disk is powered off.

They call the appliances tomes.

Tape: where data goes to die
One of tape’s big advantages is that it uses no power at rest. Any disk-based tape replacement will have to come as close to the same ideal.

The tomes use a single hard drive, an ARM-based processor board with NIC and NVRAM. Total power use – when powered up – about 11.5 watts, less than 15k FC drive. With tighter code, a slower drive and more integration, I’d bet they could cut that in half.

The single disk drive means that tomes must be used in groups to enable distributed RAID techniques and exchange of algebraic signatures to ensure inter-disk recovery. The paper goes into those techniques in detail.

NVRAM

The purpose of the NVRAM is to provide low-power, persistent storage; operations such as metadata searches and signature requests do not require the unit’s drive to be spun up.

. . . the NVRAM primarily holds metadata such as algebraic signatures and index information, flash writes are relatively rare; flash writes coincide with disk writes.

The Ethernet interconnect is important – by using cheap unmanaged switches for fan out, high aggregate bandwidth, exceeding that of current tape libraries, is easily and inexpensively achieved. The use of power-over-Ethernet would further reduce costs, especially if the system used 4200 RPM drives.

The StorageMojo take
Most of the disk vs tape discussions look at the disk device vs tape cartridge cost issue – and they aren’t that different even today. But the tape library market is a $4-5 billion market. A disk-based alternative to slow tape libraries could take a big chunk of that.

Further, this design could be integrated into a single disk controller board, creating a disk with a single Ethernet port and incredible packaging and manufacturing economies.

If Seagate were smart they’d jump on this. This is a major opportunity to drive another significant consumer of disk drive units – without encroaching on existing OEM customer businesses. That doesn’t happen very often.

Comments welcome, as always. Pergamum was an ancient Greek city known for its sizable library, second only to the library of Alexandria.

{ 10 comments… read them below or add one }

Pete Steege Friday, 14 March, 2008 at 7:50 am

I agree – huge opportunity for disk to be what tape has been, at least for a good chunk of that tape market. Interesting how disk is finally on a path to take from the tape space just as flash does the same to the performance disk space. Clearly a changing of the guard is under way for storage tiers.

Both shifts will take years to materialize in a macro way, of course.

Steve Shockley Friday, 14 March, 2008 at 12:07 pm

What about offsite storage? This might be an alternative for disk -> disk -> tape for the second disk, but unless these devices are easy to move around (and as cheap as a tape) I don’t think they’ll take over.

PTZ Friday, 14 March, 2008 at 8:07 pm

Here is an idea: what if such a disk with a single Ethernet port would run ATA-over-Ethernet ? http://en.wikipedia.org/wiki/ATA_over_Ethernet

I think I’ve got an idea for my future startup ;-)

David Magda Friday, 14 March, 2008 at 8:16 pm

In Section 5.3.1 of the paper they say that the absolute maximum transfer rate they can get is 10 MB/s. This actually drops further once they add parity and such: they state 3-5 MB/s in the conclusion.

LTO-4 currently gets you 800 GB at 120 MB/s, uncompressed. If you need to string together several of the Pergamum systems to match that, then power may come out in a wash (Quantum’s LTO-4HH SAS drive uses 30.1 W max, 28.8 W typical, according to their spec sheet).

So, to get 120 MB/s, they need to use at least twelve units, which leads to 12 * 11.5 W = 138 W for 10 MB/s; 276 W for 24 units at 5 MB/s to reach 120 MB/s. Though one advantage of this system is that you can stream at any speed and you don’t have to worry about shoe-shining.

Sun’s T10000 gets you 500 GB at 180 MB/s, uncompressed, running at a max of 90 W according to the spec sheet. 36 * 11.5 W = 414 W to get 180 MB/s, if each Pergamum unit gives you 5 MB/s. The tape silos also take power, so for a fair comparison you should add that as well.

You’ll also have to rewrite software to to use Pergamum’s software interface (Section 3.3).

I still have to go through the paper more thoroughly, but this is an interesting research product that will hopefully give another option in protecting our data.

Joe Saturday, 15 March, 2008 at 5:38 am

So Robin, do you know about Coraid? We recommend them for backup purposes. Disk drives available over ethernet, drivers for Linux, Windows, OS/X, and Solaris. (disclosure: we do resell them, precisely for the purposes you indicated).

Coraid’s make for a great, inexpensive backup system. You can pop media out and put new media in (media == disk), and treat the media like tape. You can ship arrays out as whole backup units if you wish.

They are excellent backup devices and work quite well with our high performance storage units. They aren’t high speed in terms of storage, but 100 MB/s sustained (uncompressed) to disk, isn’t bad as a backup. Thats 1GB/10 seconds, 1 TB in 10,000 seconds. Since it is ATA over Ethernet, you can have multiple parallel streams. Coraid spec lets you get up to 40k drives per network segment, so, in theory, with a well design network and backup system (as well as a cluster of fast machines to save stuff off of), you may be able to achieve very high (sustainable) backup rates. 10 JackRabbits pumping data to 10 Coraid’s could conceivably back up 1 TB in 1000 seconds (under 1/3 of an hour). 100 such units of each may be able to achieve 1 TB in 100 seconds.

The pain will shift from depositing the data to managing the data streams and quality in this model, as well as media management, etc. This isn’t the tape rotation schedule issue, but rather how do you guarantee that you can always read data from a possibly unreliable spinning medium.

Robin Harris Saturday, 15 March, 2008 at 9:58 am

Pete, you’re right – this is just a beginning. I keep being surprised by the disk guys apparent inability to imagine new markets. I hope this paper stirs them up. I’d be happy to help Seagate (or WD or STK or Hitachi or . . .) scope this out.

Steve, I see this as a replacement for the tape silo market, not tape itself. Do people routinely empty their tape silos and move all the tapes offsite? Also, if you look at the shock and vibe specs for tape cartridges they aren’t much better than spun-down disks. Him-m-m. . .

David, thanks for the additional analysis – good stuff! With some serious engineering: tight code; slow disk; HW parity assist and such the power and performance can be brought into line with tape – while retaining the massive advantage of rapid random access. Nobody wants to do legal discovery from a tape silo.

PTZ, Joe, I’ve written about Coraid in the past and would be happy to write about them again. I hear very little about them and so they don’t have much top of mind with me.

Robin

Ethan Monday, 17 March, 2008 at 12:46 pm

In response to David Magda, I would agree that current tape systems can stream at 180MB/s for 90W, but that’s just the tape drive. Want a CPU to run your drive? Add at least 150W. Want a tape robot? That’s more energy. Want redundancy? Tape can do it with mirroring, but now you’ve doubled your power consumption (and hardware and media cost) with no commensurate increase in performance. The 5 MB/s Pergamum number is for fully redundant storage with MTTDL from media failure in the millions of years; the tape numbers are for raw storage with MTTDL from media failure significantly shorter.

BTW, I hope you’re planning to buy a lot of tape drives: the LTO-4 quotes an MCBF of 100,000 load/thread/unload cycles. At 240 per drive per day (just 10 per hour!), you’ll replace that tape drive after a bit more than a year; at over $3000 per drive, that’s a big chunk of change. Not an issue for streaming backup, but definitely an issue for an archival storage system.

If you want to stream out backups (that you plan to never use), a system like Pergamum may not be the right choice. If you want an archival storage system (not a backup system), we believe that energy-efficient disk-based archives are the right way to go.

Storage Alexandria Saturday, 29 March, 2008 at 1:58 pm

Found your site on technorati, under “data storage 101 “.

Cool post. I agree tape is a waste of…tape.

Mary Ellen Monday, 21 April, 2008 at 1:18 pm

You need to get this info out to the museums of the world. Here in Canada, a big movement was made around 2002 to digitize archival collections onto CDs then DVD’s.
I agree with your “lost culture” — the last 50 years will become wiped from memories of future generations.
thanks for your efforts,
Mary Ellen

Peter Li Wednesday, 23 April, 2008 at 7:29 am

I don’t get how can you go from $0.50/GB (in the conclusion of the paper)
to $4700 for 10 PB (petabytes)…Can someone help me to understand how
to achieve that level of cost reduction ?

Thanks.

Leave a Comment

Previous post:

Next post: