A 1 petabyte science project

by Robin Harris on Tuesday, 8 December, 2009

But not that kind of science project. This is the real deal, already running near a petabyte, needing to upgrade and looking for answers. Sounds like they’ll be spending real money real soon.

I edited for brevity and asked the writer to monitor the comments to help answer any questions.

The neuroscience institute:

I work for a large neuroscience institute. We’re big data generators, now using Sun’s SAM-FS for data migration between FC-AL tier1, SATA tier2 and large LTO4 tape silo stores for tier3. We use LTO4 because the cost/benefit in STK’s T10K B drives just didn’t add up!

We’ve run this for the last couple of years, nearing a PB between disk and tape.

It’s been a bumpy road, as HSM can be, if implemented with “end user touching” in mind. It’s taken a couple of years, a lot of development between us and the SAM-Q engineers, and many sleepless nights to make it work near seamlessly for end users.

Our problem:

  1. Our meta data slices (for various reasons) on 15k RPM FC-AL disk in the STK 6140 arrays are barely able to keep up with their workload.
  2. We need to expand our disk infrastructure for both front end high performance disk and backing space commodity disk archive (SATA and lots of it!).
  3. Tape is all good. Cool, calm and collected ;).

HDS tells me their AMS2500 is great: beautiful SAS backplane; amazing cache partitioning; and brilliant scalability. I worry that 3Gbit/sec SAS spindles will not have the same kind of concurrency that my traditional FC-AL 6140 chassis does (not that it’s helping much!).

Sun tells me their monster HPC array STK 6780 will blitz anything. And they’re willing to qualify SSD’s inside the FC-AL connected array shelf to fix metadata I/O constraints.

Further, Sun is suggesting their F5100 (SAS connected only, if I recall correctly), will be the answer to my meta data latency and IOPS woes. But it’s only SAS “direct connected” and doesn’t support the FC-AL loop for fail-over between hosts.

My take, so far:

  1. Flash and SSD are being pushed hard. It *seems* sensible to win a small IOPS war (I need thousands of IOPS, not hundreds), and the latency war of mechanical disk – but I worry about Sun’s roadmaps, future and overall strategy. I’m currently in talks with Fusion-io.
  2. With regard to the OpenStorage promise of “changing storage economics”, I am unconvinced the price is as sharp as it could be, considering I could white box it myself, with some large supermicro JBOD arrays + OpenSolaris!
  3. Between HDS and Sun, it’s a tough choice. I’ve not seen HDS play in this space before, and am unfamiliar with their market strategy, their gearing towards more HPC/big data mover scenarios and “science” in general. Are they really geared for these kinds of workloads? I’ve only ever seen them employed in the generic enterprise “Let’s run MS Exchange and Oracle” space.
  4. With Sun, I like the promise of the big 6780 chassis – everyone says it is the “monster” array that will handle anything, but with the premium price of FC-AL compared to SAS….

I’d love your thoughts!

The StorageMojo take
Seems like a job for scale-out cluster storage man – like IBRIX, Parascale or Scaleout. It may also be a fit for some of the new tiering-in-a-box vendors like Avere Systems or Storspeed.

But if we stick with the big boys – Sun and HDS -how should they sort this out? Is there anyone else they should look at? Update: Readers suggested Gluster, a very cool scale-out file system and Isilon, the easy-to-manage cluster storage system. Oh, and Bycast which is quite big – through OEM deals with IBM and HP – in the medical imaging space. End update.

Vendors are encouraged to respond. Please do us the favor of identifying yourself as such.

Courteous comments welcome, of course. I did work for IBRIX and Parascale at one time.

{ 45 comments… read them below or add one }

Karl Katzke December 8, 2009 at 5:47 pm

Just napkin-thinking here, but… I think we need more metrics.

How big is each tier? What’s the average life of a chunk of data before it falls down a tier, and how many times is it moved back and forth between tiers on average? How often is archival data accessed all the way from tape? How peaky is the load on the system? Where are the bottlenecks?

How close to the producer and consumer is the data? Is it written mostly sequentially or is it written in a very random chunks?

My thinking: The distributed file systems work well when you’re dealing with large masses of data that’s accessed at the same time. Three-tier file archival systems work well when you’re dealing with a small amount of data that’s being read or written that’s the same, and then masses of data that’s rarely accessed. It doesn’t sound like the current approach is a bad one, it sounds like the tuning parameters are off and the different tiers are mis-sized or mis-configured for their roles.

Visiotech December 8, 2009 at 7:47 pm

Typical maths on any SAS or FC 15Krpm will average about 170 to 200 IOPS “sustained” no matter its size and brand or if it is HDS, EMC, HP, IBM or Sun array. Just do the average IOPS/disk using Storage Performance benchmark reports for more about it http://www.storageperformance.org.

You will see that most of them are pretty equal plus or minus 10 IOPS per disks…DO NOT FALL FOR THEIR IOPS per subsystems but on IOPS per disk installed during the tests. BIG difference here. Some might show millions of IOPS but also have 1000 disks during the test. LSI/Sun/IBM are typically the best in that category. Bottom line Sun 6780 is better than HDS AMS.

Most HSM are not set up properly. It takes years of expertise to get them right. If not planned and set up carefully this is what happens. Most admin lost their mind quickly. Mainframe guys who are using this technologies since early 80`s can tell you more about the gotchas…like me.

If you have lots of small files, use SAMFS round robin between a series of FC RAID group to spread the load. Do not stripe or concat storage to avoid all disk group at the same time for few IO. Round robin will distribute the load among all RAID group by itself. That is a golden secret with SAMFS. Round robin does an excellent job at this type of workload. Tuning you file systems is also another part. Make sure you have someone who deeply understands SAMFS tuning parameters. Very few Sun and admins understand them. Ask Sun to help you on this if required. Not simple. I used free Sun analyzing tool called SWAT by Henk to get it right. http://blogs.sun.com/henk/entry/storage_performance_and_workload_analysis

Typically you need to have 80% or more of your daily workload happening on tier 1. Make sure you do NOT restore files from tier 2 to tier 1 for workload unless this is what you have planned. Avoid shoeshine effect of your data between tiers. Tier 1 should be offload to tier 2 and tier 2 should be offload to tier 3.

I like SSD when it is required. SAMFS can certainly benefit using them if IOPS is the problem.

z December 8, 2009 at 9:27 pm

Hi Karl.

Agreed. Metrics are probably important here to get any real feel for the hauling taking place.

4Gbit/sec fabric. On many of the ports, we can easily slam 380MB/sec for sustained hours of transfer.

Current tier distribution sits like this:

Tier1 [FC-AL disk] @ ~10TB
Tier2 [Enterprise SATA, FC connected chassis] @ 60TB
Tier3 [StorageTek SL libraries, multiple arms etc] @ 100’s of TB.

The average life of a chunk of data before it falls down the tiers really is very hard to give firm and non-transient stats on. We leave disk VSN (SATA Tier2) on slow disk (and not exclusively tape) for 200 days before what is known as ‘unarchive’. In any given day, we can see anywhere from only a couple of hundred MB be pushed from lower tiers to live filesystems, to several 2 or 3TB. It is that sporadic.

The *reason* for such sporadic behaviour is as a result of the odd and unquantifiable nature of how researchers operate with the kinds of technology they do, I guess!

In terms of how often tape data is accessed, simple logs can tell us that in a day, we’ve seen 150 tape stages ranging from 3MB files, all the way up to 35GB singular TIFF images. Extrapolation out to a month doesn’t hold or show any patterns.

System load on meta data controllers is indeed peaky at times, with UNIX load averages of 9; 10; 11 on quad socket 16 thread boxes being not unusual when big-data pushing is taking place.

Bottlenecks currently live at the MDC meta-io end, in the physical (mechanical) latency of spindles, in getting access to meta data – i.e, if a client wants something and the spindles are sufficiently busy, the client will have some appreciable “wait” and, indirectly iowait as a result.

Hence the strong thoughts of SSD, to clear the IOPS air.

Additionally, one could conjecture that the (comparatively) old and little 6140 LSI based [Engenio?] controllers are less than capable of putting up with such grief, with a lot of mixed I/O taking place.

When you have 10 drives of LTO4 all blazing, each capable of line rate 120MB/sec native transfer, with a small 6140 controller and some FC-AL disk, one could also assume that it simply doesn’t add up, in terms of the I/O that can be attained from said spindles, out to xyz number of tape drives.

Finally, the fact that the meta IO spindles are lumped into the same controller and same chassis as userland LUN’s really hurts! A design decision made years ago, not by us, that with maturity and understanding, would never have been done in hindsight! Hence, the idea to separate meta entirely, and have it living on a separate, high speed SSD or flash based solution!




Ron December 8, 2009 at 9:33 pm

I am not sure how you missed in all your discussions One of the most dominante vendor in the HPC space: DataDirect Networks (DDN).

For multi-petabyte, high density, high reliability this is the best of breed solution for HPC & life sciences.

Check out the SFA10K option, answering both intensive Throughput & IOPS demand at the same time : http://ddn.com/index.php?id=227

Over 10GB/s of throughput… and an impressive amount of IOps to disk & cache.

In only 20U you get 300 Slots for mixture of SAS/SATA/SSD..
If you’re in for the big boys, in a single rack you get 600 Slots.

Combine that with an HPC parallel file system such as Lustre or GPFS, to break the barriers of traditional CIFS/NFS, you get a system that will answer your needs for the longer run.

Shehjar Tikoo December 9, 2009 at 12:03 am

In the scale-out cluster storage man space, IBRIX, parascale and scaleout are not your only options. You should consider GlusterFS too. It brings all the advantages of a scale-out cluster FS that one can run on commodity hardware and the advantages of being open-source and a strong community following, even in the HPC arena.

TS December 9, 2009 at 2:23 am

“With regard to the OpenStorage promise of “changing storage economics”, I am unconvinced the price is as sharp as it could be, considering I could white box it myself, with some large supermicro JBOD arrays + OpenSolaris!”

Bingo. I wouldn’t have said it better. Hadoop HDFS, GlusterFS, MogileFS. Many alternatives exist. Next year pNFS would be production ready?

For the metadata server, I highly recommend 2U Supermicro 24 bay 2.5 inch JBODs or Dell MD1120s filled with Intel SSDs driven by a whole bunch of SAS adapters in L2ARC mode. Beats Fiber channel HDDs any day.

If only the stupid L2ARC is persistent.
Nearly 2 years after the bug report, I still haven’t seen it in OpenSolaris. I guess the Sun guys see the L2ARC as an extension of ram, rather than a faster persistence layer than HDDs.

z December 9, 2009 at 2:49 am


@ Visiotech: Your comment on the 6780 being the hero interests me a lot. I was always suggested to that the AMS2500 was the king of the hill for modular arrays. Can you give me some reasons for the 6780 being the top dog? Is it based purely of SPC-1/2 results, or is there something specific in the silicon that really makes the big shiny silver and metal-blue array a clear winner here? Controller tech? Cache algorithms? Host port/backplane config?

Agreed, HSM can be a harsh mistress at best of times, and the downfall of an engineer/admin, at worst. I think it took us many months, if not years to really understand the beast in depth. As far as tweaks go, we’ve got some round robin in place to minimal effect. Could do with more groups. We also use some dark stage-ahead, associative stage flags, cache stubbing of whole/partial file, stage-direct-to-memory, fine grained lease locking across QFS, and a whole host of other stuff, such as fifo path buffer adjustment, tape buffer maxphysio etc. It’s just not enough, unfortunately, on the meta data keeping up front.

SSD really is hitting a strong chord with me currently, but keeping it ‘inter-loop’ on the arbitrated loop seems difficult, with so many companies going for the direct connected (SAS) strategy. Does it spell signs that we should all be moving away from an FC connected AL space, and to a SAS, direct host connected architecture? The AMS2500 is due to get inter-array SSD qualification shortly, so I am told, if not already. The 6780 is apparently about to get the same treatment.

@ Ron:

Epic niche unit there. I have to wonder how well it would integrate for our purposes – and how it goes against things such as BluArc/HNAS. Laying any filesystem I want ontop of it, and effectively making it behave as I want, or are we locked into “buy per service, CIFS, NFS, pNFS, iSCSI” style?

@ Shehjar:

Indeed, another tangible option. I guess I’m finding it hard to let go of a “SAN” in the traditional sense, and I need to learn to think of it as lots of little fast/well connected bricks of storage, doing their own thing with massive aggregated I/O. How well developed and mature do you feel things are at this point? I have a lot of room for experimentation (hey, it’s one giant science experiment, this place!), but I need to be mindful of the fact that I can’t afford for it to go wrong ;).

Thanks all 🙂


Andrey Kuzmin December 9, 2009 at 3:51 am

> Seems like a job for scale-out cluster storage man – like IBRIX, Parascale or Scaleout.
Surprising not to see Isilon mentioned, they report high penetration rate in the bioinformatics space.

Visiotech December 9, 2009 at 6:51 am

If you have archive to tape coming from your primary disks you will affect IO rates of your tier 1. If your SAMFS is setup properly it will get it from tier 2 who are the best candidate to do it. Ten LTO-4 is kind of big for the amount of tier 2 storage you have. Minimize your tape activities and plan your staging if it is possible. This way you avoid tape activities during your peak workload. That is a way to limit IOPS on your storage.

Yes 10x LTO-4 will bring your 6140 to it’s knee. I have high-end HDS to their knees too during massive backup on LTO and T10K. Like I said it is not related to vendors but how many physical disks are involved in your architecture. Tape are extremely demanding on disk sustain IO.

Brent Rohloff December 9, 2009 at 7:40 am

You should look at more vendors than just Sun, even though Dartmouth College fmri Data center was a Sun Center of Excellence.

Gary Orenstein December 9, 2009 at 10:34 am

Seems like the cart before the horse without more discussion on the application and workload:
-how much data is active/inactive
-how much new data is generated every day, week or month
-how are updates made, from how many servers
-what are the internal SLAs to the application users

That said this is certainly a scale out issue. Here’s my take:

-Since total capacity is a primary concern, I’d start there
-I’d stick with commodity hardware and smart software, also commodity networking like IP and Ethernet
-I’d focus on high-capacity SATA (1-2TB) drives for as much of the data as possible
-Ensure that the scale-out file system can reach hundreds to thousands of nodes
-Capacity requirements will drive the need for many nodes
-If you have a truly distributed system, you’ll be able to make user of all the CPU, memory, network bandwidth and disk capacity available on all of those nodes
-Don’t get too carried away with flash yet. Flash won’t solve your capacity issue and I’d solve that first
-Once you have a distributed system in place, smart software features like retrieving small files in a single disk I/O might alleviate the urgent need for expensive memory by delivering peak performance from high-capacity SATA drives
-Other software features like client side routing and distributed metadata will offer more economical scale and greater performance compared to alternatives [note we just did a Cloud Infrastructure Chalk Talk on this topic here: http://bit.ly/8P3BVd%5D
-With hardware pricing improving every year, ensure that the system allows seamless migration to new hardware platforms while it is up and running. Planning these days involves application life far outlasting individual hardware component life.

These are just a few ideas that come to mind. Perhaps you could get your three tiers down to just one tier that uses a very low cost server node paired with innovative software for a combination of high-performance and low-cost capacity. I know this might be a stretch, but perhaps interesting to consider. Drop by http://www.MaxiScale.com for more info. Also note our news announcement today with Supermicro where you can get a four node, 16TB configuration for under $10,000, and a 32TB configuration for under $11,500 http://bit.ly/51EFuK

GreggT December 9, 2009 at 11:32 am

I was wonder about Isilon as well, it’s probably the most popular solution in the next-gen sequencing field, and will only get more so as Illumina announces they are using Isilon as part of their turnkey informatics platform to go along with their sequencers.

Shehjar Tikoo December 9, 2009 at 1:40 pm

@z: I see. I suppose the SAN-affinity is due in large part to the need for a HSM which, AFAIK, is being done better at the block level as of now. However, I do see GlusterFS as a potential candidate especially when I read in your earlier post, “Bottlenecks currently live at the MDC meta-io end..”. GlusterFS has a distributed hash table based namespace aggregation which in other word implies that the meta-data disk accesses are not hitting on a to a single or even a subset of disks but spread uniformly over the array. Plus, there is no need to move away from your current concept of lots of little fast/well connected bricks of storage, doing their own thing with massive aggregated I/O, just that, instead of block-level IO for SANs, GlusterFS enables the same for files and directories.

Things in the Gluster world are fairly advanced in terms of stability and manageability considering that we’ve had production deployments for a couple of years now. Plus, we just yesterday did a major release(..yes, I work for Gluster..).

@TS: I am not sure if that one-year-production-ready timeline being touted in Linux kernel lists is actually going to result in adoption in production environments. Although I am highly impressed by the way NFS insiders have constantly accelerated the pace of development, adoption and stabilization, the complexity of pNFS and NFSv4 just doesnt seem to lend itself to such an early production adoption.

Adam December 9, 2009 at 4:48 pm

Totally agree about Isilon!

Paul Rutherford December 9, 2009 at 11:21 pm


It’s simple. Isilon has many 1PB file systems installations. We will deliver a petabyte for this science project and have it up and running on their network in a couple hours. Better yet, since it will take some time to move 1 PB of data off the HSM system, why not buy a few hundred TB to start and add additional storage as needed with no downtime. In less than 2 racks you can have a 1PB system up and running and accessible using standard network protocols. It will have 684 SATA drives with meta data spread across all of them. It will have 304GB of cache and 152 cores. This system will scream!

Follow that up with the reduction in time spent managing the system which is not even a full time job. Maybe a few hours a week. Compare that to partnering with a vendor to write code and running HSM and what you have is a life changing experience for the IT department. Not to mention we can guarantee at least 80 percent utilization.

Don’t believe me? I will hook them up with as many users as they want. Let’s start with the Broad Institute, the J. Craig Venter Institute, Complete Genomics, the Center for Inherited Disease Research at Johns Hopkins, the Oklahoma Medical Research Foundation and Cold Spring Harbor Labs as just a few of the bioinformatics and life sciences organizations using our products. Or we can hook you up with Illumina to look at an integrated solution with their Next Gen Sequencer and our storage distributed by Dell.

Oh, and if what I propose is more power than you need, we can use a lower performance node. Not fast enough? We can configure with more SATA or all SAS. The ROI will be compelling.

Want to look at SSD technology? Imagine if the system described above had a scalable SSD component as well. Applications that require 100s of millions of files in a single file server with requirements for high performance access to every file will perform like they are on a Tier 1 SAN. The only difference is you will not have all the complexities of building and managing the SAN and associated software. Try building that yourself.

There is no way they should consider HSM or multi-tier at this time. Replace the HSM solution ASAP. This project is clearly a production operation so I would steer clear of the open source approach unless you have a team to support it. Why have people spending time/energy with an open source file system when you could have a PB up and running and managed this weekend and managed with much less than one person.

This is clearly not a science project unless you want to make it one.


Paul Rutherford
CTO, Isilon

z December 10, 2009 at 3:48 am


The Isilon comments are interesting and relevant, I believe, on the basis that we’re going to be employing some large Illumina hardware shortly for DNA sequence analysis/synthesis. They say ~4.6TB per full run!

@ Gary:

You mentioned commodity components in ethernet/IP. I was heavily considering 10GigE directly into the back of the main file-service head boxes (that share out CIFS/NFS), as we’ve found traditionally, that our 1GbE copper wet-bits-of-string really aren’t up to the task any more. Does 10GigE bode will to you, in terms of ‘commodity’. 10Gig optics still aren’t cheap…

Agreed, flash won’t solve a capacity issue, nor do I see it doing this any time cheaply within the next 2 to 4 years. It seems to solve the issue of meta data latency, that said.

I really like your idea of getting three tiers down to one, but all that being said – I’m very much a guy living in tape land for compliance, density, and that I can’t afford the power + cooling for isles and isles of disk arrays…

The idea of one front tier is really interesting to me, and as TS mentioned above – SSD + SATA paired in L2ARC could kick down FC modules any day.

That said, I’m interested in that statement. How do folks perceive things like HSP’s (Hybrid Storage Pools) in acting as a direct replacement for traditionally costly, fast, powerful FC disk arrays?

We’ve all heard the marketing, and seen the awesome blog posts by the FISHworks crew – but what are people finding out there, in the real world?


TS December 10, 2009 at 1:02 pm

@Paul Rutherford:

Quick question, how much does the Isilon software license cost, if you don’t mind me asking.

The thing about recent development is that if you go with high end software packages, is that the software is a sunk cost. There is no amortization period. It is sunk, and it is gone. Licenses can’t be resold, unlike hardware.

Even if your software is much faster than available open source alternatives, the question is, people can end up buying 2 times more hardware to make up the performance difference(Assuming software license is 50% of total acquisition cost), and hardware can be resold or repurposed, and enjoys a flat 3 year amortization period.

Gary Orenstein December 10, 2009 at 4:01 pm


In my opinion stick with the Ethernet wave, 1Gb now, 10Gb as soon as it becomes cost effective. Those prices drop quarterly, so it should not be long before they fall in line. For now, I don’t personally believe you’ll need 10GbE at each end node, but rather as a switch to switch interlink.

I’d ask what is driving your need for SSD. If it is just IOPS, I’d invite you to look more closely at software enhancements such as MaxiScale’s ability to deliver small files in a single disk I/O operation. In a fully distributed system, this can give you TONS of IOPS. Combined with our distributed metadata approach, I’m not sure you would need anything else.

Yes, you probably need tape for now, but I always find it helpful to push the thinking and then come back to reality.

My personal opinion is that no one really wants tiered storage, but rather they fall into it via economic necessity. Hybrid storage pools sounds like a loaded term to me. Our approach at MaxiScale is that we believe there is a ton of juice to be squeezed from SATA drives through smart software, and effective implementation of a distributed system as opposed to racing to solve the issue with expensive hardware solutions.

Hope this makes sense…happy to chat in more detail anytime.


[ed. note: Gary is the VP of Marketing at Maxiscale.]

Ricardo Garza December 11, 2009 at 7:44 am

Have you looked at a RamSan from Texas Memory Systems?


Mike Maxey December 11, 2009 at 5:24 pm

You propose an interesting problem that we hear quite often. Where do I go now that I have a tiered architecture that has hit its management and performance limits. I can’t keep moving data and guessing about user access. I can’t afford to stick everything in a Tier-1 NAS or clustered file system. I want commodity but how do I get performance?

The ParaScale answer is to implement ILM in place.

Today you tier in an attempt to balance cost and performance and end up spending all your time fine tuning HSM policies and wrangling with vendors on issues. Follow the lead that Robin suggested and look to a solution like Avere combined with a distributed scale-out storage system. The combination enables massive flexibility in what is accelerated and enables change configurations without the need for HSM, stubs, links or data migrations.

With this solution you can leverage the benefits of scale out and commodity via standard protocols and accelerate where and when necessary. Add capacity OR performance as necessary – don’t overpay for one when you need the other. You can also stop moving data, and instead move the processing – with the potential of integrating applications directly on the storage.

Big vendors have made a killing selling the fastest data movement solutions that enable you to move massive amounts of data to processing. Clustered file systems were created to help solve this problem and Wall Street and others paid top dollar to implement tier zero systems. Combine this with leading data warehousing or business intelligence software and you can easily start spending the GDP of an African country. This is the old way and it’s broken. Moving a couple megabyte applications is much easier than moving a 100TBs of data. Add the ability to process in parallel on many nodes and you have something that can change the cost economics of data analysis. It’s been proven in map reduce and leveraged in life sciences.

I’ve expanded more on these concepts on the ParaScale blog:

Mike Maxey
Director of Product Management

Visiotech December 11, 2009 at 6:58 pm

Most comments are about to rip and replace with new widgets that promis to solve it. What happen if they do not…

What a waste of money and energy here by some of your comments…

Is this a new way of solving problem now. in IT..No one can solve simple architecture problem…where are the storage expert here…trap in sales cycles of new widgets…

None of you think it is a setup problem when MOST of the time it is. Yes the disk array “might” be the cause of latency. It can be fix by using cheap disks on the side or even few SSD at 100x less expensive than rip & replace.

Look like the “real” storage architect have retired…replaced by sales rep…

Ron December 12, 2009 at 1:42 am

Hi Z,

You’re open to put whatever open storage system & protocol on top of the DDN storage is FC/IB storage. Completely open to any gateway / interface you would want to put on top of it.

You are not bound to NFS, CIFS, iSCSI..
On contrary, you should explore more parallel oriented file system (as many mentioned above). This gives you the freedom to scale-out, not be vendor dependent, etc..

Blue-Arc & DDN are partners if you decide to go down that path.

There are several options out there, but I wouldn’t go exploring configurations that requires too much change-mgmt & patching control such as cloud file systems ,etc.. while those are great, to my opinion for internet use, they are not suitable for HPC style file services currently.

HSM to tape December 13, 2009 at 1:50 am

A few have made the very valid point that HSM (especially HSM to tape) is a dinosaur. It is from an era when both large space and high IO required 100 to 1000s of spindles which each cost $1000+.

In the last five years storage (even managed storage) is so cheap that the overhead cost of HSM to tape is higher than the cost of today’s spinning disks.

And SSDs solve the IO issue, now all that is missing is your vendor of choice has implemented smart software to automatically serve hot data from SSD. Call this HSM for the new century.

The only bottleneck that may remain is the controller based architecture.

Distributed (file or storage) systems that scale performance with extra nodes are possible solutions. A distributed file system takes a bit more integration than a storage system that presents a storage protocol (block or file).

deadhead92 December 13, 2009 at 10:33 am

If money is no object isilon is great. We didnt want to spend equal or more of our budget on storage as we do for sequencers. We built open storage on whiteboxes for a fraction of the cost. May just be a case combination of our needs and good talent on our team that this has worked out so well.

David Magda December 13, 2009 at 10:38 am

Can you slot in ZFS, with its use of SSDs to speed up reads and writes (aka “hybrid storage pools”)?

AFAIK, SAM-FS doesn’t run on ZFS directly, but you may be able to create ZFS volumes (which look like raw disks), and then run SAM on top of that. This may be able to at least solve Problem #1.


Recent posts to the “sam-managers” indicate this is a popular topic:


z December 13, 2009 at 2:21 pm

Some great thoughts and comments reading upwards.

A particular problem is illuminated by David M, however. Things like SAM-FS (and, by extension, QFS) capitalize greatly from the use of horizontally scaling physical QFS client systems, which require multi-mounted block level access – across something as an FC-AL SAN. The issue in running ZFS as the volume management method for over-arching SAM is that it’s not a multi-mount savvy FS, nor does it provide concurrent mounting in this respect. If it did, this whole thing would be a lot easier.

TS made some good points about software costs and hardware volume, I believe.

@ Ricardo: Yup, we have looked at TMS stuff, and the truth is, it’s only beneficial at the very top end and at the “Tier 0” point. To replace an entire environment of spinning rust would be financially huge.

Like Visiotech suggests, I’m unconvinced that yanking an entire environment out and replacing it with the new trends in disk/node methodologies are necessarily the way “up” and “out” of this, given I have significant compliance issues to adhere to, in terms of when my data is ‘cold’, how it is ‘cold’ and the costs of powering a couple of racks of super dense disk. Additionally, the whole idea behind SSD/Flash based solutions here was to drive up IOPS for specific very sensitive loads within a multi load, multi IOPS/skew system, without needed to push 100’s of mechanical spindles into place.

@ Ron: I’m interested in the fact that I can put whatever I want ontop of that kit – it’s got some flexibility. I’ll look more into it.

@ HSM to Tape: I guess the other thing to consider in the tape layer is that it offers some interesting abilities in the data protection space. It’s not just about having ‘density’ or ‘cheap media’. It’s also very much about keeping power and cooling costs down, as well as compliance up. At this point, I’m unsure as to how many of the solutions we’ve talked about could automatically copy data down the pipe to a tape filesystem and then stub, to be recalled when needed. It’s kind of breaking rules/mixing metaphors in some ways. In growing, and scaling out, if heat, cooling, physical space etc are an issue – then the question of density and flexibility isn’t actually solved by the ‘just add more nodes’ concept that we are toying with here.

@ deadhead92: Yup. Am thinking about it, indeed. Whiteboxing has it’s own set of issues, that said. Do you find patching overhead, stability, supportability and overall physical component reliability to be OK in your situation?

Things that seem obvious to me so far:

1. People love disk/clickable node architecture, but in the realm of compliance for a big bang compliance scenario, I’m not sure where we are left if one can’t instantly have the data on a ‘backup’ media mechanism as well – remember, tape is more than just about HSM density and because it’s cool.

2. There seems to be great economy in the methods mentioned above to scale out and increase physical throughput of the overall entity here, but if the costs of the actual software that glue it all together are significantly high, in a way, we’re back where we started – with poor storage economics (when you could generate several TB in a day and think nothing of it, and you license by the TB, that can become a bad situation quickly), but some killer performance, which doesn’t actually match all our compliance needs.

3. It seems that there isn’t really a happy medium to engineer within here. It’s not a case of merging the best of both worlds. It’s traditional methods vs the new methods – and attempting to blend or integrate the two seems to keep coming unstuck at certain points along the thought path…

So I’d throw the question out there:

Let’s say I wanted to keep my large three tier HSM architecture, but I wanted to integrate the ease of management, scaleability of storage and physical ‘IO as you go!” style thinking to it, without spending the GDP of a small island nation, keeping in mind, it needs to get to tape somehow, instantly, if not close to instantly. How would you engineer it? What challenges would you face?


Visiotech December 13, 2009 at 10:03 pm

deadhead92 you need to look at this blog here before stating tape are dead.

I would say disk is a temporary repository for tape…since IT exist and it will remain true for another 10 years at least.

Tape remain the cheapest way to store “long term” archive data. Even backup benefit of this media when use properly. Most of young peoples think tape is old. Get the facts and you might be surprise they surpass disks in many area including speed…and get better error correction.

If you need to archive hundred of PB you like tape. Just the time required to store that data on disk will elapse it’s expected 5 years life…if you can get a spare drive too…disk vendor replace them much faster than tape.

Tape last 10+ years easy…I have 20 years tape been read daily…Disk array who store my tier 1 as been replace twice in the same range.

HSM to tape December 14, 2009 at 12:47 am


Perhaps my views are colored by newer storage players that don’t license each feature per controller and amount of attached storage.

I agree tape has a role to play as a backup archive medium, not sure HSM at the filesystem layer is better (scalable/reliable) than automated tape backup/restore.
Especially in an environment such as yours where your large volumes of data are moving back to spinning disk from tape.

Good luck

obi December 14, 2009 at 3:58 am

There is one thing I seem to be missing out of all of this.

With a SAM-FS HSM system, one can take daily snapshots (SAM-FS metadata dumps) and have, reasonably instantly, depending on the number of inodes to traverse, a point in time backup to restore from.

When storing user data on such a system, this becomes a brilliant way of covering ones arse. This can go back as far as you can keep storing the metadata dumps. Not to mention, one can perform the seemingly amazing miracle of an entire filesystem restore, if someone just happens to blow the whole thing away.

How do the companies proposing z replaces his HSM system deal with the very real issue of backing up a petabyte of disk? Or is that something for the end user to figure out after you dump that on them?

Rob Peglar December 14, 2009 at 5:51 am


I have been encouraged to respond, and am glad to do so. As Robin mentions, vendors are encouraged to identify, so I will. My name is Rob Peglar, and I am Vice President, Technology for Xiotech. Amongst my 32 years in the industry, I also worked for STK as a storage architect for nearly a decade, so I understand fully your explanation of HSM, SAM-FS and the issues with your current disk storage. Believe me, I spent many a long night in the datacenter working with SAM, so I feel your pain 🙂

I would like to offer an alternative approach to the ‘big chassis’ solution; instead, use highly scalable intelligent storage elements (ISE), much as the compute side is using compute elements. Without trying to sound like a sales rep, the ISE is the highest performing disk element available, based on SPC-1 and SPC-2 results for protected data. It also is the most economical solution for performance-starved applications (PSAs) in terms of footprint, watts, IOPS/disk, and $/IOP. Our current figure of $3.05/SPC-1 IOP is over twice as efficient as the Sun 6780, which measured $7.15/IOP. For SPC-2, the ISE was measured at 789 MB/sec per 3U. Since the ISE is an independent element, unlike a drive bay, performance scales linearly with the number of elements.

Besides performance, however, both reliability and data integrity are vital in petabyte-scale repositories. The ISE is designed with several autonomic and self-healing capabilities which are intended to minimize disruption due to FC-AL or individual disk failures, and help eliminate false positive failures – which are by Seagate’s own admission over half of the drives returned to them classified as ‘failed’ by legacy RAID controllers. The ISE is a fabric-based scale-out storage element, containing its own bandwidth, cache, and IOPS pool per element. Since it is a fabric machine, there is no limit to the number of elements you can deploy. There are several types of media contained within the element – currently 6 different types ranging from 40 2.5″ 146G/15K disks per 3U, several types of 3.5″ 15K disks, to 20 3.5″ 600G/10K disks to 20 3.5″ 1TB/7200 disks per 3U.

The ISE also performs the ANSI T-10 Data Integrity Field (DIF) function on all I/Os entering the element. DIF is a vital part of any petabyte-scale installation. Two summers ago, CERN published a remarkable paper illustrating their own data integrity issues, which were revealed with simple programs and resulted from the lack of both payload and address integrity checks on their existing disk arrays. I encourage you to research DIF and construct your ongoing disk array strategy with DIF included. Here is the URL from Robin Harris’ article on same, two years ago.


Finally, there is system reliability. The ISE is a 5-year, $0 warranty product, unlike many other arrays. Its disks are contained in sealed datapacs, highly protected against excessive heat and vibration. The ISE is also being studied, tested and implemented in certain large US national laboratories and academic HPC environments, because the product of performance X reliability is second to none. I would be happy to discuss these under NDA, due to the sensitive nature of the projects.

The ISE also works with many clustered filesystems as well as your current SAM-FS as a disk tier. In truth, it is designed to do one thing very, very well – form the basis for disk storage for large scale-out compute problems without locking the researcher into a particular filesystem, FC-AL design, or HSM capability. The ISE also works as block storage with several high-performance NAS heads, several of which have been mentioned already.

My email is robert_peglar@xiotech.com. Thank you.

Paul Rutherford December 15, 2009 at 9:59 am

Z, TS, Deadhead92,

The best way to understand what Isilon can do for you is to have a one on one meeting. email me and I will set up time for an in depth discussion of your needs and our products.


paul December 15, 2009 at 1:03 pm

After reading all these point solutions from vendors, I would encourage you to evaluate the IBM Smart Archive solution suite and the Information Archive platform.


This is a next generation archiving solution that delivers on the promise of “ILM” with software integration to almost any structured or unstructured data/application in the enterprise. The infrastructure consists of IBM 2U servers which FC connect to either the IBM TruMAID (Massive Array of Idle Disk) storage and/or to your tape libraries. The IA pools could look to SAM-FS as another tier, but with much lower TCO than any of these other scale out NAS or internal cloud solutions.

TruMAID provides 179TBytes per square foot using 2TB SATA drives, and a full cabinet (10 sq. ft.) with 1.79PBytes requires only 5.5kW maximum power.

The IBM Information Archive is a complete software and hardware solution that would integrate well into your existing SAM-FS environment, giving you full protected archive features for compliance and eDiscovery requirements, and provide a very compelling TCO vs. standard NAS or cloud storage.

I am the IBM Data Archive Institute storage consultant for the western U.S. Send me your contact information if interested and let’s schedule a meeting to explore further.

Carter George December 15, 2009 at 3:00 pm

Most of the responses here have focused on your problems in the fast tier. If I understand correctly, you’re not looking for a new file system, but for storage to make the different tiers go faster with your existing SAM-FS infrastructure.

Having gone to the effort of implementing an HSM deployment (easier said than done), one way to take advantage of that is to optimize your tier 2. You can get more cost-savings than just using cheap SATA disk. For scientific data sets – such as next-gen sequencing, mass spectrometry, images that come off of many types of instrument – it is possible to integrate content-aware compression transparently in the second (SATA) tier.

This kind of compression recognizes specific file types and can get up to 75% compression at close to wire speed performance. This won’t solve your IOPS problem for the metdata slices, but saving 75% for a petabyte of storage at that tier could help you find the money to buy the cool stuff for tier one. If your data looks like TIFFs, SAM/BAM/SRF for genomics, or any other scientific image or coded data set, this would be worth looking in to. If it’s just alphanumeric, then generic compressors (such as those in ZFS) could be turned on.

As it happens, we do have some experience with the AMS 2500. I can’t compare it directly to the STK models, which I believe are made for Sun by LSI. We’ve found the HDS a bit difficult to get configured and ordered, but once it is in place, it’s rock solid. Super highly available, no disparity between vendor performance claims and actual performance, and very good at using intelligent cache to get that claimed 900,000 IOPS out of an array with SAS drives. (We’ve found it’s possible to get just as much performance with the 15K rpm SAS as with Fibre Channel drives.)

Cache is important in the HDS scheme, so you’d want to get the full cache size on offer. Finally, I am a bit surprised that HDS has not qualified any SAS form factor SSD’s in this array yet, as it would be a natural thing to do. SSD would be a good fit for high IOPS to a read-mostly metadata slice.

Although we have not used the AMS with SAM-FS, we have used it with multiple cluster file systems (Ibrix, PolyServe, Lustre) to good effect.

Carter George, VP Products, Ocarina Networks

paul December 15, 2009 at 3:25 pm

The Ocarina post prompted me to point out that the

IBM Information Archive appliance supports both compression and deduplication of files archived. This provides from 20-80% reduction in storage size. The compression and deduplication is performed either at the client or upon ingest to the archive through up to (3) 8-core IBM System X servers, providing plenty of horsepower for the job.

Paul Hewitt
IBM Data Archive Institute

Dave Brown December 16, 2009 at 1:05 pm

You mention not wanting to get connected to the hip, or hip pocket, to a large storage vendor. Consider complimentary technology to your existing SAM-FS software technology from DataCore Software. DataCore is coming up on being 12 years old in the industry now and has been an innovator in the industry. DataCore was the inventor of the thin provisioning technology and continues to lead with things like the ability to have up to 1TB of cache in a given storage controller and was the first 8Gb FC target on the market. DataCore also just released the ability for very large volume support today.

As Rob of Xiotech mentioned their ISE 5000 is an awesome array, and adding some DataCore SANsymphony to complement it you’re increasing your performance even more. If you want to add SSD, you can do that today, just connect up SSDs from any of the vendors to your commodity server hardware running Intel or AMD and you can use things like STEC’s 3Gb or 6Gb SAS connected SSDs, or their 4Gb FC SSD. Pliant just released some and Intel has them as well. With DataCore, you can pool all those and any other disk type from any vendor into as many pools and tiers as you’d like.

Just as the CPU has evolved over the years from vacuum tubes to silicon to a small internal cache to now three layers of cache, look at storage in the same way with DataCore’s software being your fastest Layer 1 cache. The software will not slow things down but speed them up, typically taking up to an order of magnitude of I/O latency off over a normal cached array controller. There are many more features and capabilities of the software than I’ll write about here although if you’re interested in finding a solution that offers you the scalability, performance, ability to fit into your exiting storage environment (DataCore can present storage to any Open System host) and doesn’t tie you to a monolithic stack of storage, you should look at DataCore.

Dave Brown

Joe Landman December 21, 2009 at 10:20 am

Coming late with a response … we’ve had a busy month/quarter/year …

First off, we are a vendor. We build very nice, dense, and fast storage boxen/targets/storage clusters and systems for this sort of work. We deliver and sustain data rates that are quite good (e.g. non-marketing numbers, real, end-user-repeatable results). We have units in production that easily supply more than 10GB/s to large HPC cluster systems, including for informatics analysis applications.

The issues I see being requested to be addressed are 1) existing meta data servers are barely able to keep up with load, 2) a need to expand the capacity (increase the density?) . I don’t see the tape as an issue, event though it is listed as a problem.

Other comments seem to suggest replacing the existing infrastructure with something new. Choose your flavor and go forth or something like this.

If the issue is to solve #1 and #2, this is easy to do without replacing much. The question I would have is how this would scale going forward. If you see going from 1PB to 10PB as problematic, then a new architecture is probably not a bad idea. More on that in a moment.

Question #2 is “solvable” by keeping the cost of additional nodes down. This may be at odds with some vendor solutions (not ours).

Question #1 is “solvable” by replacing the MDS in the current design with something faster. We just benchmarked one of our JackRabbit-Flash units, with 1k random reads against 256GB of data at a sustained 180k IOPs. This isn’t a terribly expensive unit, and its flash drives obviously run circles around the 15k RPM drives. You can’t repeal physics; mechanics will not be as fast as electronics in most cases that I am aware of.

Ok. Onto design issues. If as you scale up, the MDS is only going to get worse (as all centralized designs will), then replacing it provides only a bandaid over the issue, and avoids solving the real problem, that of good design. #2 isn’t affected as much as #1 on the design side. Bulk data storage should be lower cost and fast. But, if you have a single point of information flow in your scale out process, your design will eventually fall over.

So if you do plan to scale up well beyond 1PB, the centralized MDS has got to go (and any design that utilizes a centralized MDS is likely to have the same issues during scale up). Here things like Gluster (which we sell/support/integrate into our offerings) and a few others make a great deal of sense. You scale up as you need, with reasonable economics.

Feel free to ping me on/offline if you need to talk about these designs. Basically, if you are not trashing your existing infrastructure, you need to have a clear conception of how much higher it can scale, and whether or not an SSD replacement will help your MDS for your planned future. If you really do need to scale up/out, our siCluster (info to appear soon at http://scalableinformatics.com/sicluster) product is certainly one worthy of consideration, providing some of the best end user achievable scale-out performance we have seen on customer applications to date.

TimC December 26, 2009 at 6:19 pm

Just a note on your AMS2500 concerns. The SAS disks in an AMS are mechanically identical to their FC brethren, they just have a different backplane interface. You should actually see BETTER concurrency because they are a point-t0-point connection rather than the loop topology of FCAL.

NG December 29, 2009 at 6:45 am

If you want to keep HSM/tape in your configuration, consider something like Quantum’s StorNext, its a SAN file system that can scale in performance with the addition of more nodes and has some very good references. If you want to move away from tape and stay with all disk, take a look at object based storage solutions (this is all in addition to what has been mentioned) like Caringo, they have petabyte implementations and demonstrate performance.

Finally, there are other ways you can reduce the cost of your infrastructure while using inexpensive SATA drives such as deduplication and compression. There are two companies that do a really good job at this independent of the file based storage you have, Ocarina Networks and Storwize. Ocarina is a post process for more static content and is able to optimize images along with text files and other precompressed files. Storewize is an inline compression engine that is optimal with a variety of file excluding precompressed ones. By using these technologies, you can reduce the footprint of data and the cost of the storage and its environment.

So to not to forget though….Open ZFS has been adopted by a few vendors who might be good for tier two including Greenbytes who added deduplication inline and Nexenta. Might be an interesting option.

Kebabbert January 17, 2010 at 4:04 pm

Nice solutions, but dont forget SILENT CORRUPTION. All solutions (except ZFS) are subject to Silent Corruption, where your data slowly get rotten bits, without the hardware even telling you this. If you value your data, do you want to avoid bit rot? What happens if your data suddenly get changed from an “1” to “0” – without the hardware informing you?

Big Physics Centre CERN did a study on this, and on 3000 hardware raid rack servers, they found 152 instances of bit rot, where the data was altered without the hardware even knowing this! The sysadmins didnt get noticed. CERN discovered this by using a program that wrote a known bit pattern and then compared to the expected result.

All hardware solutions have some rudimentary protection against bit rot and silent corruption, but no one protects completely, except ZFS. Here is two more articles for you to read, if you want to learn more about CERN and bit rot (he concludes end-to-end checksums are needed, ordinary checksums will not do – he suggests ZFS)

ZFS is designed from scratch, to NEVER EVER trust the underlying hardware (cosmic radiation might flip a bit, power spike, bugs in BIOS, not really connected card slots, etc etc):

In my opinion, ZFS protection against bit rot is THE main reason to use ZFS. Why use fast and unreliable storage? Better to use safe storage which guarantees that your bits are not altered. Read those links for more information.

Francis Kim January 31, 2010 at 3:36 pm


I would definitely look at SSDs, since you have a mountain of metadata to churn through before you can get to your petabyte. Of course you’re talking to FIO. They love science experiments. Their ioDrive cards are the IOPS kings at the moment, as their pricing suggests. One caveat. FIO’s model of “an ioDrive in every server” is going to be at odds with your existing environment running SAM-Q on Solaris. Better to look at stuffing a box with a number of SSDs (disk form factor, PCIe, etc.), then present them out as storage target LUNs for your SAM-Q server to use for metadata store. You want to disrupt your fragile HSM server as little as possible. This way, you can remain flexible with respect to SSD adoption and take advantage of the SSD’s reapidly falling price/(capacity:performance) curve.

Jacob February 16, 2010 at 6:08 pm

I’ve been doing a bunch of massive capacity projects with an archival file system from FileTek. Its is not an HSM. Its an archival file system that stores files on tape and caches them on disk. Its not designed for crazy performance but it designed for enormous file count and data integrity. In addition to archiving files it archives SQL databases too. Very cool. Next you need is a virtual file system: IRODS, SRB, Nirvana, maybe even Acopia to feed it. Drop me line if you want to references or want to learn more.

KD Mann February 26, 2010 at 5:17 pm

Just a few quick thoughts here (though I’m a couple months late)…

1) IBM’s SONAS (Scale Out NAS): infinibanded HPC derived clusters with DDN storage on the back-end. Just saw these with my own eyes last week — way fast, way scalable, way efficient, way cool

2) IBM’s SAN Volume Controller; especially now that they’re able to license per spindle instead of per terabyte. SVC’s 380,000 IOPS in SPC-1 is almost twice as fast as anything else ever tested on spinning disks, and is even about 30% faster than the big SSD arrays recently tested by both IBM and TMS. All that, and you can even put your existing spindle-farm underneath it.

Finally — wouldn’t Isilon be DQ’d here on performance? Isilon is all about cheap, massive capacity across a single namespace, but Isilon performance is not anywhere near the rest of the solutions discussed here.

A newbe March 2, 2010 at 3:19 am
Allen May 22, 2010 at 4:28 pm

Check out the Gravity stuff at Infiscale. They are the makers of a ton of free open source HPC and cloud stuff with links to systems they did that have several petabytes of storage. Met a couple of them at SC09 and they had a nice demo of their Perceus running at the Intel booth. Always partial to those that show us the source and give us code 😉

Tim May 24, 2010 at 7:48 am

might also want to have a look at http://openstoragepod.org — “Petascale storage for the rest of us!”.

Leave a Comment

Previous post:

Next post: