Build a 3PB storage solution

by Robin Harris | Monday, April 1, 2013 | Architecture, Clusters, NAS, IP, iSCSI, SAN, FC | 13 comments

Choice is a great thing, unless there’s too much of it. And choice is what we have a lot of in today’s data storage market.

A longtime StorageMojo reader has an interesting problem: architect a 3PB data storage facility. Can you help?

Here’s what he wrote to StorageMojo. His email has been slightly edited for clarity and length.

One of my current problems is to design one of the nodes for a large research data storage facility. I’ve had to do this stuff in varying degrees, varying modalities and varying tech in times gone by.

I’ve been given a number and “capacity” to look into â€“ somewhere near or around 3PB to begin with. We won’t even go down the path of discussing workloads or disk technology fit for purpose at this stage, but, something has struck me as interesting.

There is this clear divergence in disk technologies at the moment and I’m finding it hard to resolve what is the “right” one of the task.

Currently, I see:

Heavy-end storage virtualisation frames [VSP, Symmetrix et al]

Big grid-ish things [IBM XIV etc]

Weird “stacked” commodity LSI Silicon [NetApp E5400/5500, SGI IS5500/IS5600, Dell MD3660F etc â€“ all the same silicon I think?!]

Quasi virtualisation arrays with modular form factors (Hitachi’s HUS-VM?)

High performance dense trays in modular form factors [DDN’s SFA-12K Exa and Grid scaler tech?]

Bog-standard performance dense trays in modular form factors [Hitachi HUS, EMC VNX, HP EVA, Dell compellent etc etc]

That wild crazy pure flash/RAM/SSD/NAND world that guys like Violin inhabit.

Currently I’m trying to rationalise what I should be using for a storage platform that needs to scale big, but do it in a sensible economic standpoint, with density, performance of interconnect and throughput with gross mixed workloads being all big factors.

Some folks suggest to me that I should be happy enough with the LSI horizontally stacked 60-drive trays, but I am not sure the technology is tracking too well in terms of performance or density (Hitachi, DDN and maybe some others can now do 84 drives in as little as 4-RU!).

I guess my question to you is â€“ where do you see that dense high performance market heading? I know the guys at the LLNL over your way were crowing about the NetApp E5400 LSI stuff where they managed their “1TB/sec” file system (I think it was Lustre based?), but I have to wonder if that could have been more efficiently carried out using a DDN GridScaler/SFA-12K-E etc.

The StorageMojo take
Two issues here: is the segmentation our correspondent offers realistic and helpful? And what are the core architectural issues he needs to think about?

For the first issue an object store or a highly parallel NFS – like Panasas – seems to be indicated.

Given that this is a general purpose high-performance system, the critical problem seems to be how the system – however architected – handles file creation/update/deletion metadata. String enough disks together – 1,000 to 2,000 – and you can get a reasonable # of IOPS and, if you need more, put some SSDs in front.

There are a number of scale-out storage systems that will credibly and economically grow to 3PB. Metadata is often the bottleneck, as Isilon buyers have found when creating many small files.

A maximum performance spec – including file creation etc. rates – will probably help eliminate likely laggards, while a budget $ per usable TB/PB will eliminate the uneconomic products.

Vendors are welcome to offer their perspectives. Please just identify your company so we know where you’re coming from.

Practitioners who’ve done this, or something similar, are encouraged to share their hard-earned wisdom. 3PB is non-trivial today.

Courteous comments welcome, of course. I’m going to start offering almost-free consulting for end-users. Stay tuned!

13 Comments

David Magda on Monday, 1 April, 2013 at 6:18 pm

We currently have about 700 TB of raw EMC/Isilon (OneFS) storage over 24 individual “bricks”. However, we’re about to upgrade them (in a rolling fashion, one at a time with no downtime (hopefully)), to hit about 1.3 PB of raw storage in our main HPC storage island.

Each brick comes with ~24 drives, 4 SSDs, and 96 GB of RAM, though you can order different specs depending on whether you want bulk storage (SATA), faster storage (10Krpm), or lots of IOps (more SSD). Everything shows up in one namespace, and so all the clients can have a single entry in the fstab, but load is spread by distributing a Class C of IP space over all the bricks (you don’t have to use a /24; any CIDR division is possible). We’re generally quite happy with the set up in our HPC environment.

We have another Isilon cluster/namespace for our “corporate” infrastructure that has about 200 VMs running on a few dozen blades using KVM. All of the virtual disks are on NFS, and we also have separate NFS exports for things like homedirs and other data that’s shared between different systems but belonging to same group.

I would recommend breaking up the 3 PB into different storage islands (even if the equipment is from the same vendor). Each island should be focused on a different I/O pattern. Specifically, don’t put (say) the HPC scratch space on the same spindles as the homedirs, on the same spindles as any VMs you have, on the same spindles as read-mostly reference data.

Given that storage islands is the way to go (IMHO), generally I would look at system which allows you to add/move the storage units from one island to another with relative easy, and that will redistribute/re-stripe the data.

Generally speaking ,we found Iislon Just Works(tm) and management is minor. Of course you have to pay for the convenience. We’ve had a few hiccups, and have generally been content with the support.

For comparison we used to have a IBM N-series (read: NetApp), and it fell over from the load regularly once we got to a certain scale, but it was a few years old when we retired it. We still have a NetApp for some light work, and so does our Windows Team (Exchange, Sharepoint). I’m sure some people may use NetApp in HPC, but we wouldn’t dare (about 400 compute nodes, doing mostly biomedical stuff), as we can’t see it surviving the load. If you’re going to hit storage hard, you want to be able to go wide with as many heads as possible, and I don’t see NetApp working in that space (even with their clustering).

We also have BlueArcs (mostly with DDN), and the management headache with them is atrocious. Perhaps now that they’ve been been purchased by Hitachi/HDS there will be better integration between the front-end NAS heads and the back-end storage shelves, but we’ll be very happy once we can retire the stuff we have.
nate on Tuesday, 2 April, 2013 at 9:15 pm

BlueArc with DDN ? How old is that? I was a BlueArc customer going back to 2008 and they did not support anything other than LSI or HDS. I tried hard to get them to support other things and they flat out refused(we were a sizable customer at the time and were looking to migrate to a newer BlueArc – but ended up going with a competitor instead). They said it was too complicated to get things certified properly. You could technically run with other things if you wanted but it was totally unsupported. The systems we had was going EOL at the end of 2008, so either your system is many many years beyond EOL or whoever set it up made some very poor decisions.

I don’t think folks can answer the question at hand without knowing more specifically what the workload and availability requirements are. DDN systems for example are designed so that you can have maintenance periods where you can take the system off line for things like certain software upgrades, hardware replacements etc. You don’t want DDN if you want high availability. But they are good with throughput. If however you are a HPC shop and are “job” based, where you run for a few weeks/months for some specific task, then there is downtime between that and the next job for the system – those environments seem to suit DDN quite well. If you do go DDN – absolutely get the systems cabled at the factory – do not risk cabling the systems in your own racks – you have been warned!

Enterprise systems obviously can get expensive pretty quick and are generally optimized more for iops than throughput, and often have very strict lists of devices (switches/firmware/HBAs/drivers/MPIO) they support being connected to their systems. If your running a lot of oddball stuff you may end up negating a lot of the promised stability of an enterprise system when the vendor has you signing all sorts of documents absolving them of any responsibility because your running in an unsupported configuration. (and those that don’t make you sign such documents put you at an even greater risk IMO because the customer is not well informed enough as to what could happen, ignorance is not bliss here).

Or maybe this is just a bunch of bulk object storage for a massive file server or something – in which case maybe a ton of systems using DAS and something simple like Red Hat storage server(open source supported object storage system) would foot the bill. At the same time I would NEVER use Red Hat storage server for anything transactional in nature.

Or maybe you can write your own software and use a bunch of Backblaze(?) systems..

But really without significantly more info I don’t see how anyone could intelligently answer this question.
Toby O'Brien on Wednesday, 3 April, 2013 at 12:09 am

Looking through your basic requirements, we can instantly scrap a few competitors:
– Heavy-end storage virtualisation frames [VSP, Symmetrix et al]
– That wild crazy pure flash/RAM/SSD/NAND world that guys like Violin inhabit.
Shovelling 3PB into either of these is going to make your vendor buy you dinner. Forever.

The next question is planning around scaling.
– Weird â€œstackedâ€ commodity LSI Silicon [NetApp E5400/5500, SGI IS5500/IS5600, Dell MD3660F etc â€“ all the same silicon I think?!]
– Bog-standard performance dense trays in modular form factors [Hitachi HUS, EMC VNX, HP EVA, Dell compellent etc etc]
Big grid-ish things [IBM XIV etc]
None of these are going to work, as they won’t even scale to 3PB. Unless, of course, you have something to tie it together:
– Quasi virtualisation arrays with modular form factors (Hitachiâ€™s HUS-VM?)

So that is more of an option than the first two.

So, for an all in one box solution out of your list, we are down to this:
High performance dense trays in modular form factors [DDN’s SFA-12K Exa and Grid scaler tech?]

But then it comes back to the mighty dollar, and there is always more than one way to build a bucket of bits. Are you better off, as the first comment suggested, to split your build into islands? Can you use something like a HUS VM to mask commodity storage? Do you go to full on commodity storage and tie it together with your choice of clustered file system? Do you need to have all the data online, or can you go to a TSM solution?

From the data given, the answer is the DDN SFA12k or similar.

But it all depends, as always, on money/performance/simplicity. Normally you would pick two, but it’s IT, so you can kind of have one maybe if you’re lucky.
Ed on Friday, 5 April, 2013 at 10:02 am

I always love when I hear things like :
“Iâ€™m trying to rationalise what I should be using for a storage platform that needs to scale big, but do it in a sensible economic standpoint…”
ie: I need LOTS of space, and LOTS of performance but we don’t have any money. This is like the old ‘Fast, Good, Cheap, pick two’ joke.

In the 3+PB game there aren’t really tons of players to start with. Either buy it to scale, or build it with parts and software (lustre, pvfs^2, gpfs, etc) and tune forever.

And let’s be clear: It’s ALWAYS the metadata that kills you, at scale. And now we are finding out that disks can have massive affects on performance as well. (see Sequoia reports on disks)

I’m also a touch curious why the formerly Engenio/LSI gear is called “weird”. I would have lumped it with either the bog standard or the high density packaging. It’s an old fashioned dual-head style controller setup with disks behind it. (yeah OK, the controllers plug in now) Comes with several flavors of controller now, and 3 or 4 enclosures. Can mix SSD/disk like most everyone else now. Can cable a couple of extra enclosures into the loop,etc. I wouldn’t call it weird at all. (at least until they release their rumored XiV-like grid raid)

Quite a few places are using these Netapp E-series (formerly Engenio/LSI) gear to build some rather sizable file systems lately. (=3 tiered with high speed cache and L2ARC-style front end, stable NAS/SAN middle and something like LTFS EE behind it? Still can’t beat tape for cost.

building your own cloud service: buy some VSP or V7000 or other virt. gear. Our tests at scale on the virt gear out there now have not been promising. SMB and small enterprise it seems great.

YMMV
Cristian on Saturday, 6 April, 2013 at 5:29 am

Well, the post is dated April 1, 2013…
Never mind, i can see that scale-out ATA over Ethernet solution is not on the list, so i suggest yours longtime StorageMojo reader to talk with Coraid guys.

http://www.coraid.com/solutions/business_solutions/high_performance_computing

Regards,
Cristian.
GaryM on Sunday, 7 April, 2013 at 10:50 am

Hmmm I note the project is a data storage facility and there’s a great deal of talk about the hardware and virtualization capabilities.

One critical requirement seems to be missing from your client’s description, the use cases. Discussing architecture as a homogeneous solution may be premature until use patterns and data classifications are disclosed.

But, throwing big rocks and hoping for the best is another approach in lieu of causal system design.

cheers,
gary
ScottL on Monday, 8 April, 2013 at 4:33 pm

This is one instance where it’d be reasonable to consider a soft-SAN like datacore. You can commoditize whatever you please behind it and provide blocks. Problem is, there aren’t many native filesystems that will play well in this size, depending on file/folder count except maybe ZFS, EXT4, or something proprietary like VxFS.
Cristian on Saturday, 13 April, 2013 at 12:44 pm

The filesystem could be (but it depends heavily of the project goals) one already used in production at CERN (www.cern.ch), Quantum’s stornext.

Links:
http://www.quantum.com/products/software/stornext/index.aspx
http://www.quantum.com/iqdoc/doc.aspx?id=5340

Regards,
Cristian.
Jean-Luc Chatelain on Sunday, 14 April, 2013 at 1:44 pm

As CTO of DDN, I want to thank the authors of all the comments mentioning our SFA and Gridscaler technology 🙂

Now as a guy somewhat familiar with storage I would recommend that your client profiles his 3PB needs especially if it is a research institution. As someone remarked, and unless you are in the media and entertainment space and will store a few half petabytes movies, at 3PB it is the metadata that kills you. Your client’s clients, the researchers, may need just 1PB of storage for their active cluster then a high performance device will do the job; meanwhile the balance of 2PB maybe just referential data or archive and then some storage with a scaled-out namespace maybe enough.

As an example and being this time DDN centric, one could use a 12K powered Gridscaler with one fs of 500TB of Hiperf SAS drives behind a cluster but also one fs of 1pb fs of SATA drives for home directories. The balance of 1.5PB could be on an object store like WOS with our Gridscaler WOS bridge seamlessly tying all 3.

No matter which vendor, there is rarely a one size fits all worlds Your your client can save $ and headaches by understanding what this 3PN will be used for.
Bruce on Wednesday, 17 April, 2013 at 2:24 am

Any particular reason why leveraging an object storage based solution has not been discussed?

My company works in the channel with Cleversafe (www.cleversafe.com). They scale effortlessly, leverage erasure coding for high availability/efficiency and, in any configurations I have seen thus far, are much more cost effective than traditional storage players in multi-petabyte environments (even with the cloud gateways required for CIFS/NFS translation costed in)

However, I guess there are no strict answers until we understand application use cases and how dense you need the solution to be.
Alex Gorbachev on Thursday, 18 April, 2013 at 8:12 pm

I’m wondering why HDFS hasn been mentioned. While less generic, it’s a storage platform and can be great fit for certain needs.
Fernando on Sunday, 28 April, 2013 at 3:48 pm

Hi All,

Swimming in deeper water than I usually do, but as a long time reader, first-time poster, thought I’d share something that’s on my mind. I agree it’s a bit cart before horse to recommend anything without knowing more, but I’d like to add one more system for consideration, until further req’s rule it out.

We’re helping a client build a 1PB capable system that can handle mixed workloads (we’re a VAR/integrator that works with 5 storage lines), the one I’ll mention here is due to its unique form factor, Nexsan; and happens to be what we’re working on for this 1PB system, so it’s fresh on my mind.

This make/model was missing from the OP’s listing and Toby’s response too, maybe it’s not Tier-1 enough, but I’m listing it here in case the OP cares to check it out.

Lots of folks have seen/enjoyed the 42 drive Sata and Sasbeast products over the years. The 4U and 750w, 126TB (using 3TB drives) dual controller iscsi and fiber “basic block” SAN has served a backup-storage role for many service providers and orgs. The “car hood” design seen in the Beast line 11 years ago, has its merits, raw density, as the DDN model below can attest.
Beasts: http://www.netstoragesales.com/images/SASBeast/SASBeast_topfront_md.jpg
DDN: http://www.ddn.com/images/i5.jpg and http://www.ddn.com/pdfs/DDN_SS8460_Datasheet.pdf
But a few years ago, Nexsan wanted to move away from the car-hood design to the drawer system, to allow packs of drives (while still online) being accessible via the front of the chassis, their CTO said it best: http://nexsan.net/2011/03/ so you get 60 drives in 4U (e60 model) without needing to slide out the whole chassis, and the accompanying wire management headache in the back. http://www.nexsan.com/en/products/~/media/Nexsan/Files/media/ProductESeriesjpg.jpg

Anyways, we’re working on an NST5300 for a client:
* the 5100 is small, only 31 drives max, think of this for branch office type setups, ie: to use for a dev cluster or to replicate to one of these..
* the 5300 goes to 360 drives, or 1440TB using 4TB drives
* the 5500 goes to 1260 drives, or 5040TB using the same.

the last 2 digits are either: xx10 for NAS (NFS, CIFS) xx20 for SAN (iscsi), and xx30 for unified (NAS+SAN), obviously just a license key.

Its connectivity: 4x 10GigE to the clients (each controller would have dual 10GigE and 4x 1GigE) for NAS or iscsi SAN, so if you need IB or FC, or greater than 40Gbps – this is not for you. Then based on your replication specs, if you need synchronous writes to 2 sites, you need FC from the heads back to the storage blocks. If a-sync is enough (which is IP based out the front) then SAS attached E series storage blocks are likely your better choice to the back. All of these E series support Automaid, which really does work to lower power usage.
this sheet may help: http://www.nexsan.com/products/~/media/Nexsan/Files/library/datasheets/NST5000_DS.pdf
I find this sheet confuses: http://www.nexsan.com/products/~/media/Nexsan/Files/library/datasheets/NST5000_SpecSheet.pdf but it is still useful if you ignore the right 2 most columns, for ‘storage shelves’ at this scale, only consider the E60 (the drawer system mentioned above). In short, ignore the 224x storage jbods unless you need to support vmware view or something with very high iops where the working set will exceed the Fasttier cache, but still want 2.5″ SAS drives — btw: this is the only ‘storage block’ where the raid calcs are done in the heads/controllers. the E series (e18, e48, e60) all have their own multi-proc raid controllers and themselves host expansion chassis. (ie: this isn’t ZFS across all jbods). Note, you can choose different class of disk or a 224x and E60’s for different pools of storage and different workloads.
Lastly, FastTier, the special sauce — this is housed in the dual-controller chassis on a common backplane. A mix of STEC ZeusRAM, or SLC and eMLC is used for write and read caching respectively. You can use all 16 slots in the head unit to add more FastTier as needed by your apps.

So, not sure if this unit is applicable for your task, but it’s the value winner for our client’s ISCSI and NAS needs, who does have some apps that will have high IO requirements and a much larger, traditional, % of their dataset that will benefit from the larger more economical large capacity drives. And with the active drawer system in the E series storage systems, keeping the whole system online when doing drive replacements is a nice plus! 3TB with controller/head could fit in one 55U rack, if your datacenter supported the power density.

Happy shopping!
John Aiken on Monday, 24 June, 2013 at 5:33 pm

A couple of observations:

Ed’s comments on the importance of the file system metadata is critical – especially the ability of said file system to perform a scan.
Take a look at the scan performance, as it’s relevant to storage services such as backup and archive (ever see how a NetApp filer acts when your backup software asks it what files have changed?).

Also, Gary’s comments on ‘throwing big rocks’ at the problem is spot-on. Infrastructure design is driven by the workflow/user requirements, rather than ‘my pile of disks is faster than your pile of disks’ argument. Please, unless someone’s changing the laws of physics, the world sources disk drives from the same two vendors (i.e., everyone’s disks are the same under the hood). It’s the utility and features of the software (disk file system) that dictate the architectural differences between a pile of Isilon disk and a pile of Panasas, IBM, Hitachi, etc.

When looking at storage, and your Tier-1 requirement, please consider the broader picture: a comprehensive infrastructure that supports your data throughout it’s entire lifecycle, not just the Tier-1 HPC phase.

Have you considered what the really big shops do, blend disk with tape-based storage for optimal CAPEX/OPEX costs?

You should also consider the most obvious storage service (and the most mundane): data protection/backup/disaster recovery. How is this solved on the petabyte scale with an all-disk approach? How does one ‘back-up’ 3PB? If you ask the disk-only vendor, you’ll get the same answer: buy more disk.

Consider IBM’s GPFS file system: GPFS 3.5 now has a Native RAID feature that eliminates hardware-based RAID for advantages in both perfromance and avaibility(delivering up to 1.5TB sustained I/O at a large early-adopter site); it’s ability to include a tape-based storage tier (for ‘active’ archive and integrated data protection), which can dramatically reduce overall storage costs; and it’s ability to maintain a single global namespace over multiple geographies may be useful if your organization has a multi-site file sharing requirement.

Btw, we’re biased. As an IBM partner that works on HPC infrastructure projects nationally with IBM, we design mulit-PB infrastructures that blend disk and tape-based storage, and fuse the compute, storage, archive and data protection into one contiguous storage utility.

Let us know if you’d like to whiteboard through ideas on “what’s possible”, using tools from IBM’s toolkit.

Cheers –

John Aiken
Re-Store