IBRIX CTO talks segmented file system

by Robin Harris | Thursday, May 22, 2008 | Off-Topic | 17 comments

What’s in a name?
I’ve started doing some work for IBRIX. Despite looking at theirÂ web site several times I’d never understood what, exactly, they did. Turns out they make a cluster file system. One good enough for Dell, EMC and HP to resell.

As part of our get acquainted process I spent some time with their CTO, Sudhir Srinivasan. I taped him discussing their segmented file system.Â

The Cliff Notes version
The segmented file system is similar to the Google File System in that it uses commodity servers with a local file system and puts a software layer above that to cluster them. It differs in that there is no dedicated metadata server node that needs to be hardened or prepared for failover.Â

The secret sauce is that when there is a file request, the receiving node can swiftly refer the request to the node(s) with the data. And it does it without a lot of back channel chatter eating up cycles and bandwidth.Â

The StorageMojo take
I haven’t fully grokked the technology. But the appeal is undeniable: commodity servers and storage; scalable; adept at both large and small files – the latter usually a problem with dedicated metadata servers.

EMC is using IBRIX as the file system for the local storage pools under their upcoming Maui product – IMHO their most important product introduction since the original DMX. I got the feeling Joe Tucci wasn’t entirely happy about that at his press conference this week, but hey! they’ve got bigger fish to fry.

Comments welcome, of course. If you’ve used IBRIX how has it worked out for you?

17 Comments

Tracy Reed on Thursday, 22 May, 2008 at 11:24 pm

How does this compare with something like Sun’s Lustre filesystem?

http://www.lustre.org
Huw Lynes on Friday, 23 May, 2008 at 2:57 am

It seems like a very similar concept to Isilon. Have I misunderstood it?
Robin Harris on Friday, 23 May, 2008 at 7:27 am

Tracy, Lustre has a dedicated metadata cluster. They are able to scale their cluster because they can scale their metadata service through clustering – a cluster within a cluster if you will. Ibrix appears to have another method which was developed by Princeton – now at Yale IIRC – math professor.

Huw, no I didn’t explain it very well because I still don’t fully get it myself.

Looking at Isilon specifically – even though many clustering architectures have a similar issue – Ibrix appears to have reduced synchronization overhead. Most clusters have a dedicated backend network for inter-node communication because they need a low-latency way of keeping metadata synchronized.

I’ll keep digging into it. Stay tuned.

Robin
Chris on Friday, 23 May, 2008 at 9:43 am

I’ve been looking at IBRIX for a couple months now for an upcoming project. I can see how IO could be handled if you statically segment file storage based on file name – hash the name into a small Int; if between 1-99 go to server x, if between 100-199 go to server y, etc. When you try to do a directory listing, the server that get’s the request forwards on a request to all other servers for their file list for the same path. If ask the system for a specific file, it just has to hash the name to find out which box owns it, no communication necessary.

Obviously, just a guess, but that’s how I got my head around it.
Joe on Friday, 23 May, 2008 at 7:11 pm

Lustre can scale their metadata through clustering? No.

That’s a roadmapped capability, not a current one. Lustre’s only current clustering for metadata is active-passive (for failover).

As for Ibrix, it’s being turnkeyed, and is industrial grade. It doesn’t perform at scales of Lustre of course. Lustre is practically purpose-built for supercomputer file I/O concerns. As such, it lacks mission criticality support. One would be… crazy… to run Lustre in a “five nine” data center, although one might expect at least nine fives out of it. *wink*

Ibrix does not “stripe” files across the enterprise the way Lustre does. They had that feature implemented at one time, but it just wasn’t important enough to Ibrix customers to maintain it. The Ibrix folks have been kind enough to offer to reimplement the feature for large buyers.

Isilon’s scaling architecture is a bit different than Lustre’s. It redirects clients to various cluster nodes, round robin like. In the event a client needs a file that one of the cluster nodes doesn’t have, the cluster node uses the backend infiniband channel to fetch the file. This works pretty well for many workloads, but of course cannot compete with any serious Lustre installation.

Isilon is the most turnkey of the whole bunch. They’re major claim to fame is set up simplicity. I doubt very much there is any simpler appliance on the market. Certainly they are the top of all the Gartner recognized vendors for that measure of merit. After the first node, adding storage to an Isilon cluster takes less than 30 seconds. Literally.

Joe Kraska
San Diego CA
USA
Pauly W on Friday, 23 May, 2008 at 8:32 pm

Isilon uses a distributed lock manager and a large default block size which limits it scalability and efficiency. For customers looking for a large sequential block low IOPS single vendor solution Isilon is a good fit.
Eric on Sunday, 25 May, 2008 at 4:46 am

Exanet does something similar as well with their ExaStore Clustered NAS software.
Did you get a chance to play with it as well ?
Bill Todd on Monday, 26 May, 2008 at 3:49 pm

IBRIX is neat, and a decent technical overview can be found in the Dell “Achieving Scalable I/O Performance” paper on its Web site.

But it leaves a few things to be desired:

1. Explicit segment maps just don’t scale: they may take it to PB-level systems, but are likely to become cumbersome in EB-level systems.

2. The necessary central allocation coordination required by such explicit mapping scales even worse (even after applying optimizations to help distribute its lower levels).

3. Segment server fail-over configuration set-up (basically the old Sun cluster resource fail-over approach) sounds distressingly manual (in which case managing it doesn’t scale either). It’s possible to imagine add-on management utilities that mitigate this, but something more architecturally integrated and automated would be nicer.

4. For that matter, motherboards and processors are pretty inexpensive compared with the amount of storage that they can handle: just providing the option to include an extra board in each segment server to guard against server failure might be the best approach when the underlying shared-access storage is itself RAIDed (if it isn’t, it’s reasonable to ask whether making its segment manager redundant is worthwhile, since either the storage is already replicated somewhere else under some other manager or at least some of it will be completely lost if any of it fails; in any event, serially-shared access to locally-attached storage – e.g., via SAS or port selectors to SATA drives – between two such partnered fail-over boards is a far more cost-effective approach than more general-purpose sharing and should cover that particular issue adequately – when it needs to be covered at all – without the need for any additional management/configuration).

5. The glowing description of the virtues of segmented storage elsewhere on their site rings a bit hollow, since if indeed directories are distributed the same way that files are (as the Dell paper states) then whole sub-trees of the directory hierarchy that may be otherwise still valid can be isolated by the loss of a higher-level parent directory in a different corrupted segment. The attempt to make a fault-isolation virtue out of the segmentation (the primary purpose of which has nothing to do with fault isolation) seems a bit of a stretch: better to have borrowed a leaf from the ZFS book and used additional metadata replication with checksum validation, and then extended that leaf by replicating across seqment servers such that loss of any given segment (or associated server) could not affect metadata availability.

6. It’s not clear that IBRIX has paid attention to the problem of having a zillion accessors pounding on a single humongous file (the converse of the ‘many small files’ metadata distribution case). In particular, while such a file’s ‘home’ node must at least play a minor role in things like keeping mtimes and atimes up to date, the need to funnel *all* read and write requests for the file *through* it makes it a potential bottleneck (though admittedly simplifies file-level coordination of things like metadata and lock management).

Still, not at all bad, and obviously useful.

– bill
Clive Bearman on Tuesday, 27 May, 2008 at 6:01 am

Sorry to correct you Robin, but EMC is NOT using IBRIX in it’s Maui system.
Robin Harris on Tuesday, 27 May, 2008 at 8:28 am

Clive, then what 3rd party software was Joe referring to in his press conference comment on Maui?

Robin
Stewey on Tuesday, 27 May, 2008 at 9:18 am

Robin (and others),

Any information on the Ibrix pricing/licensing model? Also, I enjoyed the video and your article. The comments section for this has been unexpectedly informative as well. So many things I would not have thought of in the evaluation of such a product.
Steweu on Friday, 30 May, 2008 at 7:07 pm

Yeah, it’s not in Maui, it’s in hulk…

Maui is a home grown beast
Sudhir Srinivasan on Sunday, 22 June, 2008 at 7:41 pm

Some really great comments and discussion here.

Bill, regarding the segment map, it is a few bytes times the number of segments in a file system and so itâ€™s very small even for very large file systems (EB). You refer to a central allocation scheme â€“ can you give more details of where you think IBRIX has a central allocation scheme and for what?

Regarding failover set up and the whole issue of ease of set up in general, one of the things our customers like the most about us is our flexibility â€“ flexibility in choice of server and storage hardware, storage topology, access method, etc. etc. Even for failover, we have customers that deploy 1+1 pairs and others who do n+1 clusters. All this flexibility is sometimes perceived as complexity. To that end, weâ€™ve worked with our partners to make the most popular hardware configurations available as pre-configured units with minimal on-site set up. I urge you to read the recently published report by ESG on a lab validation of IBRIX they did, with specific focus on this issue. You can get it from http://www.ibrix.com/media/Analyst/ESG_Lab_Validation_Report_IBRIX_Fusion_Apr_08.pdf

Bill brings up a great point about providing segment-server redundancy in pairs â€“ this is in fact a very popular configuration with our customers. The system is built out of â€œbricksâ€ of two servers sharing a RAID array â€“ you can scale out by adding as many of these bricks as you need and also scale up by stacking storage (and server resources like CPU and memory) within each brick.

Regarding the benefits of segmentation for fault isolation, perhaps it would help to look at it from the perspective of time-to-repair. Should a bad failure happen (e.g. due to hardware RAID failure), only the affected segment(s) need to be repaired. As you know, repairing a file system requires scanning the directory structure and cleaning it up. In a 100TB file system with a billion files and directories, that can take a very very long time indeed. With our segmentation, only the affected segments need to be scanned and repaired and we do them in parallel, so the time-to-repair is a function of the size of a segment, not the whole file system â€“ orders of magnitude improvement. We have seen such benefits in actual real-world situations at some of our customers where unfortunate accidents happened and we were able to use segmentation to not only minimize down time but also recover data. Hereâ€™s more information: http://www.ibrix.com/media/Collateral/Technical%20Brief_Segmented%20File%20System_14Feb08.2f.pdf.
Bill Todd on Monday, 23 June, 2008 at 8:32 am

After a month I can’t remember where I got the impression that Ibrix had a central allocation scheme – all I could find now with a quick look was a reference to policy-based allocation for new files (though that *could* be done without global knowledge of free space distribution – thst’s what I was referring to as requiring ‘central allocation coordination’ – across an independent set of storage segments, at least until space started to get tight since you wouldn’t want to have to fumble around segment after segment looking for space that wasn’t there), the sentence “IBRIX Fusion software takes the capacity within each server and presents a single pool of storage” (which can be read as implying centralized management, but might not), and the assertion that Ibrix “uniformly redistributes stored data between [segments]” (which could imply, again, that global knowledge is used).

Your video comment about any server being able to route “any I/O” to its proper destination without itself performing any I/O may have led me to infer that the (in-memory) segment map information was far more detailed (and global in nature) than you suggest now. For example, for that statement to be true for an NFS random access to a file whose ‘home’ server was elsewhere (and for that matter whose contents may have spanned multiple segment servers) would require a *lot* of in-memory mapping information at every server.

If your customers value flexibility over simplicity then you’ve clearly made the right trade-off for them: the only question is whether that choice has turned away significant numbers of would-be customers (but if you’re happy with the market penetration that you have then that’s not a significant question for you).

But you haven’t eliminated my skepticism about the virtues of segmentation for fault isolation – in part because I have difficulty imagining *any* scalable cluster-storage approach that couldn’t conduct the kind of parallel recovery activity that you describe (i.e., that seems to me to have little to do with ‘segmentation’ per se, unless you define segmentation to mean *any* partitioning of responsibility that allows the system to scale out).

I certainly didn’t intend to criticize Ibrix unfairly, and in fact consider it to be one of the better-designed cluster storage products out there. I heartily agree that centralized metadata management doesn’t scale, even when handled by an expandable cluster as Lustre’s eventually will be: for metadata-intense activity, you need just as many servers (or perhaps more importantly disks) available to perform it as you do for file-access-intense activity, and that can only happen if the metadata management is spread across the same number of servers (which effectively means the same set of servers).

But my other hot button is explicit maps (because they don’t scale either: ZFS’s up-to-6-levels of indirect blocks being a good case in point) – and I don’t know how Ibrix stacks up in that area.

In any event, Ibrix seems a worthwhile addition to the storage arsenal, and I wish you luck with it.

– bill
PBlog on Sunday, 17 August, 2008 at 9:31 pm

How does this compare with GlusterFS?
Boris Zuckerman on Tuesday, 7 October, 2008 at 3:01 pm

Bill Todd said:
â€œBut you havenâ€™t eliminated my skepticism about the virtues of segmentation for fault isolation – in part because I have difficulty imagining *any* scalable cluster-storage approach that couldnâ€™t conduct the kind of parallel recovery activity that you describe (i.e., that seems to me to have little to do with â€™segmentationâ€™ per se, unless you define segmentation to mean *any* partitioning of responsibility that allows the system to scale out).â€
Bill,
Ibrix addressed those concerns in several ways.
1. Ibrix has significant built in redundancy to recover broken meta-connections after complete of partial segment failures. In other words, directories that are lost due to catastrophic loss of segments can be re-stitched using contents of shadow directories.
2. Ibrix can be configured to support multiple copies/replicas of files and directories. Replication can be performed synchronously by Ibrix clients or asynchronously by the servers. In case of file system portioning or in case of catastrophic loss of segments replicas can be used.
3. Adding additional replicas or re-stitching is done in parallel utilizing CPU and IO busses of all servers.
4. In SAN or iSCSI environment Ibrix can withstand crashes or power failures of individual servers by migrating control over segments from one server to another.
Boris Zuckerman on Wednesday, 8 October, 2008 at 5:25 pm

To PBlog on August 17th, 2008 at 9:31 pm
â€œHow does this compare with GlusterFS?â€

There are some conceptual similarities between GlusterFS bricks and Ibrix segments.
However, Ibrix is the completely kernel mode product. Itâ€™s much easier to set up and use. It delivers absolutely linear scaling.
In essence, itâ€™s written for users not for hackers!