MaxiScale’s Web-scale file system

by Robin Harris | Monday, September 21, 2009 | Architecture, Cloud computing & storage, Clusters | 10 comments

A new web scale – they claim linear scaling to 50,000 nodes — filesystem from MaxiScale has some interesting wrinkles.

Wrinkle #1 – like several of the largest Web scale filesystems MaxiScale does not use RAID. Instead, it replicates files among peer sets, a group of two or three disks on separate storage nodes that replicate each other’s data.

Wrinkle #2 – is a property of peer sets: distributed metadata. Unlike distributed lock manager’s that require is a low latency backend network for efficient management file metadata is attached to the peer set rather than the entire cluster.

In effect a MaxiScale filesystem can be thought of as potentially tens of thousands of two or three node clusters with a shared single namespace. is the file system does require a MaxiScale client then keeps track of the location of file objects.

Wrinkle #3 – MaxiScale offers three different repositories optimized for different workloads: normal (>1 MB) files; small (< And1 MB); and key value/object store. More cool stuff
Their software also enables drive hot-swapping – critical for staying online.

They also run MapReduce to manage the cluster. A massively parallel data management tool running a massively parallel cluster. Cool.

And it runs on commodity hardware.

The StorageMojo take
Just when I’d thought that HP had cornered the massively scalable market with their IBRIX purchase, Maxiscale comes along with a new take on the problem. They aren’t a general purpose file system – the required client software eliminates that – but for the web-facing file serving market they appear to have a compelling solution.

Courteous comments welcome, of course. James Hamilton shares his take on MaxiScale here. Gary Orenstein of MaxiScale writes about Small Files, Big Headaches: Ensuring Peak Performance.

10 Comments

Shehjar on Tuesday, 22 September, 2009 at 1:55 am

Re point 1, it is a wrong notion I’ve read in other places also, that, machine/server-level replication can be a complete replacement for disk-level replication like in RAID1, for all types of setup. Both operate at different levels of the storage stack and have their own pros and cons. Which one or what combination of the two gets used should be a function of the particular deployment. I think, or rather hope, MaxiScale is not limiting users to the peer-level replication only.

Re point 2, I am probably not understanding it right, but how is the concept of peer-set a wrinkle. One would think that peer-sets are actually reducing the size of the domain within which contention can happen for meta-data.
On the other hand, it is still not clear how smaller peer-sets are aggregated into a cluster without, in turn, causing meta-data lock contention among the various peer-sets.

They also run MapReduce to manage the cluster. A massively parallel data management tool running a massively parallel cluster. Cool.
Heres hoping it does not require a PhD in CS to manage the cluster. 😉
Nick Kirsch on Tuesday, 22 September, 2009 at 9:47 pm

Robin,

Just curious why you believe IBRIX is the end-all-be-all with respect to the massively scalable market VS say … Isilon? How do you segment the “massively scalable market” and what are the key properties a solution needs in order to serve it?

Thanks.

Nick
http://twitter.com/Isilon_Nick
Kevin Closson on Wednesday, 23 September, 2009 at 11:03 am

Gary Orenstein… I wondered where he landed after Gear6… Gary, if you see this I’m saying, “Hello.”

Sounds like interesting stuff…
Joe Kraska on Wednesday, 23 September, 2009 at 12:40 pm

HP and Ibrix:

HP, having acquired not just IBRIX but /also/ Polyserve /and/ LeftHand networks, is looking like it’s assembling a powerful arsenal of IP, wouldn’t you say. I.e., it certainly looks as if it’s TRYING to corner cluster file systems, doesn’t it? (even if they patently have not done so, YET). Anyway, IBRIX can do blocks and IO’s. Isilon isn’t marketed for either.

A bit of a shame, that, because I think Isilon could do iSCSI passably well. Isilon’s distributed clustered architecture suggests it would adapt to scsi IO-redirect protocol capability naturally. iSCSI LUN distributed across a whole Isilon cluster, anyone?

I covet, I covet. But alas, no.

Joe Kraska
San Diego CA
USA
Robin Harris on Wednesday, 23 September, 2009 at 5:03 pm

While I’ve watched Isilon since 2001, I don’t think they’d argue that they are a web-scale or massively-scalable system ready to drive Internet-facing apps. Nor should they want to be since that is not where the market is.

For example, IIRC they currently top out at 128 nodes, which, I hasten to add, is not an architectural limit but a testing one. IBRIX claimed at least one client with multiple PB running on about 1,000 nodes. And while Isilon prices are competitive, they aren’t white box commodity prices.

Isilon might have a marketing play in the “private cloud” meme because that won’t be driven by economies of scale as much as the public cloud market. Their ease of management will be a win there, as it is in the media market today.

I like the IBRIX architecture a lot, but I’m happy to believe that something better is out there – and maybe MaxiScale is it.
Nick Kirsch on Monday, 28 September, 2009 at 4:15 am

Thanks, Robin – great discussion. I’m glad to see the presence of scalable architectures being discussed more and more often.

Isilon has many huge internet-scale customers – including very large photo sites, leading social networks, outsourced IT providers, etc. Many of these are well over a petabyte and some are getting to the 10’s of PBs (sometimes in the same filesystem, sometimes not).

Today’s single file system limits (deployed, not theoretical) are 5 PBs, 144 nodes, and 50 Gbps – but the more compelling reason why customers choose Isilon is not simply because of the large numbers we post, but rather, the integral business challenges we’re solving.

Ultimately Isilon customers look not only to saving money (in terms of CapEx AND OpEx) but also to accelerating productivity while minimizing business risk. If you consider the total cost of ownership, I challenge anyone to find a system which is more reliable, easier to manage and grow, and higher performing than an Isilon cluster.

Nick
http://twitter.com/Isilon_Nick
Jeff Darcy on Thursday, 1 October, 2009 at 2:26 pm

The only place I can find the claim about Ibrix running on 1000+ nodes is on this site. Can the claim be substantiated?
Robin Harris on Thursday, 1 October, 2009 at 3:41 pm

Jeff,

IBRIX made the claim to me in person several times. HP reiterated the claim this week at the tech day I attended. IIRC, the site they referred to was supposed to be running the IBRIX client software which maintains a client side index and reduces latency and backend chatter. They’ve been consistent about this for some time, to me anyway.

Robin
Jeff Darcy on Friday, 2 October, 2009 at 5:42 am

Are you sure the claim was about 1000+ servers, not clients? The well publicized IBRIX deployment at Pixar involved thousands of *rendering* servers, which are clients as far as the filesystem is concerned. Even the biggest Lustre or Panasas HPC sites (e.g. ORNL or LANL respectively) don’t have that many servers, because even with their many thousands of clients they don’t need that many. If there’s really a thousand-server configuration somewhere, especially one supporting commensurate I/O rates (at least 500GB/s), I’d expect it to be big news reported many places. It sure would be a shame to see other vendors or projects dinged for only supporting 32 or 128 servers because IBRIX had deployed 1000 clients, don’t you think?
Robin Harris on Saturday, 3 October, 2009 at 7:08 pm

Jeff,

Good point. I’ve got a call into a contact at IBRIX/HP. Let’s see if he can clear it up.

Robin