What about Sun’s acquisition of Cluster File Systems, Inc.?
Yawn. CFSI was going out of business. Sun bought the assets, not the company.
Good for CFSI employees
They get a paycheck from a solvent company. They may even get some sensible marketing. Hey, it could happen.
What is Lustre?
Arguably the highest-end parallel file system. At the Seattle Conference on Scalability, founder Peter Braam spoke about current 25,000 node Lustre clusters and plans to 10x that number in the next 5 years.
Update: It appears the Lustre.org and the Lustreusers.org sites are suspended. Hm-m-m? Update II: They are back up.
Cool, huh?
So why aren’t they rich?
CFSI was a tech playpen, not a company. Like Formula 1 racing. Instead of Ferrari, CFSI had the national labs backing them. Great stuff, except nobody else has the problems the national labs have, so it limits the market.
Lustre will be facing some serious competition from pNFS once it gets baked into Linux and other operating systems. The fast-growing commercial HPC market will eat pNFS clusters up. Lustre isn’t part of that.
The StorageMojo take
Sun bought a hook into a customer base that, when budgets are good, can be very profitable. They also bought a technical team that is very knowledgeable about fabric interconnects, which in the shift to cluster storage and grids will be a very good thing for Sun.
Comments welcome, as always. OK, Lustre proponents, tell me where I’m wrong.
As one of the architects for the EMC technology underlying pNFS, currently working for a commercial-HPC company, I wouldn’t be so sure that those particular customers will eat pNFS up. The design point for pNFS is an environment where *everyone* has their own connection to the storage. People who make really big clusters either can’t or don’t want to put an HBA in every node. They can do bridging between the cluster interconnect and the SAN, or virtual-HBA tricks, but those are less preferred routes for many reasons. The Lustre design point, where some small percentage (but still a large number) of nodes have a connection to storage and talk to the rest over the interconnect that’s already there for MPI and such, is a much better fit for those environments. When you consider caching that becomes even more true. That’s not to say Lustre is perfect or unique, but pNFS probably won’t be – or be perceived as – a compelling alternative.
If I understand the pNFS design from the RFC-drafts right, it will come in several flavours – only one of them being blocks/FC.
In the other cases (files/objects) – it can use existing eth or IB as interconnect.
(FC seems to be completely dead in the HPC space – in favour of IB or 10gigE)
Lustre being owned by SUN has no real implications to the market, because as you wrote – the main users of Lustre is the US-govmnt/labs/univ – and to them – you better deliver what they want – and not what your marketing blokes think 😉
But yes – Lustre solves problems – nobody really has – I agree 🙂
(and it doesn’t solve a lot of problems – everybody else has)
Jeff, thanks for the input. I think we’re in agreement.
My point was not that the natlabs would go to pNFS – they have huge jobs and need all the help Lustre can give them – but that the commercial/enterprise high performance computing market would go pNFS in a big way. The bulk of the commercial HPC market is in the dozens to hundreds of nodes, not thousands to tens of thousands.
BTW, I checked out the company you work for, SiCortex and downloaded some whitepapers. Very impressive crew.
Robin
Jeff, what SAN? Why not just attach some pNFS OSTs (or whatever they are called) containing local disks (e.g. Thumper) directly to the cluster interconnect?
Good question, Wes, but let me turn it around. Why reinvent OSTs when Lustre already has them? Why sort out all of the issues around things like buffer handling and interconnect-abstraction layers for a client-to-OST-to-storage model, when those things are already done and tuned for several interconnects in Lustre? Sure, you can do things to pNFS to move it from its original design point to the one we’re talking about, but somewhere along the way you’ll realize that you’ve turned it into Lustre – which was already open source. That’s not entirely insane, but it won’t motivate anyone to change horses.
At a deeper technical level, as the person who wrote more of the original FMP spec than anyone else, I think I can get away with saying that it might not be the right kind of protocol for this scale. The caching and and fault-recovery models weren’t designed for it, and the pNFS specs I’ve seen don’t seem to have changed as much as I would have expected to accommodate such a paradigm shift. Maybe very little adaptation was necessary, but I think I’d be accused – rightly! – of arrogance if I made that my first assumption.
Robin, as for dozens-to-hundreds vs. thousands, I really don’t know. To some extent it depends on how you define HPC. What I can say is that some of the issues we’ve talked about apply even before a hundred. Adding an HBA to a node increases the per-node cost by a non-negligible percentage. If your hundred-node cluster only generates about ten HBAs’ worth of I/O at peak, buying a hundred HBAs because that’s the way your cluster filesystem works isn’t going to go over very well – especially if your I/O need is only during one phase of the application and those nodes could otherwise be used for computation. Sooner or later, somebody’s going to ask why you didn’t use another filesystem and use the money to buy more nodes, and I for one wouldn’t want to be the one trying to answer that.
pNFS does hold a lot promise.
From another perspective, a distributed system that’s tightly integrated in the operating system — parallel filesystems, are complex things. Having seen many pfs crashes/issues, maturity of such products definitely show.
The real interest I think is that Sun has done an excellent job with ZFS and its marketing, but they lacked a modern parallel io solution (which is now solved). It’d be neat to see how this technology pairing goes.
Robin,
SiCortex .. a very impressive crew & a very impressive machine.
PCIe cluster interconnect… no ‘commodity’ parts or single source IB technology.
They (Lustre developers) have already started a discussion about getting Lustre+ZFS done. See http://www.opensolaris.org/jive/thread.jspa?threadID=39234&tstart=0
The Lustre developers have been working on porting Lustre on top of ZFS since at least early this summer (probably one factor leading to their purchase by Sun). Some stuff:
http://arch.lustre.org/index.php?title=Feature_Lustre_ZFS
http://arch.lustre.org/index.php?title=Architecture_ZFS_for_Lustre
Apart from the (somewhat stratified) HPC world, one interesting thing they’re working on is pCIFS, i.e. parallel fileserving for Windows clients. See
http://arch.lustre.org/index.php?title=CTDB_with_Lustre
http://arch.lustre.org/index.php?title=Feature_pCIFS
It wouldn’t be surprising if Sun is very interested in that part as well. Package it up together with the X4500 Thumper as a scalable and (compared to the traditional SAN based cluster fs competition) affordable NAS appliance. Need more storage, or more bandwidth? Just buy another thumper, connect it to the Lustre/pCIFS cluster and boom. Instant capacity _and_ performance improvement for the clients.
BJ: all the links you posted are currently down. Shame, they sounded interesting!
(Another earlier submission that your system quietly threw away:)
It’s amusing that you think CFS didn’t qualify as a ‘company’ – perhaps you define a ‘company’ as an organization which feels compelled to ship product according to the short-term desires of its investors rather than when it’s ready. CFS, by contrast, took the time required to design and implement its (rather complex and ambitious) product adequately, and is now reaping the rewards of having done so (both at least modest monetary rewards and the satisfaction its owner-engineers can derive). No, it’s not the “take the money and run” kind of start-up that has become fashionable of late, but at least some of us think all the better of it for that.
What Sun gets from CFS is the following:
1. A working, fairly mature implementation of something quite close to pNFS (certainly close enough to massage into a pNFS-compliant product in far, far less time than it would take to build one from scratch, though it seems they have a project afoot there as well) – including facilities superior to those apparently specified in pNFS (e.g., mirrored metadata servers for uninterrupted availability, plus the hooks for clustered metadata servers to support far greater scalability than any single metadata server can).
2. Developers with world-class (perhaps even unparalleled) very-large-scale file-storage experience (and certainly at least some of the most knowledgeable very-large-scale file-storage architects/engineers in the world).
3. A system almost tailor-made to allow ZFS to add noticeable value to it (Peter Braam recently commented on how well ZFS should function as the base storage for the individual Lustre nodes – I suspect largely because it takes so much of the manual labor out of node-level storage changes in a system with very large node counts).
4. A potential solution to the problem of extending ZFS to support cluster-style file service (ZFS itself is not particularly amenable to such extension, save perhaps by the same kind of separation of data from metadata that Lustre already provides and pNFS eventually will but without Lustre’s metadata clustering facilities).
In other words, a state-of-the-art solution to a significant gap in their current storage line-up, with considerable future-proofing built in as a bonus.
What neither Lustre nor pNFS provides is extreme scalability across all kinds of file storage requirements. For example, ‘layout’ information for extremely large files hits scaling limits if the individual file pieces are small enough (a few MB max) to allow extreme parallelization (the layouts simply become too large to manage) – likely one of the reasons that the recent scalability conference you attended observed that algorithmic data distribution (e.g., via consistent hashing) was the way to go (ease of reorganization when required is another). Centralized metadata and mapping also imply centralized allocation management, which doesn’t scale all that well either (even if individual servers take on some of the load by ‘objectifying’ their storage, and that has the downside of adding an additional level of allocation – and mapping – to the overhead) – a third good reason to take the algorithmic distribution route. Separating data from metadata costs extra accesses and/or network hops when the data is not all that large (say, up to at least a few MB) – but adding a few MB to the size of each file’s metadata server footprint to allow modest-sized files to reside there as well clearly isn’t possible in Lustre/pNFS-style designs that centralize metadata on a small subset of the system: you really want the metadata spread across *all* the servers, which has the added advantage of efficiently supporting installations where all files are relatively small and providing good performance in installations where metadata activity may dominate the access mix.
But Lustre is still about as good as it gets today (with possible apologies to Storage Tank, which also merits respect) and eminently suited for modest-scale installations as well as intermediate-scale ones (as evidenced by their cooperative deals with the likes of HP as well as the national labs). Sun got a pearl – let’s hope they can appreciate and cultivate it.
– bill
Bill,
No, I define company as an entity that has a viable business plan. I don’t think CFS ever did. Like Cray and Thinking Machines before them, they tried to build a business out of the national lab market, which is a very tough thing to do since they are at the mercy of Congress.
I don’t doubt the quality and expertise of the CFS team. What I have heard is that Lustre, like IBM’s GPFS, requires deep expertise to install and tune.
Most enterprise HPC uses fewer than 250 servers – not 25,000 – and is managed by mere mortals, not physics PhDs.
The bottom line is that if Sun wants to grow Lustre into a commercial product, they’ll have to get their new team to reverse course and focus on the boring issues of usability, not ever greater scale. Sun could do it, The question is: will they?
Robin
Au contraire: CFS’s business model was to get a small set of big customers with special requirements to fund early development of a system that had far more general-purpose applicability down the road – without giving up control of the product to investors (or some ‘angel’ parent corporation) in the process. In that, they succeeded admirably, though I suspect that customer support consumed more development resources over time than they had expected, slowing the product’s planned evolution.
I can believe that tuning Lustre is a challenge, and that’s at least in part a fundamental design flaw inherent in choices like limiting metadata to a small subset of the servers and using explicit rather than algorithmic mapping above the individual server level (explicit mapping within a single server, e.g., as ZFS practices it, is fine). Networking can also require more careful tending at the scales they operate at (perhaps also exacerbated by the heterogeneity of the data/metadata distribution). However, these problems are far less critical for the medium-size systems more characteristic of commercial needs: while Lustre may not be any *easier* to manage there than some of its competition (nor as easy as it could be with different design choices), it should at least be competitive with most of them in this respect – and at least equally amenable to the creation of automated heuristics that compensate for the underlying inherent problems (though, as I noted before, likely not sufficiently to allow the near-infinite scalability that a different approach might have).
– bill
Bill,
Many folks have trod the road of “a small set of big customers with special requirements to fund early development” and few succeed. Why?
It is a perfectly reasonable strategy on the face of it, but the basic problem is that it is a lot harder to come down the pyramid to a higher volume market than it is to climb up into a more specialized market.
The big customers have very interesting problems. People develop relationships. The product is tuned for knowledgeable users. The deals are big and prestigious. It is a very comfortable world. Corporate quicksand that sucks you in.
The net-net is that after 5 years, CFSI sold its assets to Sun, which is a polite way of saying it didn’t have any business value beyond those assets.
The product is great. The marketing thought behind it less so.
Robin
I’m afraid you’ve got it backwards in this case: the reason there was a niche for CFS to fill in the first place was because it’s damn difficult to retrofit a more limited product to address the up-market needs of truly scalable storage (as everyone else has been finding out for decades).
CFS had a goal: build something of unique value that its founders were interested in building and get paid for it. It succeeded in that goal. However, during this success I suspect that its principals discovered that (being engineers, after all) they weren’t all that interested in marketing the product they had created more widely, or in devoting the amount of time to customer hand-holding that it required.
So they found a buyer who *was* interested in those aspects, and can now get back to doing what they enjoy: win for them, win for Sun, and a great deal easier than trying to bring in additional executive and marketing talent (an endeavor likely far outside their realm of expertise) just to keep the company name.
Some people just don’t *like* non-technical wheeling and dealing, Robin: they consider it at best annoying overhead, and are more than happy to sacrifice some income if it means they can avoid it. It can be argued that those who manage public corporations have a fiduciary obligation to their stockholders to put up with such annoyances in order to maximize stockholder return, but in the case of a privately-owned corporation such as CFS no such obligation exists: if the company satisfies its owners’ goals and meets its customers’ needs, it’s successful – period. And the fact that Sun found it worth buying indicates that it’s considered potentially successful in a more conventional sense as well.
– bill
I work in the HPC vendor industry. Here are my thoughts:
I was not surprised CFS was acquired. I thought for the last year CFS would be acquired, but I thought a company like Platform Computing would buy CFS. When that did not happen, I assumed the reason CFS had not been acquired was potential suitors thought Lustre was doomed to be disintermediated by pNFS. Given’s Sun’s work on NFS v4 and pNFS, and some statements by some Sun executives at their analyst conference last February, I was stunned when Sun annouced they were buying CFS.
CFS has definitely had success outside of the national labs, but it is still purely an research HPC play. The success of CFS was driven mostly by disappointment with alternative cluster filesystems. Many my HPC customers use CFS, and many who do not have plans to use CFS. But from a significant minority of customers, I have seen interest in pNFS. This interest is driven by the desire for an open, standardized, alternative the the myriad of parallel storage solutions. The belief is an open, standardized alterative will have enough critical mass that bugs will be fix, and required features will be added, and the pain of implementing cluster filesystems will be reduced. pNFS can run purely on Ethernet, and these customers were considering Ethernet transports.
But there is also an emerging class of I/O bound HPC cusotomers, mostly EDA and Oil and Gas Seismic analysis, who are ideal candidates pNFS customers. These customers are exceeding the bandwidth of GigE connections. Today, they are looking at NFS and cluster filesystems attached via InfiniBand and 10 Gigabit Ethernet. They are not considering SAN-based cluster filesystems. These customers are ideal for pNFS over 10GE or IB. In the case of pNFS over 10GE, pNFS over IP with TOE equipped NICs on the clients might provide enough performance. iWARP RDMA is an option, but most of the pNFS over RDMA work is in IB right now. In the case of pNFS over IB, there is considerable development effort from several vendors.
So I tend to agree with Robin that pNFS will displace Lustre and other parallel file systems in HPC. Many of these customer like to try the latest technologies when they deploy a new cluster.
Which makes Sun’s acquisition of CFS all the more of a head-scratcher. Unless Sun figures there are some parts of CFS’s assets which will benefit Sun’s plans around network attached storage.
I worked for a “HPC-customer” (seismic). They chose Lustre as it was cheap. But it was no better than the commercial SAN products, still chokes a few times a week. I look forward to seeing pNFS in action.