Building REALLY big clusters
You may be surprised to learn that Google DOESN’T build the world’s largest clusters. That honor goes to the government agencies who are Cluster File Systems Inc. customers. CFSI produces the Lustre File System, today’s high-end cluster file system, which is also available as an open source project.
Lustre stores data as objects on object storage servers which are managed by metadata servers which can also be a cluster for scale and uptime. This architecture is not unlike the pNFS proposal before the IETF.
How high is high?
Peter Braam, founder and CEO of CFSI, stated that they have clusters with over 25,000 nodes doing stuff that CFSI employees aren’t cleared to know. That is about 3x the size of the biggest published Google cluster size.
They also have clusters that support 25,000 clients. For Google that’s a rounding error.
This line intentionally left blank
With such a monster file system would you expect networking to occupy half the code? Me neither. But that’s the word from Dr. Braam. Turns out that really high-end clusters might use any of some 10 networks. Let’s see: Ethernet, Fibre Channel, Myrinet, Infiniband, Quadrics – man, there must be a lot of high-end networks I’ve never heard of.
Double your pleasure
Storage tidbit: Peter reports that with 2000 disks he sees double disk failures every two months. And he thinks ZFS is “beautiful”. So beautiful that he is planning to support Lustre on Solaris with ZFS.
The pace is accelerating
With a Petabyte FS, Peter says Lustre can do 100 GB/sec sustained I/O supporting 25,000 clients. That is a lot of iTunes video.
He’s expecting to see the first Peraflop system in 1-2 years and 1 TB/sec growing to 10 TB/sec in a few years later.
By 2020 – just over 12 years away – he expects to see Exascale computing:
- 250 milion cores
- 2 million CPUs – 125 core CPUs
- 250 TB/sec sustained bandwidth
With Terabit Ethernet and a really big switch fabric, I suppose you could. 10 Tb Ethernet would make it more manageable.
This is for you, ZFS team
With such large clusters the problem of disconnections and subsequent reintegration of cluster nodes is a serious problem. Peter recommends that versioning become a standard part of cluster file systems because it helps keep everyone coordinated. I’d just like to have versioning so I know what I sent to people, or backed up, or just lost. Most people aren’t familiar with the concept, but I love it.
The StorageMojo take
After his informative and well-delivered talk I asked Peter if he expected pNFS to displace Lustre in the market. At the low-end, yes, once adoption gets under way. But he is confident that CSFI and Lustre will continue to own the high-end. They will support pNFS anyway, so they’ll be playing there as well.
Clearly, Lustre has some very high-end capabilities that will continue to make it attractive to the very high end. Yet CFSI is missing an opportunity to build a volume business by not going after the sub-100 node cluster market, which will become much more common in the enterprise over the next several years.
Comments welcome. More on the conference coming soon.