Facebook on disaggregation vs. hyperconvergence

by Robin Harris | Tuesday, January 6, 2015 | Architecture, Cloud computing & storage, Clusters | 13 comments

Just when everyone agreed that scale-out infrastructure with commodity nodes of tightly-coupled CPU, memory and storage is the way to go, Facebook’s Jeff Qin, a capacity management engineer – in a talk at Storage Visions 2015 – offers an opposing vision: disaggregated racks. One rack for computes, another for memory and a third – and fourth – for storage.

The rationale: applications need different amounts of each resource over time. Having thousands of similarly configured servers ignores this fact and leads to substantial – at FB scale – waste.

They’ve also found that different components reach functional obsolescence at different rates. Refreshing hardware at the rack level is simpler than opening thousands of servers and replacing dusty bits.

Enabling this dramatic change is their new network. No details on this network, but it must offer high bandwidth and extraordinarily low latency.

Another rack resource coming soon: optical cold storage racks starting at 1PB and expected to go to 3-4PB with the advent of 400GB optical discs.

The StorageMojo take
Holy disaggregation, Batman! The hooded crusaders at Facebook are roaring out the Zuckcave with architectures blazing. Maybe hyperscale is even odder than we imagined.

What does this mean for the rest of us? A first approximation: very little.

Facebook is an amalgam of services with very different requirements: instant messaging; friend news feeds; gaming; video; long-term photo storage; and oodles of advertising and user tracking.

An Amazon home page draws on over 100 distributed asynchronous services, but the focus is your shopping cart and payments. Facebook is, in comparison, a realtime feed mashed up with a massive personal archive.

Facebook is popular culture and its application resource requirements reflect that. Apps, like memes or fads, ebb and flow with user’s whims. Search, by contrast, is almost static.

To the extent that there is a larger lesson, it’s the network that FB has designed. If they can actually make disaggregation work the network is key.

The advantages of stripped down, warehouse-optimized LANs recall the earlier battle between RISC and CISC in CPUs. Simpler, cheaper and faster vs complex, costly and slower.

That is an idea with legs.

Courteous comments welcome, of course. As is traditional, Internet access at CES is spotty.

13 Comments

Rob on Tuesday, 6 January, 2015 at 11:45 am

> What does this mean for the rest of us? A first approximation: very little.

Agree. Loading/unloading pictures and scrolling through timelines would be okay with CPU in one rack and Memory in another rack – I guess. I can’t see this winning out in general as the TB databases a business trolls wouldn’t tolerate that latency.
Anonymous covard on Tuesday, 6 January, 2015 at 1:48 pm

I’m sure that are they going to use the blasting OpenPOWER POWER8 cpus. Time will tell.
Gabriel Chapman on Wednesday, 7 January, 2015 at 9:09 am

This is the direction that makes sense for the web giants, which are very few in number, but who predominately build their own infrastructure. The impact on the rest of the world is negligible, and that’s why for the traditional IT dept a Hyper Converged offering is a viable solution. That said, I see a trend for a segment of customers at the high end of the Enterprise looking to do something similar by adopting Rack Scale dynamics but leveraging technologies like Moonshot and Cisco M-Series for hyper scale/rack scale implementations. The challenge there is the robustness of the storage compliment. Most vendors are focusing on the disaggregated compute currently.
Anton on Thursday, 8 January, 2015 at 1:20 am

Well, this idea isn’t new. It sounds like the mainframe and unix world all over again, but on a bigger scale. Take a mainframe rack. Need more memory? Hot-add some memory cards. More processing power? Hot-add a cpu-card. More I/O? Add some pci cards. The interconnect is already there, so everything connects together. That’s what FB had to reinvent, and apparently, they succeeded. But new technology? No, I don’t think so. Only the scale is new.
Benjamin Darfler on Friday, 9 January, 2015 at 3:15 pm

FB just posted a long article on their new network.
https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/
James Hesketh on Monday, 12 January, 2015 at 8:04 am

I think in the short term it will probably mean little change, as already hinted the method of transport across the data centre will be the focal point and specifically latency challanges will be the K2 mountain to climb.

For the long term, thinking of vast hyperscale deployments, is this not what a succesfull public cloud is? Removing a layer of software that needs to run on every hypervisor would be very benficial I would suspect.

There was a project in Open Compute in 2011 that was starting to tackle this, the project eventually got splt into two working groups, but it is really encouraging to see Facebook still talking and working on this.

I think IBM also had some idea’s that resonated along the same lines – but it was not in any way the same scale as this. From memory that would have been the x-server and being able to daisy chain multple servers together to create one large logical unit.
Rob on Monday, 12 January, 2015 at 9:17 am

Looks like Qin was describing something further into the future as the new network makes no mention of separate memory racks. That or the new network wasn’t described/interpreted correctly.
Benoit Hudzia on Tuesday, 13 January, 2015 at 5:14 am

This trend has been existing for a long time, look at Intel Rack scale architecture, or even the (discontinued) Hecatonchire project I started at SAP almost 5 year ago , slide deck : http://www.slideshare.net/blopeur/project-hecatonchire

My most recent blog post about it are here: http://www.reflectionsofthevoid.com/2014/09/from-converged-infrastructure-to.html
and here : http://www.reflectionsofthevoid.com/2014/11/on-emergence-of-hardware-level-api-for.html
Jamie Kelly - Fujitsu on Wednesday, 14 January, 2015 at 5:34 am

When choosing a way ahead between disaggregation and hyperconvergence or any other technology , you have to analyze your business needs thoroughly. What data mix do you want to store, short term vs long term needs, workloads, security…there’s no technology which is going to save you doing some hard-headed thinking.
Jacob Marley on Wednesday, 14 January, 2015 at 8:37 pm

Their traffic growth chart (in the video) of “machine to user” and “machine to machine” traffic is starting to look like a HPC cluster.
Brian on Friday, 16 January, 2015 at 9:11 am

Don’t see this making much difference in the long haul of aggregation techniques, but I could be wrong.
Vaughn Stewart on Friday, 16 January, 2015 at 10:02 am

== Disclaimer: Pure Storage Employee ==

Robin – thanks for sharing Jeff’s message. Hyperconverged is an exciting technology that I believe is seeking to find its sweet spot in the market. Will it flourish at mass-scale or is it more apropos for ultra-small deployments?

Hyperconverged allowed web giants to address the IO bottlenecks of the early 2000s. With all-flash aways as the new norm in general purpose shared storage, this pain point may no longer exist.

Thanks for reminding all of us that innovation never sleeps.

— cheers,
v
Greg Ferro on Thursday, 5 February, 2015 at 4:25 am

Intel is developing rack scale server technology that does exactly this. Shelves of CPUs, memory, storage and networking in a coherent rack architecture. Today it uses Infiniband to connect the pieces together (pretty stupid) although Ethernet is starting to creep into the model.

PCIe V4.0 bus is only 15.75 Gbps per lane and supports up to 16 lanes for 252 Gbps. This could easily be handled in a standard Ethernet silicon today.

Iâ€™m building multi-terabit non-blocking Ethernet backbones today for a large scale HPC customer at very low cost.

So this is a â€œwhenâ€ not â€œifâ€ in my book.