Fat trees and skinny switches

by Robin Harris on Sunday, 24 August, 2008

Jim Gray, in his paper Distributed Computing Economics, (pdf) noted that there is a rough price parity between 10 bytes of network traffic and a megabyte of disk bandwidth. One of his conclusions: computing has to be as close to the data as possible in order to avoid expensive network traffic.

The paper was written in 2003 and the numbers may not be as accurate today. What if we could change that relationship? How would that impact data center architectures?

In their paper, A Scalable, Commodity Data Center Network Architecture, (pdf) Mohammad Al-Fares, Alexander Loukissas and Amin Vahdat, 3 UC San Diego computer scientists, present an architecture that may do just that. They propose to leverage commodity Ethernet switches to support the full aggregate bandwidth of clusters.

They claim the approach requires no changes to the host network adapter, operating systems or applications. And it works with today’s Ethernet, IP and TCP protocols.

The problem
The record growth in those massive compute and storage clusters is a tribute to the rapid increases in CPU power and distract capacity. local area network bandwidth however has not kept pace. While it’s possible to aggregate link bandwidth by adding network interface cards, that solution doesn’t scale.

The authors note that inter-node bandwidth is a common bottleneck in large-scale clusters. The typical application efficiency of MPI clusters of 10-15% is stark evidence of this problem.

The paper seeks to:

. . . design a data center communication architecture that meets the following goals:

  • Scalable interconnection bandwidth: it should be possible for an arbitrary host in the data center to communicate with any other host in the network at the full bandwidth of its local network interface.
  • Economies of scale: just as commodity personal computers became the basis for large-scale computing environments, we hope to leverage the same economies of scale to make cheap off-the-shelf Ethernet switches the basis for large-scale data center networks.
  • Backward compatibility: the entire system should be backward compatible with hosts running Ethernet and IP. That is, existing data centers, which almost universally leverage commodity Ethernet and run IP, should be able to take advantage of the new interconnect architecture with no modifications.

Can they do it?
The San Diego team relies on a network design net was first developed more than 50 years ago by Charles Clos. They chose a particular instance of his topology called a fat tree topology.

The essential difference between current data center architectures and the fat tree architecture is that the aggregation layer is made up of two layers of smaller switches.

Here is their diagram of today’s data center architecture:

Here is their diagram of the fat tree architecture, showing how a packet would be routed:

The key to making this to layer aggregation level work is in the two level routing table. As the team describes it:

. . . we modify routing tables to allow two-level prefix lookup. Each entry in the main routing table will potentially have an additional pointer to a small secondary table of (suffix, port) entries. A first-level prefix is terminating a if it does not contain any second- level suffixes, and a secondary table may be pointed to by more than one first-level prefix. . . .

This two-level structure will slightly increase the routing table lookup latency, but the parallel nature of prefix search in hardware should ensure only a marginal penalty. This is helped by the fact that these tables are meant to be very small. . . . the routing table of any pod switch will contain no more than k/2 prefixes and k/2 suffixes.

Power and cooling
Another advantage of the fat tree architecture is that low cost switches use less power and require less cooling per port than the big iron switches. The team found that 10 GigE switches consume roughly double the Watts per gigabit of bandwidth and dissipate roughly 3 times the heat of commodity GigE switches per Gbit.

The difference looks like this:

Experimental implementation
The researchers configured a small test bench with 16 virtual hosts, 4 pods each with 4 switches, and 4 core switches. While this configuration has 20 switches and 16 posts in a larger cluster than the number of switches will be smaller than the number of ports.

They found that with a couple of modest enhancements to the fat tree topology they were able to achieve worst-case bandwidth of over 87% of the ideal bisection bandwidth for the model. The worst-case bandwidth for the standard tree topology was just under 28%.

The team also considered the problem of cabling together small switches. In the context of a maximum capacity 27,648 node cluster using 48 port switches they found that by placing all the switches in a single rack they could minimize connections between the edge and aggregation layers.

While not as pretty a solution as the costly internal bandwidth of big switches the cabling issues are workable.

Other fat-tree architectures
Fat tree architectures have been around for several decades and were first used for telephone switching. Other examples are systems from SGI and Sun’s 3,456 port InfiniBand switch. What is new is the author’s method of routing Ethernet traffic in the system and the optimizations they suggest.

The StorageMojo take
The persuasive power of low cost commodity architectures has been well demonstrated in the CPU and storage areas. Networks have remained a special case.

What this paper demonstrates is that lower cost network infrastructures are possible. Google’s infrastructure makes common use unmanaged commodity switches, but they’ve made a fetish of keeping data close to processors, something not always possible in general-purpose clusters.

I suspect that Cisco’s analyst relations team will have new marching orders this week. This is not a message Cisco will want customers to hear.

The power efficiency argument is not going to be that persuasive. The real win is in the dollars. And that’s all you need to get a CIO’s attention in these difficult times.

Courteous comments welcome, of course. This paper was presented at SIGCOMM ’08, as was a paper on a related topic Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises. The latter paper looks at a new Ethernet protocol that enables building networks without the flooding, broadcasting and spanning tree techniques that force the use of IP routing in large LANS.

And thanks to Wes for helping me stay informed.

{ 6 comments… read them below or add one }

Anonymous August 24, 2008 at 8:00 pm

“fat tree”? “fact tree”? Which is it?

Robin Harris August 25, 2008 at 12:29 am


You’ve noted a “wordo” – an artifact of learning to write with my mouth.

I’ve started using dictation software. The spelling is good, but it misses entire words and I haven’t yet figured out how to catch those.

Please bear with me.


Amin Vahdat August 25, 2008 at 7:03 am

Thank you very much for your note and for the article. I think that the article is well done. As co-author of the original article, I have a few comments:
– You mention that “The power efficiency argument is not going to be that persuasive. The real win is in the dollars.” I would argue that, if you care to, you can reduce the power efficiency argument to dollars. For PC’s in a data center, the cost to cool the machines dominates the hardware cost over a 2-3 year period. There will be increasing pressure on the IT industry both from within (dollars) and without (carbon footprint, regulation, etc.) to limit energy usage.
– You describe a “maximum capacity 27,648 node cluster using 48 port switches”. In fact, we based most of our discussion on 48-port switches because they currently offer the most performance per dollar and are a good match to current rack sizes in the data center. However, one could use 64-port switches in a fat tree topology to build out a 65,536 node cluster or 96-port switches to build out a 221,184 node cluster. One nice property of a fat tree is that the number of hosts that it supports grows with the cube of the number of ports in the baseline switching element. More specifically, using k-port switches, a fat tree can support (k^3)/4 hosts.
– Finally, I think that one of the compelling arguments for this approach is that it does not rely upon aggregation to higher speed links to achieve its scale (relative to traditional tree-based techniques). Thus, if you wanted to build out a cluster with 10 GigE all the way to the edge servers, you would not have any options today using traditional techniques because there is no standard for higher-speed Ethernet (e.g., 40 GigE) yet available. Even when it does become available, available switches will have low port density and, typically, prohibitive costs. The fat tree approach uses identical switching elements throughout the structure, meaning that it could deliver 10GigE to the edge even today.

Som August 25, 2008 at 10:59 am

You can achieve the same using existing switches using ECMP. The first level switches do not even need to run IP – they can be L2 switches.

HPC clusters have been using two and multi-layer CLOS configuration for at least seven/eight years.

What’s the benefit of the two level routing table scheme?
( I have not gone through the full paper yet – I hope to do that shortly – maybe the answer will be obvious)

Som Sikdar

bofkentucky August 25, 2008 at 3:55 pm

Check out http://aggregate.org/FNN/ for more research on maximizing bisection bandwidth while controlling per-port costs for clusters.

Steve Jones August 26, 2008 at 1:02 am

The price parity between 10 bytes of network traffic, and 1 MB of disk bandwidth refers to WAN network, not LAN. As the article refers to data centre articles, I’m not sure of the relevance. Indeed if you read the distributed computing economics paper, then it has price parity between LAN and disk bandwidth (about 1TB per $). For data centres that’s not suprising – big iron SAN switches and LAN switches are both expensive items. In many cases, it’s even the same basic frames. That’s not to say that there isn’t some potential value in using commodity switches for some types of connection, but the reality of many data centres is that extensive use is made of network virtualisation in SAN and Ethernet connections. commodity switches generally can’t do those sort of things.

As for the energy consumption per GBps and trhe heat dissipation, the chart shows an approximate 3:1 reduction for both. That’s hardly suprising – unless something has gone seriously wrong with the law of conservation of energy (or somehow a very large proportion of the “big iron” switches’ power is being disipated in the form of electro-magnetic radiation outside the data centre (hardly likely) then pretty well all that electrical energy eventually comes out as heat. (Yes, there are a few little niceties about power factors, reactive loads and the like, but generally electrical power in = heat out to a very close order in a closed environment).

Leave a Comment

{ 3 trackbacks }

Previous post:

Next post: