Jim Gray, in his paper Distributed Computing Economics, (pdf) noted that there is a rough price parity between 10 bytes of network traffic and a megabyte of disk bandwidth. One of his conclusions: computing has to be as close to the data as possible in order to avoid expensive network traffic.

The paper was written in 2003 and the numbers may not be as accurate today. What if we could change that relationship? How would that impact data center architectures?

In their paper, A Scalable, Commodity Data Center Network Architecture, (pdf) Mohammad Al-Fares, Alexander Loukissas and Amin Vahdat, 3 UC San Diego computer scientists, present an architecture that may do just that. They propose to leverage commodity Ethernet switches to support the full aggregate bandwidth of clusters.

They claim the approach requires no changes to the host network adapter, operating systems or applications. And it works with today’s Ethernet, IP and TCP protocols.

The problem
The record growth in those massive compute and storage clusters is a tribute to the rapid increases in CPU power and distract capacity. local area network bandwidth however has not kept pace. While it’s possible to aggregate link bandwidth by adding network interface cards, that solution doesn’t scale.

The authors note that inter-node bandwidth is a common bottleneck in large-scale clusters. The typical application efficiency of MPI clusters of 10-15% is stark evidence of this problem.

The paper seeks to:

. . . design a data center communication architecture that meets the following goals:

  • Scalable interconnection bandwidth: it should be possible for an arbitrary host in the data center to communicate with any other host in the network at the full bandwidth of its local network interface.
  • Economies of scale: just as commodity personal computers became the basis for large-scale computing environments, we hope to leverage the same economies of scale to make cheap off-the-shelf Ethernet switches the basis for large-scale data center networks.
  • Backward compatibility: the entire system should be backward compatible with hosts running Ethernet and IP. That is, existing data centers, which almost universally leverage commodity Ethernet and run IP, should be able to take advantage of the new interconnect architecture with no modifications.

Can they do it?
The San Diego team relies on a network design net was first developed more than 50 years ago by Charles Clos. They chose a particular instance of his topology called a fat tree topology.

The essential difference between current data center architectures and the fat tree architecture is that the aggregation layer is made up of two layers of smaller switches.

Here is their diagram of today’s data center architecture:

Here is their diagram of the fat tree architecture, showing how a packet would be routed:

The key to making this to layer aggregation level work is in the two level routing table. As the team describes it:

. . . we modify routing tables to allow two-level prefix lookup. Each entry in the main routing table will potentially have an additional pointer to a small secondary table of (suffix, port) entries. A first-level prefix is terminating a if it does not contain any second- level suffixes, and a secondary table may be pointed to by more than one first-level prefix. . . .

This two-level structure will slightly increase the routing table lookup latency, but the parallel nature of prefix search in hardware should ensure only a marginal penalty. This is helped by the fact that these tables are meant to be very small. . . . the routing table of any pod switch will contain no more than k/2 prefixes and k/2 suffixes.

Power and cooling
Another advantage of the fat tree architecture is that low cost switches use less power and require less cooling per port than the big iron switches. The team found that 10 GigE switches consume roughly double the Watts per gigabit of bandwidth and dissipate roughly 3 times the heat of commodity GigE switches per Gbit.

The difference looks like this:

Experimental implementation
The researchers configured a small test bench with 16 virtual hosts, 4 pods each with 4 switches, and 4 core switches. While this configuration has 20 switches and 16 posts in a larger cluster than the number of switches will be smaller than the number of ports.

They found that with a couple of modest enhancements to the fat tree topology they were able to achieve worst-case bandwidth of over 87% of the ideal bisection bandwidth for the model. The worst-case bandwidth for the standard tree topology was just under 28%.

The team also considered the problem of cabling together small switches. In the context of a maximum capacity 27,648 node cluster using 48 port switches they found that by placing all the switches in a single rack they could minimize connections between the edge and aggregation layers.

While not as pretty a solution as the costly internal bandwidth of big switches the cabling issues are workable.

Other fat-tree architectures
Fat tree architectures have been around for several decades and were first used for telephone switching. Other examples are systems from SGI and Sun’s 3,456 port InfiniBand switch. What is new is the author’s method of routing Ethernet traffic in the system and the optimizations they suggest.

The StorageMojo take
The persuasive power of low cost commodity architectures has been well demonstrated in the CPU and storage areas. Networks have remained a special case.

What this paper demonstrates is that lower cost network infrastructures are possible. Google’s infrastructure makes common use unmanaged commodity switches, but they’ve made a fetish of keeping data close to processors, something not always possible in general-purpose clusters.

I suspect that Cisco’s analyst relations team will have new marching orders this week. This is not a message Cisco will want customers to hear.

The power efficiency argument is not going to be that persuasive. The real win is in the dollars. And that’s all you need to get a CIO’s attention in these difficult times.

Courteous comments welcome, of course. This paper was presented at SIGCOMM ’08, as was a paper on a related topic Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises. The latter paper looks at a new Ethernet protocol that enables building networks without the flooding, broadcasting and spanning tree techniques that force the use of IP routing in large LANS.

And thanks to Wes for helping me stay informed.