A distributed fabric for rack scale computing

by Robin Harris on Monday, 12 June, 2017

After years of skepticism about rack scale design (RSD), StorageMojo is coming around to the idea that could will work. It’s still a lab project, but researchers are making serious progress on the architectural issues.

For example, in a recent paper, XFabric: A Reconfigurable In-Rack Network for Rack-Scale Computers Microsoft Researchers Sergey Legtchenko, Nicholas Chen, Hugh Williams, Daniel Cletheroe, Antony Rowstron, and Xiaohan Zhao, discuss

. . . a rack-scale network that reconfigures the topology and uplink placement using a circuit-switched physical layer over which SoCs perform packet switching. To satisfy tight power and space requirements in the rack, XFabric does not use a single large circuit switch, instead relying on a set of independent smaller circuit switches.

The network problem
My concerns around RSD have always centered on the network. It’s obvious that Moore’s Law is making more powerful and efficient Systems on a Chip (SoCs) more attractive. And flash has eliminated many issues around storage, particularly power, cooling, weight, and density – while cost is steadily improving.

Which leaves the network. Network bandwidth is much more costly than internal server bandwidth, and, due to the bursty nature of traffic, much more likely to constrain overall system performance.

Which, in a nutshell, is the business justification for hyperconverged infrastructure: blocks of compute, memory & storage using cheap internal bandwidth; with Ethernet interconnecting the blocks. But today we can have a couple of thousand microservers in a rack.

Now if we could only figure out how to network them at reasonable cost and performance. Traditional Top-of-Rack (ToR) switches are costly and don’t scale well.

Higher server density requires a redesign of the in-rack network. A fully provisioned 40 Gbps network with 300 SoCs would require a ToR switch with 12 Tbps of bisection bandwidth within a rack enclosure which imposes power, cooling and physical space constraints.

Fully distributed networks are much cheaper, but inflexible. That’s why HPE’s Moonshot uses three network topologies, one for ingress/egress traffic, multi-hop for storage and a 2D torus fabric for in-rack traffic.

The XFabric answer
With XFabric the MR team decided to split the difference.

. . . XFabric uses partial reconfigurability. It partitions the physical layer into a set of smaller independent circuit switches such that each SoC has a port attached to each partition. Packets can be routed between the partitions by the packet switches embedded in the SoCs. The partitioning significantly reduces the circuit switch port requirements enabling a single cross point switch ASIC to be used per partition. This makes XFabric deployable in a rack at reasonable cost.

Of course, you then have to deal with the fact that the fabric is not fully configurable. Which is the XFabric secret sauce.

XFabric uses a novel topology generation algorithm that is optimized to generate a topology and determine which circuits should be established per partition. It also generates the appropriate forwarding tables for each SoC packet switch. The algorithm is efficient, and XFabric can instantiate topologies frequently, e.g. every second at a scale of hundreds of SoCs, if required.

The team modeled XFabric on a test bed and the results were stunning:

The results show that under realistic workload assumptions, the performance of XFabric is up to six times better than a static 3D-Torus topology at rack scale. We also show it provides comparable performance to a fully reconfigurable network while consuming five times less power.

The StorageMojo take
With the work being done on PCIe fabrics, I/O stack routing, composable infrastructure, and resiliance in distributed storage, we are reaching a critical mass of basic research that points to a paradigm-busting architecture for RSD. In 10 years today’s state-of-the-art hyperconverged systems will look like a Model T Ford sitting next to a LaFerrari Aperta.

A key implication of RSD is that it will favor warehouse scale systems. That’s good news for cloud vendors.

But if RSD is as configurable as the current products and research suggests, it will also find a home in the enterprise. The tension that exists today between object storage in the cloud and object storage in the enterprise will govern enterprise adoption.

But that’s a topic for another post.

Courteous comments welcome, of course.

{ 1 comment… read it below or add one }

Hans July 9, 2017 at 7:38 am

gen-z is one of the players in the race to be the bus for rack scale (and larger) computing
https://en.m.wikipedia.org/wiki/Gen-Z
TCP and Ethernet will move to the edge of the DC and FC will go the way of the dodo imho

Leave a Comment

Previous post:

Next post: