The network is choking our storage

Amazon Web Services architect James Hamilton has been posting on network issues for over a year and researching them much longer. As Ethernet becomes the de facto SAN technology, his views become more relevant to the larger storage market.

Critique
Part of Mr. Hamilton’s concern is the structure of the networking industry: the high margins; the dominance of a single player, Cisco; the closed technology; and the heavy vertical integration. All antithetical to the dynamics that have driven server costs down so successfully in the last 20 years.

These are issues the storage industry knows too well. But Mr. Hamilton is more concerned about the waste the current high-cost industry structure causes.

Waste?

Workload placement
The cost of network bandwidth leads to network over-subscription. Networks are configured as tree topologies: the further you move from end nodes the worse the over subscription.

As described in the 2009 Microsoft Research paper VL2: A Scalable and Flexible Data Center Network:

. . . the capacity between different branches of the tree is typically over- subscribed by factors of 1:5 or more, with paths through the highest levels of the tree oversubscribed by factors of 1:80 to 1:240. This limits communication between servers to the point that it fragments the server pool â€” congestion and computation hot-spots are prevalent even when spare capacity is available elsewhere.

This throttles data center performance by limiting server-to-server bandwidth, fragmenting resources and reducing network utilization. The latter reflects the redundant paths needed in case of switch failure: â‰ˆ50% or more of costly data center bandwidth goes unused.

As might be expected, big Internet data centers like Amazon’s have complex and unpredictable workloads. They need lots of bandwidth between all servers all the time.

A solution
The VL2 paper describes an experimental solution to these problems that includes location-specific and application-specific addressing, multi-path traffic load balancing and a novel directory design that efficiently handles lookups and updates to network mappings.

In an 75-node test cluster the design moved 2.75TB of data in 395 seconds – 94% of maximum network bandwidth – at a fraction of the cost of current enterprise networks. The paper calculates that a cloud-service scale network with no over-subscription could be built with commodity switches at 1/14th the cost of a traditional data center Ethernet.

Whoa!

The StorageMojo take
VC and engineering dollars follow high-growth markets. What Google, Amazon and Microsoft want, they get. With the rapid growth of public cloud services the network over-subscription problem will get solved.

Merchant silicon from Broadcom, Intel and Marvell is making a tried-and-true Moore’s Law attack on hardware cost. The protocol stack is tougher, but several open-source industry initiatives are under way with support from major companies. Progress will be slower than hoped, but within 3 years we’ll have a viable stack to build on.

Where does this leave the networking industry? That depends on where you sit.

Cisco will be the biggest loser, because they’ve been the biggest winner with the current model. They may need to pull an IBM and move big into services if they want to stick around. Ironically, Cisco’s UCS product line – which bakes in the tree-structured network – has further motivated broader industry action.

The rest of the industry can go after this emerging market with a lower-GM business model. Not all of them will, but it will be a critical success factor.

The big winner will be storage. Scale-out storage relies on spraying data across multiple racks for maximum availability, utilization and performance. Cheaper, faster, better scale-out networks will only drive storage demand.

For most of us this is an academic problem today. Lightly used systems – such as for backup and archiving – don’t see Amazon’s problems. But in 5 years this will be common even outside the public cloud providers.

Just as IT users have benefited from Google’s push on energy efficiency and much more, they will also benefit from much lower cost and more scalable networks.

Courteous comments welcome, of course. I can’t help but continue to marvel at how dumb Cisco’s UCS has turned out to be. It’s a gift that keeps on giving.

2 Comments

nate on Thursday, 20 October, 2011 at 4:25 pm

I think for most people just building line rate networks will be the path of least resistance for many years to come instead of completely re-designing the applications that run on top of the network.

Cisco came out just recently and bragged about how they can have a 768 port 10gig line rate switch in about a half of a rack at 10W/port, at a fraction of the cost of the competition. They didn’t mention the competition that I use that will have a product that can do it in almost half the space as Cisco and half the power/port.

I don’t know if Cisco’s numbers are realistic or not – HP needing 7 full racks to provide this kind of network, but I found it an interesting perspective. When you can get the same results in 1/3rd of a rack on a mere 4kW of power that’s pretty amazing.

When you have switching fabrics getting beyond 10 Terabit speeds it’s really a new ball game I think, and the costs really aren’t all that high for the most current generation of stuff, IMO at least (well provided you choose the right stuff, I have my own biases of course).

I’m sure there will be edge cases like the big shops like google, amazon etc that do so much of this kind of stuff on their own already (such as sourcing their own switching equipment).

Doing some level of data segregation is important of course, you don’t want to be dumb and randomly reach over a WAN link for data when another copy of that data may be stored locally.

Jacob Marley on Monday, 24 October, 2011 at 9:12 pm

The problem is…
1. A bit of “ridiculously expensive switches”:
Cisco and other network vendors charge a very large premium for their “big-iron” switches. The premium is a combination of marketing, economies of scale (lack of) and cost of 100’s of ports with low latency and little over-subscription.

2. But more how to connect “n” servers together:
Classic network design issue. The HPC crowd was dealing with this issue way before anyone else was.

Valiant Load Balancing used by the servers in the VL2 network isn’t as good as multipath TCP…
http://wiki.tools.ietf.org/wg/mptcp/trac/raw-attachment/wiki/Maastricht_workshop/mptcp-dc-ietf.pdf

There is active work being done to solve the single path/static load balancing limitation of Ethernet, e.g. TRILL working group

A good read on the topic of data center networking…
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/37069.pdf