The limits of disaggregation

by Robin Harris on Wednesday, 30 August, 2017

Hyperconvergence – aka aggregation – is pushing scale-out architectures in one direction. But Rack Scale Design (RSD) – aka disaggregation – is pushing scale-out in another direction. And Composable Infrastructure is hoping to split the difference, with the power to define aggregations in software, rather than hardware.

But this continuum is not symmetrical on each end. We have a pretty good idea of what can be done with hyperconvergence – check out the growing vendors – but disaggregation is still mostly in the theory stage.

That’s why the recent paper, Understanding Rack-Scale Disaggregated Storage, by Sergey Legtchenko, Hugh Williams, Kaveh Razavi, Austin Donnelly, Richard Black, Andrew Douglas, Nathanaël Cheriere, Daniel Fryer, Kai Mast,
Angela Demke Brown, Ana Klimovic, Andy Slowey, and Antony Rowstron, of Microsoft Research, is so useful.

For the research, the authors developed an experimental research fabric, dubbed the Flexible Fabric to test four levels of disaggregation based on how often a reconfiguration is needed.

They levels are:

  • Complete disaggregation. Assumes any drive can be connected to any server on a per I/O basis. Most frequent reconfig.
  • Dynamic elastic disaggregation. Assumes drives will connect to servers for multiple I/Os, but that the number drives connected to any one server will vary over time.
  • Failure disaggregation. Reconfigure only on drive or server failures.
  • Configuration disaggregation. Reconfigure only during deployment, or if a rack is repurposed. Least frequent reconfiguration.

Flexible fabric
The team needed a fabric that could reconfigure in a millisecond to even get close to testing the complete disaggregation model. With SSDs capable of hundreds of thousands of IOPS, even a millisecond is much too long, but who can do better?

The paper describes the Flexible Fabric:

The core of the Flexible Fabric is a 160-port switch, which implements a circuit switch abstraction. The switch allows any port to be connected to any other port. When any two ports are connected, we refer to them as being mapped . . . . The switch supports both SAS and SATA PHYs and is transparent to all components connected to it.

The authors take pains to point out that the Flexible Fabric is a research tool and is not intended for production use. They would recommend against even attempting to use the architecture in any kind of production environment.

It’s a research tool, not a stalking horse for a new kind of fabric product.

In their research the team found some anomalies. They couldn’t use a modern PHY like SAS 3.0 because it does link quality scanning – a good thing – which makes set up time last as much as a second – a bad thing.

They also discovered that rapid and frequent drive switching crashed some host bus adapters. For the SATA configuration, they finally selected the Highpoint Rocket 640 Lite 4-port SATA 2.0 PCIe 2.0 controller.

Summary results

  • Complete disaggregation was killed by the overhead of rapid switching. Not a huge surprise.
  • Dynamic elastic disaggregation, where drives are connected to servers for minutes to hours at a time, proved to be technically viable, and potentially a boon for variable workloads.
  • Failure disaggregation also proved to be technically viable, and its use case – migrating drives from a failed server to minimize the network overhead of rebuilds – is definitely interesting.
  • Configuration disaggregation, where configurations are set at deployment, turned out to be a bust, because the flexibility and cost of the fabric didn’t provide a commensurate benefit.

The StorageMojo take
So the extremes aren’t interesting, at least given the issues with current technology. But that leaves a wide swath of possibilities for system architects to explore as RSD/disaggregation/composable infrastructure ideas gain steam.

Of course, now and always, reliability trumps flexibility. And there are, no doubt, many gremlins in dynamic disaggregation scenarios.

But greater disaggregation seems to be a secular trend due to the dissimilar rates of technological change in the underlying CPU, network, and storage technologies. Work like this paper helps sort out the issues.

Courteous comments welcome, of course.

{ 1 comment }

Eclipse 2017

by Robin Harris on Tuesday, 22 August, 2017

Things have been a bit quiet here at Chez Mojo – at least on the publishing side. One the personal side I’ve been busy with a few things, one being a move to a new place. Not much of a move – about 100 yards as the crow flies – but packing isn’t much different for 100 yards or 100 miles.

The other project was a 1,000 mile drive to Torrington, Wyoming, for the 2017 US eclipse. We left on Friday, spent a night in Las Vegas, New Mexico – not nearly as glitzy as the Nevada version – and then a couple of nights in Denver.

Torrington wasn’t the first choice for eclipse watching, but horrendous traffic on I-25 from Denver pushed us to the backroads. And the small town in southeastern Wyoming – not far from the Nebraska border – turned out to be a very friendly place.

The totality was amazing and all too brief. The NASA approved shades worked as advertised, or else I couldn’t see to type this.

The StorageMojo take
Now we’re on our way back to the red rocks of northern Arizona. Right now we’re in Aspen, Colorado, on our way to Durango – a place I’ve long wanted to visit – and then on Thursday I’ll finish moving.

After a day to recover from a 2,000 mile road trip and some heavy lifting, I’ll be back to blogging on the state of emerging technology. Cheers!


How high redundancy can hurt availability

by Robin Harris on Monday, 24 July, 2017

I wrote about how clouds fail on ZDNet today, but there was another wrinkle in the paper that I found interesting: high redundancy hurts. Counter intuitive?

This comes from the paper Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, by Peng Huang, Chuanxiong Guo, Lidong Zhou, and Jacob R. Lorch, of Microsoft Research, and Yingnong Dang, Murali Chintalapati, and Randolph Yao, of Microsoft Azure. The paper explores the “gray failure” problem, where component failures are subtle, often intermittant, and thus are difficult to detect and correct.

Go read the ZDNet piece to get the gist of their findings. This post focuses on the problem of redundancy reducing availability.

Department of redundancy department
Cloud networks are configured with high redundancy to better tolerate failures. A switch stoppage is usually a non-event because the protocols re-route packets through other switches. Thus redundancy increases availability in the case of a switch failure.

But some switch failures are intermittant gray failures: random and silent packet drops. The protocols see the dropped packets and resend them, so the packets are not re-routed. But the applications see increased latency or other glitches as those lost packets are resent.

Let’s say your cloud has a front-end server that fans out a request to many back-end servers, and the front-end must wait until almost all of the back-end servers respond. If you have 10 core switches that fan out to 1000 backend servers, you have an almost 100% chance that a gray failure at any core switch will delay nearly every front-end request.

Thus, the more core switches you have, the more likely you are to have a gray failure, and, with a high fan-out factor, the more likely you are to have a gray failure that delays nearly every front-end request.


The StorageMojo take
The paper is a highly recommended read if you architect for or rely upon one of the major cloud vendors, especially if your main focus is software. While human errors are a major cause of cloud outages, the authors make the point that undetected gray failures tend to accumulate over time, stressing the healthy infrastructure, and can lead to cascading failures and a major outage.

As anyone experienced with hardware can tell you, gray failures are regretably common, and a total bear to diagnose and correct. The late, great Jim Gray coined the term Heisenbugs to describe them, because, like quantum particles, they behave differently when you try to observe them.

The bigger lesson of the paper though is that scale changes everything. Even the kinds of bugs that can take 100,000 server system down.

Courteous comments welcome, of course. If you’re a cloud user, have you seen behavior that that gray failures might explain. Please comment!


Hike blogging: 07-17-2017

by Robin Harris on Monday, 17 July, 2017

Hike blogging has been on hiatus for several reasons, including no good pictures, packing up for a short move, too much rain – it’s monsoon time now – and I’ve been getting back to biking as well.

But this morning got out at 630 on to the Twin Buttes/Hog Heaven/Hog Wash loop. It’s about 4.5 miles, with about 370 feet of vertical.

The Hog Heaven portion is a double black diamond mountain bike trail. Given that I find it a little hairy on foot, I can’t imagine how skilled – or crazy – you have to be to bike it.

But the views were fabulous in the early morning light. Here’s one:

Click to enlarge.

The StorageMojo take
Let me know if you come to town. It’s a beautiful place and well worth a visit. Happy to recommend hikes and places to go in town for food, wine, music, and art.

Courteous comments welcome, of course.


Flash Memory Summit next month

by Robin Harris on Monday, 17 July, 2017

StorageMojo’s crack analyst team will be attending next months Flash Memory Summit. The dates are August 8-10, at the Santa Clara Convention Center.

Wasn’t able to attend last year, but the 2015 summit was the best storage show I’d seen in years. Flash is where the action is, with NVRAM coming along as well.

I’ve got a couple of meetings scheduled, but if your company is doing something early stage, I’d like to talk to you. Comment below to set up a meeting. I won’t publish invites.

The StorageMojo take
With flash products moving into maturity, StorageMojo is really interested in NVRAM technologies and in how they are affecting system architectures. Especially interested in emerging concepts.

Courteous comments welcome, of course.


The moving target problem

by Robin Harris on Tuesday, 11 July, 2017

With the news that Toshiba has developed 3D quad-level cell flash with 768Gb die capacity, I’m reminded of the moving target problem. This is a problem whenever a new technology seeks to carve out a piece of an existing technology’s market.

Typically, a startup seeks funding based on producing a competitive product in, say, two years. Good analysis will allow for the fact that competition will improve, typically based on then-current improvement trends.

Often two things happen to derail the projections. The most likely is that the new product development cycle slips out, so when the product ships it is up against another 6-18 months of incumbent improvement.

But sometimes the pace of incumbent improvement rises, so even if the newtech meets its schedule projections – when does THAT ever happen? – it is still facing a tougher competitor than planned.

Disk vs flash
Flash had this problem for a couple of decades with disk. In the early 90s I bought an HP Omnibook 300 and forked over another $400 for a 10MB Compact Flash card to replace the power hungry disk. Some flash proponents probably hoped this was the beginning of a trend.

But it was not to be. Disk vendors discovered how to increase bit density on a regular basis, and disk capacities and areal densities started rising at ≈40% a year. They also built rugged 2.5″ drives for the burgeoning notebook market, and invested in power-saving technologies.

That helped keep flash at bay for another 15 years.

But finally, the flash cost-per-bit dropped below that of DRAM, and the floodgates opened. Flash won the smartphone market, which powered investment in huge fabs, and soon flash prices were dropping faster than disks.

But the key was that flash found niches that disks could not serve. And when one of those niches exploded into industry-altering size, the economics of critical mass and mass production kicked in.

I’ve been following NVRAM with great interest for years. That’s partly due to interest in what it could mean for system architecture, but also for its potential as a substitute for flash.

While it’s clear that the NAND flash cost advantage is good for the next decade, it’s also clear that flash has been shoehorned into applications – such as caches – for which it is suboptimal. NVRAM will encroach around the edges of the flash market, not the heart.

MRAM, for example, is already doing a good business in the automotive and mil-spec sectors, because it is really tough. Diablo’s current hybrid NVDIMMs – combo DRAM and flash – could certainly benefit from a pure NVRAM solution if the price was right.

The key is that NVRAM’s sweet spot is well away from flash’s cost-per-bit and density sweet spots. A fact that Toshiba’s announcement exemplifies.

The StorageMojo take
Watching how flash and NVRAM interact in the marketplace over the next decade will be instructive for students of technology diffusion. The two technologies are close in some ways, but differ dramatically in others, so simple flash out/NVRAM in stories are will be the exception.

That also ignores the potential creativity of architects and engineers as they explore the capabilities of new kinds of NVRAM. Or the potential for a new class of devices that drive NVRAM adoption, as the smartphone drove flash.

In any case the calculus of the moving target will remain. To the nimble go the spoils.

Courteous comments welcome, of course.

{ 1 comment }

Why startups fail

June 21, 2017

A great piece at CB Insights. They collected the failure stories of 101 startups and then broke those failures into 20 categories. Spoiler alert! Here are the top 10 reasons for failure, as compiled by CB Insights. What I find interesting is that 8 of the top 10 reasons are marketing related. No market need. […]

8 comments Read the full article →

A transaction processing system for NVRAM

June 19, 2017

Adapting to NVRAM is going to be a lengthy process. This was pointed out by a recent paper. More on that later. Thankfully, Intel wildly pre-announced 3D XPoint. That has spurred OS and application vendors to consider how it might affect their products. As we saw with the adoption of SSDs, it takes time to […]

2 comments Read the full article →

A distributed fabric for rack scale computing

June 12, 2017

After years of skepticism about rack scale design (RSD), StorageMojo is coming around to the idea that could will work. It’s still a lab project, but researchers are making serious progress on the architectural issues. For example, in a recent paper, XFabric: A Reconfigurable In-Rack Network for Rack-Scale Computers Microsoft Researchers Sergey Legtchenko, Nicholas Chen, […]

1 comment Read the full article →

Infinidat sweetens All Flash Array Challenge

June 6, 2017

In response to yesterday’s StorageMojo post on Infinidat, Brian Carmody of Infinidat tweeted: Robin, Verde Valley is a great organization. @INFINIDAT will donate $10K for every Infinidat Challenge customer who mentions your blog post. — Brian Carmody (@initzero) June 5, 2017 Thanks, Brian! The StorageMojo take Verde Valley Sanctuary is a fine organization that StorageMojo […]

0 comments Read the full article →

Infinidat’s sweet AFA challenge

June 5, 2017

StorageMojo has observed, many times, that great marketing of a mediocre product beats mediocre marketing of a great product all the time. Thus it is always of interest when someone comes up with an innovative marketing wrinkle. That’s what Infinidat has done with their Faster than all flash challenge. Their claim is that their system […]

5 comments Read the full article →

Hike blogging: Devils Creek Road

June 3, 2017

Taking a vacation from the usual slog in NoAZ. I’m some 60 miles north of Seattle, working on my rain tan. The weatherman claims we’ll break 70 degrees sometime during my visit, but I’m not counting on it. Occasional patches of blue sky remind me of what is possible, if not likely. Took a 4.5 […]

0 comments Read the full article →

Routing the I/O stack

May 30, 2017

Lots of energy around the concept of Rack Scale Design (Intel’s nomenclature) in systems design these days. Instead of depositing a cpu, memory, I/O, and storage on a single motherboard, why not have a rack of each, interconnected over a high-bandwidth, low-latency network – PCIe is favored today – and use software to define bundles […]

6 comments Read the full article →

Liqid’s composable infrastructure

May 8, 2017

The technology wheel is turning again. Yesterday it was converged and hyperconverged infrastructure. Tomorrow it’s composable infrastructure. Check out Liqid a software-and-some-hardware company that I met at NAB. The software – Element – enables you to configure custom servers from hardware pools of compute, network, and, of course, storage. I met Liqid co-founder Sumit Puri […]

1 comment Read the full article →

NAB 2017 storage roundup

May 4, 2017

Spent two days at the annual National Association of Broadcasters (NAB) confab in Las Vegas. With 4k video everywhere, storage was a hot topic as well. Here’s what caught my eye. Object storage – often optimized for large files – continues to be a growth area. Scality, Dynamic Data Pool, Object Matrix, HGST, Data IO, […]

0 comments Read the full article →