Panasas pushes on

by Robin Harris on Thursday, 16 November, 2017

Panasas has long been one of the most innovative storage companies – and the industry’s best kept secret. The latter fact is due to their focus on High Performance Computing (HPC), and a steadfast refusal to market themselves as “enterprise” storage.

So, yeah, it is a engineering company. But they keep turning the product development crank and growing their product capabilities.

Their latest announcement is a case in point. They have a new controller platform that is server, rather than blade, based. This enables them to put considerably more scale-up grunt behind their controller software.

Panasas has disaggregated their director software to run on commodity servers. That software can also be run on a cluster of at least 4 nodes for high availability and performance – with up to 360GB/sec of bandwidth.

The director software is not in the data path, but as Isilon users can attest, poor metadata handling can cripple nominally powerful storage controllers. Panasas maintains separate control and data planes to mitigate metadata performance issues.

Parallel file system
Panasas was also an early advocate for the NFS 4.1 parallel file access protocol. The PNFS standard was stillborn though, due to the reluctance of vendors who couldn’t take advantage of the extra performance declining to support it.

But the PNFS architecture lives on in Panasas Direct Flow, which the company has continued to develop. Direct Flow enables multiple 10gig or faster Ethernet links to act in parallel to speed large file transfers.

The StorageMojo take
There’s a lot more to the announcement, but the bottom line is that Panasas continues to innovate and push the envelope of high performance storage for HPC.

AI’s need for large training data sets is a natural for HPC storage. We’re at the beginning of a very interesting curve in high performance storage.

Courteous comments welcome, of course.


The limits of disaggregation

by Robin Harris on Wednesday, 30 August, 2017

Hyperconvergence – aka aggregation – is pushing scale-out architectures in one direction. But Rack Scale Design (RSD) – aka disaggregation – is pushing scale-out in another direction. And Composable Infrastructure is hoping to split the difference, with the power to define aggregations in software, rather than hardware.

But this continuum is not symmetrical on each end. We have a pretty good idea of what can be done with hyperconvergence – check out the growing vendors – but disaggregation is still mostly in the theory stage.

That’s why the recent paper, Understanding Rack-Scale Disaggregated Storage, by Sergey Legtchenko, Hugh Williams, Kaveh Razavi, Austin Donnelly, Richard Black, Andrew Douglas, Nathanaël Cheriere, Daniel Fryer, Kai Mast,
Angela Demke Brown, Ana Klimovic, Andy Slowey, and Antony Rowstron, of Microsoft Research, is so useful.

For the research, the authors developed an experimental research fabric, dubbed the Flexible Fabric to test four levels of disaggregation based on how often a reconfiguration is needed.

They levels are:

  • Complete disaggregation. Assumes any drive can be connected to any server on a per I/O basis. Most frequent reconfig.
  • Dynamic elastic disaggregation. Assumes drives will connect to servers for multiple I/Os, but that the number drives connected to any one server will vary over time.
  • Failure disaggregation. Reconfigure only on drive or server failures.
  • Configuration disaggregation. Reconfigure only during deployment, or if a rack is repurposed. Least frequent reconfiguration.

Flexible fabric
The team needed a fabric that could reconfigure in a millisecond to even get close to testing the complete disaggregation model. With SSDs capable of hundreds of thousands of IOPS, even a millisecond is much too long, but who can do better?

The paper describes the Flexible Fabric:

The core of the Flexible Fabric is a 160-port switch, which implements a circuit switch abstraction. The switch allows any port to be connected to any other port. When any two ports are connected, we refer to them as being mapped . . . . The switch supports both SAS and SATA PHYs and is transparent to all components connected to it.

The authors take pains to point out that the Flexible Fabric is a research tool and is not intended for production use. They would recommend against even attempting to use the architecture in any kind of production environment.

It’s a research tool, not a stalking horse for a new kind of fabric product.

In their research the team found some anomalies. They couldn’t use a modern PHY like SAS 3.0 because it does link quality scanning – a good thing – which makes set up time last as much as a second – a bad thing.

They also discovered that rapid and frequent drive switching crashed some host bus adapters. For the SATA configuration, they finally selected the Highpoint Rocket 640 Lite 4-port SATA 2.0 PCIe 2.0 controller.

Summary results

  • Complete disaggregation was killed by the overhead of rapid switching. Not a huge surprise.
  • Dynamic elastic disaggregation, where drives are connected to servers for minutes to hours at a time, proved to be technically viable, and potentially a boon for variable workloads.
  • Failure disaggregation also proved to be technically viable, and its use case – migrating drives from a failed server to minimize the network overhead of rebuilds – is definitely interesting.
  • Configuration disaggregation, where configurations are set at deployment, turned out to be a bust, because the flexibility and cost of the fabric didn’t provide a commensurate benefit.

The StorageMojo take
So the extremes aren’t interesting, at least given the issues with current technology. But that leaves a wide swath of possibilities for system architects to explore as RSD/disaggregation/composable infrastructure ideas gain steam.

Of course, now and always, reliability trumps flexibility. And there are, no doubt, many gremlins in dynamic disaggregation scenarios.

But greater disaggregation seems to be a secular trend due to the dissimilar rates of technological change in the underlying CPU, network, and storage technologies. Work like this paper helps sort out the issues.

Courteous comments welcome, of course.

{ 1 comment }

Eclipse 2017

by Robin Harris on Tuesday, 22 August, 2017

Things have been a bit quiet here at Chez Mojo – at least on the publishing side. One the personal side I’ve been busy with a few things, one being a move to a new place. Not much of a move – about 100 yards as the crow flies – but packing isn’t much different for 100 yards or 100 miles.

The other project was a 1,000 mile drive to Torrington, Wyoming, for the 2017 US eclipse. We left on Friday, spent a night in Las Vegas, New Mexico – not nearly as glitzy as the Nevada version – and then a couple of nights in Denver.

Torrington wasn’t the first choice for eclipse watching, but horrendous traffic on I-25 from Denver pushed us to the backroads. And the small town in southeastern Wyoming – not far from the Nebraska border – turned out to be a very friendly place.

The totality was amazing and all too brief. The NASA approved shades worked as advertised, or else I couldn’t see to type this.

The StorageMojo take
Now we’re on our way back to the red rocks of northern Arizona. Right now we’re in Aspen, Colorado, on our way to Durango – a place I’ve long wanted to visit – and then on Thursday I’ll finish moving.

After a day to recover from a 2,000 mile road trip and some heavy lifting, I’ll be back to blogging on the state of emerging technology. Cheers!


How high redundancy can hurt availability

by Robin Harris on Monday, 24 July, 2017

I wrote about how clouds fail on ZDNet today, but there was another wrinkle in the paper that I found interesting: high redundancy hurts. Counter intuitive?

This comes from the paper Gray Failure: The Achilles’ Heel of Cloud-Scale Systems, by Peng Huang, Chuanxiong Guo, Lidong Zhou, and Jacob R. Lorch, of Microsoft Research, and Yingnong Dang, Murali Chintalapati, and Randolph Yao, of Microsoft Azure. The paper explores the “gray failure” problem, where component failures are subtle, often intermittant, and thus are difficult to detect and correct.

Go read the ZDNet piece to get the gist of their findings. This post focuses on the problem of redundancy reducing availability.

Department of redundancy department
Cloud networks are configured with high redundancy to better tolerate failures. A switch stoppage is usually a non-event because the protocols re-route packets through other switches. Thus redundancy increases availability in the case of a switch failure.

But some switch failures are intermittant gray failures: random and silent packet drops. The protocols see the dropped packets and resend them, so the packets are not re-routed. But the applications see increased latency or other glitches as those lost packets are resent.

Let’s say your cloud has a front-end server that fans out a request to many back-end servers, and the front-end must wait until almost all of the back-end servers respond. If you have 10 core switches that fan out to 1000 backend servers, you have an almost 100% chance that a gray failure at any core switch will delay nearly every front-end request.

Thus, the more core switches you have, the more likely you are to have a gray failure, and, with a high fan-out factor, the more likely you are to have a gray failure that delays nearly every front-end request.


The StorageMojo take
The paper is a highly recommended read if you architect for or rely upon one of the major cloud vendors, especially if your main focus is software. While human errors are a major cause of cloud outages, the authors make the point that undetected gray failures tend to accumulate over time, stressing the healthy infrastructure, and can lead to cascading failures and a major outage.

As anyone experienced with hardware can tell you, gray failures are regretably common, and a total bear to diagnose and correct. The late, great Jim Gray coined the term Heisenbugs to describe them, because, like quantum particles, they behave differently when you try to observe them.

The bigger lesson of the paper though is that scale changes everything. Even the kinds of bugs that can take 100,000 server system down.

Courteous comments welcome, of course. If you’re a cloud user, have you seen behavior that that gray failures might explain. Please comment!


Hike blogging: 07-17-2017

by Robin Harris on Monday, 17 July, 2017

Hike blogging has been on hiatus for several reasons, including no good pictures, packing up for a short move, too much rain – it’s monsoon time now – and I’ve been getting back to biking as well.

But this morning got out at 630 on to the Twin Buttes/Hog Heaven/Hog Wash loop. It’s about 4.5 miles, with about 370 feet of vertical.

The Hog Heaven portion is a double black diamond mountain bike trail. Given that I find it a little hairy on foot, I can’t imagine how skilled – or crazy – you have to be to bike it.

But the views were fabulous in the early morning light. Here’s one:

Click to enlarge.

The StorageMojo take
Let me know if you come to town. It’s a beautiful place and well worth a visit. Happy to recommend hikes and places to go in town for food, wine, music, and art.

Courteous comments welcome, of course.


Flash Memory Summit next month

by Robin Harris on Monday, 17 July, 2017

StorageMojo’s crack analyst team will be attending next months Flash Memory Summit. The dates are August 8-10, at the Santa Clara Convention Center.

Wasn’t able to attend last year, but the 2015 summit was the best storage show I’d seen in years. Flash is where the action is, with NVRAM coming along as well.

I’ve got a couple of meetings scheduled, but if your company is doing something early stage, I’d like to talk to you. Comment below to set up a meeting. I won’t publish invites.

The StorageMojo take
With flash products moving into maturity, StorageMojo is really interested in NVRAM technologies and in how they are affecting system architectures. Especially interested in emerging concepts.

Courteous comments welcome, of course.


The moving target problem

July 11, 2017

With the news that Toshiba has developed 3D quad-level cell flash with 768Gb die capacity, I’m reminded of the moving target problem. This is a problem whenever a new technology seeks to carve out a piece of an existing technology’s market. Typically, a startup seeks funding based on producing a competitive product in, say, two […]

1 comment Read the full article →

Why startups fail

June 21, 2017

A great piece at CB Insights. They collected the failure stories of 101 startups and then broke those failures into 20 categories. Spoiler alert! Here are the top 10 reasons for failure, as compiled by CB Insights. What I find interesting is that 8 of the top 10 reasons are marketing related. No market need. […]

8 comments Read the full article →

A transaction processing system for NVRAM

June 19, 2017

Adapting to NVRAM is going to be a lengthy process. This was pointed out by a recent paper. More on that later. Thankfully, Intel wildly pre-announced 3D XPoint. That has spurred OS and application vendors to consider how it might affect their products. As we saw with the adoption of SSDs, it takes time to […]

2 comments Read the full article →

A distributed fabric for rack scale computing

June 12, 2017

After years of skepticism about rack scale design (RSD), StorageMojo is coming around to the idea that could will work. It’s still a lab project, but researchers are making serious progress on the architectural issues. For example, in a recent paper, XFabric: A Reconfigurable In-Rack Network for Rack-Scale Computers Microsoft Researchers Sergey Legtchenko, Nicholas Chen, […]

1 comment Read the full article →

Infinidat sweetens All Flash Array Challenge

June 6, 2017

In response to yesterday’s StorageMojo post on Infinidat, Brian Carmody of Infinidat tweeted: Robin, Verde Valley is a great organization. @INFINIDAT will donate $10K for every Infinidat Challenge customer who mentions your blog post. — Brian Carmody (@initzero) June 5, 2017 Thanks, Brian! The StorageMojo take Verde Valley Sanctuary is a fine organization that StorageMojo […]

0 comments Read the full article →

Infinidat’s sweet AFA challenge

June 5, 2017

StorageMojo has observed, many times, that great marketing of a mediocre product beats mediocre marketing of a great product all the time. Thus it is always of interest when someone comes up with an innovative marketing wrinkle. That’s what Infinidat has done with their Faster than all flash challenge. Their claim is that their system […]

5 comments Read the full article →

Hike blogging: Devils Creek Road

June 3, 2017

Taking a vacation from the usual slog in NoAZ. I’m some 60 miles north of Seattle, working on my rain tan. The weatherman claims we’ll break 70 degrees sometime during my visit, but I’m not counting on it. Occasional patches of blue sky remind me of what is possible, if not likely. Took a 4.5 […]

0 comments Read the full article →

Routing the I/O stack

May 30, 2017

Lots of energy around the concept of Rack Scale Design (Intel’s nomenclature) in systems design these days. Instead of depositing a cpu, memory, I/O, and storage on a single motherboard, why not have a rack of each, interconnected over a high-bandwidth, low-latency network – PCIe is favored today – and use software to define bundles […]

6 comments Read the full article →

Liqid’s composable infrastructure

May 8, 2017

The technology wheel is turning again. Yesterday it was converged and hyperconverged infrastructure. Tomorrow it’s composable infrastructure. Check out Liqid a software-and-some-hardware company that I met at NAB. The software – Element – enables you to configure custom servers from hardware pools of compute, network, and, of course, storage. I met Liqid co-founder Sumit Puri […]

1 comment Read the full article →