Automating remote system support

by Robin Harris | Monday, August 13, 2012 | Architecture, Backup, Enterprise, Management | 11 comments

“Call home” support has been standard in large arrays for 15 years. But Nimble Storage has kicked it up a notch with their advanced telemetry data from installed systems. It gives new meaning to the term “after-sale support.”

Talk to me
Their system gathers configuration details and more. Feature – such as snapshots and backups – use. Volume protection. Application performance. Updated every 10 minutes.

Here’s a partial screen shot of representative data:

Nimble now has over 4TB of customer use data. Customers opt in to the program. Over 80% have.

Uses
When a problem is detected, Nimble’s software creates a trouble ticket. Then a human gets involved.

It might be as simple as having more Ethernet links on one of the active-passive storage controllers, causing asymetrical performance. Perhaps a volume is not protected. Or the replication policy won’t meet RPO objectives.

Email alerts flag issues to customers. Nimble support engineers can login remotely for real-time troubleshooting.

But that’s not all!
Detailed information on usage allows customers to compare their usage to average usage. Nimble can also look at how customers with the most efficient utilization manage their systems, automating the documentation of best practices.

For example, backup: most customers are retaining snapshots for more than a month. Over 50% of customers replicate workloads for DR.

The StorageMojo take
This is what 21st century support should look like. The best infrastructure is invisible – until it breaks – and the best support keeps the infrastructure from breaking.

The bad news: customers don’t want to buy and manage storage arrays. The good news: they want fast and reliable access to their data.

The Apple model of “it just works” breaks down if the applications are too complex, as many data center applications are. But that doesn’t require vendors to throw all the load on customers.

Automating the capture, review and disposition of system data gives a vendor important advantages:

Perceived reliability goes up, a fact established with early phone-home experience.
A stronger customer relationship makes follow-on sales easier and is a competitive barrier.
The “virtual user group” of shared data enables users to get smarter, faster using their Nimble arrays.
The real-time remote troubleshooting gives customers help when they need it most – not 4 hours later.

Courteous comments welcome, of course. What other support strategies have you experienced that either worked well – didn’t? If you want to learn more about Nimble, I did a video white paper on Nimble’s architecture last year.

11 Comments

Gordon Hadfield on Monday, 13 August, 2012 at 6:12 pm

There are a few start up storage companies at the moment (Pure Storage, Tinri, Tegile, Nimble Storage to name a few). They all provide “disruptive” storage solutions with pricing that undercuts the incumbents.

I think it’s great to see a maturity developing from amongst these start-ups , beyond the “Gee Whiz” technology story, beyond the sale, to look at what customers require on an ongoing basis.

Interesting Times for the incumbents . Will they wait to see if these up-starts fail (hopefully not) or start buying them out.
Jacob Marley on Monday, 13 August, 2012 at 7:41 pm

I was told this Nimble feature uploads twice a day, not real time.
Robin Harris on Tuesday, 14 August, 2012 at 1:29 am

Jacob, ~~I believe you are correct~~ an engineer from Nimble sets us both straight. It is the remote login by a support engineer that is real time.

Robin
Rod Bagg on Tuesday, 14 August, 2012 at 9:03 am

We do in fact receive certain configuration, HW and SW health-check information as well as capacity and performance metrics every 5-7 minutes from the array. This all feeds into the same back-end mechanisms that drive our monitoring and automation.
Dan Leary on Tuesday, 14 August, 2012 at 10:11 am

Robin – great overview! One minor note: actually over 90% of our customers opt-in to the automated support program (we call it “proactive wellness”), given all the benefits.
Random Storage Admin on Tuesday, 14 August, 2012 at 10:14 am

Netapp does this already. I wouldn’t classify this as disruptive.
Robin Harris on Tuesday, 14 August, 2012 at 10:39 am

Random, NetApp’s says:

To optimize your data center, you need a simple and effective method to proactively monitor and manage your storage infrastructure. Each of the AutoSupport family components complements one another by adding additional layers of proactive and preventative support coverage for your storage infrastructure.

AutoSupport â€“ AutoSupport is an integrated and efficient monitoring and reporting technology that constantly checks the health of AutoSupport enabled NetApp systems. Itâ€™s one of the most important and effective troubleshooting tools for our customers and NetApp support.

My AutoSupport â€“ My AutoSupport is a web based application that works in conjunction with AutoSupport to provide customers with information and tools designed to analyze, model, and optimize their storage infrastructure. My AutoSupport improves self-service support and operational efficiency of your NetApp systems. Our new mobile support app provides predictive and personalized support at your fingertips.

Remote Support Diagnostics Tool (RSDT) â€“ RSDT helps NetApp Support solve storage system issues without the need for customer staff intervention. Remote support automation also enables faster case resolution and helps minimize system downtime.

Maybe they do it and don’t say it, but it looks like they focus on system health and do not handle configuration issues like the MPIO problem. Also giving a customer tools to do their own modeling is hardly the same as the support organization looking at it themselves. In my experience it is a rare customer who actually tries to model their infrastructure. Instead they use a combination of rule of thumb and squeaky wheel lube in practice.

With Nimble’s focus on small and medium enterprises they are disruptive for existing vendors in that market.
Lee Johns on Tuesday, 14 August, 2012 at 3:47 pm

At Starboard Storage Systems we also use remote login to the customers system as a part of our phone home process as well. It is real time support like this that really makes the customer feel valued.
Darius Zanganeh on Tuesday, 14 August, 2012 at 6:01 pm

Interesting, I would argue that it doesn’t compare to the Dtrace based analytic’s on the Oracle ZFS Storage Appliance. They offer realtime information and troubleshooting right to the storage admin without having to login to another site or getting someone from support on the phone. They also of course have phone home and automatically create tickets and such. See a small demo video I made here. https://blogs.oracle.com/si/entry/oracle_zfssa_hybrid_storage_pool1 . The video covers more of the hybrid storage pool design then the actually detailed level of analytics. But you do get a quick glimpse of some of the questions you can ask both realtime and historical. Such as tell me how many IOPS an individual VM is getting? What is the read/write ratio of that VM? What is the latency of that VM? What is the latency of a particular ESX server or Oracle DB server? How many IOPS of this VM are coming from cache… and it goes on and on and on. True power to make intelligent decisions about your storage. Robin, I would be happy to give you a tour of the zfssa analytics anytime.
Dmitriy Sandler on Thursday, 16 August, 2012 at 8:23 am

Darius,
Those are two very different things. What you are talking about is reporting capabilities around what is happening to the various data components (volumes, VMs, NICs, etc.) from a performance perspective. Nimble Storage does that as well. What this post was about is the extension of that into a fully proactive monitoring and support infrastructure to take things to the next level. Reporting performance is definitely a great capability, which is why Nimble has that built right into the interface. But having support be notified of a potential problem before it even arises, and have support automatically send out a detailed resolution is unparalleled. And that’s only the beginning…
Ed on Thursday, 23 August, 2012 at 8:17 am

Hmm EMC arrays used to do this and it was part of the support contract requirements back in the early days of EMC. A modem was required by some support contracts. EMC Engineers would occasionally show up with parts at the site, and customers didn’t even know there were issues yet. Not sure how this is disruptive though. Hitachi, NetApp, EMC, and others have had failure-only services like this for quite some time. Even third party (well sort-of) providers like Vion do this today already.

Analytics is a entirely different ballgame. Bring on the ‘fishworks’ style ZFS tools in other arrays industry… PLEASE!