SSDs and the TPC-C top 10

by Robin Harris on Thursday, 19 January, 2012

If SSDs are so great, shouldn’t we see the results in TPC-C benchmarks? They are, and we do.

But there are some surprises.

Cost
Looking at the TPC-C top 10 performance results showed the dramatic impact SSDs have had on the cost per thousand transactions (tpmC).

  • There are no top-10 disk-only results after 2009.
  • The most expensive top-10 SSD result is some 15% cheaper than the least expensive disk-based result – and the other SSD results are much less.
  • No top-10 results posted during 2009 – the depth of the great recession.

Capacity
The conventional wisdom has it that disks must be way over-configured to get enough IOPS. You’d expect to see disk solutions have a lot more capacity than SSD solutions in top-10 results.

But we don’t:


The highest capacity – 1760 TB – is for an Oracle SSD-based solution. Yet the lowest capacity solution – 83 TB – is also SSD-based and is also the cheapest per tpmC.

Are we seeing issues with the rest of the infrastructure?

The StorageMojo take
I’ll be taking a deeper dive into the data, but perceptions may be at odds with what this limited set of performance focused benchmarks is showing us.

Readers: what do you think?

Courteous comments welcome, of course. Events beyond my control have reduced StorageMojo’s usual posting frequency. Hope to get things back to normal over the next several weeks.

{ 7 comments }

Gridstore snags Geoff Barrall

by Robin Harris on Tuesday, 10 January, 2012

BlueArc and Drobo founder Geoff Barrall has a new perch: Gridstore, one of the companies I’ve been following for almost 3 years. Geoff is the new executive chairman. Formal announcement is expected this week.

Gridstore’s concept is a low-cost scale-out NAS appliance designed for office environments. Each box is a small, low-power node with a couple of TB. Stack ‘em for as much redundancy, capacity and performance you want.

Think of it as the consumerization of hyper-scale technology. Nutanix writ small.

Gridstore details
Gridstore is offering a low-cost, scale-out network file server for $500 a node. That is too cheap for the enterprise storage companies to sell directly.

Founded 5 years ago, Gridstore got a beta out in 2010, and have been shipping for well over a year. They are a Microsoft CIFS protocol file server, using Microsoft’s storage server software. Running on small, 25 watt Atom-based boxes, a 6 node configuration is the size of a bread box.


Like other scale-out NAS systems, the Gridstore NAS has no single point of failure and can survive multiple node failures without going down or losing data.

They call their redundancy scheme RAIDg. When you set up a volume you dial in how many faults you want to survive and the software handles the rest.

Today the number of faults they can handle is limited to half the number of nodes minus one. If you have a 6 node configuration it can handle the loss of 2 nodes. They expect to relax that requirement in the future.

The StorageMojo take
Haven’t spoken to Geoff about this, but Gridstore seems like a natural for him. If there’s a theme to his many endeavors, its making advanced NAS technology more accessible.

Gridstore fits the bill nicely. If there’s one complaint about Drobo, its the lack of box-level redundancy. Gridstore answers this objection, at a higher price point.

Drobo – over 200,000 units sold – has blazed a trail for bringing advanced storage technology to the masses at affordable prices. They may be the first, but as Gridstore and others demonstrate, they won’t be the last.

Courteous comments welcome, of course. Hoping to make it to CES later this week. Readers: anyone I should make a point to see?

{ 8 comments }

Learning from customers

by Robin Harris on Wednesday, 7 December, 2011

EMC’s Chuck Hollis blogged about The Vendor Beating a couple of months ago. The unspoken question in the post is “how do we understand what customers are telling us?”

He writes

As an employee of a large IT vendor, I’ve been at the receiving end of a reasonable number of vendor beatings.

Occasionally it’s richly deserved. But, sometimes, it’s masking a deeper set of issues that have very little to do any vendor whatsoever.

Unhappy customers, like unhappy families, are all unhappy in their own way. This customer appeared to be overstaffed, under-skilled and poorly managed.

Interpretation
Interpreting customer complaints and behavior is hard. When companies can’t decipher what customers want – which is usually what the company isn’t selling – it is easy and dangerous to tune them out.

Customers can tell you things about your company and products that you can’t directly discover for yourself, but what customers say may be different from what they think. And both are influenced by the customer’s context, which can include company politics, prior vendor experiences, knowledge deficits and employee level.

Diagnosis
Steve Jobs once said that customers don’t know what they want until you show it to them. Customers know what would improve the current product in the current use case, but they can’t imagine bringing multiple novel technologies to bear on a much broader problem.

Tablet computers flopped for years until the iPad crystalized the market. Everyone saw the tablet problems: thick; heavy; slow; clunky UI; poor battery life; and, thanks to low volumes, cost. Incremental improvements – faster processors, more RAM, larger disks – didn’t help.

Tablets required a deep rethinking and application of several novel technologies – flash, gestures, CNC case milling, an app store and an energy-efficient OS – to create a compelling user experience.

The iPad illustrates the problem of listening to customers: they described symptoms and suggest fixes, but couldn’t articulate the underlying problem: how the use case differs from desktop and notebook PCs. That requires an act of imagination, not transcription.

The StorageMojo take
In Chuck’s post an EMC presales engineer identified the root cause of the customer’s pain:

. . . the database environment had grown willy-nilly over the years — it wasn’t laid out well, the queries weren’t particularly well written, and so on.

Sure, there were things we could do on the storage side (e.g. faster storage, better layouts, etc.), but it was a bigger issue than just storage performance.

But the larger question is: with high-speed and high-capacity SSDs, why isn’t this customer moving to an infrastructure that doesn’t need this fancy tuning? EMC can’t manage the fight between DBAs and storage admins, but they could be making it less contentious.

From within the EMC ecosystem the solution is clear: more training, professional services and faster gear. But from the outside the question is: who is building “it just works” high performance storage?

Courteous comments welcome, of course. I admire Tucci’s innovative EMC business model: outbid everyone else for chasm-crossing companies; give them global distribution and support; and watch the bucks roll in. It may not be innovative technically but it is innovative.

{ 9 comments }

How fault tolerant are SANs?

by Robin Harris on Monday, 7 November, 2011

Reader Kyle asks a good question:

SANs are advertised up the wazoo as having lots of internal redundancy such as redundant power, redundant controllers, etc. I’ve spent enough time with redundancy to know that having two pieces of hardware often doesn’t cut it. I was wondering what the real story is from someone who has spent a lot of time in the storage space. Do complete SAN failures really pretty much *never* happen or are they just on the rare side? If so what are the common points of failure? Perhaps people, the OS, non-redundant hardware parts?

Please, SAN folks, tell StorageMojo readers your experience. In the meantime, here’s

The StorageMojo take
Kyle asks 2 questions: how reliable and available are the individual devices that make up a SAN and how reliable and available is the system – the SAN as a whole.

Redundancy is aimed at ensuring availability. Because of the redundancy’s greater component count you also have more failures.

Failures of redundant components shouldn’t affect availability – assuming, that is, that failures are not correlated. That assumption turned out not to be true of RAID arrays, making them less available than advertised.

How much redundancy is enough? Customers generally prefer triple redundancy if they can afford it, partly for availability and partly for performance: losing ⅓rd of system performance is less painful than ½. But for the moonshots, NASA chose quintuple redundancy on critical systems.

Yet I’d guess that most are more concerned about SAN system availability – which includes not only what we consider SAN hardware, but also the server-side HBAs, drivers and management software. It is here that the nastiest bugs lurk: untestable interactions between applications, drivers, firmware and architecture that bite us hard – and crash entire SANs.

But what is your experience, gentle reader? Many of us would like to know.

Courteous comments welcome, of course. Update: Bayesian analysis is the best tool to evaluate system-level availability, as noted in this StorageMojo video. Sadly, the tool referred to is no longer online. Anyone want to take a whack at a new one?

{ 25 comments }

Ask StorageMojo: 80,000 mailboxes need help

by Robin Harris on Wednesday, 2 November, 2011

A StorageMojo reader has a problem. Can you help?

Our mail hub (80,000+ mailboxes) is virtualized with vSphere 4.1 with Red Hat Enterprise Linux 5 x64 and Dovecot 2.0 [an open source IMAP/POP3 email server for Linux/UNIX-like systems]. We are using HP LeftHand Networks P4300 iSCSI storage in a “network RAID10 setup of RAID10 storage” for Dovecot indexes and multiple “networks RAID1 of RAID5 storage” for actual mailboxes.

This is my take: our Dovecot indexes are getting hammered with lots of small I/O requests, about 8,000 IOPS continuous during 8-working-hour days, 75% write. Indexes are fairly small (50 GB) and expected to grow to 100-150 GB, but need a lot of random I/O. We need real-time replication in storage (LeftHand is ok for us) and we think that SSD should shine in this situation. Bandwidth is not a problem (200-300 megabits of indexes traffic, but we need more IOPs).

The problem is the indexes, but our total mailbox capacity is expected to grow to 6 TB compressed using zlib compression in Dovecot.

We want to buy a storage appliance with the following requirements:

  • Vsphere 4.1 & 5 certified storage, VAAI enabled (if possible)
  • iSCSI (1 gbps)
  • High number of IOPS (at least 12,000+, most of them writes)
  • Small size (200 GB)
  • Fault tolerant (RAID, battery-backed write cache, power supply, fans, multiple gigabit uplinks, synchronous replication)
  • Cheap (less than $30k the full setup)

We want to buy at the beginning of 2012. Any product that fits?

The StorageMojo take
Suspect price will be the most significant limiter. But the respondent only needs index storage not the whole shooting match. He’s pretty happy with LeftHand for mailbox storage.

But if we can solve both problems for him, why not? If he should relax some constraint, feel free to suggest it.

He’ll be watching the comments, so if you have questions please ask them. I’ll be following the comments as well.

Courteous comments welcome, of course. His email was edited for clarity.

{ 47 comments }

14 things to know about XIV

by Robin Harris on Tuesday, 1 November, 2011

It was almost 4 years ago that IBM bought XIV (See 2008: cluster storage goes mainstream). StorageMojo couldn’t understand IBM’s product positioning – yeah, the world was clamoring for a block device for multi-media – but liked the architecture.

Now XIV appears to be making good on its early promise. Here are XIV bullet points to consider.

  1. XIV interface the “most elegantly simple” in the industry?
  2. Design goal: using large drives produce needed app performance.
  3. 3rd gen version – now with Infiniband! – announced.
  4. Supports 3TB SAS drives.
  5. Data divided into 1MB chunks and auto distributed across all disks pseudo-randomly.
  6. Infiniband boosts performance ≈4x, making it competitive with Big Iron arrays at lower cost.
  7. Full system is 180 drives.
  8. Gen3 3TB drive data rebuild takes 54 minutes.
  9. Failing drives are rebuilt from data copies on other drives.
  10. Minimum LUN size is 17GB.
  11. Thin provisioning is standard.
  12. Using Infiniband XIV’s latency and bandwidth are on a par with IBM’s Big Iron DS8000 series.
  13. Fidelity Investments replaced a 240 drive – all 15k – Big Iron array with a 180 drive XIV and found it was 60% faster.
  14. Fidelity now has 86 XIV frames installed.

Is this the most elegant management interface in storage?

The StorageMojo take
The XIV architecture was its first strong point. And now it looks like commodity hardware has caught up with the architecture.

IBM likes the XIV management interface so much that they are standardizing on it. It is the best looking storage GUI I’ve seen – much better than the 1-step-up-from-Excel look of most – and I’d like to hear how it works for large shops.

I expect that a lot of shops that are using EMC and HDS arrays would be surprised at how far XIV has come. I was.

Courteous comments welcome, of course. IBM brought me out to a well done customer event in California and put me up for a couple of nights while schmoozing me. Other than that no lucre changed hands.

Love to see some comments from XIV users. How’s it working for you?

{ 11 comments }

The network is choking our storage

by Robin Harris on Thursday, 20 October, 2011

Amazon Web Services architect James Hamilton has been posting on network issues for over a year and researching them much longer. As Ethernet becomes the de facto SAN technology, his views become more relevant to the larger storage market.

Critique
Part of Mr. Hamilton’s concern is the structure of the networking industry: the high margins; the dominance of a single player, Cisco; the closed technology; and the heavy vertical integration. All antithetical to the dynamics that have driven server costs down so successfully in the last 20 years.

These are issues the storage industry knows too well. But Mr. Hamilton is more concerned about the waste the current high-cost industry structure causes.

Waste?

Workload placement
The cost of network bandwidth leads to network over-subscription. Networks are configured as tree topologies: the further you move from end nodes the worse the over subscription.

As described in the 2009 Microsoft Research paper VL2: A Scalable and Flexible Data Center Network:

. . . the capacity between different branches of the tree is typically over- subscribed by factors of 1:5 or more, with paths through the highest levels of the tree oversubscribed by factors of 1:80 to 1:240. This limits communication between servers to the point that it fragments the server pool — congestion and computation hot-spots are prevalent even when spare capacity is available elsewhere.

This throttles data center performance by limiting server-to-server bandwidth, fragmenting resources and reducing network utilization. The latter reflects the redundant paths needed in case of switch failure: ≈50% or more of costly data center bandwidth goes unused.

As might be expected, big Internet data centers like Amazon’s have complex and unpredictable workloads. They need lots of bandwidth between all servers all the time.

A solution
The VL2 paper describes an experimental solution to these problems that includes location-specific and application-specific addressing, multi-path traffic load balancing and a novel directory design that efficiently handles lookups and updates to network mappings.

In an 75-node test cluster the design moved 2.75TB of data in 395 seconds – 94% of maximum network bandwidth – at a fraction of the cost of current enterprise networks. The paper calculates that a cloud-service scale network with no over-subscription could be built with commodity switches at 1/14th the cost of a traditional data center Ethernet.

Whoa!

The StorageMojo take
VC and engineering dollars follow high-growth markets. What Google, Amazon and Microsoft want, they get. With the rapid growth of public cloud services the network over-subscription problem will get solved.

Merchant silicon from Broadcom, Intel and Marvell is making a tried-and-true Moore’s Law attack on hardware cost. The protocol stack is tougher, but several open-source industry initiatives are under way with support from major companies. Progress will be slower than hoped, but within 3 years we’ll have a viable stack to build on.

Where does this leave the networking industry? That depends on where you sit.

Cisco will be the biggest loser, because they’ve been the biggest winner with the current model. They may need to pull an IBM and move big into services if they want to stick around. Ironically, Cisco’s UCS product line – which bakes in the tree-structured network – has further motivated broader industry action.

The rest of the industry can go after this emerging market with a lower-GM business model. Not all of them will, but it will be a critical success factor.

The big winner will be storage. Scale-out storage relies on spraying data across multiple racks for maximum availability, utilization and performance. Cheaper, faster, better scale-out networks will only drive storage demand.

For most of us this is an academic problem today. Lightly used systems – such as for backup and archiving – don’t see Amazon’s problems. But in 5 years this will be common even outside the public cloud providers.

Just as IT users have benefited from Google’s push on energy efficiency and much more, they will also benefit from much lower cost and more scalable networks.

Courteous comments welcome, of course. I can’t help but continue to marvel at how dumb Cisco’s UCS has turned out to be. It’s a gift that keeps on giving.

{ 2 comments }

RAMCloud is the new flash

by Robin Harris on Wednesday, 5 October, 2011

Sometimes in the midst of the endless tweaking needed to maximize storage performance one just wants to say “screw it! Put everything in RAM!” And that’s just what RAMCloud does.

Disk is the new tape, flash the new disk, DRAM the new flash.
RAMCloud is a research paper (pdf) and an open software project. The goal is enterprise-class availability with every bit of active data stored in DRAM, not disk or flash, for maximum performance. It is a key-value object store today, though as pure software that could change.

It’s the brainchild of John Ousterhout, a Stanford prof who invented Tcl back in the 80s at Berkeley.

Isn’t DRAM volatile and costly?
Right on both counts, grasshopper, so RAMCloud isn’t a 1 for 1 disk-style architecture. No Google FS-style triple replication here, or RAID-style erasure coding.

Instead RAMCloud uses buffered logging:

. . . a single copy of each object is stored in DRAM of a primary server and copies are kept on the disks of two or more backup servers; each server acts as both primary and backup. However, the disk copies are not updated synchronously during write operations. Instead, the primary server updates its DRAM and forwards log entries to the backup servers, where they are stored temporarily in DRAM.

Instead of working around crashes – using multiple object copies as scale-out storage does – RAMCloud recovers lost data from the DRAM logs or disk drives to replicate the lost data at high speed. That’s possible because all the log data is in DRAM or spread across many disks.

In a recent paper (Fast Crash Recovery in RAMCloud) (pdf) Diego Ongaro, Stephen M. Rumble, Ryan Stutsman, John Ousterhout, and Mendel Rosenblum (co-founder of VMware) go into more detail on this critical feature.

The key elements are:

  • Scale. Servers scatter their backup data across all other servers so thousands of disks can serve the recovery.
  • Log-structure. Reduces complexity and offers high performance.
  • Randomization. Many decisions need to be made in a large cluster. Rather than CPU, time and bandwidth consuming determinism, injecting randomization speeds decisions with less overhead.
  • Dynamic tablets. The key-value store tracks resource usage within a single table and ensures that no single partition is too large for fast restores.

DRAM is volatile so the log replication data is spread to other servers on other racks for redundancy before being committed to disk. Still, total system write throughput is limited by the disk write speed, whose limits are a key reason people are moving from disks. Flash drives may help, but other techniques, such as log truncation and sharding make it possible to get good performance from several thousand SATA drives.

How good? The team reports that in a 60 node cluster they recover 35GB in 1.6 seconds. With more nodes larger partitions should be restored even faster. Scale is good.

Lights out!
Power failures wipe all the data in DRAM. The obvious defense is to avoid failures: combine battery backup with diesel generator sets. Power ride-through will handle interruptions into the hundreds of milliseconds.

But who is going to trust that? That’s why future commercial implementations will insist on logging to stable storage, such as the flash SSDs.

They’re getting cheaper fast – faster than DRAM – which will make this a common approach.

Cost
Professor Ousterhout kindly sent a short note about cost, correctly noting that

. . . if you measure cost/operation, DRAM is roughly 100x cheaper than disk, since a disk can only perform about 100-200 operations/second. This is why RAMCloud makes sense for data-intensive applications. . . .

While you and I might find that persuasive, too many enterprises don’t. The deep conservatism of the storage culture – both figuratively and literally – makes cost a good excuse to stay with the tried and true, and easy to explain to CFOs.

The good news for the company I hope he is starting is that the primacy of $/GB is slowly eroding as customers see the system level savings from fast storage. SSD vendors and companies like TMS and Kaminario are breaking trail for RAMCloud.

The StorageMojo take
Make no mistake: RAMCloud is a research project, not a commercial product, years and million$ away from commercial application. But the concept is promising.

Imagine a world where data layout doesn’t matter, where apps are optimized for sub-millisecond storage, where 100 byte I/Os are faster and just as efficient as 8KB I/Os. The architectural implications are huge and would take a decade or more to absorb.

RAMCloud raises the thorny issue of tiering: getting hot data on the hot storage and everything else off to disk. There are OK answers for tiering but nothing insanely great.

RAMCloud shows we’re far from the end of the line in what storage can do. Faster, better, arguably cheaper: 2 out of 3 ain’t bad.

Courteous comments welcome, of course. A shorter version of this post appeared on ZDNet.

{ 2 comments }