Tintri responds on SSD arrays

by Robin Harris on Tuesday, 20 March, 2012

StorageMojo offered its soapbox to any vendors willing to weigh in on the question of whether enterprise arrays should be built from flash SSDs or not. Ed Lee, architect at Tintri, formerly of Data Domain and a Berkeley Ph.D, elected to respond. It is a long piece but rich in insight.

Tintri produces hybrid disk/flash SSD appliances optimized for virtual environments, not Symm-killers. They use SSDs in their products, as do other folks like Nimble Storage.

No money changed hands between Tintri and StorageMojo or related entities. My accountant is weeping in the next room.

Begin Tintri’s response:

Outside the SSD Box: More than Faster Disk
Robin Harris of Storage Mojo in his recent article, “Are SSD-based arrays a bad idea? and Matt Kixmoeller of Pure in his response, The SSD is Key to Economic Flash Arrays, present interesting perspectives on whether or not SSDs are the best technology for building flash-based arrays. Robin argues that by rethinking how flash can be packaged outside the SSD box, you can achieve better performance, reliability, cost and flexibility. And these observations are supported by the experience of existing flash-based storage vendors who have developed their own custom flash modules and packaging. Matt argues that SSDs provide an industry-standard product that requires less investment to leverage, better economies of scale, and rapid improvement in technology. These are also very valid points, especially for startups with limited time and capital.

Latency
Taking latency as a point for comparison, flash-based storage vendors using custom packaging often quote IO latencies in the tens of microseconds versus SSD latencies of low hundreds of microseconds. While this is a notable difference, software and interfaces can also add overhead and the final latency seen at the subsystem level may differ by only a factor of two to four. Server-side flash products can avoid more of the software and interface overhead and provide better latencies – but may require rewriting applications to capitalize on this advantage. Keep in mind that hard disk latencies can easily reach tens of milliseconds under even moderate load. ALL of these flash-based products have latencies that are hundreds of times faster than disk.

In short, most of the performance improvement comes from simply replacing hard disk with some form of flash. This immediately shifts the performance bottleneck from storage to some other component in your system. As a result, you won’t be able to take full advantage of flash performance without also optimizing the performance of the rest of your infrastructure, and ultimately rewriting your applications as well.

The above phenomenon explains why replacing your hard disk with flash often speeds up your applications by only a factor of two to three rather than ten or a hundred. Congratulations! You’ve just moved the bottleneck from storage to some other component of your system. By Amdahl’s Law, further improving only storage performance has diminishing returns. So while custom packaging does provide significant advantages in latency, most applications are unlikely to benefit until the rest of the computing ecosystem is optimized to take full advantage of flash.

To take a closer look at SSD latencies, I ran the following simple experiment:
1) Erase an MLC SSD so that no logical blocks were actually mapped to flash, and then issue small random reads.
2) Overwrite the entire SSD so that all logical blocks are mapped, and issue the same small random reads in step 1.

The idea here is to measure the software and protocol overheads of accessing flash packaged as SSD separately from accessing the data on the SSD. Reads with no blocks mapped had latencies of around 70us, while the reads with all blocks mapped had latencies of 250us. In this case only a fraction of the overall IO latency was due to SW and protocol overhead, indicating that SSDs may still have significant room for improving latency.

Form factor
Another important issue discussed by both Robin and Matt is the relative cost of flash packaged in SSD versus non-SSD form factors. Robin argues that an SSD costs significantly more $/GB than the underlying flash while Matt argues that non-SSD packaging is expensive to develop, and SSDs provide useful flash management functions as well as hot-swap capability. It’s certainly true that developing custom packaging has a high up front cost, although this is likely balanced by lower unit costs. But as Robin points out, there are also standard packaging options available for non-SSD form factor flash, which may make custom packaging for non-SSD flash unnecessary.

A very important point to keep in mind when thinking about commercially available SSD vs. non-SSD form factors is that SSDs are designed as a substitute for disk, while non-SSD form factors are often designed as substitutes for memory. This means that SSDs focus primarily on reducing $/GB (its greatest weakness vs. disk), while non-SSDs focus on reducing $/IOPS (its greatest weakness vs. DRAM). This explains why SSD is currently much cheaper on a $/GB basis than PCIe flash, while PCIe flash designed as memory expansion is cheaper on a $/IOPS basis than SSD. This is not to say that you can’t build a non-SSD form factor that has lower $/GB than SSD, just that the primary applications for these non-SSD form factors today is usually not as a replacement for disk.

Whether flash in SSD versus non-SSD form factors is better for use in storage subsystems in the long run primarily depends on the relative volumes of these products, and the feature and price sensitivity of the applications these products serve. At this point the ‘winning’ form-factor seems hard to predict. So as a flash subsystem vendor, it seems desirable to keep your options open and ensure that your technology will work well with a variety of packaging options.

More than just a faster disk
But flash is about more than just performance and packing. Flash enables much more than just a faster, denser replacement for disk. With flash, we can finally remove a key mechanical barrier to scaling not only storage systems, but computing systems in general. Going forward, CPU, network and storage can now all scale with improvements in semiconductor technology. When transistors replaced vacuum tubes, we got more than just compact radios; we got simpler, more powerful computing systems. Similarly, flash is a catalyst that will enable far greater levels of automation and functionality for storage and computing systems than is possible today.

I tend to think of the value of new technology as the product of its simplicity times the functionality it offers. It’s clear why functionality is important, but why is simplicity so important? Technology that is simple to use will be used more often, to solve more problems, in less time. As a result, simplicity has a compounding effect on value:

Value = Simplicity * Functionality

How does one measure simplicity? One way is to list the basic steps it takes to perform a task and how long each step takes. One to three is good, four to six is manageable, and anything resembling a twelve step program will likely require written directions and a significant amount of focus. Note that in assessing the simplicity and functionality of a technology, one must do it in the context of the job that needs to be done. For example, a chainsaw has great features for cutting down trees but not for giving haircuts.

A common problem with many general purpose storage products when applied to applications such as virtualization is that they require executing long lists of steps to get anything done – and most of the features are not directly applicable to virtualization. Paradoxically, many of the features that try to make these products better suited to the application end up making the products more complex – resulting in little improvement in overall value. Kind of like adding too many tools to a Swiss army knife until you have so many that the attachments start to stick and rub against each other.

Flash as a catalyst
Flash eliminates a key mechanical barrier to scaling computing systems and is 400 times faster than disk. To keep things in perspective, the speed of sound is “only” 250 times faster than walking! If I could get to work at supersonic speeds, I would no doubt save a lot of time each year. But would I do no more which such an ability? Similarly, is flash just a faster replacement for disk? Will it make no significant difference in the way storage is managed and used? We obviously don’t think so. Flash will greatly increase the value of storage by improving both the simplicity and functionality of enterprise storage products. But these gains will not come easily or without their own set of problems.

An obvious way flash promotes simplicity is by eliminating performance bottlenecks, but as flash enables more dense storage systems many of those gains will be converted to problems in quality-of-service. A more significant way flash promotes value is by providing a better building block for constructing storage systems: flash promotes simplicity by enabling higher levels of automation and allows the implementation of more powerful functionality.

Flash will fragment the enterprise storage market. The general purpose storage systems of today will be supplanted by new flash-based products that are far simpler and more powerful for the specific application areas that they target. This will amplify the simplicity and power that flash already makes possible, and further accelerate the fragmentation of the storage market. This is precisely what happened in the 1980’s when advances in networking technology caused a shift from centralized computing to networked computing – and in the process fragmented the direct attached storage market into ones based on networked storage technology. Over time, the networked storage markets consolidated into the current general purpose storage market dominated by a few major vendors. And so the cycle is repeating itself.

We are at the start of a new technological shift. A shift that is made possible by flash and one that will disrupt the existing enterprise storage market. Just as transistors enabled new products such as personal computers and smart phones, flash will enable simple, intelligent and fast enterprise storage systems. In turn, this will lead to much higher value for end users, but only if we think outside the storage box and treat flash as more than just a faster, denser disk.

The StorageMojo take
For the record the original post wasn’t looking at hybrid solutions, although it is obvious that SSDs can help legacy designs stay competitive without replacing all disks for a few years. For folks like Tintri and Nimble who want to speed up disk storage to stay affordable SSDs make sense. Why engineer a small part of your system when an off-the-shelf solution will suffice?

But for high end transactional SAN storage I still don’t see how SSDs are the right way to go. But I’m expecting more responses, so stay tuned.

Courteous comments welcome, of course. I’m working on a post that reflects directly on Ed’s comment about SSD latency. You’ll like it.

{ 5 comments }

Dear StorageMojo: migrating from Centera to Isilon

by Robin Harris on Thursday, 15 March, 2012

A reader asks:

Do you have any tool to move External Files (nearly 70 TB) from Celerra & Centera to Isilon faster?

The StorageMojo take
I know EMC made it difficult to leave their Centera system for competitive systems, but making it difficult to leave for another EMC product seems perverse. Or maybe they don’t know how to do it either.

Readers, or Isiloners, any suggestions?

Courteous comments welcome, of course.

{ 7 comments }

SSDs in arrays: the Pure Storage view

by Robin Harris on Monday, 12 March, 2012

Pure’s Matt Kixmoeller saw the Are SSD-based arrays a bad idea post and, unsurprisingly, responded. The SSD is Key to Economic Flash Arrays is a good post and I urge interested readers to check it out.

Pure has a stellar team with deep experience. Their views are worth considering.

As Matt notes:

This post caught our eye for an obvious reason: Pure Storage did start “fresh” to build an all-flash enterprise storage array, and we did decide to use the SSD form factor, after quite exhaustive looks at all the other options. Quite simply, we found that SSDs are the most efficient and economic building blocks from which to build a flash array. Let’s explore why.

After dismissing disk arrays that add flash drives – as I do – Matt focuses on (1) all flash appliances built from raw NAND and (2) flash arrays using flash SSDs.

SSDs are most efficient
Matt argues that SSD-based arrays have 3 key advantages:

  • Economics. SSDs are a commodity product that raw flash arrays will have a hard time out-engineering.
  • Flash controller complexity. Matt notes, correctly, that the flash controller is at the heart of argument. Better to use a controller that goes into millions of SSDs or one purpose-built for a single vendor’s array? How will the single vendor be able to keep up?
  • Servicability. Pure’s use of SSDs enables them to offer a familiar hot-swap experience that higher density designs may not offer. Futhermore, Pure’s data reduction features increase effective density to rival raw flash designs.

In conclusion, Matt makes a couple of more points. First, that SSD form factors will become much more compact, such as Apple’s DIMM-like mini-SATA SSD used in the MacBook Air. Second, that the proof is in the pudding: Pure, he says, has “. . . delivered with break-through performance, at a cost below traditional spinning disk.”

The StorageMojo take
How does Matt’s response stack up to the criteria in the original post? Not that there’s anything magic about them, but . . . .

  • Latency. No response, which doesn’t mean they’re worse.
  • SSD bandwidth. No response, but to be fair with enough SSDs you should be able saturate 16Gb Fibre Channel.
  • Reliability. No direct response. Instead a focus on servicability. More on that below.
  • Cost. Says Pure is cost-effective using their data reduction technology.
  • Flexibility. This is the heart of Matt’s argument: due to the commodity volume of the flash controllers flash SSDs will evolve faster – in functionality and cost – than any proprietary solution could. Proprietary flash controllers, he says, will be boat anchors for flash array vendors and are likely to end up controlled by flash manufacturers.

Servicability is an interesting response to the question of reliability. After all, the reason hot swap is important for some components but not others is because they either a)fail often – individually or in aggregate – b)failure compromises the product or c)online expansion, upgrading or reconfiguation is desirable.

Power supplies are routinely hot swappable because they have the lowest MTBF of any major system component. Disks are hot swappable because they come in multiples that reduce their aggregate MTBF while their standardized design makes hot swap cheap. I/O cards are often hot swappable because they are critical and needs change.

SSDs should be hot swappable because their failure rates are at best about half that of disks. But DIMMs, another critical component, especially if you invest in high-capacity ones, aren’t, because they rarely fail.

While I’m not aware of any non-SSD enterprise array vendor whose arrays don’t include hot swap components – love to be educated – which is more important: a short mean time to repair (MTTR) or a long mean time between failures (MTBF)? Because that is the argument about servicability.

I’d like to publish responses from vendors who feel strongly about this issue. Not in the comments, but as a blog post. Any takers?

Courteous comments welcome, of course. I was so impressed with the Pure Storage team that I signed a rare NDA with them last spring to get briefed, the first of 2 visits to their Castro street HQ.

{ 9 comments }

StorageMojo webinar Tuesday, March 13

by Robin Harris on Friday, 9 March, 2012

The friendly folks at Panasas are sponsoring Taming the Big Data Beast: Big Data for Design and Discovery at 10am PDT. I’ll present the StorageMojo take on big data.

I’d like to hear from you on any issues I should address. Feel free to comment or email me at robin at this domain.

Update: Here’s the link to the WMV file for the webinar.

The StorageMojo take
Panasas founder Garth Gibson – he of the original Berkeley RAID paper – was so far in advance of the rest of the industry with scale-out architecture, object storage and extreme bandwidth that it is only in the last few years that enterprises have caught on to why these are all Good Things. I’m glad they’ve hung in there and pleased by their support for StorageMojo.

Courteous comments welcome, of course. I’ll be in Silicon Valley Wednesday morning with some free time. I’d like to see cool stuff that people are working on.

{ 1 comment }

Are SSD-based arrays a bad idea?

by Robin Harris on Monday, 5 March, 2012

Think: if NAND flash storage arrays were being developed today, what is the chance that we’d put the flash into little bricks and then plug a bunch of them into a backplane? So why do it now?

It is a truism of design that when a new technology is developed, we use it to build what we have today. It is only in later generations that we realize the new possibilities enabled by the technology. And those generations can be long, even in computers.

For all out talk about the rapid pace of computer innovation, the market for the tried-and-true is much larger than the one innovators fight over.

Why SSD-based arrays are a bad idea
To be clear, this discussion covers storage arrays built with standards-based (i.e. SATA, SAS, 2.5″ or similar) SSDs.

  • Latency. Low compared to disks, but substantial compared to flash. SAS/SATA stacks were never optimized because disk latency was the big problem.
  • SSD bandwidth. There are wider options, especially close to the CPU.
  • Reliability. SSDs replace the head/media assembly in disk drives with NAND chips. The rest of the SSD has all the tender bits of a regular disk – bits that account for about half of all disk failures. Compare DIMM and disk replacement rates.
  • Cost. SSDs cost 50%-100% more than the raw flash, even after using all the high-volume disk components. Mounting directly on PC boards, like DIMMs or PCIe cards, is much more cost effective.
  • Flexibility. The good news with SSDs is that they take advantage of the huge tech infrastructure that supports disks. But that’s the bad news too, if an optimized clean-sheet architecture is the goal.

How big an issue is cost? DRAM on a DIMM is ≈98% of the DIMM’s cost, where the flash in an SSD ≈50%-65% of the cost. And since flash costs are dropping faster than the other component costs, so will its percentage of SSD cost.

Given the high cost of flash media compared to disk, efficient media usage is a major issue. Will flash SSDs pass that test?

A less important but related metric: rackspace. SSDs are inefficient users of racks, taking perhaps 2x the space of non-SSD flash arrays per TB. Few customers will care, but the ones who do write big checks.

The StorageMojo take
The massive technological momentum behind SSD-based arrays make them a popular option for both vendors and customers. After 20 years of RAID arrays, customers get the model. There’s a large raft of hardware and software support for disk drives that SSDs can use.

That cuts time-to-market and development cost. Given the performance advantages of SSDs over disks it is an easy win for customers even if the architecture is sub-optimal.

The squeeze comes later: if non-SSD architectures have significant advantages the SSD-based arrays will lose market share and gross margin. Flash-based SSDs make sense for many applications where their cost is a small percentage of the total solution.

Building storage arrays from SSDs is opportunistic, not strategic. It isn’t the future for high-end storage, but less-demanding mid-markets may not care.

Courteous comments welcome, of course. I’m really interested in any holes in the logic of this analysis. Please weigh in.

{ 27 comments }

NAND’s dimming future

by Robin Harris on Wednesday, 29 February, 2012

Another StorageMojo Best paper, The Bleak Future of NAND Flash Memory, presented at this year’s FAST ’12 conference, quantifies flash’s declining reliability, endurance, and performance as density increases.

Researchers Laura M. Grupp and Steven Swanson from the UCSD Non-volatile Systems Lab and John D. Davis of Microsoft Research collected data from 45 flash chips from 6 manufacturers. Using that empirical data they predict the performance and cost characteristics of future SSDs.

Faster better cheaper or slower worse cheaper?
While NAND flash is produced with semiconductor processes, smaller feature sizes don’t lead to faster performance or greater reliability. As NAND features shrink, so do the number of trapped electrons that store information.

Figures of merit
The research found that performance, program/erase endurance, energy efficiency, and data retention time all got worse with feature shrink.

Based on past performance, the team derived equations to describe how changes in feature size have affected key specs. They looked at SLC, MLC and TLC and feature sizes scaled from 72 nm to 6.5 nm (the consensus smallest feature size published in the International Technology Roadmap for Semiconductors (ITRS0), and assumed a fixed silicon budget for flash storage.

Key results

  • Latency. MLC write latency will double over time. Triple-level cell writes will grow to over 2.5MS, noticably reducing its performance advantage over disk writes.
  • Bandwidth. Small – 512B – read bandwidth and all writes decline by up to 50% over time. The impact is greatest on high-performance SLC flash.
  • IOPS. MLC flash I/O rates will drop almost in half.

Flash may be the new disk in a few years.

The StorageMojo take
One important qualifier is that for the purposes of their modeling the team constrained the number of chips in the hypothetical future devices whose performance they predicted. While fine for isolating the impact of future chip shrinks, it ignores the potential of much greater parallelism for managing these changes.

Bandwidth drops by half? Double the number of chips.

But if something can’t go on forever, it won’t. NAND flash will soon enter an end-of-life crisis for computer applications that need performance. That’s why ReRAM (resistance RAM) looks to be a good bet for replacing computer flash – not mobile device flash – over the next decade.

Courteous comments welcome, of course. A version of this post was published on ZDNet last week.

{ 6 comments }

Virtualizing storage controllers

by Robin Harris on Tuesday, 28 February, 2012

A hardware storage controller is an expensive guarantee that you’re using old technology to handle your most important data. Hardware specs are frozen early in the typical 18-24 month development cycle so by the time you get your “new” controller it is already 2 years old.

But it may not have to be that way. In Adding Advanced Storage Controller Functionality via Low-Overhead Virtualization researchers Muli Ben-Yehuda, Michael Factor, Eran Rom, Avishay Traeger, Eran Borovik and Ben-Ami Yassour of IBM Research–Haifa wanted to find out if virtualized storage controller features are feasible.

Short answer: with some tweaking, yes.

The big question is overhead. Storage controllers are typically in the data path, so latency, as well as compute efficiency on out-of-date processors, are real concerns.

Unlike the gateway approach of virtual storage appliances (VSA), the team ran the VMs directly on storage controllers using the Linux KVM hypervisor.

Overhead
The team identified 3 sources of performance overhead:

  • Base. System work such as virtual memory managment or process switching.
  • External communication. Important if a new function is layered on top of the storage system, such as a file server.
  • Internal communication. Virtual machine coordination and communication with the hardware controller.

Reducing overhead
Different techniques are used to limit each type of overhead.

Base They statically allocate CPU cores to the guest to ensure sufficient resources. Memory is also statically allocated to the VM to reduce translation overheads.

External Device assignment is the highest-performing approach as it eliminates hypervisor intervention for physical events. This requires assigning the network device directly to the guest using an SR-IOV (single root I/O virtualization) enabled adapter which allows the guest to send requests directly to the device.

Internal communications To reduce internal communication overhead, they modified KVM’s block driver to poll instead of interrupt. This gives a fast, exit-less, zero-copy transport.

Results

By using these techniques, we show no measurable difference in network latency between bare metal and virtualized I/O and under 5% difference in throughput. For internal communication, micro-benchmarks show 6.6μs latency overhead, read throughput of 357K IOPS, and write throughput of 284K IOPS; roughly seven times better than a base KVM implementation. In addition, an I/O intensive filer workload running in KVM incurs less than 0.4% runtime performance overhead compared to bare metal integration.

That sounds pretty good.

The StorageMojo take
While the static assignments may reduce flexibility, the win is updating storage functionality on the fly. But are there viable use cases? The arc of controller history suggests there are.

The earliest disk drives were directly controlled by the host CPU. Over the decades that and much other functionality migrated to controllers and to disks. Lately that trend has slowed because of large investments in existing standards.

This paper shows that it is possible to migrate more functionality to controllers without lengthy development cycles, enabling architects to make different tradeoffs.

For example, big data requires big pipes, and big pipes are expensive. If volume-reducing preprocessors could be added to file servers, existing bandwidth could be optimized.

More importantly, it suggests that by virtualizing the controller’s applications, the underlying hardware can be updated more frequently. To be fair, that’s not what the authors suggested, but it certainly seems possible based on their work.

Courteous comments welcome, of course. Jeff Darcy of Red Hat has his own list of favorite papers from FAST ’12 here.

{ 0 comments }

Doubling flash write performance through retention relaxation

by Robin Harris on Monday, 27 February, 2012

FAST – File and Storage Technology – is a must-see conference for StorageMojo, and I’ll be reviewing several Best Papers from FAST ’12 . While most emerging technology is developed in private company labs, FAST is where much of the first publicly available research is published.

Case in point, a StorageMojo Best Paper of FAST ’12: Optimizing NAND Flash-Based SSDs via Retention Relaxation by Ren-Shuo Liu and Chia-Lin Yang of National Taiwan University, and Wei Wu of Intel. NAND engineers have known for years that it is possible to speed up writes by allowing for shorter retention, but this paper quantifies the process.

Data retention was a theme of several papers. Disk drives don’t care if an update needs to last a minute or a year, but flash does.

NAND retention
NAND flash writes are spec’d – by JEDEC – for one year of retention. But relaxing that retention requirement can be beneficial.

  • Speed. Writes can be 1.8 to 5.7x faster, depending on how long the data is to be kept.
  • SSD architecture. The need for overprovisioning and other choices is a direct result of incoming data rates and flash write speeds. Faster writes might also mean allow less aggressive garbage collection.
  • ECC. As feature sizes shrink and NAND cells get flakier, the ECC overhead required to achieve a year’s retention grows. Single error correcting codes used to suffice. Now we need 24-error correcting codes and the arms race continues.

These advantages are meaningless if most writes need to be retained for more than, say, 2 weeks. The authors looked at a number workload traces and found that for all but one of them, at least 50% of the writes were retained for 1 week or less. For active enterprise workloads the percentage is likely to over 75%.

What happens when the time is up?
The authors propose that the Flash Translation Layer keep track of how long each block remains unchanged. When – and if – it reaches the threshold, a background process rewrites the data for the standard 1 year retention.

It is feasible to differentiate between host writes and background writes – garbage collection, for example – and to write them differently. Long-term writes would get improved ECC, while host writes would avoid the costly ECC encoding required.

Yes, there is overhead in managing the fast blocks and rewriting long-term data. But the added performance appears to make that a small price to pay.

The StorageMojo take
The paper presents a strong case for relaxing retention requirements to improve performance. As future generations of flash become less reliable and slower we’ll need this and other techniques to improve – or at least maintain – performance.

Many performance enhancement schemes require unrealistic levels of intelligence about application or system behavior to be effective. But this is within the realm of practical implementation.

The retention issue is a fair example of being handed a lemon and making lemonade. Or offering another degree of freedom to system architects.

In fact, some vendors are already exploring this possibility. If it extends the useful life of flash for a few years it will be well worth the engineering effort.

Courteous comments welcome, of course. A somewhat analogous process for disks is the concept of shingle writes, an area UCSC has been working in. Will disk vendors pick it up?

{ 1 comment }