The storage tipping point

by Robin Harris on Monday, 29 June, 2015

Storage is at a tipping point: much of the existing investment in the software stack will be obsolete within two years. This will be the biggest change in storage since the invention of the disk drive by IBM in 1956.

This is not to deprecate the other seismic forces of flash, object storage, cloud and the newer workloads that are driving investment in scale-out architectures and no-SQL databases. But the 50 years of I/O stack development – based on disks and, later, RAID – is essentially obsolete today, as will become obvious to all very soon.

In a nutshell, the performance optimization technologies of the last decade – log structured file systems, coalesced writes, out-of-place updates and, soon, byte-addressable NVRAM – are conflicting with similar-but-different techniques used in SSDs and arrays. Case in point: Don’t stack your Log on my Log, a recent paper by Jingpei Yang, Ned Plasson, Greg Gillis, Nisha Talagala, and Swaminathan Sundararaman of SanDisk.

Log structured writes are written to free space at the “end” of the free space pool, as if the free space were a continuous circular buffer. Stale data must be periodically cleaned up and its blocks returned to the free space pool – the process known as garbage collection.

Log structured file systems and the SSD’s flash translation layer (FTL) both use similar techniques to improve performance. But one has to wonder: what is the impact of two or more logs on the total system? That’s what the paper addresses.

The paper explores the impact of log structured apps and file systems running on top of log structured SSDs. In summary:

We show that multiple log layers affects sequentiality and increases write pressure to flash devices through randomization of workloads, unaligned segment sizes, and uncoordinated multi-log garbage collection.

How bad?
The team found several pathologies in log-on-log configurations. Readers are urged to refer to the paper for the details. Here are the high points.

  • Metadata footprint
    Each log layer needs to store metadata to keep track of physical addresses as they append new data. Many log layers support multiple append streams, which, they discovered, has important negative effects on the lower log. File system write amplification could increase by much as 33% as the number of append streams went from 2 to 6.
  • Fragmentation
    When two log layers do garbage collection, but at different segment sizes and boundaries, the result is segment size mismatch, which creates additional work for the lower layer. When the upper layer cleans one segment, the lower layer may need to clean to two segments.
  • Reserve capacity over-consumption
    Each layer’s garbage collection requires consumption of reserve capacity. Stack the GC layers and more storage is used.
  • Multiple append streams
    Multiple upper layer append streams – useful for segregating different application update frequencies – can cause the lower log to see more data fragmentation
  • Layered garbage collection
    Each layer’s garbage collection runs independently, creating multiple issues, including:
    • Layered TRIMs. TRIM at the upper layer doesn’t reach the lower layer, so the lower layer may have invalid data it assumes is still valid.
    • GC write amplification. Independent GC can mean the lower layer cleans a segment ahead of the upper layers, causing re-writes when the upper layer communicates its changes.

The StorageMojo take
Careful engineering could solve the log-on-log problems, but why bother? I/O paths should be as simple as possible. That means a system, not storage, level attack on the I/O stack.

50 years of HDD-enabling cruft won’t disappear overnight, but the industry must get started. Products that already incorporate log structured I/O will have a definite advantage adapting to the brave new world of disk-free flash and NVM memory and storage.

Storage is the most critical and difficult problem in information technology. In the next decade new storage technologies will enable a radical rethink and simplification of the I/O stack beyond what flash has already done.

Six months ago I spoke to an IBM technologist and suggested that the lessons of the IBM System 38 – which sported a single persistance layer including RAM and disk – could be useful today. He hadn’t heard of it.

The SanDisk paper doesn’t directly address latency, but that’s the critical element in the new storage stack. Removing multiple log levels won’t optimize for latency, but it’s a start.

Courteous comments welcome, of course.


Hike blogging: the Twin Buttes loop

by Robin Harris on Sunday, 21 June, 2015

Summer finally arrived in Northern Arizona, about 6 weeks later than usual. Good news: no wildfires, thanks to lots of rain. Bad news: I was freezing!

I got out to the Twin Buttes before 7am – and was a little late. Shade is a valuable commodity in the Arizona summer!

In another 10 days or so the summer monsoon will start, my favorite time of year. It will still be hot, but moist air from the Gulf of California arrives with plenty of clouds and thunderstorms. The rain ends the annual wildfire season, while the clouds dapple the already dramatic landscape in ever-changing patterns.

The Twin Buttes Loop is about 6 miles and relatively flat: just a few hundred feet of ascent. But the scenery is anything but flat.

Courthouse and Bell rocks as seen from Chicken Point:

Click to enlarge.

Click to enlarge.

The rocks are a popular mountain biking destination, as the sign in this picture suggests. The double black diamond is for bikers, not hikers. Having hiked the trail, I can assure you that there are places where once you start you are committed to a rapid descent whether you are still on your bike on not.

Click to enlarge.

Click to enlarge.

Finally, another picture: Spring in the Desert. The fruiting body of an agave plant is on the left. These stems shoot up several inches a day and present their flowers for pollination and then, seed distribution. The stem is so large that they can cause the whole plant to capsize, ripping its roots out of the ground.

The desert isn’t easy.

Click to enlarge.

Click to enlarge.

The StorageMojo take
May you always walk in beauty.

Courteous comments welcome, of course.


Can NetApp be saved?

by Robin Harris on Wednesday, 17 June, 2015

If NetApp is going to save itself – see How doomed is NetApp? – it needs to change the way it’s doing business and how it thinks about its customers. Or it can continue as it is and accelerate into oblivion.

NetApp’s problem
NetApp is essentially a single-product line company, and that product line is less and less relevant to customer needs. There’s faster block and SAN storage and much cheaper object storage in the cloud and on-prem. NetApp is in a sour spot, not a sweet one.

Here’s what NetApp needs to do to regain momentum.

Embrace multiple product lines. OnTap, while competitive in no growth legacy applications, is not competitive with modern scale out object storage systems. NetApp needs more arrows in its quiver.

NetApp could learn from EMC, a company that has developed almost no products – Atmos is the exception that comes to mind – itself in the last 20 years. Instead, EMC buys sector leaders and pushes them through its enterprise sales channel. Both Isilon and Data Domain have major architectural flaws compared to more modern products, but EMC’s sales clout wins the day.

Embrace scale out storage. NetApp made a brilliant move when they purchased object storage pioneer Bycast. The Canadian company suffered from timid marketing thanks to a traditional Canadian reluctance to toot one’s own horn. Overcommitment to CDMI hasn’t helped either.

But Bycast had a strong foothold in medical imaging and a great story: a Bycast installation survived Hurricane Katrina in New Orleans without any data loss. Haven’t heard that story? You, and everyone else.

Buy Avere. Avere’s product is an intelligent front end cache for multiple NetApp filers. It simplifies filer management by keeping hot data local and eliminating the need to balance hot files across multiple filers.

But when you buy it, don’t try to integrate it with OnTap. It is a network device, and needs to be sold as such.

Pump up the channel. Easier said than done, but NetApp has to get ready for a lower margin future, and embracing the channel is the easiest way to start. More will need to be done – products that need less support thanks to automated support for instance – but getting lean and mean is table stakes for our brave new storage world.

The StorageMojo take
Despite the fact that NetApp stopped talking to me several years ago – except for a recent briefing invite – I still like them and wish them well. Thus this advice.

With a well-regarded global brand and a broad enterprise presence, NetApp has assets that startups can only dream of. But so did DEC, Sun and Kodak, and bad management frittered those assets away.

NetApp’s urgently needs a strategy reboot. Whether the new management team is up to the task remains to be seen. I hope they are.

Comments welcome, as always.


Why it’s hard to meet SLAs with SSDs

by Robin Harris on Wednesday, 3 June, 2015

From their earliest days, people have reported that SSDs were not providing the performance they expected. As SSDs age, for instance, they get slower. But how much slower? And why?

A common use of SSDs is for servers hosting virtual machines. The aggregated VMs create the I/O blender effect, which SSDs handle a lot better than disks do.

But they’re far from perfect, as a FAST 15 paper Towards SLO Complying SSDs Through OPS Isolation by Jaeho Kim and Donghee Lee of the University of Seoul and Sam H. Noh of Hongik University points out:

In this paper, we show through empirical evaluation that performance SLOs cannot be satisfied with current commercial SSDs.

That’s a damning statement. Here’s what’s behind it.

The experiment
The researchers used a 128GB commercial MLC SSD purchased off-the-shelf and tested it either clean or aged. Aging is produced by issuing random writes ranging from 4KB through 32KB for a total write that exceeds the SSD capacity, causing garbage collection (GC).

They then tested performance in each mode using traces from the Umass Trace Repository. The traces were “replayed” generating real I/Os to the SSD for three workloads: financial; MSN; and Exchange.

In addition to clean and aged SSD performance, they tested each VM with its own partition on a clean SSD and running the workloads concurrently on a single partition on a clean SSD.

They repeated the tests using an aged SSD, to notable effect:

 IO bandwidth of individual and concurrent execution of VMs.

IO bandwidth of individual and concurrent execution of VMs.

One of the major effects of garbage collection is in the over provisioning space – the OPS of the title. While you can confine a single VM to a single partition the over provisioning space in an SSD is shared among all partitions – at least as far as the authors know.

Garbage collection
The authors ascribe the massive performance deltas to garbage collection. For those new to this issue the basic unit of flash storage is the page – typically a few KB – which are contained with blocks – typically anywhere from 128KB to 512KB.

But the rub is that entire blocks – not pages – have to be written, so as pages are invalidated there comes a time when the invalid pages have to be flushed. Once the number of invalid pages in a block reaches a threshold, the remaining good data is rewritten to a fresh block – along with other valid data – while the invalid data is flushed.

Erasing a block takes many milliseconds, so one of the key issues is tuning the aggressivenes of GC against the need to minimize writes so as to maximize flash’s limited life. This is but one of the many trade offs required for engineering the flash translation layer (FTL) that makes flash look like a disk.

Black box
But, as the researchers note, it is not possible to know exactly what is going on inside an SSD because the FTL is a proprietary black box.

Our work shows that controlling the SSD from outside the SSD is difficult as one cannot control the internal workings of GC.

GC is the likeliest explanation for the big performance hit when VMs share a partition. The GC process affects all the VMs sharing the partition, causing all of them to slow down. Here’s another chart from the paper:

(a) Data layout of concurrent workloads in con- ventional SSD and (b) number of pages moved for each workload during GC.

(a) Data layout of concurrent workloads in conventional SSD and (b) number of pages moved for each workload during GC.

Another variable is the degree of over provisioning in the SSD. Since flash costs money, over provisioning adds cost to the SSD. Over provisioning may be as little as 7% for consumer SSDs to as high as 35% for enterprise SSDs.

Yet another variable is the how the OPS is shared among partitions. If shared at the page level, much extra data movement – and reduced performance – is virtually assured. But again, that is under the control of the FTL, and it is hard to know how each vendor handles it.

The StorageMojo take
Flash storage has revolutionized enterprise data storage. With disks, I/Os are costly. With flash, reads are virtually free.

But as the paper shows, SSDs have their own issues that can waste their potential. Until vendors give users the right controls – the ability to pause garbage collection would be useful – SSDs will inevitably fail to reach their full potential.

My read of the paper suggests several best practices:

  • Give each VM its own partition.
  • Age SSDs before testing performance.
  • Plan for long-tail latencies due to garbage collection.
  • Pray that fast, robust, next-gen NVRAM gets to market sooner rather than later.

Comments welcome, as always.


Make Hadoop the world’s largest iSCSI target

by Robin Harris on Monday, 1 June, 2015

Scale out storage and Hadoop are a great duo for working with masses of data. Wouldn’t it be nice if it could also be used for more mundane storage tasks, like block storage? Hadoop_logol

Well, it can. Some Silicon Valley engineers have produced a software front end for Hadoop that adds an iSCSI interface. The team had 3 goals:

  • Create an iSCSI volume as an HDFS file
  • Make it interoperate with native iSCSI Initiators on Windows and Linux
  • Performance comparable to common NAS appliances

The payback is that clients get a robust, resilient, scale-out infrastructure, at commodity hardware prices. Even small iSCSI arrays can’t compete, assuming, of course, that you’ve got a Hadoop cluster.

Hmm-m-m, turning an enormous key-value store into a block device. What could go wrong?

Performance could suck, for one. But surprisingly the untuned prototype software offers disk levels of performance.

With optimizing it could likely do 25%-50% better. Better yet: put the iSCSI daemon on each node and your bandwidth grows with your cluster.

The team has done some testing on Hadoop on Ubuntu with a standard Windows 7 client. Everything is off-the-shelf, with a W7 client, a namenode and 3 data nodes on a 10Gb Ethernet network.

The test payload includes single 2 GB binary file, 25.2 Gigs (~4200) of ~5 meg JPEGs & a few 10+ meg MPEGs in many subfolders (JPEGs & MPEGs don’t compress much) and 10,000 1K text files.

Here are the team’s results on the 25GB J/MPEG test. Note that zero is on the graph’s right side and incoming data is the blue line.

Test result 1

And the results for copying 2 streams of J/MPEGS plus the 2GB binary:

Test results 2

The StorageMojo take
So, are we on the verge of creating the scale-out iSCSI market niche? That’s Marketing 101: create a niche and dominate!

Thankfully, no. iSCSI target mode on Hadoop is clearly a feature that should be incorporated into a larger product. And that’s why the engineers contacted StorageMojo.

They’d like to sell or license their IP and software prototype to a company looking to differentiate their product – customers love options – and expand their use cases with a speedy block service.

If you’re interested, please contact StorageMojo by sending mail. After some diligence I’ll put you touch with the team.

Courteous comments welcome, of course. Readers, what say you? Does it broaden Hadoop’s appeal or meh?


Hospital ship Haven in Nagasaki, Japan, 1945

by Robin Harris on Monday, 25 May, 2015

StorageMojo is republishing this post to mark this Memorial Day, 2015. In a few months we will be marking the 70th anniversary of the end of World War two as well. My father was a career Navy officer and this is a small part of his legacy. See the original post for the comments, many from the children of my father’s shipmates.

While most of what modern storage systems protect are business records there is also the use of storage for saving our cultural heritage – of which this is a small part.

My father, Tom, was an officer in the US Navy Medical Corps during World War II. As a newly commissioned 2nd lieutenant he was aboard a submarine tender anchored at Pearl Harbor on December 7, 1941. As a doctor he spent the next 36 hours in an operating room working on the wounded.

Less than 4 years later he was aboard one of the first US ships to enter Nagasaki’s harbor after the Japanese surrender. In a brief memoir he describes a visit to Okinawa on the way to Tokyo – where he was aboard the USS Missouri when the formal surrender was signed – and then on to Nagasaki, the 2nd city to suffer an atomic bomb attack.

USS Haven, a USN hospital ship, in 1954

USS Haven, a USN hospital ship, in 1954

The primary mission of the Haven was the collection of Allied POW’s in need of medical care from the many camps in the area.

The trains began arriving every three or four hours each one with several hundred men. Each new arrival was a thrill with the band playing “Hail, Hail the Gang’s All Here” and the sailors and marines on the platform cheering. It was an experience to see the somewhat bewildered expressions of the men on the trains change to tears, smiles and laughter as they realized that they had reached the end of the road – that the day, the longing for which had sustained them through months and years of torture and mistreatment, was at hand.

While in Nagasaki he visited a Japanese hospital:

What we saw in that hospital was something I wouldn’t have missed seeing for anything but something I never want to see again.

Everywhere you looked there were desperately sick people, mostly women and children. Many were horribly burned and over and around all of them were flies by the millions. There were no beds – all patients were lying on straw mats on the floor. In the corridors of the hospital, the patient’s kin had set up their charcoal burners and were preparing a meal thus filling the hospital with smoke. One sensed that death was hovering over many of these people – while we were examining one recent admission, two died close by.

My father soon had his hands full with some very sick POWs.

Within a few days after the released prisoners of war had started arriving at our processing station, my two wards were filled with sick men, many of them living skeletons. Many people thought that it would be only a matter of “resting them up for a couple of days” and giving them plenty to eat. Those of us working with them, however, soon realized that a great many of them were desperately ill and urgent measures were necessary to save them.

But he also got the chance to meet with some of the scientists from the Manhattan project that developed the bomb.

It was our good fortune that the committee sent out by president Truman to study the atomic bomb explosion arrived in Nagasaki soon after we did. They asked to be quartered on board the Haven and inasmuch as I was in charge of the officer’s mess, it was my duty to look after them. As a result I had many interesting discussions regarding the atomic bomb and its possibilities with the members of the committee several of whom were scientists who had worked with the bomb from the beginning. Of course, they gave out no information except what had been released for publication, still it was a thrill to talk with the men who had done much to work it out.

After a lecture by one of the scientists my father concluded:

It may have ended the war for us, but it may some day be turned against us and we would lose the things for which we fought this long bloody war. Our country could be the greatest force for peace and security in the world if it would but accept the responsibility. Even out here few think of anything but getting back and forgetting what they have seen out here. “Let’s get home and look after our own affairs – what these people do out here is none of our business”, they say. And these are intelligent men – it depresses me. We are still selfish and materialistic, we have learned nothing apparently.

Courteous comments welcome, of course. The complete 12 page document – scanned and OCR’d into a PDF – is available here. Scanning from an old typescript is imperfect so there may be errors.

StorageMojo will be back to its regularly unscheduled programming tomorrow.


No-budget marketing for small companies

May 13, 2015

You are a small tech company. You have a marketing guy but it’s largely engineers solving problems that most people don’t even know exist. How do you get attention and respect at a low cost? Content marketing. When most people think about marketing, they think t-shirts, tradeshows, advertising, telephone calls, white papers and brochures. Those […]

0 comments Read the full article →

Hike blogging: Sunday May 10 on Brins Mesa

May 11, 2015

The Soldiers Pass, Brins Mesa, Mormon Canyon loop is my favorite hike. It has about 1500 feet of vertical up to over 5000 ft and the combination of two canyons and the mesa means the scenery is ever changing. This shot is taken looking north from the mesa to Wilson Mt. It was a beautiful […]

0 comments Read the full article →

FAST ’15: StorageMojo’s Best Paper

May 11, 2015

The crack StorageMojo analyst team has finally named a StorageMojo FAST 15 Best Paper. It was tough to get agreement this year because of the many excellent contenders. Here’s a rundown of the most interesting before a more detailed explication of the winner. CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems […]

5 comments Read the full article →

EMC II’s ragged last quarter

April 27, 2015

As reported in a Seeking Alpha quarterly call transcript, EMC’s storage unit had a $75 million shortfall in Q1. CEO Joe Tucci said . . . we were disappointed that we fell a bit short of our Q1 revenue plan, approximately $75 million short. This $75 million revenue shortfall occurred in our storage business. That […]

3 comments Read the full article →

How doomed is NetApp?

April 13, 2015

The current turmoil caused by plummeting cloud storage costs, new entrants sporting modern architectures and the forced re-architecting due to flash and upcoming NV memories is a perfect storm for legacy vendors. Some are handling it better than others, but some, like IBM and NetApp, appear to be sinking. NetApp is signalling that their 2015 […]

38 comments Read the full article →

EMC’s DSSD hiring is exploding

February 18, 2015

DSSD, the Valley startup acquired by EMC last year (see EMC goes all in with DSSD) is continuing to hire at an accelerating rate. Informed sources put the current DSSD team at 160 heads with plans to grow it to 800 over the next year. This is a program in a hurry. Hiring such numbers […]

7 comments Read the full article →

Latency in all-flash arrays

February 17, 2015

StorageMojo has been writing about latency and flash arrays for years (see The SSD write cliff in real life), with a focus on data from the TPC-C and SPC-1 benchmarks. The folks at Violin Memory asked me to create a Video White Paper to discuss the problem in a bite-size chunk. Latency is the long […]

3 comments Read the full article →

EMC’s missing petabytes: the cost of short stroking

February 10, 2015

A couple of weeks ago StorageMojo learned that a VMAX 20k could support up to 2400 3TB drives, it can only address ≈2PB. Where did the remaining 5 petabytes go? Some theories were advanced in the comments, and I spoke to other people about the mystery. No one would speak on the record, but here’s […]

6 comments Read the full article →

Help StorageMojo find the VMAX 20k’s lost petabytes!

January 21, 2015

While working on a client configuration for a VMAX 20k – and this may apply to the 40k as well, as I haven’t checked – I encountered something odd: The 20k supports up to 2400 3TB drives, according to the EMC 20k spec sheet. That should be a raw capacity of 7.2PB However, the same […]

4 comments Read the full article →