StorageMojo




Robin Harris    


The value of guaranteed uptime

May 1st, 2008 by Robin Harris in Architecture, Enterprise, Future Tech

What, if any, is the value of multi-year storage uptime?

Xiotech and Atrato promise 5 and 3 year uninterrupted service on their new arrays. Now it is time to ask, as some commenters have, so what?

After all, enterprise data centers are already well-equipped to deal with disk failures. RAID keeps the data available. 7×24 service replaces the failed drive with a new hot spare. Experienced storage admins paper over the cracks.

It isn’t like you’re going to fire all your storage admins just because arrays stop breaking.

Opex vs capex
The direct cost saving - no maintenance contract for x years - may or may not be reflected in the purchase price. From a buyer’s perspective there are 2 costs: the capital expense - capex - and the operating expense - opex. Opex is fully tax deductible in the year incurred, so it is easier to get.

Atrato and Xiotech need to think creatively about maintenance pricing.

Breaking into the glass house
Breaking into data centers with the promise of cost savings isn’t easy. The provable cost savings have to be 50% or better to get conservative data centers to change vendors. And it helps if there is a recession or the business is tanking. Motivation.

A case can be made that after adding up a standard array’s maintenance costs, random disruption costs and additional management it will be cheaper to go with the new product. The CFO will demand it.

But if you want to change the market, you have to change the way the market thinks.

Re-thinking the issue
Straight cost-displacement arguments aren’t going to have the legs both companies would like. They need a different model.

Enterprise IT is manufacturing plant - not an engineering testbed. It confuses the engineers because it seems like a techie haven - but it isn’t.

It is all about shipping product, each and every day. Like a real factory.

SPC
Everyone accepts that statistical process control has changes the face of manufacturing. A core idea behind SPC, reducing variability improves quality, is directly applicable to IT factories.

What Atrato and Xiotech do, ideally, is reduce IT ops variability. There is always a known level of performance. Availability is 100%.

Thus most of the usual dependencies are no longer dependencies. I/O slowdowns and timeouts should disappear. Drive rebuilds won’t impact performance. Admins won’t pull the wrong drive - which happens about 2% of the time - and bring down the array. And so on.

The StorageMojo take
Enterprises over-configure because they never know what is going to hit them - but they do know it will be at the worst possible time. Ideally they want to be ready to handle the biggest shopping day of the year - even after an array failure.

Workload variability isn’t going away. But wouldn’t it be nice if equipment performance and availability variability did?

That’s what Atrato and Xiotech are selling. I wish them luck communicating a value prop that strikes at the heart of what every other array vendor is selling.

Comments welcome, of course.

Notebook flash SSD market: fantasy or mirage?

April 27th, 2008 by Robin Harris in Architecture, SSD/Flash Disk

Fresh off the HD-DVD fiasco, Toshiba execs are stepping up to pursue another expensive flop: notebook SSDs. Memo to Toshiba: people won’t pay huge SSD premiums for nothing. And almost nothing is what flash SSDs provide today - and for the foreseeable future.

Please sir, may I have another!
Given the multi-billion dollar cost of semiconductor fabs, getting the notebook SSD market wrong would make Toshiba’s $250 million HD-DVD loss look cheap. The president of Toshiba semi, Shozo Saito, recently opined that flash drives will be in 25% of notebooks by beginning 2011.

He is so-o-o wrong.

Hand me the back of the envelope, please
Guessing 200M notebook sales in 2011, 50 million flash drives of, say 250 GB, for total sales of 12.5 million TB of flash. Assuming a cost reduction curve of 50% annually from today’s spot market MLC $2500/TB to ~$320/TB in 2011 . . . hmm-m . . . $4 billion in chip sales.

Give or take. Yummy!

If Toshiba projects winning 20% of the market, $800 million in sales would justify over $1 billion in flash factory capacity. And if the market doesn’t appear, a billion dollar write off.

Same power, same performance and way more costly - I’m sold!
If flash drives delivered what proponents claim there would be no problem. But they don’t and they won’t.

Power: no SSD notebook has gained more than 10 minutes battery life over disks. Since flash is already power-efficient that won’t change. Disks have multiple opportunities to improve power use - and with over a $1 billion a year in R&D behind them - they will.

Performance: tested application performance hardly changes either - even with a $3,800 flash drive. Notebook I/O doesn’t favor flash drives - and the engineering contortions needed to fix flash aren’t cheap.

The one big win for flash performance: boot and app load times. It makes the system feel a lot snappier - if you often reboot. Sleep mode makes that much less important.

Reliability/durability: flash vendors tout 2 million hour MTBFs and superior shock & vibe specs. Yet Dell reports that their SSD infant failure rates are about the same as disks. And the return rates are higher.

So where, exactly, is the flash advantage? Plus, it is only conjecture that flash drives will prove to be more reliable in actual notebook use. Only time will tell.

And what about the 4-bit MLC that Toshiba is counting on to drive costs down at 40-50% per year? This will less durable than current SLC. No hard numbers from the vendors - depends on how good their signal processing algorithms are - but it could easily be 5,000 writes - down from 10,000 today.

How do you explain that to consumers?

Data integrity: the unasked question Of all the questions about flash drives, this is the biggest. I have yet to see an SSD read error spec.

Flash has read errors - that’s why vendors implement error detection.

But flash has a problem disks don’t: flash drives move your data around a lot more often than disks do. Every time a flash drive writes a page, it has to erase the entire block that page is in.

So what happens to the data in the block? It gets read - almost always correctly - and rewritten along with the new page. The new location must be tracked by the drive.

The map that keeps track of where your data is rapidly gets very complex - and itself is regularly read and rewritten. How well protected is this critical data structure? If it isn’t bulletproof you can kiss your data good bye.

If FTL’s are like every other storage product, catastrophic failure modes are hiding in the statistical weeds. Enterprise IT is rightly suspicious of storage that “auto-magically” moves data around. Consumers have no idea. SSD vendors better have their act together or the class action suits could be as big a problem as the empty fabs.

The StorageMojo take
The further I wade into flash issues, the worse it gets. My sense is that the flash industry close to creating a multi-billion dollar fiasco. Why?

  • Over-promising on performance, reliability, battery life and data integrity. Take a systems level perspective, folks. Consumers do.
  • Over-broad positioning of flash drives as a general replacement for notebook hard drives - when pricing clearly says they aren’t.
  • Relying on system OEMs like Dell to market SSDs to consumers is a freeway to failure. They don’t have the bandwidth. The flash vendors need to market flash SSDs directly to consumers. Not sell them - market them.

The flash guys are caught in a vise: big expensive fabs that need to run all year; and seasonal demand that whipsaws their pricing all year.

Notebook flash drives can help even out demand - but only if consumers accept them for the right reasons. Otherwise Toshiba’s new fabs will build chips for a non-existent market.

Update: Flash has a place in one notebook niche: below the $40-$50 minimum cost of a disk. As we’re already seeing with the Asus Eee, replacing $50 of disk with $10 of flash makes a big price difference. But those units won’t solve the seasonality problem and may even make it worse. End update.

Comments welcome, of course.

NAB shorts: Omneon Video Networks

April 24th, 2008 by Robin Harris in Architecture, Clusters, Video

A video networking company in StorageMojo?
Omneon isn’t new to StorageMojo. Their price list has been on price list page since January 2007.

Their booth was about 50 yards from Isilon’s and EMC’s and it was a madhouse each time I walked by. Partly that was because they were holding all their meetings there, but it also seemed like there was lots of traffic.

Building storage into an app
Founded in 1998, Omneon started offering storage in response to customer demand. They decided on a commodity-based cluster and built their own storage software, MediaGrid.

Their architecture hews to the post-array Google-style storage model:

  • No RAID - slices are replicated one or more times based on policy or demand
  • Single global namespace
  • Out-of-band meta-data servers manage content servers

<strike>They can rebuild a failed 1 TB drive in less than an hour.</strike> They can replicate the data from a failed 1 TB drive in less than an hour.  Just add 4 or 24 drive content servers to scale capacity. <strong>Update:</strong> My original wording was incorrect. Thanks to Bill Todd for elucidating Omneon’s mechanism.<strong> End update.</strong>

But that’s not all!
Omneon’s content servers do more than serve content. They put their unused CPU power to work doing jobs like transcoding - translating content from one format like HD to iPhone-suitable QuickTime.

Given the growth in multi-core processors that will become a more important part of their market appeal over time. Since they process files, not blocks, they have many more opportunities to add value than a modular array.

The StorageMojo take
Omneon made a lot of smart choices with their MediaGrid architecture. It shows how a company with a few bright engineers can build a basic storage utility to take advantage of low commodity costs.

Where they win is their integration with the application and the workflow. They’ve created a video utility that integrates ingest, post, media management and playout with the smart and scalable storage needed to make it all work.

Application specific storage writ large. They’ve taken the same storage the rest of us use and wrapped broadcast interfaces around it that broadcasters already know.

Comments welcome, of course.

Xiotech’s ISE: beast or gamine?

April 13th, 2008 by Robin Harris in Architecture, Disk, Enterprise

What’s behind the hype?
Congrats to the Xiotech team on generating the most interest at SNW. Their demos were crowded with the curious. Their claims bordered on the implausible, but the credibility of the engineering team kept derision in the corners.

I talked to Ellen Lary, engineering VP, and Steve Sicola, CTO, as well as taping the very helpful Chad. Before going any further, let’s roll the 103 second - less if you skip the credits - tape:

How do they do it?
Darned if I know - they weren’t talking. Reading between the lines:

  • Systems thinking: each disk drive is more powerful than that 1980’s workhorse VAX 11-780 supermini. Put that intelligence to work!
  • Clean code: Xiotech has had free run of Seagate’s best thinking - so they’ve gotten rid of the firmware hairballs inside disk drives to create a distributed architecture where components cooperate in a trusted environment instead of competing. Their disks won’t work with your Brand X controller.
  • Spare no expense: the Xiotech team is going for the gold with a top-of-the-line resource-intensive architecture. If you have to ask how much it costs you can’t afford it.

With 350 IOPS per 15k FC drive claimed - and Sicola said more was coming - this is a lot of bang. When we see some pricing we’ll know about the bucks.

The value proposition
Xiotech’s bet is this: all is forgiven if it kicks butt 7×24 for 5 years. Each ISE is a storage utility writ small. With these building blocks, they promise, you can build an infrastructure whose availability and performance - still the storage ne plus ultra - will beat anything from EMC, IBM or HP.

A worthy goal, indeed.

The StorageMojo take
Just when EMC is assuming that Maui’s new Über-layer will win them the undying cashflow of multinationals, Xiotech comes along and exposes EMC’s feet of clay.

That sucking sound you hear is EMC emptying the datacenter’s coffers to run 7×23.999. If Xiotech can win even 10% of EMC’s business, they’ll be a $1 billion company sooner than they dreamed. And their VCs will be high-fiving in Aspen this winter.

NetApp, IBM and HP should worry as well. It sounded like Xiotech was OEM’ing the ISE to others - if so it makes sense to add them to the product line.

The disk-in-a-box model needed a thorough rethink and kudos to Xiotech for doing it. But many promising - on paper - products have failed. Once Xiotech is shipping and there is independent testing - then we’ll know what they’ve really got.

Comments welcome, of course. The indefatiguable Beth Pariseau homes in on the Atrato/Xiotech nexus.

SNW update - Xiotech’s ISE and the dilithium solution

April 9th, 2008 by Robin Harris in Architecture, Disk, Enterprise

It looks like Xiotech is going to cop the “Best Announcement at Spring SNW ‘08″ prize. See the nifty flash intro.

I did speak to Ellen Lary, Engineering VP last night after going through their mobbed booth. Later today I have an appointment with Steve Sicola, Xiotech’s CTO. I’ll have a more complete report later. Here’s what I’ve gleaned so far.

Remember Atrato?
Interesting stuff:

  • Sealed unit starting at 1.5 TB. They had a 1 PB system on display in 3 54 RU - i.e. bigger than you use - racks.
  • 5 year warranty and nifty blue LED light. Are we in a data center or a cocktail lounge?
  • Uses the draft T10 DIF (Data In Flight or Data Integrity Field, Data Integrity Feature - depending on where you read it - evidence that humans have a far greater problem with data integrity than computers do) standard to protect data within the array.
  • Uses Seagate’s own drive test software to attempt repairs on drives in place. Ellen said that about 70% of drives work normally after a power cycle.
  • If power cycling doesn’t work, the box can perform a complete reformat of the drive, starting with laying down tracks and proceeding on to what you and I consider “formatting”.
  • If a particular head is the problem, they can electrically disable that side of a platter while continuing to use the rest of the capacity of the drive.
  • It is cheaper to put in a couple of extra high-end drives than it is to make a service call. This won’t be true in China of course.

The best announcement that WASN’T made at Spring SNW
A company has figured out how to enable long distance synchronous replication. Here in America we like things big - including our idiots in Washington - and our disasters are no exception.

Hurricanes, earthquakes, volcanos, floods, blizzards, tornados and fires - and purblind ideologues - can lay waste to hundreds or thousands of square miles. So normal synchronous replication distances don’t cut it for gotta-have-it infrastructure.

The still-in-stealth-mode company’s Chief Engineer, Montgomery Scott, explained that by running dilithium crystals a little hot, a special hyperspace “tunnel” is created enabling . . . .

Just kidding. Their actual solution looked good in principle but the devil is in the details. I asked all the hard questions I could think of and they had answers for all of them, so it looks like they have something real.

Look for a fall announce.

The StorageMojo take
Those of you wondering if this year would be more of the same old, same old, fear not. The spirit and fact of invention is still strong in the ever-more-vital storage industry.

Comments welcome, of course. Would you use 1,000 mile synchronous replication if you could get it?

StorageVideoMojo

April 9th, 2008 by Robin Harris in Architecture, Video

On the occasion of announcing a new HPC modular array, the Engenio-based 4600, SGI commissioned me to do a StorageMojo video for them.

Some interesting comments about modular vs cluster storage and CXFS. And I got to practice my radio voice.

We spoke for about an hour and I boiled down the comments of Raj Das of SGI and LSI’s Flavio Santoni - before putting the StorageMojo take on it.

Must get video page up soon
One thing about video: every syllable counts. This one gets into Apple’s Motion for the first time. Nothing wild though.

Update: The video came down for a couple of modest tweaks. Now it’s back - new and improved.
End update.

The StorageMojo take
Video is another way to reach people who aren’t going to plow through a white paper. In 4 minutes you meet some people, get exposed to some new ideas and maybe learn something. And you can be drinking coffee in your bathrobe at the same time!

Comments welcome, of course.

Dear Uncle StorageMojo: Datacore vs EqualLogic

March 31st, 2008 by Robin Harris in Architecture, Enterprise

The 2nd installment of an occasional feature . . .
A reader writes:

I think your input would be valuable in helping me make a decision on storage for my company. I’ve done loads of research and I’m fairly certain I have good players narrowed down, but have reservations about both. . . .

Players:
-Datacore SANMelody H/A solution on HP hardware.
-Equallogic PS3800XV

The app
It’s is an up-to-the-minute commercial application supporting virtual machines. The VM’s run proprietary messaging/transactional servers that spend 99% of their disk I/O time appending very small messages - ~300 bytes - to transaction logs.

Update: After the initial comments, the prefers-to-remain-anonymous reader (BTW, I did check him out and his company is for real) added this clarification:

  • Yes, there are DR and HA requirements.
  • Each VM has its own transaction logs that can grow to GBs in size. These transaction logs are not for archival purposes, rather to recover state in the event of an application restart
  • Traffic: Traffic will come in bursts and maximize at about 1500 iops between 10 separate hosts.
  • Reservations: Is Equallogic a “true” H/A solution considering it does not support synchronous replication between completely separate hardware? Are the competitors claims of Datacore’s “unprotected cache” well-founded? (Datacore insists in H/A mode that all cache is synchronously written and requires a commit from its H/A partner before committing to client.)
  • Storage size requirements are small, so I’ll pay for SAS performance over SATA terabytes.

End update.

Update II: The anonymous reader comes back with more crucial detail:

Let’s pretend the budget is around $60k-$70k. I know the two finalists can provide an acceptable degree of HA, DR, and iSCSI performance at that price. What products should one be looking at from HDS/EMC/NetApp? They were not considered initially for the perception of being unaffordable.

End update II.

Update III:

The plan is for an H/A setup in a class 1 datacenter with asynchronous replication over an existing DS3 (..but dark fiber is in the works) to a remote site.

All things considered, the question could be framed, “Whom/What should be demanded for trial?”

End update III.

The StorageMojo take
It is interesting that this customer is NOT looking at the traditional OLTP storage vendors. This is a business-critical application - the company is handling Other People’s Money.

What are the questions the reader should be asking of vendors? How should the problem be framed? I surmise that price is an issue. Where else might the reader go?

I welcome comment from vendors, but please do us the courtesy of identifying yourself as such.

Comments welcome, of course.

Atrato disk array goes public

March 28th, 2008 by Robin Harris in Architecture, Disk, Future Tech

6 weeks ago StorageMojo covered the leaving-stealth-mode non-announce of Atrato’s new storage box. I spoke to Dan McCormick, Atrato’s co-founder and CEO a few days ago for an update.

They’ll have more details at SNW. But here’s what I found interesting.

Density and capacity
The new Atrato box is 3U, not 5, and has about 200 2.5″ drives, for 50 TB raw. With the new 500 GB 2.5s coming out they’ll be able to do 100 TB.

That blows away the density of EMC’s soon-to-be-announced Hulk box. And with the declining delta between 3.5″ and 2.5″ drive capacities, the Atrato box should increase their capacity per rack unit lead.

Performance
In a refreshing change from normal industry practice Atrato quotes IOPS to disk, not cache. Thus their quoted 10,000 IOPS is a real-life number. Dan said that one user got up to 20,000 IOPS after tuning their app.

Apps with big files and large I/Os need disk I/Os, not cache I/Os. Most controllers turn off cache when they see large I/Os anyway. Quoting cache IOPS to their market would be a mistake.

Power
Atrato claims an 80% reduction in power per I/O. 80% of that is due to the power efficiency of 2.5″ drives. The remaining third though is their own special sauce.

Virtual drive hospital
When a drive starts acting up - and with 200 drives that doesn’t take very long - their software “pulls” the drive and tests it. If the drive is failing they leave it alone, but Atrato has found that over half the problem drives can be put back into service.

The StorageMojo take
Still cool. An interesting metric will be uptake into space and power constrained enterprise data centers. If power really is an issue - and while I’m sure it is at some level, the priority is the question - I’d expect to see all the big NYC data centers testing these things within 90 days.

Comments welcome, of course. Dan also commented that StorageMojo’s original Atrato post was the best researched and most insightful of all the reportage they saw. Flattery works.

Punctuated equilibrium in the digital universe

March 27th, 2008 by Robin Harris in Architecture, Future Tech

Mobile computing. Cloud computing. Client-server computing. Green computing.

A new mainframe. A 9U supercomputer. Scale-out clusters. High-bandwidth RAID controllers. Multi-core processors. Massive memory servers.

Facebook. YouTube. Twitter. Blogging. MySpace. Google apps.

The Next Big Thing: there is no Next Big Thing
Punctuated equilibrium is an evolutionary theory that posits that long periods of “normal” evolution - stepwise enhancements that fine-tune environmental adaptation - are interrupted by big events - asteroid strikes, climate change - that engender explosions of mutation and variety. These variations then get whittled down by the pressures of the new normal.

The current hype around “cloud computing” is a case in point. Much over-heated prognostication about how this changes everything. But does it?

Cloud computing will host a certain class of applications that

  • Have low bandwidth requirements
  • Only require ~99% uptime
  • Are latency insensitive

Both “low bandwidth” and “latency insensitive” are relative measures. They will change over time. We’ve always had those applications and always will.

In the 1980’s those requirements fit PCs and Novell LANs. In the 1990s they fit browsers and 56k modems. Today they fit smart phones, sociall media, some web-hosted productivity apps and cool data storage

But there will always be important apps that don’t meet these restrictions and never will. Plus there will be new products that provide “cloud” advantages of cost and scale without the disadvantages of security, latency and bandwidth costs. Is a local “cloud” still a cloud?

The StorageMojo take.
Our human pattern-recognition hardware craves simple patterns and big stories - even if they aren’t there.

What is actually happening is that we are seeing an explosion of new computing forms to take advantage of many new market niches. Old forms will either bend - as the mainframe has - or break - as the minicomputer companies did.

Implicit requirements are becoming explicit. Market demand is great enough to support a larger number of niches. Application users are gradually understanding what they need - as opposed to what they’ve always wanted.

Out of this stew will come the new normal. For a few years anyway.

Comments welcome, of course.

Will FCoE save storage networks?

March 23rd, 2008 by Robin Harris in Architecture, SAN, FC

Back in ‘96, when I was flogging FC networks for Sun under NDA, the most common objection was “I don’t want another layer to manage.” Despite that FC became successful in big enterprise IT shops. But the objection is still valid and a major factor, with price, in the low uptake of FC in smaller shops.

Is FCoE (Fibre Channel over Ethernet) the answer?

FC vendors are - reluctantly - hoping it is
The future of pure FC looks pretty bleak in the long term. 10 GigE is coming down the cost curve just as earlier generations of Ethernet did. The volume Force is with them.

As 10 GigE gets cheaper its total available market gets larger. It may not be optimal, but for many shops “good enough” is good enough.

FC partisans aren’t quitting. 8 Gbit has just started shipping, 16 Gbit is on the drawing boards and there are noises about future generations beyond that.

FCoE follows in the footsteps of VTLs
When 1 Gbit FC started rolling out in ‘97, it was 10x-20x the speed of the then hot 100 Mbit Ethernet in either its full or half duplex flavors. And today - 8 Gbit FC is slower than 10 GigE. It is cheaper, but for how long?

An Emulex VP explained at a recent conference that enterprise shops have well-developed processes for managing FC SANs. FCoE enables shops to continue using those processes minus the fibre. The problem: FCoE won’t be ready for volume deployment until 2010 - if you believe the current schedules.

Any technical problems could easily drop FCoE into 2011, leaving Emulex, Qlogic and Brocade with a 3+ year chasm to cross. The Emulex VP tried to sound enthusiastic about FCoE but wasn’t succeeding. Maybe his teeth hurt.

The StorageMojo take
Enterprise data center inertia is a powerful market driver. Witness the success of VTLs. It’s understandable: they have work to do. Can’t be overhauling the engines in mid-flight.

But Wall Street isn’t as understanding as StorageMojo. FC is topping out, so where is the growth going to come from for FC companies? Especially when new iSCSI, Infiniband and pNFS products are coming to market in the near term.

The current economic malaise will force companies to get tough on data center requirements. The “good enough” standard will be the only standard for apps that aren’t absolutely core to business success.

Comments welcome, of course.

P4P: smart, fast and easy P2P

March 16th, 2008 by Robin Harris in Architecture, Future Tech, Off-Topic, SAN, FC

The P4P working group demo’d their work Friday at the Distributed Computing Industry Association show in New York. Not only did they show 2-3x faster downloads, but they also cut the average number of inter-metro hops - the expensive kind - from over 5 to less than 1. Cool.

The P4PWG idea is that if P2P is both cheaper for ISPs and faster for users we will all have a happier Internet. Folks from the Yale CompSci department - Haiyong Xie, Y. Richard Yang and Avi Silberschatz - along with Verizon and Pando Networks, cooperated on the demo.

The P4PWG includes AT&T, Verizon, Pando, BitTorrent, Cisco and LimeWire among others. The cable companies are there as observers. The P4P work is an open standard with the hope that all ISPs and P2P networks will endorse it.

How does it work?
The tech papers aren’t available yet on the web, but this is what I’ve pieced together from an afternoon’s websurfing. Update: Wide-awake reader Paul found this P4P Overview on Ars Technica. Thanks Paul! End update.

P2P is network oblivious. When you start downloading streams they might be from anywhere, regardless of network cost. The problem is that big routers are costly and smaller routers are much cheaper, not to mention undersea fiber.

What P4P is inject some knowledge into the P2P network so peering decisions are made more intelligently. It looks like a network version of locality of reference.

Implementation
There are at least 2 ways to deliver network awareness to peers. Here’s one of them.

A peer-tracker (pTracker) and an Internet tracker (iTracker) are added to the P2P network. A peer requests peering information of the pTracker, which has knowledge of local (metro area) and recent non-local resources. The pTracker sends back an edited server list and the peer goes its merry way.

If the resources aren’t local and the pTracker doesn’t know the network topology, it pings the iTracker, which returns high-level peering suggestions. If locality of reference works as well in cyberspace as it does with other data the pTracker won’t be querying the iTracker very often.

It is expected that the pTracker will be maintained by the P2P network, while the iTracker could be maintained by the ISP, network or a trusted 3rd party. This should preserve help P2P user privacy, although the *Tracker names certainly won’t reduce user paranoia.

Guys, how about something less Big Brotherish? PeerServer and RoutServer? Just a thought.

The StorageMojo take
As file sizes continue their secular trend upward the need for P2P will continue to grow. By aligning ISP, telco and user needs for faster and more efficient P2P the P4PWG has pulled off a win/win/win situation.

A less obvious benefit of this work is on VoIP networks, which are also P2P. It doesn’t take much to degrade VoIP quality. To the extent that it enables improvement in P2P network node selection, the P4P project will benefit the rapidly growing population of VoIP users as well.

Kudos to the P4PWG and especially the Yale team.

Comments welcome, of course. Images courtesy of the P4PWG.

StorageMojo’s favorite FAST 08 paper

March 14th, 2008 by Robin Harris in Architecture, Backup, Disk

It didn’t win Best Paper honors at FAST 08 - IIRC it was An Analysis of Latent Sector Errors in Disk Drives (the link is to the StorageMojo review of that excellent paper last month) but I really like the thinking behind Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage.

Written by Mark W. Storer, Kevin M. Greenan, Ethan L. Miller (UC Santa Cruz) and Kaladhar Voruganti (NetApp) the paper discusses a prototype that

. . . is a distributed network of intelligent, disk-based, storage appliances that stores data reliably and energy-efficiently. While existing MAID systems keep disks idle to save energy, Pergamum adds NVRAM at each node to store data signa- tures, metadata, and other small items, allowing deferred writes, metadata requests and inter-disk data verification to be performed while the disk is powered off.

They call the appliances tomes.

Tape: where data goes to die
One of tape’s big advantages is that it uses no power at rest. Any disk-based tape replacement will have to come as close to the same ideal.

The tomes use a single hard drive, an ARM-based processor board with NIC and NVRAM. Total power use - when powered up - about 11.5 watts, less than 15k FC drive. With tighter code, a slower drive and more integration, I’d bet they could cut that in half.

The single disk drive means that tomes must be used in groups to enable distributed RAID techniques and exchange of algebraic signatures to ensure inter-disk recovery. The paper goes into those techniques in detail.

NVRAM

The purpose of the NVRAM is to provide low-power, persistent storage; operations such as metadata searches and signature requests do not require the unit’s drive to be spun up.

. . . the NVRAM primarily holds metadata such as algebraic signatures and index information, flash writes are relatively rare; flash writes coincide with disk writes.

The Ethernet interconnect is important - by using cheap unmanaged switches for fan out, high aggregate bandwidth, exceeding that of current tape libraries, is easily and inexpensively achieved. The use of power-over-Ethernet would further reduce costs, especially if the system used 4200 RPM drives.

The StorageMojo take
Most of the disk vs tape discussions look at the disk device vs tape cartridge cost issue - and they aren’t that different even today. But the tape library market is a $4-5 billion market. A disk-based alternative to slow tape libraries could take a big chunk of that.

Further, this design could be integrated into a single disk controller board, creating a disk with a single Ethernet port and incredible packaging and manufacturing economies.

If Seagate were smart they’d jump on this. This is a major opportunity to drive another significant consumer of disk drive units - without encroaching on existing OEM customer businesses. That doesn’t happen very often.

Comments welcome, as always. Pergamum was an ancient Greek city known for its sizable library, second only to the library of Alexandria.

Cleversafe’s dispersed storage network

I had a con call with Chris Gladwin and Russ Kennedy of Cleversafe a couple of weeks ago. They’ve come to market with a product line that seeks to deliver:

  • Massive scalability to meet growing digital content requirements
  • Unprecedented Security and Privacy for critical digital assets
  • Survivability against disasters, dishonesty and time
  • Extremely cost-effective infrastructure compared to traditional methods

That’s a quote from their pitch.

Cleversafe’s product line
Cleversafe, IIRC, started as a software company, but their announced products come in nice rack-mountable boxes. There are 3 of them:

  • CS Slicestor - Dispersed Storage server - $11.3k
  • CS Accesser - Dispersed Storage router - $12.3k
  • CS Manager - Dispersed Storage network manager - $12.3k

The Slicestor is a 1U storage server containing 4 disks. The Accessor slices up the data and distributes it - think slice router. The Manager works out of band to monitor and manage the storage network components.

I assume the pricing includes some room for volume discounts. There is an open-source version (c. 2006) of the software. The company intends to offer a software-only version as well.

Why hardware?
The Conventional Wisdom in VC circles is that tin-wrapped software ramps revenues faster - hey, you’re selling tin + bits - at the cost of lower margins and loss of focus.

Qualifying hardware is non-trivial; so you tend to stay on one platform longer than you should. At liquidity event time, software companies fetch higher multiples, so it may be a net loss. VCs live by the Golden Rule: he who has the gold makes the rules.

What it does
Cleversafe has an iSCSI or block storage interface. It takes the data, slices it into small pieces using Information Dispersal Algorithms and then ships the slices off to storage either locally or around the world.

In the latest version you can specify how many slices the system makes and how many slices are required to rebuild the data. If you have 11 data centers around the world, you can specify that, say, 6 are required to recreate the data.

You could lose access to 5 data centers and still recover. If the local controlling authority busts into 3 or 4 data centers, they get nothing. Pretty cool if you worry about corrupt government officials getting hold of your company secrets.

The company is planning on adding FTP, CIFS and NFS in the fullness of time.

How well it works
Cleversafe claims that given sufficient low-latency bandwidth the dispersed storage is as fast as a local disk. That’s a tall order, but for now I’ll take their word for it.

Who should buy it?
The company is aiming the Dispersed Storage Network at ISPs to offer as a service and multinationals with round the clock operations and critical data.

How it works
Cleversafe uses Cauchy Reed Solomon erasure codes to slice and dice the data. These codes have several advantages:

  • More capacity efficient and failure tolerant than parity codes
  • Doesn’t require a license
  • Code and decode are faster than other stack operations

If you’d like to play with Cauchy Reed Solomon, check out Dr. Jim Plank’s software page which includes

. . . Reed-Solomon coding, Cauchy Reed-Solomon coding, general bit-matrix coding, Reed-Solomon coding optimized for RAID-6, and Liberation coding. The documentation provides some tutorial material on matrix and bit-matrix based erasure coding.

I met the good doctor at FAST, where he was delighted to find that Clevesafe - also a FAST presenter - was using techniques he’d worked on a decade ago.

The StorageMojo take
I’m impressed with what Cleversafe has done. They will look even smarter after EMC’s Hulk/Maui announcement this spring. I suspect they’ll be bought by year’s end.

Kudos to the Cleversafe team.

Comments welcome, of course.

NetApp’s research offensive

February 26th, 2008 by Robin Harris in Architecture, Disk

After last year’s publication of the Google and CMU papers on the much-higher-than-expected annual failure rates of disk drives, StorageMojo challenged vendors to respond.

I said

The industry has an excellent opportunity to move to greater transparency with storage consumers. Sometimes relationships need a jolt to remind everyone just how much we rely upon each other. Storage is a vital industry with the responsibility to protect and access an ever increasing fraction of mankind’s data. Customers want the best tools for the job. It appears the industry hasn’t been providing them, at least for disk drives. I know some efforts are underway in IDEMA to improve the quality of the numbers. I’d get serious about ensuring that the revised processes actually benefit customers rather than soothing corporate egos. Otherwise this situation will arise again.

Further, the need to engage at a more personal level is a predictable outcome of the continuing consumerization of IT. This is an example of the new normal. Embrace it.

Working through the weekend, NetApp’s Val Bercovici did. IBM did so a little later. EMC said semi-nothing.

Two weeks later a not-very-bright EMC’er sent an EMC lawyer to shut StorageMojo up. Some people are so-o-o sensitive.

FAST forward
This week at FAST (File and Storage Technologies ‘08) a group of research papers respond to the Google and CMU work. In Parity Lost and Parity Regained, Are Disks the Dominant Contributor for Storage Failures?, An Analysis of Latent Sector Errors in Disk Drives and An Analysis of Data Corruption in the Storage Stack NetApp researchers working with academics including Bianca Schroeder - one of the authors of the CMU paper - and Andrea and Remzi Arpaci-Dusseau, of the University of Wisconsin, produced a series of papers examining the state of the art in data storage.

Often using NetApp’s AutoSupport data base, the papers delve into knotty problems in array architecture and component behavior. With the advantage of large sample sizes the papers see further into statistically uncommon events.

For example An Analysis of Data Corruption in the Storage Stack looked at over 1.5 million disks on more than 40,000 systems over 41 months. Those numbers dwarf the combined samples of the Google and CMU teams.

Some surprising results
The cynical, myself among them, might be tempted to dismiss the work as exercise in self-justification. The studies find disk scrubbing useful in eliminating silent data corruption, a result any half-awake SE will use to their advantage.

But in Parity Lost and Parity Regained - nice Milton reference! - they also found that disk scrubbing could spread an error - parity pollution - across multiple disks. In fact,

. . . the tendency of scrubs to pollute parity increases the chances of data loss when only one error occurs.

This is honest research, following the data where ever it goes. It is the difference between science and spin.

The StorageMojo take
NetApp’s research offensive is commendable. While IBM, HP and Microsoft maintain large research groups and publish regularly, they are many times NetApp’s size.

It is also smart marketing. NetApp’s research gives them a ready entree to corporate system architects and technical opinion leaders with a fresh and data-heavy perspective on IT risk management.

NetApp is to be congratulated for the work they’ve done. By participating in the conversation they advance the state of the art and their stature with customers. The former is good for the industry and both are good for NetApp.

Update: A commenter requested links to the papers. They aren’t all freely available on line yet. Here are the two I found online. Download the pdf for Parity Lost and Parity Regained, An Analysis of Data Corruption in the Storage Stack.

Update 2: Prof. Peter Honeyman of CITI wrote in to let us know that the FAST papers are available here. Thanks Doc.

Comments welcome, of course.

Why do storage systems fail?

February 24th, 2008 by Robin Harris in Architecture, Disk, Enterprise

It’s the disks, right?
We’ve heard much about disk failures - as recently as last week as well as last year’s reports from Google and CMU. But what about the rest of the system?

In a FAST ‘08 paper to be presented this week - Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics - authors Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky analyze logs from 39,000 systems over 44 months to get answers.

1.8 million disks in 155,000 shelves
NetApp provided data from a variety of systems, including near-line, low-end, mid-range and high-end arrays. The team analyzed the log reports to understand what components led to failures.

The 15 page paper offers some interesting findings

  • Physical interconnect failures are a significant contributor - anywhere from 27-68% - of storage subsystem failures.
  • Subsystem failure rates that use the same disk models show similar disk failure rates - but the subsystem failure rates vary significantly.
  • Enclosures have a strong impact on subsystem failures. Some enclosures work better with some drives than others.
  • Dual-redundant FC shelf interconnects reduce annual failure rates 30-40%.
  • Interconnect and protocol failure rates are much more bursty than disk failures. Some 48% of overall subsystem failure arrive at the same shelf within 10,000 seconds (~ 3 hours) of the previous failure.
  • As interconnect failures are so bursty, resilience mechanisms beyond RAID are required to achieve subsystem availability.

What else?
They also found that enterprise drives had an AFR consistent with manufacturer specs - less than 1% AFR. This result derives from looking at the disks as the system does rather than as users see them.

The StorageMojo take
Interconnects, especially connectors, have long been fingered as a significant cause of the equipment problems - and not just in storage. While the team seems to report that interconnects are a greater cause of subsystem failure than disks, there seems to be some room for disagreement about what the numbers are telling us.

For example, this result doesn’t fully explain the delta between what disk users have found and the “trouble not found” rates that manufacturers report. Even if you accept the common 50% TNF vendors report, drive failures are still higher than this research finds.

Perhaps we should conclude that NetApp’s engineering is higher quality than the general run of storage arrays. Or perhaps system log analysis is still a dark art whose results are more indicative than conclusive.

Comments welcome, as always. I’m at the FAST ‘08 conference this week in the San Jose Fairmont hotel.



Next Article »
StorageMojo RSS Feed May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006