StorageMojo





Robin Harris    


Stupid storage failures

November 25th, 2008 by Robin Harris in Architecture, Disk, SSD/Flash Disk

Valiant but doomed
The ZFS discussion thread had an interesting comment from Sun’s Jeff Bonwick, architect of ZFS, on storage device failure modes. How do you know a disk or a tape has failed?

You don’t. You wait, while the milliseconds stretch into seconds and maybe even minutes. Jeff states the problem - and Sun’s solution - this way:

. . . we’re trying to provide increasingly optimal behavior given a collection of devices whose failure modes are largely ill-defined. (Is the disk dead or just slow? Gone or just temporarily disconnected? Does this burst of bad sectors indicate catastrophic failure, or just localized media errors?) . . . there’s a lot of work underway to model the physical topology of the hardware, gather telemetry from the devices, the enclosures, the environmental sensors etc, so that we can generate an accurate FMA [Fault Management Architecture] fault diagnosis and then tell ZFS to take appropriate action.

With all due respect to Jeff, that solution seems iffy: how will you ever keep up with all the devices and firmware levels needed to make that work?

A community of prima donnas
There are lots of messy failure modes in computer systems. The literature around the Byzantine Generals Problem (Wikipedia - for a rigorous treatment download The Byzantine Generals Problem by L. Lamport et.al) tackles the problem of the malicious server in a community of network servers. That is a hard problem.

Knowing whether a storage device is alive, dead or only sleeping shouldn’t be so hard. They have powerful 32-bit processors - more powerful than a VAX 780 - and lots of statistics on what the drive is doing.

It seems like a disk could give a modulated heartbeat signal to drivers - “ready” “reboot” “caught in retry hell” “dead” - to decrease uncertainty.

The StorageMojo take
Drive vendors may think that non-standards for drive condition reporting are a form of lock-in, but that misses the bigger picture: the quality and timeliness of condition reports - even with a standard format - would be a competitive differentiator.

At the margin it would help slow the move to commodity-based cluster storage by enabling array vendors to improve their error handling and perceived reliability. It would also help disks versus flash SSDs, whose perceived reliability is partly due to the gap between user-judged drive “failures” and vendor “no trouble found” test results.

Storage systems all know how to deal with disk failures - they have to. So drive vendors, how about getting together to help make knowing a drive’s status a lot easier? Hey, IDEMA, make yourself useful!

Courteous comments welcome, of course.

Economic crisis and the storage industry

November 19th, 2008 by Robin Harris in Clusters, Enterprise, Future Tech

Yes, Virginia, the storage industry will survive the crisis
Economists and business leaders generally agree that the current, as yet unofficial, recession will be the worst we have seen since the Great Depression. The credit bubble has popped and we are facing global de-leveraging that will take years to unwind.

De-leveraging is fancy term for “a lot less money rolling around.” The computer industry started after the Great Depression so this will be the worst times we’ve ever seen.

How bad will it get for storage?
Storage is a special case. Disk drives underlie everything we do and they show no sign of slowing their capacity increases and price drops.

Data growth rates are a little less certain - contracting businesses produce less data - but the economic advantages of online data continue to grow as cost per gigabyte drops. Even in the financial sector someone is going to have to unravel all of those credit derivative swaps and synthetic securities that the “rocket scientists” - heckuva job, guys! - developed.

Where will this impact IT operations? Right in the heart of the array business.

A little smarter, a lot cheaper
Assume 80% of all business data is unstructured. And suppose 80% of that data is stored on storage arrays that are optimized for transactional data.

If RAID arrays average $6/GB today and cluster storage averages $2/GB we can begin to estimate the potential impact. In a perfect world 64% - 80% of 80% - of all corporate data could be migrated from high cost storage arrays to much lower cost storage clusters.

If the storage array business is a $21 billion a year today that means there is roughly a total available market of $13 billion of IT spend that could go to storage clusters. If storage clusters are 1/3 the price of storage arrays that suggests a total storage cluster business of $4 billion a year.

That ignores, of course, the traditional impact of sharply lower storage costs: a rapid increase in the amount of data stored. Online and easily searched data is much more valuable than data is stored on paper or tape. A first-order guess is that in today’s market there is the potential for an $8 billion a year storage cluster IT spend.

That’s the theory, anyway. The reality is that most IT professionals will not give up the storage arrays they know and love without a fight. But the economic pressure will be unrelenting.

Winners and losers
This won’t be a rapid process. The early not-very-good storage arrays came out in 1990 and took 8 years before sales reached 50% of the capacity of enterprise storage. The economic advantages of cluster storage are greater and the pressure to contain costs much stronger today. It will be 6 years before half of all enterprise storage capacity sales are in storage clusters.

The winners will be those companies that embrace and extend the capability of storage clusters the soonest. Among large companies HP and EMC appear to have the lead. Among the small companies several will be purchased while others will continue to grow as independent entities.

The losers? IBM appears to have no discernible strategy. NetApp is bogged down in its efforts to integrate the GX global namespace with the contradictory requirements of its traditional Data OnTap code base.

Sun has good building blocks but will fail if they lead with Lustre. HDS will wait until the market is defined to start moving - but that may be too late. This is a software play in more ways than one.

Smaller companies in the array business have a steep learning curve with cluster storage. Expect most of them to fade over time. There will be opportunities for OEM suppliers to the mid-tier vendors.

The StorageMojo take
The age of the raid array is coming to an end. They won’t disappear anymore than mainframes have. But they will become much less common. The array business will see single-digit sales drops and general long-term stagnation. The storage cluster business will show robust growth.

The race for storage cluster dominance is still young. There are many variables where newcomers and existing players can find or fumble important advantages. Can storage clusters be effectively productized? Or will integration requirements favor service-oriented companies? How will flash be best integrated into storage clusters? How will the SMB market be cracked?

The economic crisis does not create new trends. It accelerates existing ones. IT professionals should not underestimate the power and impact of the current crisis on once sacrosanct IT budgets.

IT likes to talk about “business partnership.” Now is the time for action. Show the CFO that you know how to do more with less and you’ll be a partner. Insistence on business as usual is the wide road to a pink slip.

Courteous comments welcome, of course. Disclosure: I’ve recently done some work for HP on their announced but not-quite-shipping Extreme Data Storage 9100. I was impressed.

Atmos: EMC rolls the dice

November 17th, 2008 by Robin Harris in Off-Topic

EMC’s Atmos, the product formerly known as Hulk/Maui, has gotten the full EMC marketing machine treatment. With a twist: EMC is rolling the dice on an unproven concept.

If it’s eat lunch or be lunch, EMC prefers to dine. I like it.

The pig
I covered Atmos’ academic antecedents - OceanStore and Antiquity - in an earlier post. After looking at the announcement material it is clear that Atmos offers far less than the Berkeley folks envisioned.

They may want to get there, but they aren’t there yet. That’s why we have v1 software.

Squinting past the hype
There are some oddities in the announcement.

  • No customer endorsement. Normal EMC announcements always have joyful customers endorsing the product. For a product that has been shipping since June - according to some EMC bloggers - that Atmos doesn’t is unusual.
  • “Powerful object metadata and policy-based information management capabilities . . . .” Atmos is not a file system - file systems exist on the client - so the lack of OceanStore’s introspective data management feature is ugly.
  • How do you access it? Most attention has focused on REST and SOAP. It does support CIFS, NFS and IFS (Installable File System - haven’t seen that in a while). The latter are more important.
  • Centera vs Atmos. EMC is at great pains to claim that Atmos doesn’t compete with Centera. Obviously it does, since it would be trivial to add the Centera’s features to a cheaper storage infrastructure.
  • EMC tossed out the IBRIX cluster file system in favor of something they gen’d up fairly quickly. A CFS is non-trivial so one must wonder how stable and feature-rich the local storage pools are.

The perfume
All the touting of the policy-based management doesn’t answer the need for introspective object management. In OceanStore, the storage system doesn’t know about relationships between objects - it isn’t a file system - so introspection is important for the system to react intelligently to change.

Let’s say that a webpage with a video on it has links to other videos and multi-megabyte downloads. The policy system in Atmos relies on the user to specify the content’s policy. But if the videos and downloads are specified with different policies, the availability of each component on the page will vary when it catches fire on the web.

An introspective system would note that these objects are associated and move/replicate them together. Introspection isn’t easy, but in a billion object system, humans just get in the way.

The StorageMojo take
None of the big storage companies is doing more to shake up the industry than EMC. Atmos is bold, whatever you think about its chances.

The important point is that EMC is embracing, however gingerly, commodity storage for enterprise customers. They aren’t the first with sub-$2/GB bulk storage, but CIOs listen to them.

Atmos batters EMC’s core value prop with a beta+ product for a not-sure-it-exists nascent market. Atmos is EMC’s boldest move since the original Symm. It may also turn out to be its most successful. Or not.

Atmos seems to have an unusual dispensation from profitability in the interests of giving the technology and the market time to mature. This speaks to a seriousness of purpose that competitors would be wise to note.

From an architecture perspective it isn’t clear whether the overhead of an Atmos is worth the cost. Perhaps a simpler content delivery network structure would deliver 95% of the benefit of Atmos at half the cost.

Right now the product is far from fully baked. EMC will no doubt learn valuable lessons about what the global 5000 and ISPs need from Internet-era storage. Competitors who wait too long will be looking at a steep learning curve.

Google’s Jeffrey Dean is actively looking for an integration strategy to knit together their global collection of data centers into a single namespace. While they have special requirements their reluctance to embrace an OceanStore-like architecture suggests that global cloud storage hasn’t reached a technical consensus.

Make no mistake: Atmos is huge. Whether it wins or someone else does is beside the point. The battle for massive-scale commercial storage has been joined.

Courteous comments welcome, of course.

The computer science behind EMC’s cloud storage

November 12th, 2008 by Robin Harris in Architecture, Clusters, Enterprise, Future Tech

EMC has announced Hulk/Maui, now known as Atmos. I’m flying to Boston today and don’t have access to EMC’s announcement documents.

But I have something better: the papers that provide the theoretical underpinning for Atmos. They provide an in-depth background that isn’t often available for new products.

These papers have too many interesting details to summarize them all. Here are some points that strike my fancy. YMMV.

If you want to understand Atmos these papers are essential. Details of EMC’s implementation will differ of course, but the underlying architectural trade-offs and management issues remain.

A 10 trillion file store
In 2000 a UC Berkeley paper OceanStore: An Architecture for Global-Scale Persistent Storage, authored by John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao, laid out the architecture of what is now Atmos. EMC provided funding for the research and Patrick Eaton went to work for EMC a couple of years ago.

The abstract says:

OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowed to be cached anywhere, anytime. Additionally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through proactive movement of data.

The design center: 1 billion users; each storing 10,000 files. 10 trillion files. Utility storage indeed!

A cluster of clusters
OceanStore is a software layer that creates a global storage cluster. While the paper simply refers to servers, the servers can be clusters as well.

EMC’s engineers chose to use a 3rd party cluster product - IBRIX I think - for the local data stores so they could focus on the layer that glues the sites together. Each local store can itself be a petabyte or more.

Update: several commenters assure us that IBRIX is not the local cluster file system. EMC is using some open source software in Atmos. End update.

Untrusted infrastructure
A key goal of the paper and its prototype was to assume untrusted infrastructure - a phrase that fairly sums up today’s Internet. Only clients are trusted with cleartext - all stored content is encrypted - but most servers are assumed to be working correctly and to help maintain file consistency.

Nomadic data
A global storage system has a unique requirement for locality. But it also needs to be able to store data anywhere, anytime to maintain persistence in the face of outages and catastrophes. Thus data has to be separated from its physical location.

Files are encrypted at the source and stored as persistent objects with unique Global User ID’s (GUID). OceanStore has no knowledge of a file’s objects, so it relies on introspection, a mechanism that notes correlations among objects.

Thus the system moves highly correlated objects together, reducing the latency problems that a non-introspective object store faces in a global infrastructure.

Ciphertext
The paper notes that restricting OceanStore to ciphertext limits what can be done with the data. But there is more flexibility that you might suppose.

The operations compare version, compare-size, compare-block, and search are all possible. In addition there are several feasible update operations, such as replace-block, insert-block, delete-block and append.

Applications
Multi-petabyte data stores for scientific, security or commercial applications are obvious applications. But telcos and ISPs are most interested in mobile apps.

The authors call out email as an apt OceanStore application.

OceanStore alleviates the need for clients to implement their own locking and security mechanisms, while enabling powerful features such as nomadic email collections and disconnected operation. Introspection permits a user’s email to migrate closer to his client, reducing the round trip time to fetch messages from a remote server. OceanStore enables disconnected operation through its optimistic concurrency model—users can operate on locally cached email even when disconnected from the network; modifications are automatically disseminated upon reconnection.

APIs
OceanStore offered its own API. But the authors also developed facades for the base API that emulated a Unix file system. a transactional database and a World Wide Web gateway.

Replication
OceanStore used erasure codes, not unlike the mechanism Cleversafe uses for its distributed data store system. Replica management is a major task for a global system and the paper goes into some detail on their solutions.

The 2nd paper
A 2nd paper, Antiquity: Exploiting a Secure Log for Wide-Area Distributed Storage (available at the same link above) published last year, expands on the OceanStore work.

. . . the secure log interface implemented by Antiquity is a result of breaking OceanStore into layers. In particular, a component of OceanStore was a primary replica implemented as a Byzantine Agreement process. This primary replica serialized and cryptographically signed all updates. Given this total order of all updates, the question was how to durably store and maintain the order? . . . The secure log structure assists the storage system in durably maintaining the order over time. The append-only interface allows a client to consistently add more data to the storage system over time. Finally, when data is read from the storage system at a later time, the interface and protocols ensure that data will be returned and that returned data is the same as stored.

Finally, self-verifying structures such as a secure log lend themselves well to distributed repair techniques. The integrity of a replica can be checked locally or in a distributed fashion. In particular, we implemented a quorum repair protocol where the storage server replicas used the self-verifying structure. The structure and protocol provided proof of the contents of the latest replicated state and ensured that the state was copied to a new configuration.

The StorageMojo take
Bravo! EMC is taking cutting edge computer science and turning it into a product. I’ll comment on the specifics of Atmos later.

New storage paradigms are rare. To have so many academic papers on the underlying technology is rarer still.

EMC would never provide this much information themselves - it would slow down the sales cycle. But these papers - and the couple of dozen others on the OceanStore site - provide implementors with a wealth of technical background.

Comments welcome, of course. Anybody want to comment on what these papers mean for the patentability of Atmos?

How bad do the ads suck?

November 10th, 2008 by Robin Harris in Off-Topic

I’ve been working with IDG to monetize StorageMojo through ad sales without much success. The latest iteration of the process you may have noticed: the ad that covers the page until you click “close.” They pay OK, but they aren’t the difference between hamburger and steak.

Which, BTW, you are welcome to do as soon as you like. Please don’t suffer through them on my account.

I think I was told that the ad would only appear like once a week per viewer, but I don’t know if that is correct or true.

Anyway, I invite StorageMojo readers to comment. What are the right limits for ads on StorageMojo?

Are the “roadblock” ads - I think that is what these coverall ads are called - too much? How much advertising is OK?

The StorageMojo take
I make no apologies for being a capitalist tool. But I also don’t want to drive off readers either. So let me know what you think.

If anyone has a line on a low-overhead ad network that pays reasonably well for a high-quality audience, I’d love to hear about it.

Courteous comments welcome, of course. Especially on this topic. Wes, thanks for the tickler and yes, I think I know where you are.

Flash-talking with Fusion-io

November 7th, 2008 by Robin Harris in Off-Topic

Fusion-io commissioned me to create a video with David Flynn, Fusion-io co-founder and CTO, talking about their architecture and the benefits of high bandwidth NAND flash. Even though I’ve been researching flash for a couple of years, some of David’s comments surprised me.

Flash doesn’t make a good disk
Anyone who cares to can track how my view of flash has evolved. From early enthusiasm, based on my happy experience with a flash-based HP Omnibook 300 - the original netbook - in the ’90s, to increasing skepticism.

The “aha” moment came at the Flash Memory Summit in August, when an industry panel agreed that

. . . NAND flash is best seen as an extension to DRAM and a layer between DRAM and disk - not as the guts of a disk drive replacement.

BTW, I started skeptical on Fusion-io and have become a convert. Go figure.

The learning continues
Fusion-io isn’t the only company offering flash storage in a non-disk format, but they do seem to be furthest along. I think their perspective is way more important than, say, Seagate’s. Here’s the video.

The StorageMojo take
Every time a new technology appears, our first impulse is to recreate the products of the old technology with it. Such is the case with flash.

We’ve run into the limits of the old disk/RAID/array/SAN paradigm. With storage clusters, flash and changing workloads we now face the exhilarating - and sometimes frightening - prospect of re-architecting our storage infrastructures.

Fusion-io won’t be the final word on flash, but they’ve made a great start. Not to mention a real head start.

Courteous comments welcome, of course.



StorageMojo RSS Feed January 2009 December 2008 November 2008 October 2008 September 2008 August 2008 July 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006