The computer science behind EMC’s cloud storage

by Robin Harris on Wednesday, 12 November, 2008

EMC has announced Hulk/Maui, now known as Atmos. I’m flying to Boston today and don’t have access to EMC’s announcement documents.

But I have something better: the papers that provide the theoretical underpinning for Atmos. They provide an in-depth background that isn’t often available for new products.

These papers have too many interesting details to summarize them all. Here are some points that strike my fancy. YMMV.

If you want to understand Atmos these papers are essential. Details of EMC’s implementation will differ of course, but the underlying architectural trade-offs and management issues remain.

A 10 trillion file store
In 2000 a UC Berkeley paper OceanStore: An Architecture for Global-Scale Persistent Storage, authored by John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao, laid out the architecture of what is now Atmos. EMC provided funding for the research and Patrick Eaton went to work for EMC a couple of years ago.

The abstract says:

OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowed to be cached anywhere, anytime. Additionally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through proactive movement of data.

The design center: 1 billion users; each storing 10,000 files. 10 trillion files. Utility storage indeed!

A cluster of clusters
OceanStore is a software layer that creates a global storage cluster. While the paper simply refers to servers, the servers can be clusters as well.

EMC’s engineers chose to use a 3rd party cluster product – IBRIX I think – for the local data stores so they could focus on the layer that glues the sites together. Each local store can itself be a petabyte or more.

Update: several commenters assure us that IBRIX is not the local cluster file system. EMC is using some open source software in Atmos. End update.

Untrusted infrastructure
A key goal of the paper and its prototype was to assume untrusted infrastructure – a phrase that fairly sums up today’s Internet. Only clients are trusted with cleartext – all stored content is encrypted – but most servers are assumed to be working correctly and to help maintain file consistency.

Nomadic data
A global storage system has a unique requirement for locality. But it also needs to be able to store data anywhere, anytime to maintain persistence in the face of outages and catastrophes. Thus data has to be separated from its physical location.

Files are encrypted at the source and stored as persistent objects with unique Global User ID’s (GUID). OceanStore has no knowledge of a file’s objects, so it relies on introspection, a mechanism that notes correlations among objects.

Thus the system moves highly correlated objects together, reducing the latency problems that a non-introspective object store faces in a global infrastructure.

Ciphertext
The paper notes that restricting OceanStore to ciphertext limits what can be done with the data. But there is more flexibility that you might suppose.

The operations compare version, compare-size, compare-block, and search are all possible. In addition there are several feasible update operations, such as replace-block, insert-block, delete-block and append.

Applications
Multi-petabyte data stores for scientific, security or commercial applications are obvious applications. But telcos and ISPs are most interested in mobile apps.

The authors call out email as an apt OceanStore application.

OceanStore alleviates the need for clients to implement their own locking and security mechanisms, while enabling powerful features such as nomadic email collections and disconnected operation. Introspection permits a user’s email to migrate closer to his client, reducing the round trip time to fetch messages from a remote server. OceanStore enables disconnected operation through its optimistic concurrency model—users can operate on locally cached email even when disconnected from the network; modifications are automatically disseminated upon reconnection.

APIs
OceanStore offered its own API. But the authors also developed facades for the base API that emulated a Unix file system. a transactional database and a World Wide Web gateway.

Replication
OceanStore used erasure codes, not unlike the mechanism Cleversafe uses for its distributed data store system. Replica management is a major task for a global system and the paper goes into some detail on their solutions.

The 2nd paper
A 2nd paper, Antiquity: Exploiting a Secure Log for Wide-Area Distributed Storage (available at the same link above) published last year, expands on the OceanStore work.

. . . the secure log interface implemented by Antiquity is a result of breaking OceanStore into layers. In particular, a component of OceanStore was a primary replica implemented as a Byzantine Agreement process. This primary replica serialized and cryptographically signed all updates. Given this total order of all updates, the question was how to durably store and maintain the order? . . . The secure log structure assists the storage system in durably maintaining the order over time. The append-only interface allows a client to consistently add more data to the storage system over time. Finally, when data is read from the storage system at a later time, the interface and protocols ensure that data will be returned and that returned data is the same as stored.

Finally, self-verifying structures such as a secure log lend themselves well to distributed repair techniques. The integrity of a replica can be checked locally or in a distributed fashion. In particular, we implemented a quorum repair protocol where the storage server replicas used the self-verifying structure. The structure and protocol provided proof of the contents of the latest replicated state and ensured that the state was copied to a new configuration.

The StorageMojo take
Bravo! EMC is taking cutting edge computer science and turning it into a product. I’ll comment on the specifics of Atmos later.

New storage paradigms are rare. To have so many academic papers on the underlying technology is rarer still.

EMC would never provide this much information themselves – it would slow down the sales cycle. But these papers – and the couple of dozen others on the OceanStore site – provide implementors with a wealth of technical background.

Comments welcome, of course. Anybody want to comment on what these papers mean for the patentability of Atmos?

{ 5 trackbacks }

Recommended reading about EMC Atmos — Storage Soup
Wednesday, 12 November, 2008 at 4:31 pm
Christophe Bertrand » Blog Archive » Cloud or vaporware?
Thursday, 13 November, 2008 at 5:36 pm
Rich Whiffen » Blog Archive » EMC and cloud storage
Friday, 14 November, 2008 at 6:46 pm
Storage Short Take #4 - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, storage, and servers
Tuesday, 2 December, 2008 at 6:01 am
Protecting Your Data and Hurricanes « On-Site Computer Services, Inc. in New Orleans Blog 504-469-6991
Tuesday, 7 July, 2009 at 6:49 am

{ 8 comments… read them below or add one }

Ryan Malayter Wednesday, 12 November, 2008 at 10:03 am

Sounds great in theory. Assuming the technology actually works well, and is low latency, the big issue I see is with the economic model.
First, if this is an EMC-only solution, without open APIs, then the “global” network mentioned in the first paper will simply neverhappen.
Second, it sounds like getting SaaS providers to cooperate in a global mesh would have even more pitfalls than the issues surrounding peering between Tier-1 ISPs. How do they assign value to not only bandwidth, but also capacity, latency, customer count, etc. And how long before a “storage Cogent” came in with low prices, angering the “established” cloud storage vendors, getting de-peered, and segmenting the global storage network?

the storage anarchist Wednesday, 12 November, 2008 at 10:29 am

Nice coverage, Robin. I for one sincerely appreciate that you’ve dug deep into this and revealed the science behind Atmos. And honestly, I don’t think EMC necessarily wanted it kept secret.

That said, dare I ask – do you think this means EMC actually *IS* innovative?

Or is recognizing, funding and then converting a research project into a real-world product no more innovative than say, acquiring a smart-but-struggling startup and turning their technology into more than it ever was going to be on its own?

Jeff Darcy Wednesday, 12 November, 2008 at 11:20 am

I wouldn’t quite refer to the OceanStore papers as the technical underpinnings for Atmos. Certainly there’s some relationship there, which I find pleasing because I was involved in making that connection at one time (I was one of the first two industry representatives to see OceanStore run – across two laptops at Granlibakken). However, Atmos might not look at all like OceanStore, for three reasons.

(1) It’s highly likely to be a new codebase even if it implements the same ideas.

(2) Atmos doesn’t quite have the same goals as OceanStore (especially wrt multi-tenancy) and the state of the industry has changed in quite a few ways since OceanStore was being actively developed.

(3) Even if Atmos is exactly like OceanStore, which OceanStore? Some of the basic approaches, including in areas you mention above, changed quite a bit during the project lifetime.

From what I’ve read so far, Atmos seems more clearly related to an EMC Cambridge project I used to know as RAIDiant than to OceanStore, and might even be related to at least one other that I’ll keep quiet about. I think I’ve succeeded in prodding Steve Todd to write more about the aspects that are most likely to be OceanStore-like, though, so that’s probably where we should look for a more authoritative answer.

Dave Graham Wednesday, 12 November, 2008 at 12:59 pm

Hey Robin!

love reading your articles.

just to note that Atmos is NOT using IBRIX (or Clustered XFS) for the underlying filesystem. This is a truly organic, object based filesystem designed by Eaton and Co.

cheers,

Dave Graham

Anonymous Wednesday, 12 November, 2008 at 1:20 pm

Robin said:
“EMC’s engineers chose to use a 3rd party cluster product – IBRIX I think – for the local data stores so they could focus on the layer that glues the sites together. Each local store can itself be a petabyte or more.”

EMC did use IBRIX for some 9+ months during the initial development of Maui/Hulk. But the use of IBRIX merely bought EMC time to develop their own (theoretically) scalable “local data store”. The verdict is still very much out on EMC’s ability to compete in the local clustered storage market (let alone cloud).

Mike Feinberg, vice president of EMC’s cloud infrastructure group, shared the following with techtarget.com:

“The talk in the storage industry is that EMC was using Ibrix software for its clustered file system. And while Atmos does include capabilities (such as global namespace) that Ibrix offers, Feinberg said that no other vendor’s IP is used outside of EMC. “It’s all EMC technology,” he said. “Frankly, there’s no Ibrix software and no other people’s software. It’s built from the ground with home-grown technology, although we do use open source technology as we see fit.”

Steve Jones Thursday, 13 November, 2008 at 7:08 am

It would be possible to engineer a storage system like this on a peer-to-peer basis. There’s a neat little job for the open source devotees. With P2P file sharing some of the mechanisms are already there, but some really clever work would be required on locality, public quota controls, resilience. Not a great thing for mastgering data, but a neat way of using lots of cheap storage otherwise tied up in underused PC storage for backup, network access and so on.

It shouldn’t be assumed that this sort of service has to be constrained to commercial providers.

More heavyweight commercial DP is more of an issue. The killer there is often performance – high latency is inherent to anything geographically spread. Global latency is truly awful – up to about 400ms round-trip, but even continental (typically 50-100ms) is unusable for many functions. In that case you want your compute capability close to the data.

There are also issues over consistency of multiple copies. Still, it’s an interesting approach and I can see it’s use for low access rate storage, archival stores (email looks a lot like that), remote backups etc.

drsan Friday, 14 November, 2008 at 6:48 am

to answer a few of these questions:
1) is Atmos patentable? well, only the EMC engineers/IP attorneys know for sure. i suspect a few parts of it might be, otherwise not.
2) is EMC innovative? of course they are, but no more so than any other cloud approach or wide-area/object-oriented filesystem approach.

but it’s really a shame that Atmos is implemented, at least currently, using technology that does not include ANSI T10 DIF. Without DIF, the nice, large cloud of petabytes of data is going to end up as nice, large, cloud of petabytes of data, many gigabytes of which is corrupt, and you don’t know which ones. file or object replication doesn’t help here – it doesn’t really protect the data, it merely makes potentially corrupt copies of it. Read the CERN paper from the spring of 2007.

drsan Friday, 14 November, 2008 at 6:49 am

to answer a few of these questions:
1) is Atmos patentable? well, only the EMC engineers/IP attorneys know for sure. i suspect a few parts of it might be, otherwise not.
2) is EMC innovative? of course they are, but no more so than any other cloud approach or wide-area/object-oriented filesystem approach.

but it’s really a shame that Atmos is implemented, at least currently, using technology that does not include ANSI T10 DIF. Without DIF, the nice, large cloud of petabytes of data is going to eventually end up as nice, large, cloud of petabytes of data, many gigabytes of which is corrupt, and you don’t know which ones. file or object replication doesn’t help here – it doesn’t really protect the data, it merely makes potentially corrupt copies of it. Read the CERN paper from the spring of 2007.

Leave a Comment

Previous post:

Next post: