EMC has announced Hulk/Maui, now known as Atmos. I’m flying to Boston today and don’t have access to EMC’s announcement documents.

But I have something better: the papers that provide the theoretical underpinning for Atmos. They provide an in-depth background that isn’t often available for new products.

These papers have too many interesting details to summarize them all. Here are some points that strike my fancy. YMMV.

If you want to understand Atmos these papers are essential. Details of EMC’s implementation will differ of course, but the underlying architectural trade-offs and management issues remain.

A 10 trillion file store
In 2000 a UC Berkeley paper OceanStore: An Architecture for Global-Scale Persistent Storage, authored by John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao, laid out the architecture of what is now Atmos. EMC provided funding for the research and Patrick Eaton went to work for EMC a couple of years ago.

The abstract says:

OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowed to be cached anywhere, anytime. Additionally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through proactive movement of data.

The design center: 1 billion users; each storing 10,000 files. 10 trillion files. Utility storage indeed!

A cluster of clusters
OceanStore is a software layer that creates a global storage cluster. While the paper simply refers to servers, the servers can be clusters as well.

EMC’s engineers chose to use a 3rd party cluster product – IBRIX I think – for the local data stores so they could focus on the layer that glues the sites together. Each local store can itself be a petabyte or more.

Update: several commenters assure us that IBRIX is not the local cluster file system. EMC is using some open source software in Atmos. End update.

Untrusted infrastructure
A key goal of the paper and its prototype was to assume untrusted infrastructure – a phrase that fairly sums up today’s Internet. Only clients are trusted with cleartext – all stored content is encrypted – but most servers are assumed to be working correctly and to help maintain file consistency.

Nomadic data
A global storage system has a unique requirement for locality. But it also needs to be able to store data anywhere, anytime to maintain persistence in the face of outages and catastrophes. Thus data has to be separated from its physical location.

Files are encrypted at the source and stored as persistent objects with unique Global User ID’s (GUID). OceanStore has no knowledge of a file’s objects, so it relies on introspection, a mechanism that notes correlations among objects.

Thus the system moves highly correlated objects together, reducing the latency problems that a non-introspective object store faces in a global infrastructure.

Ciphertext
The paper notes that restricting OceanStore to ciphertext limits what can be done with the data. But there is more flexibility that you might suppose.

The operations compare version, compare-size, compare-block, and search are all possible. In addition there are several feasible update operations, such as replace-block, insert-block, delete-block and append.

Applications
Multi-petabyte data stores for scientific, security or commercial applications are obvious applications. But telcos and ISPs are most interested in mobile apps.

The authors call out email as an apt OceanStore application.

OceanStore alleviates the need for clients to implement their own locking and security mechanisms, while enabling powerful features such as nomadic email collections and disconnected operation. Introspection permits a user’s email to migrate closer to his client, reducing the round trip time to fetch messages from a remote server. OceanStore enables disconnected operation through its optimistic concurrency model—users can operate on locally cached email even when disconnected from the network; modifications are automatically disseminated upon reconnection.

APIs
OceanStore offered its own API. But the authors also developed facades for the base API that emulated a Unix file system. a transactional database and a World Wide Web gateway.

Replication
OceanStore used erasure codes, not unlike the mechanism Cleversafe uses for its distributed data store system. Replica management is a major task for a global system and the paper goes into some detail on their solutions.

The 2nd paper
A 2nd paper, Antiquity: Exploiting a Secure Log for Wide-Area Distributed Storage (available at the same link above) published last year, expands on the OceanStore work.

. . . the secure log interface implemented by Antiquity is a result of breaking OceanStore into layers. In particular, a component of OceanStore was a primary replica implemented as a Byzantine Agreement process. This primary replica serialized and cryptographically signed all updates. Given this total order of all updates, the question was how to durably store and maintain the order? . . . The secure log structure assists the storage system in durably maintaining the order over time. The append-only interface allows a client to consistently add more data to the storage system over time. Finally, when data is read from the storage system at a later time, the interface and protocols ensure that data will be returned and that returned data is the same as stored.

Finally, self-verifying structures such as a secure log lend themselves well to distributed repair techniques. The integrity of a replica can be checked locally or in a distributed fashion. In particular, we implemented a quorum repair protocol where the storage server replicas used the self-verifying structure. The structure and protocol provided proof of the contents of the latest replicated state and ensured that the state was copied to a new configuration.

The StorageMojo take
Bravo! EMC is taking cutting edge computer science and turning it into a product. I’ll comment on the specifics of Atmos later.

New storage paradigms are rare. To have so many academic papers on the underlying technology is rarer still.

EMC would never provide this much information themselves – it would slow down the sales cycle. But these papers – and the couple of dozen others on the OceanStore site – provide implementors with a wealth of technical background.

Comments welcome, of course. Anybody want to comment on what these papers mean for the patentability of Atmos?