pNFS technical intro

by Robin Harris | Monday, October 15, 2007 | Architecture, Clusters, Future Tech, NAS, IP, iSCSI | 9 comments

I don’t normally link and run but this is a good article on the Next Big Thing in NFS v4.1.

Written by 3 NetApp engineers, Garth Goodson, Sai Susarla, and Rahul Iyer, Standardizing Storage Clusters offers a good overview of what’s new. It’s on the ACM Queue web site.

If paragraphs like

protocol operations

The pNFS protocol adds a total of six operations to NFSv4. Four of them support layouts (i.e., getting, returning, recalling, and committing changes to file metadata). The two other operations aid in the naming of data servers (i.e., translating a data server ID into an address and getting a list of data servers). All the new operations are designed to be independent of the type of layouts and data-server names used. This is key to pNFSâ€™s ability to support diverse back-end storage architectures.

get you interested the article is well worth a read.

The StorageMojo take
pNFS is going to commoditize parallel data access. In 5 years we won’t know how we got along without it.

9 Comments

Mike Trogni on Monday, 15 October, 2007 at 5:12 pm

Thank you, very nice article.
Bill Todd on Monday, 15 October, 2007 at 5:23 pm

Hmmm.

pNFS certainly seems to fit the bill for HPC applications, which IIUC typically need high-bandwidth access to files without the guarantees of POSIX single-system semantics. And it should fit the bill equally well for bulk access to immutable files as required by things like video servers and hospital storage of x-ray and similar data.

But for general-purpose file systems? No way. Windows file system semantics is as stringent as POSIX semantics, and pNFS addresses neither (nor IIUC does it provide CIFS-compatible behavior: only environments which can currently live with NFS’s current semantics will find pNFS acceptable – though the article suggests that pNFS’s write-atomicity guarantees may be weaker than current NFS’s): does anyone really think that large numbers of applications will be re-written to start issuing specific byte-range lock requests just so they can use pNFS rather than a distributed file system that gives them *real* transparency? Part of the value which a file system adds to an environment is its inter-process coordination of both data and its associated metadata, and a great many applications have come to depend upon it and may break without it: applications that don’t share volatile data may be OK, but how many environments have *no* applications that depend upon coordinated updates to data?

Lustre, by contrast, got it right in this area: it provides single-system POSIX semantics (and I believe CIFS as well) for applications on distributed clients while allowing clients to optimize access (pretty much as pNFS does) in cases where no one else is currently competing for access to the same data (or where clients specifically elect to live dangerously). There’s no real reason pNFS couldn’t have done the same and in the process fixed a long-standing limitation of NFS (with the option to retain it in the few cases where applications might care to).

pNFS also doesn’t scale as well as it could. I’ve mentioned before the need for algorithmic data-distribution mechanisms to eliminate pNFS-style ‘layout’ maps that can grow linearly with file size to unmanageable proportions. But perhaps even worse, they must be updated (especially when the layout is reorganized rather than just extended or truncated) – and the paper suggests that such updates require something of a major hiccup in access to the file while all clients are notified synchronously about the change (though better design could avoid this). Allowing some kind of client ‘layout plugin’ that would allow use of mapping algorithms could provide a growth path, though.

Furthermore, the statement that “Although the pNFS architecture assumes that a single server is responsible for all metadata, it can also be applied to systems where metadata, in addition to data, is distributed among multiple servers” sounds like a hand-wave: the paper explicitly states that the client must ‘mount’ a specific metadata server, and presumably must then pass all control requests through that server (though I suppose it could forward them to other metadata servers if the metadata were distributed).

Finally (at least on the scalability front), file- and object-style data servers must hold access control information for all the files they hold, meaning that a changing permissions for a single large file could potentially require updating every data server in the system. My impression is that capability-based access control with time-outs is a better approach to this issue (especially if most permission changes can afford to be performed a bit lazily, letting existing capabilities just time out rather than actively invalidating them).

The paper does not specifically address redundancy, but my impression is that it’s left up to the clients to handle. Consistent redundancy is as much a part of system structural integrity as metadata is, and if it’s left up to fallible (let alone potentially malicious) clients that’s a major exposure: it may be a necessary one in the case of dumb block-storage data servers, but not otherwise.

Many of the above limitations could be addressed by a design that distributed metadata and file management throughout the system rather than limiting it to a central server. It would then be feasible to funnel all requests for a given file through a single ‘home’ server for that file, allowing that server to coordinate all locking and metadata updates for the file (locks could actually be handled at the subsidiary locations where the data was stored save for the awkwardness of handling cases where a lock was requested for a large portion of a large file, potentially spanning many servers): read requests would just be forwarded to the appropriate location (they could go there directly if capabilities were used to validate requests, locking weren’t handled centrally, and atime updates could be handled lazily), and write request descriptions would be forwarded as well (again, they could potentially go directly with capabilities, distributed locking, and lazy mtime updates) to allow the target to request the data from the client (thus not only keeping the bulk data movement direct but allowing the server to manage its available bandwidth better). This would retain pNFS’s ability to leverage the aggregate bandwidth of the entire server set for data movement while centralizing the handling of the small control messages and locking/metadata activity for each file at one of the servers in the system (but distributing the aggregate load for *all* files and directories across the entire server set rather than on a single master server that could get swamped).

Instead, the central metadata server bottleneck and tripartite data server universe suggests the worst of design-by-committee. Rather than aim pNFS at a really good solution, its architects seem to have aimed to accommodate as many *existing* configurations as they could: dumb disks/arrays (hey, SANergy and its imitators have been doing that for a decade or more now: do we really need yet another?), stand-alone file-server nodes (Yet Another File Server Virtualization Product), and the buzzword-heavy object-based storage that Garth Gibson has been stumbling around trying to make sense of since the early ’90s (Lustre did a better job with that as well, though they did start quite a bit later without any older baggage to carry around, which may have helped make their vision clearer).

Will pNFS satisfy some users? Yes. Will it ‘commoditize parallel data access’? Not likely – no more than NFS commoditized remote access (save in Unix environments).

– bill
Jeff Darcy on Thursday, 18 October, 2007 at 7:54 am

You can’t use algorithmic layout to avoid maps. Consistent hashing etc. are great when the cost of migrating the data looked up that way is small, which is why they’re great for looking up the *pointers to the real data* in P2P systems, but if the cost of migration is high then you can’t deal with the problem of a changing server set by updating pointers. Instead you have to deal with it by redirecting from the guessed-at server to the one that was designated for that datum when it was written, which is a bit of a performance problem but more importantly a correctness-proving nightmare when you’re operating on big batches of blocks. Looking up a per-file server and then getting a layout from that server actually scales much better, except in the most pathological write-sharing cases which applications tend to avoid (for good reason) anyway. You can use consistent hashing to find the per-file server, which would be an improvement over either pNFS or Lustre as each currently works, but doing it for the blocks itself doesn’t work. Been there, done that, wouldn’t be keen on repeating it.
Bill Todd on Thursday, 18 October, 2007 at 5:01 pm

Of course you can use algorithmic layout to avoid maps, just as RAID has since considerably before its acronym was invented: you just can’t reasonably use algorithmic layout to avoid *all* maps.

So you distribute data around the server set algorithmically (avoiding the ‘layout’ maps used in pNFS and Lustre which not only can become unreasonably large as file sizes move toward the PB range but which inevitably introduce additional levels of look-up overhead for random access to such files) and then have only the (pretty much unavoidable) server-local mapping to deal with (which also has the virtue of being *managed* entirely locally – one of the legitimate strengths claimed for ‘object storage devices’ if they’re used sensibly).

And consistent hashing works well for the higher-level distribution, because when the membership of the server set changes you only have to migrate as much of the data as it takes to rebalance the set (and you *always* have to migrate that much data unless you’re willing to allow significant imbalance – especially the kind that otheerwise makes a new server a very hot spot because it has loads of free space for in-coming material and the existing servers don’t). When the membership changes you simply use the old configuration (with forwarding when necessary: an additional network hop really isn’t a serious problem as long as it’s limited to small control messages rather than bulk data, and part of the beauty of the mechanism is that *the individual servers know where to send the relocated data* and can do so in parallel while managing that forwarding) until the new configuration has stabilized.

Of course, you don’t distribute small files this way, just files large enough to be worthwhile in terms of both distributing space use and leveraging the throughput of multiple servers. And you don’t distribute at very fine grain (a ‘chunk size’ of a few MB per server is a good figure for current hardware characteristics).

If you indeed tried something along these specific lines I’d be very interested in hearing about exactly what problems you had with it.

– bill
Jeff Darcy on Friday, 19 October, 2007 at 11:07 am

You seem to have a fundamental misunderstanding of what layout maps are or how they’re used, Bill. They impose no additional levels of lookup overhead beyond what’s also in the “server-local mapping” in your scheme. Driving all of the data through the same servers as the metadata is clearly insane, as it precludes optimization for one of those two very different workloads. If you have data/metadata separation then the metadata servers need some sorts of maps, and since the whole point of this whole exercise is to make things distributed so they can scale so sharing those maps is the obvious next step. There’s less overhead (though somewhat more complexity) involved in doing that along with the occasional expiration or invalidation than in getting involved in every transfer. Data/metadata separation is also very useful in other ways, such as access control or transparently turning a cluster filesystem into a distributed one (my last project at EMC after I was done with HighRoad). Hand-waves just don’t compare to years spent actually thinking through the second- and third- and tenth-level issues necessary to produce a working implementation.
Bill Todd on Friday, 19 October, 2007 at 1:57 pm

I’m afraid that you’re the one who seems confused, Jeff.

1. It simply isn’t feasible to map even many-TB files explicitly without either using multiple levels (in fact, *several* levels as files approach multi-PB sizes) in the map or requiring that each extent mapped be something like 1 GB in size (which eliminates the possibility of enlisting multiple servers in streaming serial data at high bandwidth): a conventional Unix inode, for example, can only map around 10 file blocks (let’s be generous, though, and assume that it can map that many contiguous extents) before requiring the first additional level of mapping indirection, and you’d have to increase the inode size by several orders of magnitude before making a serious dent in the number of levels required to map TB, let alone PB, files. By contrast, when the only explicit mapping involved is the server-local mapping distributed across the server set then keeping *all* the mapping information for *all* the large files in the system RAM becomes eminently feasible as long as the extents are at least a few MB in size (a size that *does* allow flexible aggregation of multiple servers in streaming serial data): even if each mapping entry consumes a few dozen bytes, they still constitute only about 0.001% of the total local storage and hence can reasonably be maintained in RAM (since RAM costs only about 100x as much as an equal amount of disk space you can easily configure an amount of RAM equal to a significant fraction of 1% of the node’s disk storage space without noticeably increasing the overall cost of the node, and keeping all the extent pointers for all local extents of large files in that amount of RAM will consume only a few percent of it). In summation, there’s no way in hell you could maintain the kind of huge map required to describe a huge file explicitly on the client without using multiple levels of look-up indirection – and in fact no way to keep that large a map on modestly-configured clients at all (unless, as noted, you’re willing to sacrifice server bandwidth aggregation for serial access to the large file by placing huge contiguous pieces of it on each server).

2. I never in any way suggested that data pass through the metadata server(s): that would be ridiculous. In fact, had you read more carefully you would have found multiple places in which I stated explicitly that bulk data transfer should go directly between the client and the server node holding the data. It is, however, no real problem if small *control* (e.g., read request or notification of ‘desire to write’) messages incur an extra hop once in a while (or even all the time if you want to coordinate authentication – and perhaps locking – through a specific ‘home’ server for each file rather than use mechanisms like capabilities which can become complicated to revoke if permissions change dynamically); hell, if you’re maintaining atimes and mtimes (or just the EOF marker) the file’s metadata may need to be updated anyway, so the hop may not be ‘extra’ after all.

3. Whether you have data/metadata separation or allow both to be spread across the entire server complement, you do *not* need overall maps for a large file if you distribute its pieces among the servers using a hash function: you only need the local mapping at each storage server, as I already described – the hash describes which storage server to target for any given ‘chunk’ (virtual address range within the file, at least for byte-stream files), and the storage node’s purely local RAM-resident map takes things from there. So you don’t have big file-specific maps, you don’t have to give them to clients (clients don’t even have to understand the mapping algorithm if they funnel file request information through the ‘home’ server and look-up requests through any server that’s convenient, though for the latter path-caching and direct targeting are also reasonable options), you don’t have to worry about keeping all map copies synchronized, you don’t have multi-level look-ups, you don’t have to coordinate allocation globally (or at least between a metadata handler and the target storage node)…

Suck it up and take this seriously if you want to continue to discuss it, Jeff: so far, your ‘years spent actually thinking’ about such issues are looking singularly unimpressive, and the only hands that are waving appear to be your own.

– bill
Jeff Darcy on Friday, 19 October, 2007 at 7:33 pm

I’m going to ignore your extreme rudeness, Bill, and try to concentrate on the technical issues. I was on Usenet once, as I see you still are; I long ago lost patience with the kind of posturing and flaming that you’ve brought with you.

1. Where ever did you get this notion that clients do or should fetch the *entire* map for a file? Certainly not from me, and I don’t think from the pNFS spec either. That’s the confusion to which I was referring. You seem to have spent a lot of effort, or at least typed a lot of text, theorizing about the dimensions of a problem that actually doesn’t exist in practice. All this talk about “traditional UNIX inodes” mapping blocks individually is actually pretty funny (and yet I’m accused of not being serious). I was around when things worked that way, but it was a long time ago. Modern systems simply don’t work that way, so the map for a file doesn’t get anywhere near as large as you seem to think. People have known for years how to allocate ever-larger extents as files grow larger, so the number of extents is usually a lot more like O(log(blocks)) than O(blocks) and the map even for very large files doesn’t grow nearly as fast as you seem to think.

Clients fetch only the parts of the map that they need, and any multi-level lookup is for the server to resolve so there are no extra round trips associated with it. If clients are doing big sequential I/Os, they fetch huge gobs of it; if they’re not, they don’t (to avoid false contention). Also, the maps are (or at least were) extent-based, so if space had been allocated in multi-gigabyte chunks then each such chunk would still only occupy one map entry. Maps can be fetched in parallel with I/O, in plenty of time for the next chunk of I/O, releasing old extents to make room as they go, without much difficulty at all. We used to test a lot with large files, especially for geophysics. We found that, even with a small extent cache and an unambitious prefetch strategy, waits for new extents hardly ever caused stalls. That one little bit of empirical data trumps all of your “clients can’t possibly” theorizing.

2. I just checked, and there are not “multiple places” where you “explicitly” said that bulk data transfer should not be forwarded. On the contrary, your comment about “knowing where to send the relocated data” suggests such forwarding, though that might just be the result of sloppy terminology. You were talking about two different servers of the same type (combined data/metadata) and should at least have specified which one you were talking about. Since we have also been talking about split data/metadata servers, it would have been even better if you had specified which type of server you were talking about as well.

I also take issue with your “no real problem” attitude toward extra hops. They might or might not be a performance problem; lots of small/quick messages can still clog up a system pretty badly just as lots of raindrops can make a flood. More importantly, they make the already-difficult parts of the protocol more so. You seem to think of authentication and locking as a small aside when in fact they’re critical pieces of the puzzle, and you haven’t even mentioned recovery – one of the hairiest bits – at all. When someone goes on endlessly about the error-free uncontended “happy path” for I/O without even considering the less common cases, I call that handwaving and I don’t apologize for it. So do most of the other people I know who’ve gone to great lengths to ensure that even if something happens very rarely it won’t cost you your data when it does. In production code, even the rarely exercised pieces have to work, and those pieces get much more difficult when transactions get forwarded around. I’ve developed such distributed protocols when I had to deal with latency in a global-scale network, but I believe it’s something that even the most experienced practitioners using the best model checkers and other tools should only do when there’s a compelling reason. In this case there’s not. As superficially appealing as it might be, there’s no performance- or reliability-related benefit.

3. Your point about “clients don’t even have to know the algorithm” only makes it more evident that you have failed to understand pNFS before attacking it, because *they don’t have to know now*. That’s a problem that only exists in your half-proposal, not in pNFS. The whole idea of FMP, which morphed into pNFS, was that the server does all the “cooking” of whatever big hairy data structures it uses to keep track of file contents, and then presents the easily digestible result to clients. Splitting the map for a single file is also a big lose because then you’ll have to deal with all of the boundary-crossing issues not just for reads and writes but for locks and during recovery. Here’s some more empirical data for you: a surprising number of applications are profligate about locking, which will create server-spanning locks and complicate not just other locks but read and write handling as well. Again, this is something that requires a compelling justification, and there isn’t one. As I mentioned above, with modern techniques all of the metadata even for a very large file can easily be handled on one server, and if you want to split load across servers you can do it just fine at file boundaries. Empirical datum #3: even in the HPC field where I now work, petabyte files are practically nonexistent. In some ways I wish they were more common, because the frequently-chosen alternative is huge masses of smaller files with funny names. That’s actually even more difficult for a cluster or distributed filesystem to handle efficiently. Be that as it may, that’s what people do and that’s what those who actually design such systems need to account for.

So, Bill, who needs to start taking this seriously? Where are your empirical data points to guide design of these sorts of systems? What’s “impressive” about a mountain of text that’s mostly off in the weeds concept-wise? I would never claim that pNFS is an ideal solution; in fact I’m on record at this very site saying that I think some aspects of it are less ideal than our host seems to think. Ditto for Lustre, which I spent all day today working on – and by working on I mean code, not just setup. There are still some significant unsolved problems in this area, but that doesn’t mean anyone with a few P2P notions and a half hour on their hands can turn the field on its head. Many, if not most of the people working on cluster filesystems have also worked on distributed systems. They’ve known about consistent hashing and Byzantine generals and Lamport clocks and relaxed ordering/consistency models for quite a while. That honeymoon period ended years ago. Those ideas are being applied, or often have been applied, judiciously, where they fit, guided by empirical need and not mere “I found a hammer; where’s a nail?” impulse. What will your first original contribution be?
Bill Todd on Friday, 19 October, 2007 at 10:15 pm

Y’know, Jeff, for someone who started the business of throwing around sleights like “You seem to have a fundamental misunderstanding…” and accusations of ‘hand-waving’, you’re kind of thin-skinned yourself when they’re turned back in your own direction. Before lecturing others on their attitude, you should look to your own for possible reasons for it. Perhaps that’s what happens when someone enters a conversation expecting not to learn anything from it: they can then get kind of uppity when things don’t quite go as they expect them to, because they’re primarily pontificating instead of actually listening.

1. I actually have read enough about pNFS to know that clients may not need to inhale the entire map at one time, but of course that means that they will need to perform metadata server queries each time they need another portion of it (which could be on every access for a random access pattern) – not exactly a great recipe for performance. But in any event that was not my primary objection to the map: its size on the metadata server still requires multi-level look-ups for large files (instead of continuing to wave your hands so vigorously, why not actually work through some numbers on this: it’s hardly a graduate-level exercise), and it still requires distributed updating (if copies or portions are distributed to clients) when portions of it change. And you seem utterly oblivious to the problems inherent in the ‘multi-gigabyte’ extents you describe – which limit sequential streaming within that range to the capabilities of the single server on which it resides (again, hardly a recipe for great performance where high-bandwidth access is desired: it’s the same problem that allocating a multi-GB extent on a single disk encounters in that respect, though IIRC EMC had a brain-damaged ‘RAID-S’ implementation that did just that so perhaps this explains your blind spot in that area).

2. Though it’s a kind of petty point, your ‘checking’ described above seems to have been as incompetent as the rest of your discussion so far. The multiple instances in which I stated that bulk data should move directly between clients and the storage server holding it were (you can search for this text which you should have found the first time rather than blustering about its non-existence) “read requests would just be forwarded to the appropriate location … and write request descriptions would be forwarded as well … to allow the target to request the data from the client (thus not only keeping the bulk data movement direct… This would retain pNFSâ€™s ability to leverage the aggregate bandwidth of the entire server set for data movement” and (this one was more implicit, since I felt that I had made the concept quite clear in those multiple earlier lines) “an additional network hop really isnâ€™t a serious problem as long as itâ€™s limited to small control messages rather than bulk data”. Your problem with the part about servers knowing where to send the relocated data was due to sloppy reading, not sloppy terminology (but since you clearly haven’t bothered to read carefully from the start, no surprises there: take a look at the context in which the phrase was used – including both the start of the paragraph and the start of the sentence – and light may dawn).

2a. Nice try at exploding an occasional extra hop (or possibly even sometimes two) per client request into “Lots of small, quick messages” being the equivalent of raindrops making a flood. If the system can handle the actual data movement itself, associating an extra small packet (or rarely even two) with it is hardly a cause for alarm – especially since it occurs only with files too large to reside wholly on their ‘home’ server node.

2b. Of course I’ve only touched upon authentication and locking, and even less on recovery: since you seem to have difficulty even understanding how the data can be distributed, going into greater detail in other areas seems premature. This doesn’t imply that I haven’t worked them out myself, of course. As for the ‘happy path’, perhaps you’re simply unacquainted with the precept that one should make the common path fast when that’s possible without making uncommon paths unreasonably complex – again, since you appear unable to grasp even the common path, I’ve been working on that piece of your education before moving elsewhere.

2c. Simple request-forwarding is hardly the potential bugaboo that you suggest; in particular, it does not bring additional members into any ‘transaction’ (when something as heavy-weight as a traditional transaction or distributed transaction is in fact necessary – and since I’ve written distributed transaction managers I do know what they entail).

3. Your reading comprehension failed you yet again, I’m afraid: the comment about clients not needing to understand the mapping algorithm was in *contrast* to pNFS’s requirement that clients employ explicit maps, which itself is a consequence of using centralized metadata on servers with commensurately limited capabilities (whereas with fully-distributed partitioned metadata sending a client request through the file’s ‘home’ server does not run the risk of overwhelming a central resource).

3a. Your comment about ‘splitting the map for a single file’ simply indicates that you still have on clue what I’ve been talking about; THERE IS NO MAP TO SPLIT (at least save for the opaque-to-any-external-observer server-local maps of what in OSD terms would be called the ‘objects’ comprising the local pieces of the file), and in the simple version of the proposal (with locking and authentication for a large, distributed file centralized at its ‘home’ server) the only ‘boundary crossing issues’ are the same ones pNFS has (at the points where a file’s storage moves from one storage server to the next and write-atomicity must be coordinated when necessary).

– bill
Bill Todd on Sunday, 21 October, 2007 at 4:52 pm

Despite the conspicuous lack of any substantive objections to the architecture that I’ve sketched out here, I’m always willing to revisit my assumptions in the face of a challenge to them. So I decided that it was time to take a look at Amazon’s ‘Dynamo’ architecture (Google up amazon-dynamo-sosp2007.pdf) for their very-large-scale production environment, since I had seen a passing reference in the Scalability Conference that Robin reported on here a while ago to the use of consistent hashing as its data-distribution mechanism.

It turns out that the Amazon architects and I are on very much the same page in several areas (I’ll only touch upon the ones that seem pertinent to the existing discussion here, though there are others as well).

They think that incremental scalability is important:

“Dynamo should be able to scale out one storage host (henceforth, referred to as â€œnodeâ€) at a time, with minimal impact on both operators of the system and the system itself.” (“Minimal impact… on the system itself” is one of the hallmarks of the use of consistent hashing as a data distribution mechanism rather than some mechanism which must not only rebalance data according to some global knowledge of available space but coordinate this movement closely with updates to pointers to that data.)

They value symmetry:

“Every node in Dynamo should have the same set of responsibilities as its peers; there should be no distinguished node or nodes that take special roles or extra set of responsibilities. In our experience, symmetry simplifies the process of system provisioning and maintenance.” (I.e., they deprecate use of special ‘metadata servers’.)

They value decentralization:

“An extension of symmetry, the design should favor decentralized peer-to-peer techniques over centralized control. In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system.” (Again, this favors techniques which do not rely upon global knowledge of the distribution of free space, let alone require global synchronization when performing allocations and deallocations, and which allow those transactions which *do* require participation by multiple storage nodes to involve only the minimal set of nodes that are affected.)

They choose consistent hashing to distribute their data and used it in exactly the same manner as I’ve proposed, including their initial mechanism for handling nodes of varying storage capacities. They later decided to refine this mechanism based upon the overhead of redistributing small objects when rebalancing was necessary and of their use of Merkle trees to detect stale object copies, which is a far less significant problem in my proposed system since it does not distribute small objects via consistent hashing (due to the desirability of being able to cluster them for efficient look-up and aggregate access) and does not suffer from stale-object problems (due to its use of synchronous replication strategies); whether their later mechanism would be a net win for my approach is not clear, since it has the ultimate disadvantage of either running out of steam at very large node counts or consuming a great deal more space than necessary at modest node counts and also requires some careful, centrally-coordinated tending to keep things in balance, but I’m glad I encountered it. AFAIK this is the first large-scale integrated storage system – as distinguished from, e.g., edge-of-the-Web caching facilities – that uses consistent hashing (and until I heard about Dynamo I was surprised that no one seemed to be taking advantage of it).

They leverage the use of consistent hashing to balance dynamic load as well as space use.

They believe that most outages are transient (though may entail waiting for a component to be repaired rather than being virtually instantaneous) rather than permanent, hence do not immediately respond to them by rebalancing.

Areas in which Dynamo differs include:

1. A relaxed consistency model. I just don’t believe that this is necessary for adequate performance in a controlled (single secure data center, or even multiple such centers) environment, and the speed of light allows replication to sites up 100 miles away at latencies which are small relative to that of an average disk access. Request prioritization should provide acceptable response times for synchronous replication, especially when expedited by initial acceptance in a low-latency transaction log (possibly even in NVRAM), and quickly timing out and then queuing updates to unresponsive redundant recipients (or for really remote sites possibly through a modest queuing repeater located a bit closer but still outside the expected blast radius) arguably attains similar levels of performance and availability without allowing the kind of replica divergence which requires some kind of ‘conflict resolution’, as long as each site has reasonable internal redundancy in both storage and in the paths to its copies (and to other sites) and one can tolerate having a ‘primary’ data center which coordinates such activity (though can still farm out at least read and some write service requests to a remote site that the primary siteby knows has up-to-date data, thus leveraging its additional bandwidth or potential proximity to a client) in the absence of a whole-site disaster. Since even synchronously-replicated storage facilities must tolerate at least temporary decreases in their level of redundancy when components fail, even shorter-term decreases in redundancy due to such brief queuing activity should be acceptable (as long as some minimal acceptable level of synchronized redundancy is retained – e.g., via emergency redirection of one copy to a non-standard location if otherwise only one up-to-date copy would exist in the system). Note that when special requirements exist it is possible to give primary-copy responsibility to different sites for different portions of the overall data set – either by partitioning it or (though considerably more complex) identifying the primary (coordinating) site for each object in an otherwise homogeneous database (I have some difficulty imagining that this would be worthwhile, but it is possible).

The last time I measured U.S. coast-to-coast round-trip ping times on the Internet they were pretty respectable (significantly under 0.1 second): until we start replicating data off-planet I’ll continue to suspect that synchronous replication (with the possible addition of those ‘queuing repeaters’ that I mentioned to help handle large distances) is a viable option for most interactive environments (the Web itself being the major variable), as long as the relevant *applications* can execute fairly close to the data. By contrast, Dynamo’s ‘quorum’ mechanism which requires reading data from multiple nodes before it can be returned seems a questionable trade-off, even taking into account Dynamo’s extreme reluctance to stall updates for any reason: it clearly *works* for Amazon, but I question whether it even performs as well as (let alone measurably better than) well-designed synchronous replication would – and not having to deal with version divergence is a non-trivial advantage of the latter.

2. Support for large and potentially complex objects (files). Dynamo’s largest objects are 1 MB in size, hence it has no need to distribute pieces of them across multiple nodes and coordinate updates across those pieces (and associated metadata) using POSIX-like single-system file semantics.

3. Support for a conventional directory hierarchy. Dynamo just accesses by hashing the object ID to the object’s target node(s).

4. Support for any form of object clustering for improved aggregate access (Dynamo doesn’t).

5. Support for transactional updates (generally not required in Dynamo since its objects are simple and don’t span nodes, and its consistency is relaxed).

6. A sub-optimal redundancy strategy (I believe that Dynamo is more exposed than necessary to data loss from multiple failures) – especially in a single-site environment.

7. Distributing a single redundancy strategy across multiple sites. Dynamo has only a single redundancy strategy: propagating replicas forward to successor nodes in its consistent-hashing ‘ring’. It achieves multi-site disaster tolerance by requiring that the ring nodes be distributed (apparently) round-robin around the multiple sites. While this is the most storage-efficient way to achieve multi-site disaster-tolerance, it has multiple down-sides: the sites are joined at the hip rather than essentially independent (there’s some value in being able to run completely different software at a replica site), there’s value in needing to propagate only logical updates (rather than, say, images of every updated page) between sites (especially if they’re far apart and/or connected by a link of modest bandwidth), and having to fetch data from multiple remote sites just to satisfy every read operation (due to its ‘quorum’ requirement) sucks. Raw storage has become cheap enough that incorporating both site-local redundancy (perhaps parity-based, if you’re feeling strapped) *and* (when disaster-tolerance is required) remote replication is hardly infeasible, and using site-local parity protection plus a single similar remote replication site gives you more availability using less storage than using three Dynamo-style sites would (unless you feel the need to ride out two concurrent whole-site disasters).

8. Read/Write ‘quorum’ and total replica count control (per object?). Quorum control is not applicable to a synchronously-updated system, but my approach does provide per-object replication control (number of local copies or RAID-5/6; all objects would be replicated at each remote site at which the file system containing them was replicated, but potentially with different local replication strategies there).

9. Decentralized membership. Dynamo is able to tolerate simultaneous inconsistent views of the membership in part because of its relaxed consistency mechanism. I’m not yet certain whether my system would find that level of uncertainty tolerable, though by queuing updates from individual nodes that cannot communicate with a peer while initiating an investigation that will soon establish for everyone the presence or absence of that peer I think it can combine reasonable responsiveness (both to on-going activity and to temporarily restoring redundancy if necessary) with safe replication (the critical period is that between the first unsuccessful update attempt and everyone’s notification that the node is unresponsive – though not yet expelled).

In sum, while I applaud Dynamo’s innovation I’m not entirely convinced that it’s the best solution even for Amazon’s very special requirements, let alone for more conventional needs. Still, its demonstrated success over the two-year period prior to the paper’s publication helps validate several of my own approaches, for which I’m duly appreciative.

– bill

P.S. Robin, I think the update that you appended to the Mac ZFS debate post may have been meant for this one.

P.P.S. While discussing relative ZFS virtues on the “ZFS Hater Redux” sequel to the “Don’t be a ZFS Hater” thread that you linked to I was reminded that I had downloaded a UWisc thesis early last month but never gotten around to reading it. It’s a *great* description of the kinds of nasties encountered while managing storage even though the author occasionally plays a bit fast and loose with paraphrasing his references: Google “Iron File Systems” and choose either the vijayan-thesis.pdf reference for the Full Monte or the iron-sosp05.pdf reference for the Reader’s Digest Condensed Version. Particularly scary are the descriptions of how many current file systems sometimes just blithely ignore write errors (not ‘silent’ errors, not even ‘latent’ errors: errors the underlying layers *report* to them): I’ve downloaded the current Linux source to see whether the author may possibly have misinterpreted what was happening, but it sure sounds bad.