On Saturday June 14th Google will host the 2nd Seattle Conference on Scalability. I’m planning to be there.
Interesting topics
Given the dual-track format, I may not be able to see everything I’d like. Here’s what looks cool to me:
- CARMEN: a Scalable Science Cloud. “CARMEN is a $9M project building a scalable science cloud. Its focus is on supporting neuroscientists who will use it to store, share and analyse 100s of TBs of data.”
- Chapel: Productive Parallel Programming at Scale. “Chapel is a new programming language being developed by Cray Inc. as part of the DARPA-led High Productivity Computing Systems Program (HPCS). . . . Language concepts that support this goal include abstractions for globally distributed data aggregates and anonymized task-based parallelism.”
- Conquering Scalability Challenges with Transactional Billing. “A huge problem faced in the world of billing and charging is the ability to process a large number of transactions per day.”
- GIGA+: Scalable Directories for Shared File Systems. “This talk is about how to build file system directories that contain billions to trillions of entries and grow the number of entries instantly with all cores contributing concurrently. The central tenet of our work is to push the limits of scalability by minimizing serialization, eliminating system-wide synchronization, and using weaker consistency semantics.”
- maidsafe: A new networking paradigm. “We describe a significant new way of networking and data handling globally. This data centric network is likely to revolutionize the IT industry in a very positive fashion.”
Here’s a link to registration and all the presentations.
The StorageMojo take
They all look good, don’t they? I’m very interested in the transactional billing talk, since the major reason for the disconnect between IT and the lines of business is that IT is essentially tax-supported rather than priced.
IT costs are peanut-buttered across LOB overhead – so expensive resources get over-consumed – leading to over-configuration and business value blindness. These guys aren’t talking about that problem – but it is the key to making utility computing and SOA business tools instead of marketectures.
Comments welcome, of course. I’m looking forward to seeing Seattle with blue skies.
Scalable Directories for Shared File Systems
Well, scalability is really easy to achieve if you concede weaker consistency semantics. That probably makes sense for a lot of large web applications, where individuals are only writing emails that go in to their own folder, or uploading content to their own photo album or video share. The web company hosting all that may have trillions of files, but those are from hundreds of millions of users, with very little contention across them.
Weak consistency file systems not very interesting in business, though, or in any case where multiple people might share a file in that shared file system.
There are already some commercial shared file systems (and/or clustered NAS solutions) that give you weaker consistency on directory updates, but they don’t admit it publicly.
I think relaxing consistency constraints is both fine and necessary at huge scale, as long as file system vendors are being upfront about what the consistency guarantees really are.
Out of curiosity, what do you mean by weaker consistency on directory updates, Carter?
Consistency in data systems has been an issue since news group servers decided to replicate each other back when usenet ruled the world. It was an issue when the home office trucked taped around the country to local offices when mainframe connectivity was not a technical or cost effective option. The problem space is the same, only some of the expectations have changed.
In the scientific community needs a decent publish schema for the TBs of data they have. Isn’t this point mute since cern complained about one (1) bad bit in every MB they have ?
Whether scalability meets the needs of business, depends on the expectations of the data publisher and the consumer. Before we discuss scalability someone should address the issues and models of author, publish and consumption life cycles; what the data authors, publishers and consumers expect; and what they really need. Especially during authoring, when multiple authors edit multiple copies of the same data set (file) in a time shifted environment.
Lotus notes, in the 1980’s (’90’s), addressed this issue with a manual reconciliation process. Will a similar process be accepted and adopted by today’s market as a viable solution ? Would anyone tolerate it today ?
Cater notes, it’s not “interesting” for business. I think he’s right for the most part.
Our expectations for data propagation have already been set by existing applications ie email. You would assume as these application became more ubiquitous, our expectations for performance would have driven the demand for increased performance. But, it hasn’t. In fact the opposite occurring, there is a social trend of communicating using text messaging instead of voice calls. On the average, a text message takes between 45 and 120 seconds to arrive. Funny, how we wouldn’t tolerate that on a phone call.
This “near off-line” social communication trend will help set the acceptable levels of performance and data consistency in the business and home consumer information markets. As long as the consumer’s and publisher’s expectations are met, the system will be deemed acceptable for a great number of business applications.
Carter brings out another very good point, some commercial NAS supplier’s are not informing their customers that there may be delays in directory updates. Holding back this information is unscrupulous on the supplier’s part. But, in today’s business climate its ok to mislead customers and find some way to justify excessive pricing, consulting and service fees for “open source” software.
I just meant that if I create a file in a directory, then everyone and anyone who lists that directory should see that file in there right now. On some cluster file systems, or clustered NAS implementations, you can create a file on Node A, and list the directory on node B and you won’t see the file. Maybe if you try it again if a minute, you will see the file. That’s an example of weak consistency – it eventually gets made consistent, but there are periods of time where different users will list the same directory and get different results.
In some very large scale operations – like those on this Google conference – weak consistency is ok? If there are 10,000 videos uploaded to YouTube in an hour, does it really matter what order they were uploaded in, exactly? It’s different if you are talking order entry, or back transactions. I am not saying that weak consistency file systems are evil, just that you have to admit it when you are one, so that people can reasonably say, oh, ok, that’s fine for my application or no, that’s absolutely unacceptable.
The lights are out in this room, so excuse any typo’s please – I can’t see what I’m writing….and no, that’s not supposed to a veiled dig at weak consistency file systems – it’s just a dig at the failing eyesigh of an old file systems engineer!
Wow – while I haven’t done any thorough analysis, off the top of my head I’m not aware of a significant distributed file system since early implementations of AFS (Andrew) that didn’t have pretty strong consistency guarantees. I was going to ask you to name names, but then I took a quick glance at your blog and it appeared that you weren’t really talking about the kind of commercial file systems that people buy (rather than those which they create themselves for internal use, without making their design public).
My own view is that appropriate object partitioning can be used to scale POSIX-compliant approaches sufficiently that weaker file systems will turn out to have been stop-gaps filling a temporary need, but until someone actually builds such an alternative (if you don’t think that Lustre already has) that’s open to debate.
– bill
Bill,
Your comments are a good example of some mis-conceptions in the storage communities. Technically, all files systems have a consistency component to them. If you look at difference between the file data stored on disk and what is residing in file cache, it can be considered a “consistency” issue.
We (the industry) had developed weakly consistent file systems to overcome some of the performance latencies experienced with synchronous semantics with geographically dispersed file replicas or replica stores that are not online. Most companies didn’t have Gbit pipes connecting facilities 10 years ago, so weaker consistency was acceptable for the application (Carter’s comments). But today, companies are moving much more data than they were 10 years ago.
Examining the impact of geographic distance, fibre optics will incur about a 1ms delay for every 50KM of cable distance. Round trip, that’s 2 ms per disk read or write request. Locally, we experience travel latency under 1us. That’s at least 2000x increase in latency for 40KM. If you were in NYC and your remote store was in Boston, you will experience latencies of 14ms, which can significantly slow file operations. Transport technologies like ethernet will experience higher latencies. Transports on public networks even higher, 20-30ms.
The bottom line is: you need to pick/tune your distributed files system to the need s of your application, cost structure, operating systems and end user experience. They all basically operate the same way from a user’s perspective. When examining products for deployment, the critical issues is error and failure recovery.
Whether file system semantics are posix or something else is losing value rather quickly, especially in the home and mid-markets.
I don’t agree with the point that these measures are stop-gap. There is a transport cost factors and a physical constraints effecting platform performance and user experience. I would classify varying consistency as cost optimizations and a work around to improve user experience.
In the near term, we will probably continue to see variable consistency file systems until someone comes up with a way to instantaneously transport information from one location to another (zero-latency). Or, we have to stop paying a 10x premium for a 2x increase in performance. Or we become satisfied with the cost and experience of the file solution. Kind of like our cable bill…
“Your comments are a good example of some mis-conceptions in the storage communities.”
No, they are not.
“Technically, all files systems have a consistency component to them. If you look at difference between the file data stored on disk and what is residing in file cache, it can be considered a “consistency†issue.”
No, it cannot (at least not in anything like the sense being discussed here): that ‘issue’ is simply well-defined POSIX (and similar file system) semantics rather than the weaker guarantees that we’re talking about.
“We (the industry) had developed weakly consistent file systems to overcome some of the performance latencies experienced with synchronous semantics with geographically dispersed file replicas or replica stores that are not online.”
No, by and large ‘we’ didn’t: perhaps you’re confusing this with some of the asynchronous *replication* facilities developed to support remote disaster-tolerant copies (which aren’t part of the file system per se and if readable at all are readable only with the caveat that they may not be up to date – similar to the way that a local file system is not up to date after a crash that loses its cached dirty data).
…
“Examining the impact of geographic distance, fibre optics will incur about a 1ms delay for every 50KM of cable distance.”
No, it will not: it’s about 1 ms for each 100 miles of distance (i.e., you’re off by a factor of about 3).
“Locally, we experience travel latency under 1us.”
Which is in most cases significantly over-shadowed by hardware and software interface overhead (which can vary from a few us for FC to tens of us for Ethernet).
“That’s at least 2000x increase in latency for 40KM.”
More like a 10x – 100x increase (depending on the interface overhead, and after correcting your cable-speed error) – and over a mere 40 KM the added latency (about 250 us) is insignificant compared to the latency of an average disk access (5 – 13 ms).
“If you were in NYC and your remote store was in Boston, you will experience latencies of 14ms”
More like 4 – 5 ms round-trip.
“which can significantly slow file operations.”
Maybe, maybe not: while 5 ms is certainly enough to slow a single disk access noticeably, it’s only incurred on *write* operations. Given that the systems are isolated from each other (and from common power failures), you don’t have to wait for the remote write to complete (i.e., you only have to wait 5 ms for an ACK, rather than 5 ms plus the remote disk access, and can hide that 5 ms behind the local-site disk access: only if you’re using non-volatile write-back local cache will the round-trip latency become uncomfortable). You can send all writes lazily (don’t have to wait for an immediate ACK) except those which represent transaction (synchronous file operation) commit points. If you can accept the limits of asynchronous replication you don’t have to wait even for those. And if none of these is acceptable to you, you can establish a relatively inexpensive closer repeater which will still give you disaster tolerance at your primary site without having to wait for synchronous replication all the way to Boston.
“Transport technologies like ethernet will experience higher latencies. Transports on public networks even higher, 20-30ms.”
You really should time a coast-to-coast ping sometime: the results will surprise you.
“The bottom line is: you need to pick/tune your distributed files system to the need s of your application, cost structure, operating systems and end user experience.”
Not unless you require concurrent write access to very-widely-separated replicas (which most environments do not): otherwise, you can just update at a single site and, optionally, allow slightly out-of-date concurrent reads at asynchronously-replicated remote sites (if they’re too distant to make synchronous replication feasible).
“They all basically operate the same way from a user’s perspective.”
That’s because they all obey essentially the same (POSIX-like) de facto standards (save possibly for asynchronously-updated replicas, but as I already noted that’s not part of the file system semantics per se but is defined separately).
“When examining products for deployment, the critical issues is error and failure recovery.”
Those are attributes of the remote replication mechanism, not of the file system per se – unless you’re talking about geographically-dispersed cluster file systems, the exemplars of which (e.g., those developed by Tandem, DEC, and IBM) are by definition synchronously consistent (and DEC deployed them successfully with inter-node distances up to 500 miles, though I won’t suggest that the link latency was unnoticeable at such distances).
“Whether file system semantics are posix or something else is losing value rather quickly, especially in the home and mid-markets.”
Not to people who would like their databases to be recoverable after something as common as a power failure: without guaranteed update order, most databases would likely to be toast.
“I don’t agree with the point that these measures are stop-gap.”
No problem.
– bill
Lots of interesting comments on weak consistency.
Perhaps our abstract could have been clearer. Users and applications of a GIGA+ file system see strong consistency — once a file has been created in a directory lookups on every other node see it.
What Swapnil is doing in GIGA+ is implementing the file system client in each node with weak consistency on the layout of the directory entries across all the servers. Each lookup will in fact reach across to the correct server where the actual lookup is strongly consistent. But the client might send the lookup to the wrong server first, because of its weakly consistent mapping of the layout of the directory, causing the server to correct the client’s mapping and induce the client to retry (a bounded number of times in the worst case).
The benefit of this approach is that no function ever has to grab a global lock and invalidate/update every client’s cached map of the directory layout. The most innovative parts of GIGA+ have to do with the freedom each server has to split or merge buckets of entries in a huge directory at any time without notifying central or global data structures, and relatively quickly but asynchronously, update interested client caches (since anytime a client mis-addresses a given server it updates the client with a dense bitmap that expresses everything it knows about the directory.
Swapnil is presenting GIGA+ in a couple of hours at the Google Scalability Conference. Hope you are here to ask him questions. If not, I believe video of the talk is going to be made available.
Garth Gibson (CMU and Panasas)
That makes a lot more sense. I’m curious, though, why you use client maps rather than an algorithmic distribution (consistent hashing would seem the obvious candidate) to locate directories: if you did, then the only times that client location information would need updating would be the far rarer occasions when the system membership changed (and the location information itself would be far more concise, which might have some value for very small clients like cell phones).
– bill
Bill,
I can get up to breath again.
You caught my math error. Can’t beleive I did that, NY to Bostong is 218us each way. That day was the end of a long 2 nighter, but another silent data corruption is now dead. Thanks for keeping me honest..
Just to clarify, file system features like “consistency” and posix compliance are part of the selection criteria for the application. File system semantics insulates the application from the underlying file system mechanics, no performance criteria is given in the posix spec.
Synchronous operations (which is also a class of “consistency”) across multiple nodes in geographically dispersed regions with varied transport quality CAN/MAY suffer from some serious performance challenges for many applications.
Features, like asynchonous replication, that have been sold like over priced clothing and auto accessories are becoming embedded in files systems or at least shipped with the operation systems. You may disagree that remote or local asynchonous replication is part of the file system, but if the semantics hides the replication features from the application and the replication features execute without intervention by the application, replication must be considered part of the file system.
Examining another type of weak consistancy model in storage, look at a RAID volume (pool if you prefer) with a write cache enabled (no write through). The data written to the disk volume is technically a replicated copy (super set) of the RAID volume cache which is synchronized to the disk at some other point in time — weak consistency at the block level.
Why would someone deliberately use a weakly consistent disk volume configuration ? More than likely for the “performance improvement” experienced by enabling the write cache, a weakly consistent configuration. You can even configure your PC running windows for executing with the disk write cache enabled, just don’t lose power before the cache is committed to disk. LOL. Weak consistency is implemented in a wide variety of layers in file and data storage systems. Yes, WE, the architects, marketing and developers in the data storage industry put weak consistancy features in our products — because our customers ask for them.
2011will be the year we start to see the shift in attitude, a migration, from traditional monolithic file systems to data/file services. It won’t be driven by businesses for internal IT or the scientific community, but by the consumerism of data. Already we can see MSOs and mobile carriers offering backup/synchronization services to their customers. Today, I backup my outlook, data files, and other non-file data, like my contacts and phone book to verizon and then “resynchronize” to a different device or my home desktop or office desktop. Some infomation I can access from directly from their web portal.
I’m not saying traditional file systems are going to go away, their capabilities will change. Lustre, that been around since the early 90s, is currently moving toward using Zfs as the backing store, mapping their replication scheme over a traditional file system. I don’t think anyone would say that the “lustre global file system” using a traditional file systems as a backing store is just file or meta-data replication mechanism attached to a traditional file system.
Lastly, distance equals time and money. Two important features of file systems are integrity and availibility. We currently solve the integrity part by replicating data, either by multiple copies in clear code or encoded in ecc and parity. Availibility are attributes that include accumulated latencies from caching, disk access (if disk is used) and accumulated propagation delays of the interconnects. As replicated copies are geographically dispersed, the cost of interconnect and the impact of interconnect capacity, reliability and latency on the user experience need to be strongly considered for the solution provided. In terms of cost, in the US we generally pay about between $20 and $40/mbs/mo to be hooked long haul (not including local loop fees). That’s between $0.00006172 and $0.00012345 per megabyte transfered assuming 100% channel utilization and 1mb/s bandwidth. Mobile networks like evdo are equally expensive. Local storage on the other hand costs about $4/GB, amoratize over 36 months plus $5/mo for power to keep disks spinning. But burdoned data center costs drive the cost per rack to about $10K/yr. Depending on data capacity, a data center with local data storage could easily equal the cost of a long haul connected data storage. Depending on data center demands, it may be prudent to use long haul with a centralized data center or a high speed connect to an outsource like Amazon. Does it really matter, outside of liabilities, where the data is stored as long as the end user is satisfied with the features they are paying for ?
The data storage industry has to start seriously reconsider the positioning of data storage and file systems. Traditional packaging of storage systems and files systems is at the end of its life. Historical methods of “tossing out technologies on the back porch and asking does anyone want to buy this ?” and hope that some uninformed editor will thinks their widget is the next “shiny new object”, may work for a little while, but consumers are smarter today. I’ll use iSCSI and HBA API as a case in point.
The industry needs an end to end planning model, not a sales tool that places some company’s product into an application, but one that finds and plans the next incremental technological and product opportunities. One that helps us select where we should be making investments and areas we should be selecting for research. One that integrates economics, technological approaches, data format capabilities, degraded operations, error recover, technology maturity models and user requirements and experiences.
As you had said, most file systems semantics behave pretty much like the POSIX de facto standard. This also means there is a lack of differentiation between file systems from an applications’s perspective. As long as there is feature parity between file systems, the end user doesn’t really care if the developers used trellis encoding, reed/solomon, CRCs or notes tied to mice to provide the features. For the most part, end users just want their storage to work within their expectations and cost models. Whether a file system uses one hashing method vs another isn’t important (except to patent attorneys) as long as they meet the requirements of the user, database applications included.
The bottom line with distributed file systems: file consistency depends on workload and performance of the system – interconnects included. The client commonly needs access to the most current data availible whether stored locally or remotely either whole or in parts, replicated data included. All these other techniques deployed in global and distributed file systems boil down to way to improve the availibility of correct, current data thoughout the distribution. Not hard stuff if you know what you want. Most models have been already been explored in the 30s, 40s and 50s by US and French phone companies
-x
Bill, we do use an algorithmic distribution over a very large number of buckets, but we don’t start very small directories spread over all of these buckets. Instead we start with one bucket and repeatedly split buckets, one or many at a time, into the pre-determined set of buckets — this is the central idea in Fagin’s 1979 seminal paper on extensible hashing. The map is a dynamic picture of the progress of that splitting process and it allows all splitting to be locally triggered and executed by each server independently. One big reason for doing it this way is readdir speed — too many tiny buckets leads to a really bad readdir speed. In terms of the size of the map. we generally compare it to inode + indirect blocks for a file big enough to contain the directory — the classic directory implementation in a file system. By these measure, our map is small. Of course for cell phones as clients, that might not make sense. Without much thought on that problem, I’d suggest proxies on the servers for collections of cell phone clients, so the cell phones don’t have to keep even this state.
garth
xfer, you remain confused.
1. NYC to Boston is about 200 miles as the car drives, a bit less as the crow flies, but likely no less as the fiber lies. Unless you’ve managed to increase the speed of light by an impressive amount, that’s about 1 ms each way in a vacuum, around 2 ms each way in fiber.
2. Asynchronous replication *has nothing to do with consistency* (unless concurrent read access to the remote replica is allowed, in which case its consistency semantics are clear: the remote replica is always *internally* consistent with any competent asynchronous replication mechanism, but it may be a bit out of date compared with the source system).
3. Volatile write-back caching *has nothing to do with consistency*: the image all users see is always consistent, and if access is interrupted by a power failure the image that all users see after restart is also always consistent (though may not be identical to what they saw prior to the failure).
4. Non-volatile controller-level caching *has nothing to do with consistency* – and in fact nothing to do with much of anything save performance, since in a competently-designed and -managed system whether data is in that cache or on the disk platter should be completely transparent.
5. Lustre has certainly not “been around since the early 90s”: it was barely a gleam in Peter’s eye when I spoke with him in 1998, formal research began in 1999, and the first release was in 2003. And it’s hardly just a “replication scheme”: it uses file systems like ZFS only as storage for named objects that is somewhat more convenient than organizing them on block-level storage, and creates its own completely independent file system on top of that storage.
6. Replicating data contributes to its availability, not to its integrity. And the kinds of latencies under discussion here (on the order of milliseconds) are not considered relevant to availability in any analyses that I’ve ever seen.
– bill
I should have clarified the fact that post-power-failure consistency with a volatile write-back cache does depend upon a file system that uses transaction-like mechanisms (whether by using a log, BSD-style ‘soft update’ facilities, log-structured underlying storage, ZFS-style shadow paging mechanisms, etc.) coupled with judicious cache flushes to protect consistency: otherwise, the post-failure data state may be inconsistent despite the best that fixup utilities like fsck can accomplish (just as is true using a simple-minded file system write-back cache) and potential lack of consistency after such failures is a well-defined – though not necessarily well-understood – aspect of such systems (in contrast to design/implementation flaws like XFS’s inclination – finally curbed IIRC – to zero out some post-crash data despite its use of a log).
And any kind of master/slave replication mechanism performed at the block-write rather than the file-operation level similarly depends upon a source file system that ensures such post-crash consistency, since after a source system failure the replica must be ‘recovered’ from its crash-consistent state before being usable (thus block-write-level replication does not lend itself to concurrent reading at the replica – with the exception shadow-paging and possibly ‘soft update’ approaches).
– bill
Hmmm – the last sentence of my next-to-last post seems seems to have mysteriously evaporated somewhere along the way: if Robin is up to his editing tricks again he’d damn well better stop it (posts should either stand in their entirety or be removed in their entirety – he doesn’t have any right to modify their content without so noting thereby changing the author’s words into his own without acknowledging it), but I *could* just have fumble-fingered it out before submitting it.
In any event, the essence of that sentence was to point out the GIGO quality of xfer’s observations: being so confused in both technical and terminological areas, it’s hardly surprising that his conclusions seem so random.
– bill
Bill,
I encourage free-wheeling discussion on StorageMojo, but within some limits. That’s why – in addition to controlling comment spam – that I have comment moderation turned on.
I’ve had numerous people whose expertise I respect tell me that they don’t like commenting on blogs because of the personal attacks and trolling they see on other sites. Nonetheless they do comment on StorageMojo to our benefit. I empathize with them as I have plenty of experience with gratuitous insult elsewhere.
While I and other StorageMojo readers appreciate your contributions, sometimes you veer off into areas that detract from your content. I deleted the sentence in question – rather than attempting to reword it – because it of that.
If you’d rather I pulled the post entirely I will. I’d rather keep it because of the value it adds. Or if you’d like to rephrase things to make your points respectfully I’d be pleased to update the comment.
StorageMojo is largely a labor of love. I’d prefer to publish every comment as received because it is less work for me. And I certainly appreciate your technical expertise. But I want to keep the comments on StorageMojo civil to encourage contributions from everyone.
I’m not comfortable with the role of censor and perhaps I could handle these rare instances better. Would a [sentence deleted by moderator] tag help? I’m open to suggestion.
Let me know what you’d like to do.
Regards,
Robin