Bigtable to the rescue (sort of)
In Part 1, Sean Quinlan, a Google engineer, related how the original GFS single master architecture became a bottleneck. But since Google controls its entire software stack from OS to apps, it could compensate by tweaking the apps and the infrastructure, like Bigtable.

Bigtable is Google’s structured storage system. If you need a refresher – I did – check out Google’s Bigtable Distributed Storage System.

The short version from 3 years ago is:

Google’s Bigtable is essentially a massive, distributed 3-D spreadsheet. It doesn’t do SQL, there is limited support for atomic transactions, nor does it support the full relational database model. In short, in these and other areas, the Google team made design trade-offs to enable the scalability and fault-tolerance Google apps require. . . .Bigtable today supports almost 400 Google apps with data stores ranging up to several hundred terabytes.

Bigtable has a distributed lock server, Chubby, that coordinates the several thousand nodes in large Bigtable clusters. Presumably that is why Bigtable has been able to scale to handle many of the problems the single-master GFS has created.

But – and there’s always a but – Quinlan says that Bigtable isn’t an optimal solution to the many files/small files problem:

. . . [U]sing BigTable . . . as a way of fighting the file-count problem where you might have otherwise used a file system to handle that — then you would not end up employing all of BigTable’s functionality . . . . BigTable isn’t really ideal . . . in that it requires resources for its own operations that are nontrivial. Also, it has a garbage-collection policy that’s not super-aggressive, so that might not be the most efficient way to use your space. . . . people who have been using BigTable purely to deal with the file-count problem probably haven’t been terribly happy, but . . . it is one way for people to handle that problem.

GFS was designed to maximize bandwidth to disk as the crawlers sluiced data back to Google. Low latency was a non-goal. But as Google offered more user-facing apps, latency became important. Sean notes:

. . . if you’re writing a file, it will typically be written in triplicate—meaning you’ll actually be writing to three chunkservers. Should one of those chunkservers die . . . the GFS master will notice the problem and schedule what we call a pullchunk, which means it will basically replicate one of those chunks. That will get you back up to three copies, and then the system will pass control back to the client, which will continue writing.

When we do a pullchunk we limit it to something on the order of 5-10 MB a second. So, for 64 MB, you’re talking about 10 seconds. . . . If you are working on Gmail, however, and you’re trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up.

Consistency, consistency, cOnsiStencY
Since GFS and BigTable were designed to run on massive pools of commodity hardware failures and faults were a given.

For example, disk drives would tell Linux they supported some IDE versions when they really didn’t, leading to silent data corruption when drives and kernels disagreed about the drive’s state. GFS includes rigorous end-to-end check-summing to protect data from network and storage corruption, but other decisions compromised data consistency.

GFS simply assumes that there will be times that stale data is returned to applications. Data appended to an open file won’t be seen until the file is reopened.

Given that the Google owned GFS, Bigtable and the apps, it seemed acceptable to ask the apps to handle some problems. But some of the inherent problems are hard ones.

If a client crashes in the middle of a write, data could be left in an indeterminate state. The RecordAppend operation supported multiple writers to a single file, so if a primary writer failed you could end up with multiple inconsistent copies of the data in a single file – with different versions of the file in different chunks.

These things may not happen all that often, but it’s Murphy’s Law: if they can happen, they will. With several million servers and dozens of data centers, it is a continuing headache.

Sean makes an interesting comment about the GFS snapshot feature, which he calls “the most general-purpose snapshot capability you can imagine.”

I also think it’s interesting that the snapshot feature hasn’t been used more since it’s actually a very powerful feature. . . from a file-system point of view, it really offers a pretty nice piece of functionality.

I’d like to hear from Google app developers why they didn’t use the snapshot feature. I suspect it is an interesting set of reasons.

The StorageMojo take
Google engineers have been hard at work for the last 2 years building a distributed master system that will work better with Bigtable to fix many of the current problems.

Still, it is amazing that in 1 year 4 or 5 people could put together a file system critical to Google’s success for almost 10 years. It looks creaky now, but it has also scaled far beyond what it’s developers expected.

“Scalability” is one of the most abused words in the IT marketing lexicon. It is often used where “expandability” is more appropriate.

That GFS has scaled 1,000x or more is a benchmark for Internet data center infrastructure. With billions of people still not on the web and the growth of sensor networks, machine translation and other scale intensive apps, 1,000x is the new normal.

Get used to it. Plan for it.

Courteous comments welcome, of course.