Google File System v2, part 2

by Robin Harris on Tuesday, 18 August, 2009

Bigtable to the rescue (sort of)
In Part 1, Sean Quinlan, a Google engineer, related how the original GFS single master architecture became a bottleneck. But since Google controls its entire software stack from OS to apps, it could compensate by tweaking the apps and the infrastructure, like Bigtable.

Bigtable is Google’s structured storage system. If you need a refresher – I did – check out Google’s Bigtable Distributed Storage System.

The short version from 3 years ago is:

Google’s Bigtable is essentially a massive, distributed 3-D spreadsheet. It doesn’t do SQL, there is limited support for atomic transactions, nor does it support the full relational database model. In short, in these and other areas, the Google team made design trade-offs to enable the scalability and fault-tolerance Google apps require. . . .Bigtable today supports almost 400 Google apps with data stores ranging up to several hundred terabytes.

Bigtable has a distributed lock server, Chubby, that coordinates the several thousand nodes in large Bigtable clusters. Presumably that is why Bigtable has been able to scale to handle many of the problems the single-master GFS has created.

But – and there’s always a but – Quinlan says that Bigtable isn’t an optimal solution to the many files/small files problem:

. . . [U]sing BigTable . . . as a way of fighting the file-count problem where you might have otherwise used a file system to handle that — then you would not end up employing all of BigTable’s functionality . . . . BigTable isn’t really ideal . . . in that it requires resources for its own operations that are nontrivial. Also, it has a garbage-collection policy that’s not super-aggressive, so that might not be the most efficient way to use your space. . . . people who have been using BigTable purely to deal with the file-count problem probably haven’t been terribly happy, but . . . it is one way for people to handle that problem.

GFS was designed to maximize bandwidth to disk as the crawlers sluiced data back to Google. Low latency was a non-goal. But as Google offered more user-facing apps, latency became important. Sean notes:

. . . if you’re writing a file, it will typically be written in triplicate—meaning you’ll actually be writing to three chunkservers. Should one of those chunkservers die . . . the GFS master will notice the problem and schedule what we call a pullchunk, which means it will basically replicate one of those chunks. That will get you back up to three copies, and then the system will pass control back to the client, which will continue writing.

When we do a pullchunk we limit it to something on the order of 5-10 MB a second. So, for 64 MB, you’re talking about 10 seconds. . . . If you are working on Gmail, however, and you’re trying to write a mutation that represents some user action, then getting stuck for a minute is really going to mess you up.

Consistency, consistency, cOnsiStencY
Since GFS and BigTable were designed to run on massive pools of commodity hardware failures and faults were a given.

For example, disk drives would tell Linux they supported some IDE versions when they really didn’t, leading to silent data corruption when drives and kernels disagreed about the drive’s state. GFS includes rigorous end-to-end check-summing to protect data from network and storage corruption, but other decisions compromised data consistency.

GFS simply assumes that there will be times that stale data is returned to applications. Data appended to an open file won’t be seen until the file is reopened.

Given that the Google owned GFS, Bigtable and the apps, it seemed acceptable to ask the apps to handle some problems. But some of the inherent problems are hard ones.

If a client crashes in the middle of a write, data could be left in an indeterminate state. The RecordAppend operation supported multiple writers to a single file, so if a primary writer failed you could end up with multiple inconsistent copies of the data in a single file – with different versions of the file in different chunks.

These things may not happen all that often, but it’s Murphy’s Law: if they can happen, they will. With several million servers and dozens of data centers, it is a continuing headache.

Sean makes an interesting comment about the GFS snapshot feature, which he calls “the most general-purpose snapshot capability you can imagine.”

I also think it’s interesting that the snapshot feature hasn’t been used more since it’s actually a very powerful feature. . . from a file-system point of view, it really offers a pretty nice piece of functionality.

I’d like to hear from Google app developers why they didn’t use the snapshot feature. I suspect it is an interesting set of reasons.

The StorageMojo take
Google engineers have been hard at work for the last 2 years building a distributed master system that will work better with Bigtable to fix many of the current problems.

Still, it is amazing that in 1 year 4 or 5 people could put together a file system critical to Google’s success for almost 10 years. It looks creaky now, but it has also scaled far beyond what it’s developers expected.

“Scalability” is one of the most abused words in the IT marketing lexicon. It is often used where “expandability” is more appropriate.

That GFS has scaled 1,000x or more is a benchmark for Internet data center infrastructure. With billions of people still not on the web and the growth of sensor networks, machine translation and other scale intensive apps, 1,000x is the new normal.

Get used to it. Plan for it.

Courteous comments welcome, of course.

{ 1 trackback }

Google waves goodbye to Mapreduce «
Sunday, 12 September, 2010 at 10:31 am

{ 1 comment… read it below or add one }

Robert Pearson Thursday, 10 September, 2009 at 7:27 am

[Enhanced Previous Post]
[Personal Statement]
These are two of your finest posts.

“Google File System v2, part 1″
“Google File System v2, part 2″

To my regret these articulate, salient posts went unanswered.
This tells me that the level of “Storage Problem Solving” is still an
art and may never be a science.

[Background for Technical Statement]
The original problem of “Access Density” was tied specifically to disk drives.
Your two posts highlight where the coming “Access Density” problem
will be joined.
When I started looking at bandwidth needs for “End-to-End Information
on Demand (E2EIoD)” there were many other areas (bottlenecks, hot spots) for
restricting or impeding the flow of Information. I tried to identify
the most important ones with the “Speed Limit of the Information
Universe” concept.

[Possible Solution]
To deal with the problem of the “fire fighting” IT mentality for bottlenecks and hotspots with $pounds and $pounds of $cure, an ounce of prevention would be cheaper and work better. The ounce of prevention would be to have Infrastructure Bandwidth costs added as a line item in the IT budget.
Two options:
1) How much bandwidth do you need? How much can you afford?
2) How much can you afford? How much do you need?

[Two Ways To Determine Bandwidth Needs]
1) Take a WAG (Wild Ass Guess) at a typical Unit of Information (UoI) for your shop. Preferrably a Managed Unit of Information (MUoI) which means there is an SLA (Service Level Agreement) assigned to the UoI. By definition a Managed Unit of Technology must reside on a Managed Unit of Technology.
A UoI may be as small as a block in a file or as large as database but represents a “functional” entity to the IT system.
There may need to be at least one of these for each Line Of Business (LOB) managed. Probably more than one MUoI. Make that MUoI the center of your Information universe. Start listing all the services it provides to application requests on the ROI (Return On Investment – making money +$$$) requirements versus all the TCO (Total Cost of Ownership – costing money -$$) requirements. Then assign a Bandwidth FOM (Figure Of Merit – another WAG?) and total them up. You begin to get an idea of the demands placed on your Information.
2) Sometimes it is easier to start from the Unit of Technology (UoT). This is a good way to learn the process since there are fewer UoT’s than UoI’s.

You can automate both of these processes. There are third party tools that will do this.

[Personal Statement]
File systems have outlived their usefulness. Prepare for the fundamental shift in the paradigm.
Backups, as currently known, are passing from the scene.
How will you do Disaster Recovery (DR) in the future?
Chris Poelker of FalconStor talks about continuous imaging for DR in his latest two posts on his blog. I second that motion… and now you have Backblaze cheap Storage to do it with…

Leave a Comment

Previous post:

Next post: