Google File System Eval: Part I

GFS: Fundamental Paradigm Shift or One-off Oddity?
As regular readers know, I believe that the current model of enterprise storage is badly broken. When I see something that blows that model away I like to learn more. In particular, I assess the marketability of the innovation. So a cool technology, like GFS or backup compression, makes me wonder if and how customers would buy it.

This article offers a 100,000 foot view of GFS by way assessing its commercial viability. If you want a technical article about GFS, I can only recommend The Google File System by Ghemawat, Gobioff, & Leung, from which this article draws heavily. The Wikipedia article, oddly, isn’t very good on GFS (as of 5-16-06). If you don’t have a BS/CS or better you’ll likely find Ghemawat et. al. a slog. Probably worthwhile. Good for temporary relief of insomnia.

Google From Space
GFS is one of the key technologies that enables the most powerful general purpose cluster in history. While most IT folks lose sleep over keeping the Exchange server backed up for a couple of thousand users, Google’s infrastructure both supports massive user populations and the regular roll out of compute and data intensive applications that would leave most IT ops folks gibbering in fear. How do they do it?

Partly they are smarter than you. Google employs hundreds of CompSci PhDs as well as many more hundreds of really smart people. Partly it is their history: impoverished PhD candidates can’t afford fancy hardware to build their toys, so they started cheap and got cheaper. And finally, being really smart and really poor, they rethought the whole IT infrastructure paradigm.

Their big insight: rather than build availability into every IT element at great cost, build availability around every element at low cost. Which totally changes the economics of IT, just as the minicomputer in the ’70s, the PC in the ’80’s and the LAN in the ’90’s all did. Only more so. When processors, bandwidth and storage are cheap you can afford to spend lots of cycles on what IBM calls autonomic computing. With the system properly architected and cheap to build out, it scales both operationally and economically.

All that said, Google hasn’t done anything unique with their platform that other people hadn’t already. They just put it together and scaled it to unprecedented heights.

Note to CIOs: it isn’t going to take your users long to notice that Google can do this stuff and you can’t. Your life isn’t going to get any easier.

GFS From Low-Earth Orbit
Despite the name, GFS (not to be confused (although how exactly I don’t know) with Sistina’s GFS – maybe we should call it GooFS) is not just a file system. It also maintains data redundancy, supports low-costs snapshots, and, in addition to normal create, delete, open, close, read, write operations also offers a record append operation.

That record append operation reflects part of the unique nature of the Google workload: fed by hundreds of web-crawling bots, Google’s data is constantly updated with large sequential writes. Rather than synchronize and coordinate the overwriting of existing data it is much cheaper to simply append new data to existing data.

Another feature of the Google workload is that it mostly consists of two kinds of reads: large streaming reads and small random reads. As large reads and writes are so common, GFS is optimized for sustained bandwidth rather than low latency or IOPS. As multi-gigabyte files are the common case, GFS is optimized for handling a few million files, so, doing the math, a single GFS should be able to handle a few petabytes of active data.

All this is built on very cheap components whose frequent failure, given the size of cluster, is expected. The system monitors itself and detects, tolerates, and recovers quickly from component failures, including disk, network and server failures.

GFS From An SR-71 Blackbird
A GFS cluster consists of a single master and multiple chunkservers, and is accessed by multiple clients. Each of these is typically a dirt-cheap Linux box (lately dual 2 GHz xeons with 2 GB ram and ~800GB of disk).

Files are divided into chunks, each identified by a unique 64-bit handle, and are stored on the local systems as Linux files. Each chunk is replicated at least once on another server, and the default is three copies of every chunk (take that RAID-6 fanboys!). The chunks are big, like the files they make up: 64MB is the standard chunk size. The chunkservers don’t cache file data since the chunks are stored locally and the Linux buffer cache keeps frequently accessed data in memory.

If, like me, you thought bottleneck/SPOF when you saw the single master, you would, like me, have been several steps behind the architects. The master only tells clients (in tiny multibyte messages) which chunkservers have needed chunks. Clients then interact directly with chunkservers for most subsequent operations. Now grok one of the big advantages of a large chunk size: clients don’t need much interaction with masters to gather access to a lot of data.

That covers the bottleneck problem, but what about the SPOF (single point of failure) problem? We know the data is usually copied three times — when disk is really cheap you can afford that — but what about the all-important metadata that keeps track of where all the chunks are?

The master stores — in memory for speed — three major types of metadata:

  • File and chunk names [or namespaces in geekspeak]
  • Mapping from files to chunks, i.e. the chunks that make up each file
  • Locations of each chunk’s replicas

So if the master crashes, this data has to be replaced pronto. The first two — namespaces and mapping — are kept persistent by a log stored on the master’s local disk and replicated on remote machines. This log is checkpointed frequently for fast recovery if a new master is needed. How fast? Reads start up almost instantly thanks to shadow masters who stay current with the master in the background. Writes pause for about 30-60 seconds while the new master and the chunkservers make nice. Many RAID arrays recover no faster.

The last type of metadata, replica locations, is stored on each chunkserver — and copied on nearby machines — and given to the master at startup or when a chunkserver enters a cluster. Since the master controls the chunk placement it is able to keep itself up-to-date as new chunks get written.

The master also keeps track of the health of the cluster through handshaking with all the chunkservers. Data corruption is detected through checksumming. Even so, data may still get pooched. Thus the GFS reliance on appending writes instead of overwrites; combined with frequent checkpoints, snapshots and replicas, the chance of data loss is very low, and results in data unavailability, not data corruption.

They don’t call it that, but cares about storage, and I find this part particularly interesting. GFS doesn’t use any RAID controllers, fibre channel, iSCSI, HBAs, FC or SCSI disks, dual-porting or any of the other costly bling we expect in a wide-awake data center. And yet it all works and works very well.

Take replica creation or what you and I would call mirroring. All the servers in the cluster are connected over a full duplex switched Ethernet fabric with pipelined data transfers. This means that as soon as a new chunk starts arriving, the chunkserver can begin making replicas at full network bandwidth (about 12MB/sec) without reducing the incoming data rate. As soon as the first replica chunkserver has received some data it repeats the process, so the two replicas are completed soon after the first chunk write finishes.

In addition to creating replicas quickly, the master’s replica placement rules also spread them across machines and across racks, to limit the chance of data unavailability due to power or network switch loss.

Pooling and Balancing
Storage virtualization may be on the downside of the hype cycle, and looking at GFS you can see what simple virtualization looks like when built into the file system. Instead of a complex software layer to “pool” all the blocks across RAID arrays, GFS masters place new replicas on chunkservers with below average disk utilization. So over time disk utilization equalizes across servers without any tricky and expensive software.

The master also rebalances replicas periodically, looking at disk space utilization and load balancing. This process also keeps a new chunkserver from being swamped the moment it joins the cluster. The master allocates data to it gradually. The master also moves chunks from chunkservers with above average disk utilization to equalize usage.

Storage capacity is reclaimed slowly. Rather than eager deletion, the master lets old chunks hang around for a few days and reclaims storage in batches. Done in the background when the master isn’t too busy, it minimizes the impact on the cluster. In addition, since the chunks are renamed rather than deleted, the system provides another line of defense against accidental data loss.

Cap’n, The Dilithium Crystals Canna Take ‘N’More! Oh, Shut Up, Scotty.
Google ran a couple of tests to test dilithium crystals GFS clusters.

We all know this must work in the real world since we all use Google everyday. But how well does it work? In the paper they present some statistics from a couple of Google GFS clusters.

Read Part II of Google File System Eval

Why an SR-71? Because it is the coolest airplane ever built.