Google File System Eval: Part II

In yesterday’s post I ran through a quick (really, it was!) overview of the Google File System’s organization and storage-related features such as RAID and high-availability. I want to offer a little more data about the performance of GFS before offering my conclusion about the marketability of GFS as a commercial product.

The Google File System by Ghemawat, Gobioff, & Leung, includes some interesting performance info. These examples can’t be regarded as representative since we don’t know enough about the population of GFS clusters at Google, so any conclusions drawn from them are necessarily tentative.

They looked at two GFS clusters configured like this:

Cluster A B
Chunkservers 342 227
Available Disk Cap. 72 TB 180 TB
Used Disk Cap 55 TB 155 TB
Number of Files 735 k 737 k
Number of Dead Files 22 k 232 k
Number of Chunks 992 k 1550 k
Metadata at Chunkservers 13 GB 21 GB
Metadata at Master 48 MB 60 MB

So we have a couple of fair sized storage systems, one utilizing about 80% of available space, while the other is close to 90%. Respectable numbers for any data center storage manager. We also see that chunk metadata appears to scale linearly with the number of chunks. Good. The average file size on A appears to be roughly 1/3 that of B. The average files sizes appear to be about 75 MB for A and 210 MB for B. Much larger than the average data center file size.

Next we get some performance data for the two clusters:

Cluster A B
Read Rate – last minute 583 MB/s 380 MB/s
Read Rate – last hour 562 MB/s 384 MB/s
Read Rate – since restart 589 MB/s 49 MB/s
Write Rate – last minute 1 MB/s 101 MB/s
Write Rate – last hour 2 MB/s 117 MB/s
Write Rate – since restart 25 MB/s 13 MB/s
Master Ops – last minute 325 Op/s 533 Op/s
Master Ops – last hour 381 Op/s 518 Op/s
Master Ops – since restart 202 Op/s 347 Op/s

Just as the gentlemen said, there is excellent sequential read performance, very good sequential write performance, and unimpressive small write performance. Looking at cluster A’s performance, I infer that in the last minute it performed about 125 small writes, averaging about 8k each. Clearly, not ready for the heads-down, 500 desk, Oracle call center. Not the design center either. It appears to me though, that this performance would compete handily with an EMC Centera or even the new NetApp FAS6000 series on a large file workload. Not bad for a 3 year old system constructed from commodity parts.

The GFS implementation we’ve looked at here offers many winning attributes.
These include:

  • Availability. Triple redundancy (or more if users choose), pipelined chunk replication, rapid master failovers, intelligent replica placement, automatic re-replication, and cheap snapshot copies. All of these features deliver what Google users see every day: datacenter-class availability in one of the world’s largest datacenters.
  • Performance. Most workloads, even databases, are about 90% reads. GFS performance on large sequential reads is exemplary. It was child’s play for Google to add video download to their product set, and I suspect their cost-per-byte is better than YouTube or any of the other video sharing services.
  • Management. The system offers much of what IBM calls “autonomic” management. It manages itself through multiple failure modes, offers automatic load balancing and storage pooling, and provides features, such as the snapshots and 3 day window for dead chunks to remain on the system, that give management an extra line of defense against failure and mistakes. I’d love to know how many sysadmins it takes to run a system like this.
  • Cost. Storage doesn’t get any cheaper than ATA drives in a system box.

Yet as a general purpose commercial product, it suffers some serious shortcomings.

  • Performance on small reads and writes, which it wasn’t designed for, isn’t good enough for general data center workloads.
  • The record append file operation and the “relaxed” consistency model, while excellent for Google, wouldn’t fit many enterprise workloads. It might be that email systems, where SOX requirements are pushing retention, might be redesigned to eliminate deletes. Since appending is key to GFS write performance in a multi-writer environment, it might be that GFS would give up much of its performance advantage even in large serial writes in the enterprise.
  • Lest we forget, GFS is NFS, not for sale. Google must see its infrastructure technology as a critical competitive advantage, so it is highly unlikely to open source GFS any time soon.

Looking at the whole gestalt, even assuming GFS were for sale, it is a niche product and would not be very successful on the open market.

As a model for what can be done however, it is invaluable. The industry has strived for the last 20 years to add availability and scalability to an increasingly untenable storage model of blocks and volumes, through building ever-costlier “bulletproof” devices.

GFS breaks that model and shows us what can be done when the entire storage paradigm is rethought. Build the availability around the devices, not in them, treat the storage infrastructure as a single system, not a collection of parts, extend the file system paradigm to include much of what we now consider storage management, including virtualization, continuous data protection, load balancing and capacity management.

GFS is not the future. But it shows us what the future can be.