Google’s Bigtable Distributed Storage System, Pt. II

by Robin Harris on Friday, 8 September, 2006

Google’s Bigtable is essentially a massive, distributed 3-D spreadsheet. It doesn’t do SQL, there is limited support for atomic transactions, nor does it support the full relational database model. In short, in these and other areas, the Google team made design trade-offs to enable the scalability and fault-tolerance Google apps require. As we shall see, Bigtable today supports almost 400 Google apps with data stores ranging up to several hundred terabytes.

Dear Buddha, please bring me a pony and a little plastic rocket…
And a few hundred tablets.

Tablets, which are collections of rows, are stored in SSTables, a name that probably has some arcane compsci meaning, but that I like to think of as Spread Sheet Tables. As mentioned, once an SSTable is written, is it never changed. If the data in the tablet is changed, a new SSTable is generated.

Graven In Plated Media
The immutability of SSTables is what makes it cost-effective for them to be stored in GFS, a file system that is not optimized for small writes. GFS also stores the tablet log that records all the changes to a tablet. GFS normally stores data on three different disks on different servers so the data is highly available.

Permission To Speak, Sir!
Tablet readers and writers need to have permission, which is checked against a list maintained in Chubby. Writes are committed to the log where they are grouped for efficient writing. Reads are done against a merged view of the data. Since the rows are sorted alphabetically it is easy for the system to form the merged view.

Let’s Get Down!
In size, that is. Bigtable uses a number of strategies to control and reduce memory and storage consumption. For example, the contents of critical memory buffers get committed to SSTables when they grow too large.

Client apps can optionally specify compression for SSTables. And clients can even specify what compression to use. The results can be pretty surprising. For example, due to the repetition across webpages in a website and the types of compression used, compression rates of 10-1 are common. All of a sudden 10 billion webpages take up no more storage an uncompressed 1 billion webpages. Whew! How easy is that?

Bigtable Bigperformance
So yada, yada, yada, these smart guys tweaked this six ways from Sunday. Who cares? What does it deliver? Good question. The paper includes some performance data based on tests the Google team ran. Interesting tidbit:

The tablet servers were configured to use 1 GB of memory and to write to a GFS cell consisting of 1786 machines with two 400 GB IDE hard drives each. . . . Each [client]machine had two dual-core Opteron 2 GHz chips, enough physical memory to hold the working set of all running processes, and a single gigabit Ethernet link. The machines were arranged in a two-level tree-shaped switched network with approximately 100-200 Gbps of aggregate bandwidth available at the root. All of the machines were in the same hosting facility and therefore the round-trip time between any pair of machines was less than a millisecond.

Which I’d bet reflects today’s standard Google hardware configurations.

And this probably reflects the standard software configuration:

The tablet servers and master, test clients, and GFS servers all ran on the same set of machines. Every machine ran a GFS server. Some of the machines also ran either a tablet server, or a client process, or processes from other jobs that were using the pool at the same time as these experiments.

In other words, the average Google Bigtable server is also running a GFS server and is not 100% dedicated to Bigtable.

The benchmark description is too long and involved for my short attention span. To focus on the bad news, the worst scaling result was in the random reads, where

The random read benchmark shows the worst scaling (an increase in aggregate throughput by only a factor of 100 for a 500-fold increase in number of servers).

Which is still a result that would, as a thesis topic, get you a compsci PhD. Except these folks already have theirs.

American Idle
As of August 2006, Google has 388 Bigtable clusters running, and 12 of those have over 500 tablet servers each. One busy group of clusters handles more than 1.2 million requests per second, using more than 8,000 tablet servers. Do the math: each tablet server handled about 150 requests per second, a number that server makers would be too embarrassed to quote, and that gets to the heart of Google’s architectural philosophy: use all the software you need and throw cheap hardware at the problem

I also went over to the TPC benchmarks website to see if there was anything there I could compare this to. The short answer is no, but I went ahead anyway. It is an apples-to-oranges comparison exercise – TPC has a set of rules and expectations that Google’s performance test are nowhere near satisfying – yet thought-provoking since Google is doing everything on commodity hardware and open-source and home-built software.

The Benchmark Guys Will Never Forgive Me – But Here Goes
IBM mainframes are the OLTP champs, with over 4 million TPC-C transactions per minute, while the best cluster solution, from HP, does about 1.2 million per minute. So Google’s not-comparable result of 72 million requests per minute from a cluster that I estimate cost about the same as the IBM mainframe’s $12 million list price (which includes maintenance) suggests that they are in the ballpark in terms of efficiency.

So How About Some REAL Apps?
Real for Google, anyway.

  • Google Analytics. This program tracks traffic patterns at participating websites and maintains a ~200 TB raw click table (with 80 billion cells) and a ~20 TB summary table.
  • Google Earth. You’ve played with this one: high-res earth photos. The main table contains 70 TB of data, with a relatively modest ~500 GB index file that must serve tens of thousands of queries per second.
  • Crawl. One of these has an 800 TB table with one trillion cells.

Lessons Learned
This is my favorite part of the paper because they talk about what went wrong, rarely seen in marketing-vetted heroic engineering stories. But these folks are scientists, and they want to show that they are still awake. Good.

  • Everything breaks. One can only imagine the stories behind this list of woes: ” . . . we have seen problems due to all of the following causes: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance.
  • Understand the use of new features. Add only what users really need, not a generalized solution to the entire problem space. In my experience this is a huge trap for developers, aided and abetted by mediocre marketing.
  • System-level monitoring. You can’t fix it if don’t know it’s broke. Not only Bigtable, but Chubby, GFS and client processes.
  • Simple designs are good. For example, only rely upon widely-used features in your code and in the systems you interact with.

StorageMojo’s Conclusion
Bigtable reflects Google’s unique design culture and huge scale. Instead of trying to be all things to all people, Bigtable is a few things to a wide range of applications. The benefits are massive scale, high reliability, low cost and reduced maintenance. And it may be a launch pad for more Google ventures.

As D-Trace developer Bryan Cantrill once noted to me, in reference to Google, “you can’t spray on transaction processing”. For example, with Bigtable’s limited atomicity, application-dependent consistency, uncertain isolation and excellent durability, it certainly doesn’t meet the normal requirements of an ACID database for transaction processing. Yet perhaps, if Google decides to go after Ebay – despite their recent pact – they could develop a set of applications that work around the limitations of Bigtable. With a cost structure that is a fraction of Ebay’s they could rapidly grab the business of millions of small sellers just by undercutting Ebay’s rising prices.

Bigtable, like GFS, is no direct threat to the enterprise database market. It is too specialized and limited. Yet it does show what a clean-sheet rethinking of distributed storage can accomplish. I predict there will be a number of startups over the next five years that try to bring the benefits of the Google architecture to new applications in the enterprise data center.

As ever, your comments welcome.
Oh, and BTW, I upgraded to a new version of WordPress and haven’t figured out how to turn off registration without turning off moderation. So please take a moment to register until I figure it out. Thanks.

{ 3 comments… read them below or add one }

jmd2121 Wednesday, 27 September, 2006 at 3:38 pm

The phrase “the Google team made design trade-offs to enable the scalability and fault-tolerance” is a critical point. Google is really good at making apps and search that points to TO the answers, but seems to have missed the point that what people (and the computers) really need ARE the answers. Hackerville. ;)

jones Tuesday, 20 February, 2007 at 3:32 pm

“IBM mainframes are the OLTP champs”

Try a few Intel Woodcrests in a “Standard” 20-30 SCSI drive array.

Gaju Friday, 17 July, 2009 at 9:53 pm

This article was super-helpful. I did what would make you light your hair on fire and then try to put it out with an icepick. I went through the academic paper twice and cobble a decent understanding of what it’s about. But this summary, apart from being far more readable, hangs the different aspects of Bigtable on the right pegs.


Leave a Comment

Previous post:

Next post: