Google rolls out new applications to millions of users with surprising frequency, which is pretty amazing all by itself. Yet when you look at the variety of the applications, ranging from data-sucking behemoths like webcrawling to intimate apps like Personalized Search and Writely it is even more startling. How does the Google architecture manage the conflicting requirements of such a wide range of workloads? Bigtable, a Google-developed distributed storage system for structured data, is a big piece of the answer.
Isn’t The Google File System The Answer?
Another part, yes. And like a fractal, we see some of the same patterns repeating. Which isn’t too surprising, for two reasons. First, all Google apps need to scale way beyond what most commercial systems ever consider. Second, Sanjay Ghemawat, a Google Fellow and former DEC researcher whose long-time interests include large-scale, safe, persistent storage, is a designer of not only Bigtable, but of the Google File System and MapReduce, Google’s tool for processing large data sets. The man’s got big on the brain and he’s in the right playpen.
If It’s a Storage System, Where Are The Disks?
Don’t be so literal-minded. For these guys “storage” is where in data space you put the data. Not only disks, but data structures and flows. Lock management. Data layout. And more. The disks are there, under GFS and the local OS on the servers.
This article is adapted from a paper entitled Bigtable: A Distributed Storage System for Structured Data(PDF) that was just released. If you like reading PhD-level comp-sci conference papers, go right to it. Me, I’d rather light my hair on fire and put it out with an ice pick. To each his own.
Scale To Thousands Of Terabytes And Servers
Every good product solves a problem. Bigtable is designed to solve several problems: scale to petabytes and thousands of servers; support workloads ranging from throughput intensive webcrawlers to latency sensitive end users; and high availability despite disk, server and network failures.
So, It’s A Database, Right?
They say it resembles a database in many ways, but they don’t call it a database. For one thing, it doesn’t support the full relational DB model. Nor does it speak SQL. Atomic transactions are limited. And the data elements are simply strings of characters, so it is up to the applications to know what the data is and how it is structured. So Oracle stockholders can relax. Bigtable is not a commercial database. It wouldn’t pass the ACID test.
So It’s The Mother Of All Spreadsheets?
Except no nifty 3-D bar charts. But I hate those anyway. Bigtable stores data in, surprise, tables, that are addressed by row name, column name, and time. So if you wanted to store, hypothetically, several billion webpages, you’d use the URL’s as row names; features (like language or keyword) of the webpages as column names; and changed content, like a new post, would be timestamped and stored in the cell with the original.
So Far, I Get It. But How Do They Scale To Petabytes?
Patience, grasshopper. First of all, the rows are maintained in alphabetic order (which the paper calls lexicographic order) by row name. The row ranges are broken up into partitions called tablets. Tablets are distributed across servers for load balancing. As with large chunk size RAID 5, if your request for a group of rows falls within a single tablet, you get efficient I/O and minimize network traffic. It is this concept of the tablet and how it is implemented that gives Bigtable much of its power. But before I go into that, there are a few other things you should know.
So Bigtable Stores Everything Forever . . .
No. That would be a stretch, even for Google-sized data centers, and the data would get unwieldy. So Bigtable allows applications to tell columns to keep either the last n versions of a cell or only versions written in the last n days. The old data eventually gets garbage-collected out of Bigtable. Data can be kept indefinitely, too.
Transactional Analysis: I’m OK, You’re OK, If You = Row
There is a reason that webpage URLs are used as row names in the example above. Bigtable supports atomic read-modify-write transactions, but only across a single row or a designated group of rows. So when you want a complete update of that webpage, it needs to be stored in a row, not a column.
Talk To Me, You Big Oaf
Bigtable’s API supports pretty much what you would expect, if you expect anything of a database. You can read, write, update, delete and control access. You can limit the rows, columns and timestamps produced by a scan of the data. There is also a scripting language, with the great name Sawzall that enables a variety of data transformations, filtering and summarization. It isn’t SQL, and there are probably some things the Google apps people would love to see added, but it’s working today.
Like man, Bigtable is no island. It relies on several other systems to do what it does. Once I discuss them I’ll get back to Bigtable proper.
- GFS. Bigtable uses the Google File System to store data and log files. Regular StorageMojo.com readers know GFS imparts all kinds of performance and availability advantages without costly RAID arrays.
- Cluster management. Google has a cluster management system which so far seems publicly undocumented (maybe they’re embarrassed) that schedules, monitors and manages the Bigtable’s cluster.
- SSTable. This is the underlying file format used to store Bigtable data. SSTables are designed so that a data access requires, at most, a single disk access. An SSTable, once created, is never changed. If new data is added, a new SSTable is created. Once an old SSTable is no longer needed, it is set out for garbage collection. SSTable immutability is at the core of Bigtable’s data checkpointing and recovery routines.
- Chubby. Cute name, huh? Chubby is the distributed lock server that allows a multi-thousand node Bigtable cluster to stay coordinated. Chubby itself is a cluster app that maintains five active replicas, one of which is the master. Like GFS’s master node, Chubby is architected to keep lock management traffic very light. Chubby also rules over tablet server life and death, stores access control lists, data schemas and the bootstrap location of Bigtable data.
Take 2,000 Tablets And Call Me In The Morning
Bigtable has three main components:
- A library linked into every client
- A master server (why couldn’t they call it Master Chief?)
- And many tablet servers
The master assigns tablets (remember those: a collection of rows?) to tablet servers, balances the tablet server load, detects the loss or addition of tablet servers, performs garbage collection and some other chores. Like the GFS master node, client data doesn’t move through the master. In fact, the system is architected in such a way that most clients never communicate with the master, which helps keeps the master lightly loaded in practice.
With These Tablets We’d Have Had 2^27 Commandments
Each tablet server typically manages between 10 and 1,000 tablets. Each tablet averages about 100-200 MB. Each Bigtable table commonly consists of multiple tablets. Each tablet contains all the data of a group of rows. A newly created table consists of one tablet. As it grows it is dynamically broken into multiple tablets.
Operator, Give Me Tablet 634-5789
Finding the right tablet with the least amount of effort and time is critical to keeping a huge database efficient. Bigtable’s three-level addressing scheme accomplishes that while supporting a huge address space.
- Chubby stores the location of the root tablet
- The root tablet contains the address all metadata tablets
- The metadata tablets contain the locations of a group of user tablets
This simple scheme allows a single Bigtable to address over 16 billion tablets.
Part II of Google’s Bigtable Distributed Storage System will cover tablet serving, data compression, lessons learned and application performance. Coming Soon To A Web Page Near You!
Comments, as always, welcome.
Trivia test: what is the name of the Hollywood movie whose subject is the making of a fictional movie named “Chubby Rain”. For extra credit: what is the catch phrase repeated by the action star whose name is also the acronym for the phrase. This is why you need more storage.