A couple of years ago at the first Seattle Conference on Scalability, Google’s Jeffrey Dean remarked that the company wanted 100x more scalability. Unsurprising given the rapid growth of the web. But there was more to it than that: GFS – the Google File System was running out of scalability.

The culprit: the single-master architecture of GFS. As described in the StorageMojo post Google File System Eval: Pt. 1

A GFS cluster consists of a single master and multiple chunkservers, and is accessed by multiple clients. . . .

If, like me, you thought bottleneck/SPOF when you saw the single master, you would, like me, have been several steps behind the architects. The master only tells clients (in tiny multibyte messages) which chunkservers have needed chunks. Clients then interact directly with chunkservers for most subsequent operations. Now grok one of the big advantages of a large chunk size: clients don’t need much interaction with masters to gather access to a lot of data.

The master stores — in memory for speed — three major types of metadata:

  • File and chunk names [or namespaces in geekspeak]
  • Mapping from files to chunks, i.e. the chunks that make up each file
  • Locations of each chunk’s replicas

Not as scalable as the Internet
As it turned out though, the single master did become a bottleneck. As Google engineer Sean Quinlan explained in a recent ACMqueue interview:

The decision to go with a single master was actually one of the very first decisions, mostly just to simplify the overall design problem. That is, building a distributed master right from the outset was deemed too difficult and would take too much time. . . .

. . . in sketching out the use cases they anticipated, it didn’t seem the single-master design would cause much of a problem. The scale they were thinking about back then was framed in terms of hundreds of terabytes and a few million files. . . .

Problems started to occur once the size of the underlying storage increased. Going from a few hundred terabytes up to petabytes, and then up to tens of petabytes, that really required a proportionate increase in the amount of metadata the master had to maintain. Also, operations such as scanning the metadata to look for recoveries all scaled linearly with the volume of data.

. . . this proved to be a bottleneck for the clients, even though the clients issue few metadata operations themselves—for example, a client talks to the master whenever it does an open. When you have thousands of clients all talking to the master at the same time, given that the master is capable of doing only a few thousand operations a second, the average client isn’t able to command all that many operations per second.

Also bear in mind that there are applications such as MapReduce, where you might suddenly have a thousand tasks, each wanting to open a number of files. Obviously, it would take a long time to handle all those requests, and the master would be under a fair amount of duress. . . .

We ended up putting a fair amount of effort into tuning master performance, and it’s atypical of Google to put a lot of work into tuning any one particular binary. Generally, our approach is just to get things working reasonably well and then turn our focus to scalability—which usually works well in that you can generally get your performance back by scaling things. . . .

It could be argued that managing to get GFS ready for production in record time constituted a victory in its own right and that, by speeding Google to market, this ultimately contributed mightily to the company’s success. A team of three was responsible for all of that — for the core of GFS — and for the system being readied for deployment in less than a year.

But then came the price that so often befalls any successful system — that is, once the scale and use cases have had time to expand far beyond what anyone could have possibly imagined. In Google’s case, those pressures proved to be particularly intense.

Although organizations don’t make a habit of exchanging file-system statistics, it’s safe to assume that GFS is the largest file system in operation (in fact, that was probably true even before Google’s acquisition of YouTube). Hence, even though the original architects of GFS felt they had provided adequately for at least a couple of orders of magnitude of growth, Google quickly zoomed right past that.

Google has now developed a distributed master system that scales to hundreds of masters, each capable of handling about 100 million files.

Application pressures
Not only did the system need to scale far more than the designers anticipated, but the number of apps GFS supported also grew. And not all of those apps – think Gmail – could use a 64MB chunk size efficiently.

Thus the need to handle a 1MB chunk size and the number of files associated with smaller chunk and file sizes. That’s where BigTable comes in as both a savior and a problem itself.

The StorageMojo take
I’ll continue this in a later post, but the moral of the story is obvious: Internet scale is unlike anything we’ve seen computing before. Even guys at the epicenter with immense resources and a clean sheet underestimated the challenge.

Estimating exponential growth is a universal weakness for homo sapiens. All the cloud infrastructure vendors and providers need to think long and hard about how they will manage the growth that at least some of them will have.

But as Sean notes, if over-engineering gets you to the market late, you may not have the scale problem you were planning (hoping?) for.

Continued next post

Courteous comments welcome, of course.