Google File System Eval: Part II

In yesterday’s post I ran through a quick (really, it was!) overview of the Google File System’s organization and storage-related features such as RAID and high-availability. I want to offer a little more data about the performance of GFS before offering my conclusion about the marketability of GFS as a commercial product.

The Google File System by Ghemawat, Gobioff, & Leung, includes some interesting performance info. These examples can’t be regarded as representative since we don’t know enough about the population of GFS clusters at Google, so any conclusions drawn from them are necessarily tentative.

They looked at two GFS clusters configured like this:

Cluster A B
Chunkservers 342 227
Available Disk Cap. 72 TB 180 TB
Used Disk Cap 55 TB 155 TB
Number of Files 735 k 737 k
Number of Dead Files 22 k 232 k
Number of Chunks 992 k 1550 k
Metadata at Chunkservers 13 GB 21 GB
Metadata at Master 48 MB 60 MB

So we have a couple of fair sized storage systems, one utilizing about 80% of available space, while the other is close to 90%. Respectable numbers for any data center storage manager. We also see that chunk metadata appears to scale linearly with the number of chunks. Good. The average file size on A appears to be roughly 1/3 that of B. The average files sizes appear to be about 75 MB for A and 210 MB for B. Much larger than the average data center file size.

Next we get some performance data for the two clusters:

Cluster A B
Read Rate – last minute 583 MB/s 380 MB/s
Read Rate – last hour 562 MB/s 384 MB/s
Read Rate – since restart 589 MB/s 49 MB/s
Write Rate – last minute 1 MB/s 101 MB/s
Write Rate – last hour 2 MB/s 117 MB/s
Write Rate – since restart 25 MB/s 13 MB/s
Master Ops – last minute 325 Op/s 533 Op/s
Master Ops – last hour 381 Op/s 518 Op/s
Master Ops – since restart 202 Op/s 347 Op/s

Just as the gentlemen said, there is excellent sequential read performance, very good sequential write performance, and unimpressive small write performance. Looking at cluster A’s performance, I infer that in the last minute it performed about 125 small writes, averaging about 8k each. Clearly, not ready for the heads-down, 500 desk, Oracle call center. Not the design center either. It appears to me though, that this performance would compete handily with an EMC Centera or even the new NetApp FAS6000 series on a large file workload. Not bad for a 3 year old system constructed from commodity parts.

Conclusion
The GFS implementation we’ve looked at here offers many winning attributes.
These include:

  • Availability. Triple redundancy (or more if users choose), pipelined chunk replication, rapid master failovers, intelligent replica placement, automatic re-replication, and cheap snapshot copies. All of these features deliver what Google users see every day: datacenter-class availability in one of the world’s largest datacenters.
  • Performance. Most workloads, even databases, are about 90% reads. GFS performance on large sequential reads is exemplary. It was child’s play for Google to add video download to their product set, and I suspect their cost-per-byte is better than YouTube or any of the other video sharing services.
  • Management. The system offers much of what IBM calls “autonomic” management. It manages itself through multiple failure modes, offers automatic load balancing and storage pooling, and provides features, such as the snapshots and 3 day window for dead chunks to remain on the system, that give management an extra line of defense against failure and mistakes. I’d love to know how many sysadmins it takes to run a system like this.
  • Cost. Storage doesn’t get any cheaper than ATA drives in a system box.

Yet as a general purpose commercial product, it suffers some serious shortcomings.

  • Performance on small reads and writes, which it wasn’t designed for, isn’t good enough for general data center workloads.
  • The record append file operation and the “relaxed” consistency model, while excellent for Google, wouldn’t fit many enterprise workloads. It might be that email systems, where SOX requirements are pushing retention, might be redesigned to eliminate deletes. Since appending is key to GFS write performance in a multi-writer environment, it might be that GFS would give up much of its performance advantage even in large serial writes in the enterprise.
  • Lest we forget, GFS is NFS, not for sale. Google must see its infrastructure technology as a critical competitive advantage, so it is highly unlikely to open source GFS any time soon.

Looking at the whole gestalt, even assuming GFS were for sale, it is a niche product and would not be very successful on the open market.

As a model for what can be done however, it is invaluable. The industry has strived for the last 20 years to add availability and scalability to an increasingly untenable storage model of blocks and volumes, through building ever-costlier “bulletproof” devices.

GFS breaks that model and shows us what can be done when the entire storage paradigm is rethought. Build the availability around the devices, not in them, treat the storage infrastructure as a single system, not a collection of parts, extend the file system paradigm to include much of what we now consider storage management, including virtualization, continuous data protection, load balancing and capacity management.

GFS is not the future. But it shows us what the future can be.

{ 1 trackback }

Remix Resource » Blog Archive » Google File System Eval: Part I
Thursday, 29 June, 2006 at 6:31 pm

{ 14 comments… read them below or add one }

Brian Wednesday, 28 June, 2006 at 11:36 am

Cost: Let’s not forget that Google is running into huge electric bill problems by using thousands of regular PCs. I think I heard a quote sometime that power is 50% of their operating costs. I’d rather spend the money on power instead of parts that can go awry for my server setup, though.

Jarom Wednesday, 28 June, 2006 at 1:30 pm

Great review! Thanks.

Ben Wednesday, 28 June, 2006 at 1:31 pm

I invite you to check out the apache hadoop project, which implement various bits of the Google infrastructure (GFS and map/reduce in particular.) It was broken off from the Nutch project.

http://lucene.apache.org/hadoop/

Bobby Wednesday, 28 June, 2006 at 5:10 pm

Wow, I really like your writing style. I think I’m an instant fan. Great read too. Can’t wait to see their big bag data house in OR soon.

Roman Thursday, 29 June, 2006 at 7:08 am

I’m trying to figure out why you’re comparing a Google Cluster with a Centera, and in the same breath, a FAS 6000. That’s comparing apples with oranges and crowbars… did you mean Celerra?

rush Saturday, 1 July, 2006 at 4:08 am

informative article. Thanks!

John Tuesday, 15 August, 2006 at 10:40 pm

I don’t know if I am a fan of RAID-6 ( or RAID-n, n > 6 ), but as quoted, the real utilization of their disk capacity is at most 1/3 of 80% or 90%, which most IT manager would NOT like to see. The whole problem of GooFS, from storage point of view, is too many ineffective redundant disks. The Google clusters have at least 100,000 boxes, only 1/3 of which are “effective” boxes. Excluding installation of cost of each box, extra equipments alone are costing Google at least $40,000,000 ( each box is estimated to be $600 each, excluding power and physical space to keep them ). As Google system expands, this cost will scale as well, let alone they still need live people to replace bad components on the spot ( hope someday they will use robots ). Can the same availability and performance be achieved with much lower cost, I hope/think so. Is there any such system already there? Not yet ( a real shame on storage industry ). Why Google cannot /does not design it? Guess they are lack of such talents, plus they are already much better than Yahoo and MSN … (Sigh.)

Robin Harris Wednesday, 16 August, 2006 at 9:42 am

John, that sort of reasoning is why business people don’t get IT people. As a business decision the answer is return on investment, not some abstract measure of capacity utilization.

To wit: if my BigIron IOtron 90000 costs $20 per usable GB – after allowing for RAID and growth and short stroking or whatever, and the commercial version of the GooFS system costs $1.50/GB with all the same caveats, no financial or business type will care that utilization is 9% or 90%.

GooFS HAS achieved all this at much lower cost – as near as I can tell (see Killing With Kindness: Death By Big Iron) – so the real question is, can they do it for even LESS. We know they are certainly thinking that way because the competitive advantage of lower costs is so compelling. Stay tuned.

John Wednesday, 16 August, 2006 at 7:41 pm

Robin, first I am new to your site but liked it immediately. A good site indeed.

Back to the Google system, most of points agreed. But don’t forget its development cost. Dozens if not hundreds of good PhDs (or candidates) and programmers spent a lot of time on it. So if GFS were a commercial product now, it would be much more expensive than the seemingly equipment cost.

After all, RAID started as an inexpensive (of which the initial I standed for, now becomes independent rather than inexpensive) means to provide highly available storage. Just as you said in your another article, the difference of RAID and GFS-type systems is where you put availability function, whether at disk/array driver, node (box) or simply application level.

GFS is successful only for Google. Not many other companies can pull it off. Yahoo and MSN are still struggling to catch up. I once heard a Yahoo system engineer said: “Where can we buy a system like Google’s?” Since for them, as you said, the return on investment would be much quicker if they could buy such a system. Right now, they (including MSN, Amazon and maybe Ebay) are doing the storage part in a very ad-hoc way, probably designed and implemented by a couple smart programmers. This provides a big opportunity to the storage companies, such as EMC, NetApp, HP, SUN and IBM. It is a shame they still cannot provide such systems but charging zillions of dollars for their big BigIron IOtron. Maybe you know better …

To me, a storage system should be like power sockets. If we run out of sockets, we buy a couple of extension cords. The beauty is we can cascade these extensions indefinitely ( of course within the load limit of the incoming power line). We don’t need to worry about from which socket we get power. I wonder why such storage devices don’t exist now. Maybe you could elaborate on that …

Robin Harris Thursday, 17 August, 2006 at 12:22 pm

John, glad you like site. I have a blast researching and writing StorageMojo.com and it’s great connecting with new people.

You raise a good point about the investment required for GFS. GFS is not a commercial product so it doesn’t require a lot of the product shipment and marketing. Yet I’d guess the GFS team wasn’t all that large – maybe a dozen or so really smart folks. After all, they used Google’s standard Linux distribution to run each of the nodes, including the storage, with a Linux file system handling the local details.

The three guys who wrote the article I relied upon certainly seemed like they knew how to architect a system, so then its the coding. I’d bet at least one of the architects also did a fair amount of coding himself. So even though Google has a lot of PhD’s most of them were doing other things. It would be nice to learn more about the GFS team.

The really cool thing is that once a group has done something like this, it becomes far easier to be the second team to do it. Something with naked commercial ambition behind its elegant and bullet-proof architecture. It seems bound to happen: so much money in Silicon Valley; the team that’s done it once; an obvious new market that will be huge; no market leader. It will be fun.

Ankur Sethi Tuesday, 29 May, 2007 at 1:05 pm

What is really interesting is that they have not patented it. It is chock full of innovations and could get a few patents. So why aren’t they doing it? They are just that much ahead of the competition that people would get a few ideas from the patent and modify it slightly to keep away from lawyers google would have to hire. I guess they just don’t want to go in the patent direction which other corporations cannot live without. (I heard that Microsoft patented the right click on a stylus in a touch-screen interface by the method of holding down the stylus to the screen.) So anyway, google is not marketing a product and they are happy to keep the GFS in the background.

Tom Tuesday, 13 January, 2009 at 1:42 am

First of all, you’re doing a great job with your articles! Excellent work!

Secondly, it might be nice to make a comparison with Hadoop, which is also used by Yahoo.
http://hadoop.apache.org/

Nate Tuesday, 15 February, 2011 at 10:25 pm

I’m no programmer or genius – but I also do not see how this could not be made much more efficient and cost per annum friendly, between the very massive drives that are always getting even bigger which i’m sure you could slap into ultra low power computers – but mainly I’m sure that the chunk size could be changed/adapted to more efficiently meet the needs of xyz databasing entity – under the assumption that smaller chunks would handle smaller file sizes much better.

Rob Peglar Saturday, 3 November, 2012 at 6:12 am

Robin, thanks for writing the article. I enjoyed reading it. A few comments for you.
1) Your URL to the Ghemawat paper is broken; a nice 404 back from Google. Here’s the correct link. research.google.com/en/us/archive/gfs-sosp2003.pdf
2) Many of the points you bring up are absolutely correct. However, since you like GFS, and rightly so, you should have studied OneFS (Isilon) before you wrote your post. OneFS is not only earlier work but has fewer of the architectural drawbacks you correctly pointed out that GFS has, such as the master failover scenario, consistency and the shortcomings on general data center workloads. There is no master-slave relationship in OneFS; all nodes perform both metadata and data service to all clients; symmetric scale-out. The use of Infiniband for internodal communication is key, much like a scale-out compute cluster. I could go on.
3) Lastly, the method for data protection in GFS (triple replicas) is hardly unique in the world of filesystems. OneFS, for example, on a directory or file basis, either performs mirroring (2-way to 8-way) or uses Reed-Solomon encoding (single to quadruple). This allows great flexibility for not only users but datacenter admins in constructing efficient, optimal infrastructure.

Again, Robin, nice job on the article. As the kids would say, if you like GFS, you’ll really like OneFS.

Leave a Comment