Architecting the Internet Data Center – Parts I-IV

Introduction: Internet Scale vs Enterprise Scale
The rise of the Internet Data Center (IDC), such as Amazon, Google and Yahoo, in the last 10 years is perhaps the fastest and most radical transformation in information technology’s history. From humble beginnings using standard enterprise style architectures and products, the IDC has rapidly evolved into a form whose implications are only slowly being understood by the rest of the IT sector, vendors and customers alike. Soon IDCs will dwarf the Enterprise Data Center (EDC) not by 10x, but by 100x or even 1,000x in any metric that matters.

Nor is the IDC a muscle-bound, one trick giant. Though huge the IDC is lithe and flexible. A “state of the art” EDC struggles to architect, implement, provision and manage new applications in three year projects for a few thousand users, IDC’s routinely roll out to millions and even tens of millions of users in months. An IDC is, first and foremost, an application delivery engine.

Of course, as the old IT joke goes, God created the universe in seven days, but He didn’t have an installed base. And neither do the IDCs, today. Yet even there the IDCs have important lessons to offer as we create the 21st century infosphere. These lessons just aren’t always obvious. If they were, everyone would have figured them out by now.

Not The Same As The Old Boss
IDCs differ from EDCs in many ways. The question is why. The answer can tell us much about how we can expect to see IDCs evolve as well as how to re-think the EDC. Is it the economics? The applications? The technology? The scale? Or is something else behind it?

Some of the hardware elements that differentiate IDCs from EDCs are:

  • No big iron RAID arrays
  • No Fibre Channel
  • Direct-attached disk (with the exception of Yahoo Mail)
  • Many small, low-cost servers

This architecture is a startling contrast to the standard big-iron array-and-server architectures heavy on Fibre Channel, RAID, mainframes and servers in the $15k and up range. In this series of articles I’m going to look at a few of these differences to see what is behind them, using one of my favorite papers as a guide.

Finally, A Crystal Clear Crystal Ball
In an excellent six year old paper Rules of Thumb in Data Engineering Jim Gray and Prashant Shenoy offered a set of useful generalizations about data storage design. How useful? Let’s see what they can tell us about IDC architecture.

I/O Cost and System Architecture
Bits want to be free, but I/Os aren’t and never will be. An IDC may be thought of as a massive I/O processor: data streaming in from bots; requests from customers; search results, ads, email, maps, etc. flowing out. Sure, computation is needed to find and sort the content. Yet the raw material is the many petabytes each IDC stores and accesses.

So the cost of an I/O, in CPU cycles and overhead, is important. Gray and Shenoy derive some rules of thumb for I/O costs:

  • A disk I/O costs 5,000 instructions and 0.1 instructions per byte
  • The CPU cost of a Systems Area Network (SAN) network message is 3,000 clocks and 1 clock per byte
  • A network message costs 10,000 instructions and 10 instructions per byte

[For some reason, the authors switched metrics from instructions to clocks. I’m assuming, conservatively, that 1 instruction = 1 clock. The authors note elsewhere that the clocks per instruction (CPI) actually ranges from 1-3.]

So for an 8KB I/O, which is a standard I/O size for Unix systems, it costs

  • Disk: 5,000 + 800 = 5,800 instructions
  • SAN: 3,000 + 8,000 = 11,000 clocks
  • Network: 10,000 + 80,000 = 90,000 instructions

Thus it is obvious why IDCs implement local disks in general preference to SANs or networks. Not only is it cheaper economically, it is much cheaper in CPU resources. Looked another way, this simply confirms what many practitioners already have ample experience with: the EDC architecture doesn’t scale easily or well.

Comments – moderated to eliminate spam – always welcome. BTW, the numbers in the paper are six years old. More current estimates much appreciated.

Pt. II: Storage Cost and Implementation

Storage Cost and Implementation

The Internet Data Center (IDC) is architected very differently from an Enterprise Data Center (EDC). In an EDC, RAID arrays are used to hide the disk’s physical limitations. In the IDC the infrastructure is designed to work with those limitations to reduce complexity, increase availability, lower cost and optimize performance. It seems likely that at some point the EDC will have to follow suit.

As in Part I, I look at the six year old paper Rules of Thumb in Data Engineering by Jim Gray and Prashant Shenoy and relate what they conclude to the trends we see today in the IDC. The value of this exercise is that Rules looks at critical technology trends and draws logical conclusions about the resulting IT model we should be using. The IDCs stand as a test of the paper’s conclusions, enabling us to see how accurate and relevant the metrics the authors use are to the real world of massive scale IT.

Disk and Data Trends
As Rules notes, disk trends are clear and quantifiable. For example, in 1981 DEC’s RP07 disk drive stored about 500 MB and was capable of about 50 I/Os per second (IOPS), or 1 IOPS for every 10 MB of capacity (it also was the size of washing machine). The hot new Seagate 750 GB Barracuda 7200.10 is capable of 110 random IOPS, or about 1 IOPS for every 7 GB. So in 25 years, despite all the technology advances, this amazing device offers 1/700th the I/O performance per unit of capacity.

Looked at another way, in two and a half decades the ratio between disk capacity and disk accesses has been increasing at more than 10x per decade.

Gray and Shenoy conclude these trends imply two things. First, that our data has become cooler, that is, there are far fewer accesses per block than in the past. Second, disk accesses are a scarce resource and have grown costlier. Disk I/Os need wise use to optimize system performance.

IDC Adaptations to Disk I/O Rationing
IDC architectures reveal an acute sensitivity to disk I/O scarcity. Since Google has released the most detailed information about their storage, I’ll use them as the example. From the limited information available it appears the other IDCs use similar strategies, where possible, or simply throw conventional hardware at the problem, at great cost (see Killing With Kindness: Death By Big Iron for a detailed example).

Two I/O intensive techniques are RAID 5 and RAID 6. In RAID 5, writing a block typically requires four disk accesses: two to read the existing data and parity and two more to write the new data and parity (RAID 6 requires even more). Not surprisingly, Google avoids RAID 5 or RAID 6 and favors mirroring, typically mirroring each chunk of data at least three times and many more times if it is hot. This effectively increases the IOPS per chunk of data at the expense of capacity, which is much cheaper than additional bandwidth or cache.

I/O rationing favors fast sequential I/O as well. As Porter and Shenoy note:

A random access costs a seek time, half a rotation time, and then the transfer time. If the transfer is sequential, there is no seek time, and if the transfer is an entire track, there is no rotation time. So track-sized sequential transfers maximize disk bandwidth and arm utilization. The move to sequential disk IO is well underway. . . . caching, transaction logging, and log-structured file systems convert random writes into sequential writes. This has already had large benefits for database systems and operating systems. These techniques will continue to yield benefits as disk accesses become even more precious.

Google specifically optimized GFS for large reads and writes. Nor did they stop there. They also append new data to existing data rather than synchronize and coordinate the overwriting of existing data. This again optimizes the use of disk accesses at the expense of capacity.

Conclusion
Gray and Shenoy’s paper is surprisingly successful in predicting key design elements of an I/O intensive infrastructure as exemplified by Google and others. Yet they didn’t get everything right, although even their misses are instructive. Stay tuned.

Next: The Storage Management Crisis in Architecting the Internet Data Center: Pt. III

Pt III: Feeding the Beast

Our conversation about why Internet Data Centers (IDC) are architected as they are has covered, so far, I/O Cost and System Architecture and IDC Adaptations to Disk I/O Rationing. I was going to go into management today, and then another issue raised in the paper Rules of Thumb in Data Engineering by Jim Gray and Prashant Shenoy, caught my eye.

Feeding the Beast
A venerable concept in data storage is the storage pyramid. At the top of the pyramid is the fastest and most expensive storage and at the bottom is the slowest and cheapest. The taxonomy starts with the on-chip storage, such as registers, buffers, instruction, data and private L1 caches, perhaps an L2 cache and then moves off-chip to external caches, main memory (RAM), disk cache and finally the magnetized bits on a spinning disk. Disks have their own performance hierarchy, ranging from dual-ported, fibre channel, 2.5″ platter, 15,000 RPM speedsters with 16MB of cache all the way down to 3600 RPM 1″ microdrives that are no faster than USB thumb drives. Tape is still the slowest, but with the rise of 25x backup compression, it isn’t always the cheapest.

The pyramid is important because CPUs are voracious consumers of data. For example, Intel’s new Core 2 processors can issue up to four instructions per clock cycle. On a 2GHz processor, that is up to 8 billion instructions per second. Dual-core probably comes close to doubling that number – although actual instructions per clock are typically 2-3. Do the math: 2 Ghz = 0.5 nanosecond clock. With dual processors averaging a total of 5 instructions per clock you get 10 instructions per nanosecond. The very fastest RAM, which few of us use, is about 5ns. So every memory access means a 20 clock cycle stall. A disk with a 10ms access means a 20,000,000 clock cycle stall.

This huge I/O access cost is one of the key factors that led Intel to de-emphasize clock speed and focus on dual-core processors to grow performance. The storage couldn’t keep up with the CPU.

Even with the lower speeds we’re now seeing, feeding such processors is beyond what storage alone can do. Intelligent software design is required to ensure the greatest possible data locality and to reduce cache misses and disk accesses.

For the foreseeable future processor data demand will continue to outpace storage bandwidth. Which gets us to the issue of how best to use such blazing fast processors.

Caging the Beast
IDCs employ hundreds of thousands of processors. Google will pass the 1,000,000 processor mark this year. So the architecture of the storage system that feeds those processors is a critical problem. Take a look at all the multiprocessor options, from dual-core CPUs to clusters (of several different varieties), symmetric multi-processors (SMP), SIMD and MIMD machines, and it’s clear that there is a lot of experimentation about how to create cost-effective multi-processor architectures.

In their six year old paper, Gray and Shenoy write about the issues that SMP systems face. SMP systems typically share resources among processors and run one copy of the operating system, which coordinates the work the processors do. In the paper Gray and Shenoy noted that getting good performance from massive SMPs is not easy. They then suggested

An alternative design opts for many nodes each with its own IO and bus bandwidth and all using a dataflow programming model and communicating via a high-speed network [15]. These designs have given rise to very impressive performance, for example, the sort speed of computer systems has been doubling each year for the last 15 years through a combination of increased node speed (about 60%/year) and parallelism (about 40%/year). The 1999 terabyte sort used nearly 2,000 processors and disks [see sort benchmark].

Dataflow programming is a paradigm organized on the principle of “when all the inputs are present, start the program and ship the results as inputs to the next program”. This model works well for parallel processing, since the availability of data drives processing, not some hopelessly complex master scheduling algorithm. Unlike the SMP model, each of the nodes has its own operating system and local resources, which reduces contention only to the high-speed switched LAN that interconnects the nodes.

Many-Little Beasts
They call this approach the “many-little scalable design” and note that this design

. . . leverage[s] the fact that mainframe:mini:commodity price ratios are approximate 100:10:1. That is, mainframes cost about 100 times more than commodity components, and semi-custom mini-computers have a 10:1 markup over commodity components . . . .

This is the approach taken by Google and to a lesser extent by Amazon with blade servers. No one has discovered a more cost-effective method to deliver internet-scale services.

Simplicity Matters
Processors are fast and getting faster. Interconnects other than gigabit Ethernet are expensive. So it makes sense that local resources, rather than SANs, are the infrastructure of choice for IDC deployment.

Next: The Storage Management Crisis in Architecting the Internet Data Center: Pt IV And this time I mean it!

Comments always welcome. And thank you Jim Gray for your comment on Part I of Architecting Internet Data Centers.

Pt IV: The Storage Management Crisis in Architecting the Internet Data Center

IDC-scale Storage Management
We’ve seen how IDC storage practices follow the dynamics outlined by Gray and Shenoy. Yet one of the most interesting questions to practitioners has to be storage management. Yes, huge disk farms can be built, but how are they managed? Rules of Thumb in Data Engineering makes many interesting statements with implications for management, yet few prescriptions. I’ll explore some of those statements and then offer my own conclusions. You may have different ones, which I’d like to hear.

RAM, Disk and Tape
The key rules of thumb affecting storage management are:

  • Storage capacities increase 100x per decade
  • Storage device throughput increases 10x per decade
  • Disk data cools 10x per decade
  • In ten years RAM will cost what disk costs today
  • NearlineTape:OnlineDisk:RAM storage cost ratios are approximately 1:3:300
  • A person can administer a million dollars of disk storage

The first four are exponential functions, and since humans are bad at estimating exponential effects, let’s explore these further.

Storage Capacities Increase 100x Per Decade
Today, in 2006, 750 GB is highest capacity disk drive available. This rule means that in 2016 the largest disk drive will 75 TB. The largest laptop drive today is 160GB. In ten years the largest laptop drive will be 16TB.

One implication of this is that capacity is cheap and getting cheaper faster than any other storage technology except RAM. Therefore, a storage solution engineered for the coming decade will spend capacity rather than accesses, just as several of the IDCs do today.

Storage device throughput increases 10x per decade
Disk data cools 10x per decade
These rules are arithmetical inverses. If capacity increases 100x, and throughput 10x, then accesses per byte must drop by 10x. Over time, as Gray and Shenoy note, disk becomes the new tape. The cost of an access keeps climbing, the cost of capacity keeps dropping and eventually the storage is only good as an archive, unless the usage model is adjusted to reflect the change.

In ten years RAM will cost what disk costs today
What saves disk from tape’s fate is cheap RAM. Everything on disk today will be in RAM. Disk will be saved for stuff we hardly ever look at.

Ten years from now the average laptop will have a 6-8 TB disk drive and 60-80 GB of RAM. Loading the OS at 2-3 GB/sec will take about 10 seconds. Loading all your favorite tunes, apps, documents and movies will take another 30 seconds. You’ll be happy, even though disk accesses are more costly than ever.

NearlineTape : OnlineDisk : RAM storage cost ratios are approximately 1:3:300
In the paper, Gray and Shenoy note

Historically, tape, disk, and RAM have maintained price ratios of about 1:10:1000. That is, disk storage has been 10x more expensive than tape, and RAM has been 100x more expensive than disk. . . . But, when the offline tapes are put in a nearline tape robot, the price per tape rises to 10K$/TB while packaged disks are 30K$/TB. This brings the ratios back to 1:3:240. It is fair to say that the storage cost ratios are now about 1:3:300.

The disk to nearline tape costs will inevitably reach parity despite the hard work of their engineers. Why? Because the costs of mechanical robots are only declining 2% (if that) annually while disk costs are declining at 40-50% annually. Even if tape was on the same price/capacity slope as disks – which they aren’t – the mechanicals increasing share of cost puts nearline at a permanent disadvantage. It isn’t a question of if the cost lines will cross, only when. Fortunately for StorageTek, Quantum, et. al., storage consumers are a conservative bunch, and tape buyers are the most conservative of all. It will take 10-15 years for word of tape’s obsolescence to spread.

A person can administer a million dollars of disk storage
Of all of Gray’s and Shenoy’s rules this is the one I find the most interesting. In discussing it they note

The storage management tools are struggling to keep up with the relentless growth of storage. If you are designing for the next decade, you need build systems that allow one person to manage a 10 PB store.

Now, that number is from six years ago. I suspect that if they revisited that recommendation today, they would start from today’s big systems. 100x in 10 years is the same as 10x in two successive five year spans. The biggest arrays today are about one PB and presumably may be managed by one person (feel free to correct me on that point). So in five years, 10 PB seems reasonable. Yet in 10 years, if the 100x rule holds, one person will manage 100 PB. So my question to readers is: are any of today’s storage management paradigms scalable to 100 PB in 10 years? To 10 PB in five years?

Conclusion:
Storage’s secular trend is to move intelligence from people to CPU and then to storage. With early drum storage, programmers hand-optimized the data layout on the drum to reduce rotational latency. With the RAMAC, the CPU controlled the disk drive’s arm movement directly. Over time many functions, such as mirroring, moved from person-to-CPU-to-storage. Storage has already achieved a degree of automation and virtualization far beyond what practitioners imagined even 20 years ago.

Looking at the rules of thumb in Gray and Shenoy’s paper and comparing them to what we know about IDC’s – mostly Google, since they’ve been the most open – suggests that the rules are essentially correct. More importantly, not following the rules as one scales up is costly in equipment and labor.

Enterprise Data Centers (EDCs) don’t have the growth rate of a Google or Amazon, so they don’t face the stark choices that come with triple digit growth. Yet it is clear that today’s EDC architectures will not scale indefinitely, either technically or economically. Smart EDC architects and CIOs will start thinking sooner, rather than later, about how to start adapting IDC architectures to EDC problems.