Update: Read the entire article on one page here.

Update: A major power problem at the building where Dreamhost and many other hosting and network firms are colocated kept StorageMojo.com down for several hours Friday. As luck would have it, the failure occurred just as I hit the publish button. My apologies for not being faster on the draw.

Our conversation about why Internet Data Centers (IDC) are architected as they are has covered, so far, I/O Cost and System Architecture and IDC Adaptations to Disk I/O Rationing. I was going to go into management today, and then another issue raised in the paper Rules of Thumb in Data Engineering by Jim Gray and Prashant Shenoy, caught my eye.

Feeding the Beast
A venerable concept in data storage is the storage pyramid. At the top of the pyramid is the fastest and most expensive storage and at the bottom is the slowest and cheapest. The taxonomy starts with the on-chip storage, such as registers, buffers, instruction, data and private L1 caches, perhaps an L2 cache and then moves off-chip to external caches, main memory (RAM), disk cache and finally the magnetized bits on a spinning disk. Disks have their own performance hierarchy, ranging from dual-ported, fibre channel, 2.5″ platter, 15,000 RPM speedsters with 16MB of cache all the way down to 3600 RPM 1″ microdrives that are no faster than USB thumb drives. Tape is still the slowest, but with the rise of 25x backup compression, it isn’t always the cheapest.

The pyramid is important because CPUs are voracious consumers of data. For example, Intel’s new Core 2 processors can issue up to four instructions per clock cycle. On a 2GHz processor, that is up to 8 billion instructions per second. Dual-core probably comes close to doubling that number – although actual instructions per clock are typically 2-3. Do the math: 2 Ghz = 0.5 nanosecond clock. With dual processors averaging a total of 5 instructions per clock you get 10 instructions per nanosecond. The very fastest RAM, which few of us use, is about 5ns. So every memory access means a 20 clock cycle stall. A disk with a 10ms access means a 20,000,000 clock cycle stall.

This huge I/O access cost is one of the key factors that led Intel to de-emphasize clock speed and focus on dual-core processors to grow performance. The storage couldn’t keep up with the CPU.

Even with the lower speeds we’re now seeing, feeding such processors is beyond what storage alone can do. Intelligent software design is required to ensure the greatest possible data locality and to reduce cache misses and disk accesses.

For the foreseeable future processor data demand will continue to outpace storage bandwidth. Which gets us to the issue of how best to use such blazing fast processors.

Caging the Beast
IDCs employ hundreds of thousands of processors. Google will pass the 1,000,000 processor mark this year. So the architecture of the storage system that feeds those processors is a critical problem. Take a look at all the multiprocessor options, from dual-core CPUs to clusters (of several different varieties), symmetric multi-processors (SMP), SIMD and MIMD machines, and it’s clear that there is a lot of experimentation about how to create cost-effective multi-processor architectures.

In their six year old paper, Gray and Shenoy write about the issues that SMP systems face. SMP systems typically share resources among processors and run one copy of the operating system, which coordinates the work the processors do. In the paper Gray and Shenoy noted that getting good performance from massive SMPs is not easy. They then suggested

An alternative design opts for many nodes each with its own IO and bus bandwidth and all using a dataflow programming model and communicating via a high-speed network [15]. These designs have given rise to very impressive performance, for example, the sort speed of computer systems has been doubling each year for the last 15 years through a combination of increased node speed (about 60%/year) and parallelism (about 40%/year). The 1999 terabyte sort used nearly 2,000 processors and disks [see sort benchmark].

Dataflow programming is a paradigm organized on the principle of “when all the inputs are present, start the program and ship the results as inputs to the next program”. This model works well for parallel processing, since the availability of data drives processing, not some hopelessly complex master scheduling algorithm. Unlike the SMP model, each of the nodes has its own operating system and local resources, which reduces contention only to the high-speed switched LAN that interconnects the nodes.

Many-Little Beasts
They call this approach the “many-little scalable design” and note that this design

. . . leverage[s] the fact that mainframe:mini:commodity price ratios are approximate 100:10:1. That is, mainframes cost about 100 times more than commodity components, and semi-custom mini-computers have a 10:1 markup over commodity components . . . .

This is the approach taken by Google and to a lesser extent by Amazon with blade servers. No one has discovered a more cost-effective method to deliver internet-scale services.

Simplicity Matters
Processors are fast and getting faster. Interconnects other than gigabit Ethernet are expensive. So it makes sense that local resources, rather than SANs, are the infrastructure of choice for IDC deployment.

Next: The Storage Management Crisis in Architecting the Internet Data Center: Pt. III And this time I mean it!

Comments always welcome. And thank you Jim Gray for your comment on Part I of Architecting Internet Data Centers.