Update: Read the entire article on one page here.
Update: A major power problem at the building where Dreamhost and many other hosting and network firms are colocated kept StorageMojo.com down for several hours Friday. As luck would have it, the failure occurred just as I hit the publish button. My apologies for not being faster on the draw.
Our conversation about why Internet Data Centers (IDC) are architected as they are has covered, so far, I/O Cost and System Architecture and IDC Adaptations to Disk I/O Rationing. I was going to go into management today, and then another issue raised in the paper Rules of Thumb in Data Engineering by Jim Gray and Prashant Shenoy, caught my eye.
Feeding the Beast
A venerable concept in data storage is the storage pyramid. At the top of the pyramid is the fastest and most expensive storage and at the bottom is the slowest and cheapest. The taxonomy starts with the on-chip storage, such as registers, buffers, instruction, data and private L1 caches, perhaps an L2 cache and then moves off-chip to external caches, main memory (RAM), disk cache and finally the magnetized bits on a spinning disk. Disks have their own performance hierarchy, ranging from dual-ported, fibre channel, 2.5″ platter, 15,000 RPM speedsters with 16MB of cache all the way down to 3600 RPM 1″ microdrives that are no faster than USB thumb drives. Tape is still the slowest, but with the rise of 25x backup compression, it isn’t always the cheapest.
The pyramid is important because CPUs are voracious consumers of data. For example, Intel’s new Core 2 processors can issue up to four instructions per clock cycle. On a 2GHz processor, that is up to 8 billion instructions per second. Dual-core probably comes close to doubling that number – although actual instructions per clock are typically 2-3. Do the math: 2 Ghz = 0.5 nanosecond clock. With dual processors averaging a total of 5 instructions per clock you get 10 instructions per nanosecond. The very fastest RAM, which few of us use, is about 5ns. So every memory access means a 20 clock cycle stall. A disk with a 10ms access means a 20,000,000 clock cycle stall.
This huge I/O access cost is one of the key factors that led Intel to de-emphasize clock speed and focus on dual-core processors to grow performance. The storage couldn’t keep up with the CPU.
Even with the lower speeds we’re now seeing, feeding such processors is beyond what storage alone can do. Intelligent software design is required to ensure the greatest possible data locality and to reduce cache misses and disk accesses.
For the foreseeable future processor data demand will continue to outpace storage bandwidth. Which gets us to the issue of how best to use such blazing fast processors.
Caging the Beast
IDCs employ hundreds of thousands of processors. Google will pass the 1,000,000 processor mark this year. So the architecture of the storage system that feeds those processors is a critical problem. Take a look at all the multiprocessor options, from dual-core CPUs to clusters (of several different varieties), symmetric multi-processors (SMP), SIMD and MIMD machines, and it’s clear that there is a lot of experimentation about how to create cost-effective multi-processor architectures.
In their six year old paper, Gray and Shenoy write about the issues that SMP systems face. SMP systems typically share resources among processors and run one copy of the operating system, which coordinates the work the processors do. In the paper Gray and Shenoy noted that getting good performance from massive SMPs is not easy. They then suggested
An alternative design opts for many nodes each with its own IO and bus bandwidth and all using a dataflow programming model and communicating via a high-speed network [15]. These designs have given rise to very impressive performance, for example, the sort speed of computer systems has been doubling each year for the last 15 years through a combination of increased node speed (about 60%/year) and parallelism (about 40%/year). The 1999 terabyte sort used nearly 2,000 processors and disks [see sort benchmark].
Dataflow programming is a paradigm organized on the principle of “when all the inputs are present, start the program and ship the results as inputs to the next program”. This model works well for parallel processing, since the availability of data drives processing, not some hopelessly complex master scheduling algorithm. Unlike the SMP model, each of the nodes has its own operating system and local resources, which reduces contention only to the high-speed switched LAN that interconnects the nodes.
Many-Little Beasts
They call this approach the “many-little scalable design” and note that this design
. . . leverage[s] the fact that mainframe:mini:commodity price ratios are approximate 100:10:1. That is, mainframes cost about 100 times more than commodity components, and semi-custom mini-computers have a 10:1 markup over commodity components . . . .
This is the approach taken by Google and to a lesser extent by Amazon with blade servers. No one has discovered a more cost-effective method to deliver internet-scale services.
Simplicity Matters
Processors are fast and getting faster. Interconnects other than gigabit Ethernet are expensive. So it makes sense that local resources, rather than SANs, are the infrastructure of choice for IDC deployment.
Next: The Storage Management Crisis in Architecting the Internet Data Center: Pt. III And this time I mean it!
Comments always welcome. And thank you Jim Gray for your comment on Part I of Architecting Internet Data Centers.
My compliments on the “Architecting The Internet Data Center”, especially Pt III.
Why Pt III?
It has been my observation that the state-of-the-art of the “Enabling Technology” is several orders of magnitude ahead of IT thinking. You can always process more Information than you can supply.
I have been working on this from the Storage side. Hence my “Speed Limit of the Information Universe” is Storage oriented not Processing oriented. My “storage pyramid” is inverted from the venerable concept in your article and ordered by Content. Lots of important Information waits on less important Information, based on Content. Hence the inversion.
It strikes me your IDC model is OLTP focused. That’s nice work if you can get it. IMHO, there seems to be a lot more non-OLTP Processing in Ad-Hoc Information spaces. Of course there are all those server pages generated in the IDC?
My focus is on the Content of the Information entity or object. To get around the naming conflict I use the term “Managed Unit of Information” for the Content an Information entity or object.
What are the demands on this Managed Unit of Information?
One demand is Findability Search, Find and Obtain network requests.
Quoting from Wikipedia, who say it better than I, “The TCP software implementations on host systems require extensive computing power. Gigabit TCP communication using software processing alone is enough to fully load a 2.4 GHz Pentium 4 processor, resulting in little or no processing resources left for the applications to run on the system. As of 2006, very few consumer network interface cards support TOE.”
If the processor bandwidth is already exhausted by your OLTP scenario, what do I do for my needs?
I prayed to the network Gods for years for TCP/IP rewrites. Then I prayed to the Broadband God and he delivered Gigabit Ethernet. Then I prayed to the InfiniBand God without success. Now I pray to the TOE God along with the Access Density God. There is no TCP/IP God.
This may be why Google uses 100 Mb ethernet – at least as of three years ago. Wouldn’t surprise me if they still were – the 3x bandwidth improvement normally seen over fast ethernet hardly seems worth it.