Update: Read the entire article on one page here.

Storage Cost and Implementation

The Internet Data Center (IDC) is architected very differently from an Enterprise Data Center (EDC). In an EDC, RAID arrays are used to hide the disk’s physical limitations. In the IDC the infrastructure is designed to work with those limitations to reduce complexity, increase availability, lower cost and optimize performance. It seems likely that at some point the EDC will have to follow suit.

As in Part I, I look at the six year old paper Rules of Thumb in Data Engineering by Jim Gray and Prashant Shenoy and relate what they conclude to the trends we see today in the IDC. The value of this exercise is that Rules looks at critical technology trends and draws logical conclusions about the resulting IT model we should be using. The IDCs stand as a test of the paper’s conclusions, enabling us to see how accurate and relevant the metrics the authors use are to the real world of massive scale IT.

Disk and Data Trends
As Rules notes, disk trends are clear and quantifiable. For example, in 1981 DEC’s RP07 disk drive stored about 500 MB and was capable of about 50 I/Os per second (IOPS), or 1 IOPS for every 10 MB of capacity (it also was the size of washing machine). The hot new Seagate 750 GB Barracuda 7200.10 is capable of 110 random IOPS, or about 1 IOPS for every 7 GB. So in 25 years, despite all the technology advances, this amazing device offers 1/700th the I/O performance per unit of capacity.

Looked at another way, in two and a half decades the ratio between disk capacity and disk accesses has been increasing at more than 10x per decade.

Gray and Shenoy conclude these trends imply two things. First, that our data has become cooler, that is, there are far fewer accesses per block than in the past. Second, disk accesses are a scarce resource and have grown costlier. Disk I/Os need wise use to optimize system performance.

IDC Adaptations to Disk I/O Rationing
IDC architectures reveal an acute sensitivity to disk I/O scarcity. Since Google has released the most detailed information about their storage, I’ll use them as the example. From the limited information available it appears the other IDCs use similar strategies, where possible, or simply throw conventional hardware at the problem, at great cost (see Killing With Kindness: Death By Big Iron for a detailed example).

Two I/O intensive techniques are RAID 5 and RAID 6. In RAID 5, writing a block typically requires four disk accesses: two to read the existing data and parity and two more to write the new data and parity (RAID 6 requires even more). Not surprisingly, Google avoids RAID 5 or RAID 6 and favors mirroring, typically mirroring each chunk of data at least three times and many more times if it is hot. This effectively increases the IOPS per chunk of data at the expense of capacity, which is much cheaper than additional bandwidth or cache.

I/O rationing favors fast sequential I/O as well. As Porter and Shenoy note:

A random access costs a seek time, half a rotation time, and then the transfer time. If the transfer is sequential, there is no seek time, and if the transfer is an entire track, there is no rotation time. So track-sized sequential transfers maximize disk bandwidth and arm utilization. The move to sequential disk IO is well underway. . . . caching, transaction logging, and log-structured file systems convert random writes into sequential writes. This has already had large benefits for database systems and operating systems. These techniques will continue to yield benefits as disk accesses become even more precious.

Google specifically optimized GFS for large reads and writes. Nor did they stop there. They also append new data to existing data rather than synchronize and coordinate the overwriting of existing data. This again optimizes the use of disk accesses at the expense of capacity.

Conclusion
Gray and Shenoy’s paper is surprisingly successful in predicting key design elements of an I/O intensive infrastructure as exemplified by Google and others. Yet they didn’t get everything right, although even their misses are instructive. Stay tuned.

Next: The Storage Management Crisis in Architecting the Internet Data Center: Pt. III

4 Comments

David Magda on Wednesday, 26 July, 2006 at 4:24 pm

Regarding I/O costs and the characteristics of Google’s system, does Sun’s ZFS have the same (or similar) advantages? Most writes are in large chunks (usually 128kB), and occur on any free block–they don’t overwrite the data that they’re replacing–so you possibly save on seek time.
Robin Harris on Wednesday, 26 July, 2006 at 10:00 pm
Good question David. I thought about interweaving some of the ZFS info, and thought better of it since it isn’t, AFAIK, in use at any IDCs. Yet you are correct, ZFS uses a number of techniques to conserve I/Os, including:
- All writes are full stripe, thanks to RAID-Z’s variable width stripes
- Writes are ganged together for maximum performance
- As you noted, no overwrites, which also contributes to resiliancy, along with a maximum 128k block size
In 10 years I think that 128k block limit will be one of the few things the team will wish they’d made much larger.
David Magda on Thursday, 27 July, 2006 at 11:47 am

I believe the 128 KB block size is the current default. According to the tutorial presentation at:

http://www.sun.com/software/media/real/zfs_learningcenter/high_bandwidth.html

they can go up to 32MB (don’t remember off-hand which segment they mention it in). Given the design though put into ZFS, I’d be very surprised if they had such arbitrary limits.

Doing some digging, the current minimum write size is 4KB, and maximum is 128KB, but both values are defined by variables that could be easily changed (ZIL_MIN_BLKSZ and ZIL_MAX_BLKSZ, respectively) :

http://blogs.sun.com/roller/page/realneel?entry=the_zfs_intent_log
Robin Harris on Thursday, 27 July, 2006 at 12:57 pm

David, good to hear it isn’t hardwired into ZFS. I guess I have to amend my comment to:

In 10 years the ZFS team will be real glad they made block size a variable.