It isn’t clear how serious the enterprise storage vendors and and their customers are about reducing energy consumption. A server may have 4-8 cores, consuming 50 W when idle, attached to 8, 16 or even 24 drives each pulling 8 W at idle.

High end it drives, whose demise is widely predicted, may consume 12 W at idle. If they are serious storage is a good place to start.

But how?
A recent paper from Microsoft research in Cambridge, Write Off-Loading: Practical Power Management for Enterprise Storage (pdf) by Dushyanth Narayanan, Austin Donnelly and Antony Rowstron, studies the issue. The traditional view is that enterprise workloads are too intense to generate savings by spinning down disks.

The team analyzed block level traces from 36 volumes in an enterprise data center and found that significant idle periods exist. They found that a technique they call write off-loading can save 60% of the energy used by enterprise disk drives.

Ring for the MAID
Main memory caches are good for handling reads but their lack of persistence means they are not effective for writes. That is the impetus for the write off-loading techniques.

Blocks intended for one volume are redirected to other storage in the data center. During write intensive periods the disks are spun down and the writes redirected. Blocks are off-loaded temporarily, for for as much as several hours, and are reclaimed in the background after the home volume disks are spun up.

The team reports

Write off-loading modifies the per volume access patterns, creating idle periods during which all the volumes disks can be spun down. For our traces this causes volumes to be a vital for 79% of the time on average. The cost of doing this is that when a read occurs for a non-off-loaded block it incurs a significant latency while the disks spin up. However our results show that this is rare.

Locality of reference hasn’t gone away.

Yes, you can spin disks down in the enterprise
The Microsoft team used servers in their Cambridge research facility to measure volume access patterns. This isn’t hard-core OLTP but there are generic server functions such as user home directories, project directories, print server, firewall, Web staging, Web/SQL server, terminal server and a media server.

They acknowledge that for TPC-C and TPC-H benchmarks disks are too busy to benefit from write off-loading. Nonetheless, even OLTP systems have significant variations in their workloads. At night for example, traffic might be light enough to power down many array disks.

The team took a week’s worth of traces. The total number of requests was 434 million, with 70% reads. They found that peak loads were substantially higher than average loads. This over-provisioning enables the power savings of write off-loading.

They also found that the workload is read dominated. Yet on 19 of the 36 volumes the traced volumes had 5 writes for every read.

How write off-loading works
A dedicated manager is responsible for each volume. The manager decides whether to spend the disks up or down and also when and where to off-load writes.

The manager off-loads blocks to one or more loggers for temporary storage. The storage could be a disk or SSD but the team only tested disk-based bloggers.

Loggers support four remote operations: write, read, invalidate and reclaim. They write the blocks and the associated metadata including the source manager identity the logical block numbers and a version number.

The invalidate request includes the version number and the logger marks the corresponding versions as invalid. Every claim is like a read except the logger can return any valid range it is holding for the requesting manager.

Their implementation uses a log-based on-disk layout.

Manager determines when to off-load blocks and went to reclaim them while ensuring consistency and performing failure recovery. The manager fields all read and write requests, handing them off to loggers and/or caches as needed.

Write off-loading is vulnerable to 10-15 second delays when a read forces a disk to spin up. 1% of the read requests had a response time of more than 1 second.

The write performance is equivalent to array performance in 99.999% of the cases. Here’s a figure that gives results for a “least idle” servers.

The tested configurations:

  • baseline: Volumes are never spun down. This gives
    no energy savings and no performance overhead.
  • vanilla: Volumes spin down when idle, and spin up
    again on the next request, whether read or write.
  • machine-leveloff-load: Write off-loading is enabled but managers can only off-load writes to loggers running on the same server: here the “server” is the original traced server,not the test bed replay server.
  • rack-level off-load: Managers can off-load writes to any logger in the rack.

And this differs from MAID how?
In a massive arrays of idle disks (MAID) a small number of the disks are kept spinning to act as a cache while the rest are spun down. This requires additional disks per volume. Copan Systems claims power savings of 75% with their “enterprise MAID” product. [Note to Copan – I’d be happy to have you compare your approach in the comments.]

Write off-loading does not require additional disks per volume or new hardware. The technique can use any unused data storage on the LAN.

The StorageMojo take
Can write off-loading become a viable commercial product? If Microsoft were to commercialize it in Windows Server at a low price it certainly could. Given the general reluctance of Redmond to productize MR concepts I wouldn’t expect anything soon. Too bad.

What this also underscores is the continued development of tightly coupled of storage and server architectures for cost-effective solutions with unique benefits. The ability to relax some constraints of the (increasingly atypical) “typical” enterprise data center work load shows what can be accomplished through creative architecture.

As the leading OS vendor, Microsoft has an unparalleled opportunity to bring these ideas to market and create functional differentiation with Linux. I hope someone with clout in Redmond is looking at this.

Comments welcome, of course. What could be more appropriate in an era of massive write-offs?