Write off-loading enterprise storage

by Robin Harris on Sunday, 20 July, 2008

It isn’t clear how serious the enterprise storage vendors and and their customers are about reducing energy consumption. A server may have 4-8 cores, consuming 50 W when idle, attached to 8, 16 or even 24 drives each pulling 8 W at idle.

High end it drives, whose demise is widely predicted, may consume 12 W at idle. If they are serious storage is a good place to start.

But how?
A recent paper from Microsoft research in Cambridge, Write Off-Loading: Practical Power Management for Enterprise Storage (pdf) by Dushyanth Narayanan, Austin Donnelly and Antony Rowstron, studies the issue. The traditional view is that enterprise workloads are too intense to generate savings by spinning down disks.

The team analyzed block level traces from 36 volumes in an enterprise data center and found that significant idle periods exist. They found that a technique they call write off-loading can save 60% of the energy used by enterprise disk drives.

Ring for the MAID
Main memory caches are good for handling reads but their lack of persistence means they are not effective for writes. That is the impetus for the write off-loading techniques.

Blocks intended for one volume are redirected to other storage in the data center. During write intensive periods the disks are spun down and the writes redirected. Blocks are off-loaded temporarily, for for as much as several hours, and are reclaimed in the background after the home volume disks are spun up.

The team reports

Write off-loading modifies the per volume access patterns, creating idle periods during which all the volumes disks can be spun down. For our traces this causes volumes to be a vital for 79% of the time on average. The cost of doing this is that when a read occurs for a non-off-loaded block it incurs a significant latency while the disks spin up. However our results show that this is rare.

Locality of reference hasn’t gone away.

Yes, you can spin disks down in the enterprise
The Microsoft team used servers in their Cambridge research facility to measure volume access patterns. This isn’t hard-core OLTP but there are generic server functions such as user home directories, project directories, print server, firewall, Web staging, Web/SQL server, terminal server and a media server.

They acknowledge that for TPC-C and TPC-H benchmarks disks are too busy to benefit from write off-loading. Nonetheless, even OLTP systems have significant variations in their workloads. At night for example, traffic might be light enough to power down many array disks.

The team took a week’s worth of traces. The total number of requests was 434 million, with 70% reads. They found that peak loads were substantially higher than average loads. This over-provisioning enables the power savings of write off-loading.

They also found that the workload is read dominated. Yet on 19 of the 36 volumes the traced volumes had 5 writes for every read.

How write off-loading works
A dedicated manager is responsible for each volume. The manager decides whether to spend the disks up or down and also when and where to off-load writes.

The manager off-loads blocks to one or more loggers for temporary storage. The storage could be a disk or SSD but the team only tested disk-based bloggers.

Loggers support four remote operations: write, read, invalidate and reclaim. They write the blocks and the associated metadata including the source manager identity the logical block numbers and a version number.

The invalidate request includes the version number and the logger marks the corresponding versions as invalid. Every claim is like a read except the logger can return any valid range it is holding for the requesting manager.

Their implementation uses a log-based on-disk layout.

Manager determines when to off-load blocks and went to reclaim them while ensuring consistency and performing failure recovery. The manager fields all read and write requests, handing them off to loggers and/or caches as needed.

Performance
Write off-loading is vulnerable to 10-15 second delays when a read forces a disk to spin up. 1% of the read requests had a response time of more than 1 second.

The write performance is equivalent to array performance in 99.999% of the cases. Here’s a figure that gives results for a “least idle” servers.


The tested configurations:

  • baseline: Volumes are never spun down. This gives
    no energy savings and no performance overhead.
  • vanilla: Volumes spin down when idle, and spin up
    again on the next request, whether read or write.
  • machine-leveloff-load: Write off-loading is enabled but managers can only off-load writes to loggers running on the same server: here the “server” is the original traced server,not the test bed replay server.
  • rack-level off-load: Managers can off-load writes to any logger in the rack.

And this differs from MAID how?
In a massive arrays of idle disks (MAID) a small number of the disks are kept spinning to act as a cache while the rest are spun down. This requires additional disks per volume. Copan Systems claims power savings of 75% with their “enterprise MAID” product. [Note to Copan – I’d be happy to have you compare your approach in the comments.]

Write off-loading does not require additional disks per volume or new hardware. The technique can use any unused data storage on the LAN.

The StorageMojo take
Can write off-loading become a viable commercial product? If Microsoft were to commercialize it in Windows Server at a low price it certainly could. Given the general reluctance of Redmond to productize MR concepts I wouldn’t expect anything soon. Too bad.

What this also underscores is the continued development of tightly coupled of storage and server architectures for cost-effective solutions with unique benefits. The ability to relax some constraints of the (increasingly atypical) “typical” enterprise data center work load shows what can be accomplished through creative architecture.

As the leading OS vendor, Microsoft has an unparalleled opportunity to bring these ideas to market and create functional differentiation with Linux. I hope someone with clout in Redmond is looking at this.

Comments welcome, of course. What could be more appropriate in an era of massive write-offs?

{ 10 comments… read them below or add one }

Ryan Malayter July 21, 2008 at 5:48 am

Wouldn’t variable-speed disks be a much cleaner solution to the same problem? That way a 15K disk could spin down to something really slow and low-power… say 600 rpm during idle periods, and then speed up as necessary. THe latency penalty for reads would then be much smaller (and much less problematic for say an OLTP application).

I would imagine most of the power used by an HDD is used by the motors, not by the electronics, right?

Wes Felter July 21, 2008 at 12:05 pm

IMO write offloading across a network is already obsolete; you might as well use a writeback flash cache.

Robin Harris July 21, 2008 at 12:37 pm

Ryan, the paper addressed the variable speed disk idea thusly:

DRPM [12] and Hibernator [32] are recently proposed approaches to save energy by using multi-speed disks (standard enterprise disks spin at a fixed rate of 10,000 or 15,000 rpm). They propose using lower spin speeds when load is low, which decreases power consumption while increasing access latency. However, multi-speed disks are not widely deployed today in the enterprise, and we do not believe their use is likely to become widespread in the near future.

Maybe the disk vendors are getting ready to do this, but I think they need their OEMs to buy in. That will take a couple of years.

Wes, the paper noted they didn’t investigate SSDs but that there was no reason, in principle, that they wouldn’t work. The advantage of a network cache is that unlike a server-based cache it can still be accessed if the server goes down. With a network cache you can afford to build in all the HA features. Other thoughts?

Robin

Bill Todd July 21, 2008 at 9:02 pm

The whole idea seems pretty silly given the numbers that you quoted: if 1% of the read accesses take close to 1000 times as long as a typical read access (because the disk has to be spun up: if you look at the graph, most of the accesses that took more than 1 second took *a lot* more), then overall throughput is down by close to 90% – and you’d be far, far better off using 7200 RPM SATA disks to drop power requirements by about the same amount as the saving claimed while taking at most a 50% throughput hit – or taking advantage of the far higher storage density available on SATA drives to use fewer of them and trade more throughput for efficiency.

Even better, use commodity 2.5″ drives to cut power requirements by closer to 90% of what those ‘enterprise’ drives use while retaining not all that much less than 50% of the throughput – more, if you use a few more of them to spread random accesses across.

Not to mention not having to explain to your users why some of their requests take so long to complete.

If increasing data center storage power efficiency was the goal, framing the question as “How do we do this using enterprise disks?” seems to have been far too narrow. And once current enterprise disks have been replaced by far more efficient ones, whether taking the same kind of drastic throughput hit just to reduce power consumption *another* 60% would seem questionable – though letting half of each mirrored pair of RAID-1 drives spin down during periods of low demand might be worthwhile, since that wouldn’t expose readers to the kind of 10-second delays that the described approach does.

– bill

Anders Gregersen July 22, 2008 at 4:16 am

Most seem unwilling to sacrifice performance to powersavings. We have all grown accustomed to a high performance from our disksystems without really looking at the power consumption (or heat generation). Now we see that power consumption is parameter equal to performance in priority and you can’t do that with out some sort of sacrifice. Most disksystems are oversized for their task (lets call it an investment in the future for growth) and now we need to plan in greater detail for the future, only investing in what we know, and not buying disksystems that can perfom at a level that the business really do not need. Most don’t even know how many IOPS we need or what our system cost in $/IOPS or for Watts/IOPS, perhaps a future measure for purchasing?

Robin Harris July 22, 2008 at 5:46 am

Anders, good point! That is why I am as yet unconvinced that “green” IT is a real priority. If you are bumping up against your power company – ok, you’ll change – but otherwise, power is pretty far down the list.

When order processing backs up on the last day of the quarter who is going to give a hoot about IT’s power conservation program? There is very little upside for IT to go green.

Robin

Jeff Darcy July 22, 2008 at 7:12 pm

Not a bad idea, really, but if done at significant scale it will run into some of the same problems as other things (e.g. storage virtualization, CDP) that fork or redirect writes – maintaining the indices so that you can satisfy reads from the alternate location, and/or so that since-overwritten blocks can be reclaimed. That’s where the real fun tends to be.

As for “green IT” not being a real priority, I think you’ll find it’s another thing filtering down from HPC into other markets. The guys who run the really big computers are already very well aware that they will probably spend more on power and cooling than on the equipment itself. These guys care about performance as much as anyone, and they know that to get better performance they need to overcome current limits on system size. Why do you think Roadrunner or Blue Gene are based on power-efficient processors instead of Intel heat pumps?

Sure, there are a lot of people paying lip service to “green IT” and there’s little upside to providing the same functionality with the same physical plant in a greener fashion, but there’s very signficant upside in meeting *new* needs without having to build a new data center. I’ve seen customers base purchase decisions on these factors. It’s a real priority for some real people, even if they’re not the people where you hang out.

Ryan Malayter July 22, 2008 at 8:22 pm

Another major issue is that this technique works well with DAS, but ignores the realities of the SAN. Volumes on most poplar SAN platforms are not a dedicated set of disks, but an entire pool of disks (and sometimes even a pool of controllers). You can’t spin down a SAN array when most every volume in the datacenter is striped onto that RAID set – there are almost no idle or write-mostly periods.

Steve Jones July 24, 2008 at 4:27 am

As many have pointed out, it’s reads that are the problem – writes can be dealt with by write-back. As there is a huge difference between the access time on a “spun up” disk (about 5ms) and perhaps 2,000 x that on a “spun-down” disk, then it only needs a tiny percentage of I/Os to be so affected to make a massive difference on the average access time. Frankly this approach could only work if the workload pattern is such that “spin up” requirements are very rare indeed. That begins to sound something much closer to an archival-type requirement than general purpose, shared storage, for which this sounds wholly unsuited.

An alternative approach might be to use very large disks with multiple copies of the data. Then only spin up enough copies to support the current I/O demand requirement. Of course that will build up a whole lot of write-back operations which will be required when the devices are eventually “spun up”. However, a suitable storage array with non-volatile cache and dirty data markers could deal with this and do a “lazy-resync” to flush out cache data).

In effect there would be a series of mirrored disks, but it would be possible (in principle) to just have one of these online at any time due to the persistence in the array of the write-back cache. From time-to-time the array logic could spin the drives up and stage-out the data to avoid cache exhaustion. Of course if there’s a huge data churn then that approach won’t work, but then simply the array will power up all the drives anyway. It’s during “quiet” periods tha this would save power.

There is the objection that there are many copies of data held, but as we already have to do that for hardware redundancy reasons, then we already have to allow for some of that. Of course this solution envisages (potentially) many more copies, but as many of us know to our costs, the reducing cost per GB of storage is not matched by an increas in IOP capability. With 1.5TB (and shortly 2TB) disks available, then this might be a way of being able to exploit these huge empty spaces.

Another thing to note, is that queuing theory tells you that a single queue, multiple server, model (such as you’d get with this principle) then you can drive the back-end devices to a much higher level of utilisation without adversely affecting service times. The effect of that is that you can eke more IOPs out of a given set of disks by having multiple mirrors and stil get good service time (at the expense of space utilisation due to the multiple copies). Such a design would work best for workload patterns characterised by high read-to-write ratios.

However, I’m not wholly convinced off all this – I think something more dramatic in the form of SSD is going to be required as, for high-perforance storage, rotating disks are proving to be a major bottleneck. That also goes for the power too, although I would be interested to know how much of the power in an Enterprise array comes from the rotating disks, and how much from the (necessarily) very powerful controllers and electronics.

Chris Santilli, CTO, COPAN Systems August 8, 2008 at 7:24 am

Write-offloading looks like an interesting and compelling approach for transactional enterprise storage, where writes can be buffered before being sent to destination LUNs, as mentioned in the whitepaper, solid state disk cache is very effective at read caching. There remains the issue of “locality of reference” that is not solved with write-offloading. In fact, there may exist more conditions in which write-loading would create random access due to the locality of the initial write. COPAN Systems’ MAID platform is purpose-built for persistent data in the enterprise, for backup, archive or tiered storage applications, where data access is regular intensive sequential write, followed by occasional random read of small or large objects.

COPAN Systems Enterprise MAID uses disk power off to massively reduce power consumption by switching disks off, rather than spinning down the drives but keeping them electrically active. The Enterprise MAID power consumption feature allows for the highest disk density packaging in the industry of 896 disk drives in a single cabinet. At any one time no more than 25% of user data disks in the system are powered on (although all disk devices always appear available). By limiting the maximum spin-up ratio, the total maximum power requirement is permanently and always reduced to a minimum. This gives not only utility power savings but also gives savings in associated cooling. This saving is further enhanced by the reduction in data-centre overhead power costs (UPS, batteries etc), we could amount to a further 40% of actual load split across all the services in the datacentre.

COPAN Systems uses an Always on region™ (AOR) to cache data destined for MAID storage, that is frequently accessed (for read or write). This ensures that application metadata is given a fast response, just as in any traditional storage environment. The AOR is used to map specific portions of disk LUNs to disk that are always spinning. This simplifies locating a persistent application on to an Enterprise MAID subsystem, as metadata can also be located in the same array.

Some estimates suggest that up to 80% of all data stored in enterprise applications is persistent or inactive, not currently supporting business operations, but essential for compliance, audit and future use. If 80% of the data in an organization was able to be identified and migrated to a COPAN Systems MAID platform, the power savings could far exceed that of write-offloading alone. If COPAN Systems MAID was deployed for persistent data, whilst write-offloading was deployed for transactional storage (potentially with SSDs as the offload target), the overall total benefit, in power cooling and infrastructure costs could be truly huge.

COPAN Systems offers solutions today with 896TB of raw storage in a single footprint with a total power consumption of 7.3KwH. This equates to 8.14 watts per TB. Compare this 12w for a 300GB cheetah (40 watts per TB), COPAN Systems MAID platform is approximately 5 x more power efficient). This includes all of the controllers necessary to drive large disk pools, which are excluded from the comparison for DAS disks It should be RAID protection differences would have to be considered too, smaller RAID groups equate to better power management, larger RAID groups lessen the benefit.

Power savings in storage, can only be achieved by maximizing disk sizes and switching disks fully off when possible. Identifying persistent data and removing it from the transactional storage subsystems to a MAID platform (COPAN Systems is often up to 6 x more dense than transactional storage), and then maximizing efficiency in transactional storage through caching and write-offloading would seem to be a good strategy for dealing with massive data growth.

In addition, there have been many studies of power managed disk drives (e.g. http://csl.cse.psu.edu/publications/ispass03.pdf ) such that the focus of these studies are the access patterns to power-managed disk arrays. In the research paper referenced above, Penn State and IBM research, determined that the use of power managed disk drives for Tier1/Tier2 applications would actually consume more power. This is noted by the fact that it takes more power (18-20w) to spin up a disk drive. If the frequency of the spin up/down is too high, the drives consumes more power.

Write-offloading is a very interesting technique for minimizing disk spin-up in transactional environments. COPAN Systems believes that minimizing the data in transactional environments, by relocating data written once, occasionally accessed, and never changed, to a purpose-built platform could deliver even further benefits before write-offloading is considered.

Disk spin-down cannot be implemented in isolation. Thought must be given to data access patterns, data protection schemes, and disaster recovery. Any environment where spin-down is used as opposed to imposing a hard limit on disk use, requires an infrastructure to deliver at full load, with datacentre overhead. Spin down does not address these issues, which may come to represent a far higher cost than the savings in drive spin-down alone. When used in a persistent environment, massive savings can be accrued with MAID platforms with a hard power budget.

Leave a Comment

Previous post:

Next post: