Why do we chill data centers?

by Robin Harris | Friday, April 10, 2009 | Enterprise, Management | 11 comments

James Hamilton pointed to an Intel Brief titled
Reducing Data Center Cost with an Air Economizer that cost conscious folks will want to look at. The concept: let’s take vendors at their specs for temperature ranges and use free outside air to cool a data center.

Google reported that temperatures below 40C (104F) did not affect drive life. So the Intel researchers – Don Atwood, an Intel regional data center manager and John G. Miner, a senior systems engineer – conducted a proof of concept test using 900 servers in a 1,000 sq. ft. trailer.

Each space housed 8 racks with a total of 448 blades and a power density over 200 watts per sq. ft. Dividing the trailer in half, they air-conditioned one half and for the other half they

. . . used essentially the same air-conditioning equipment, but with modifications that enabled it to operate as an economizer by expelling hot air to the outdoors and drawing in 100 percent outside air for cooling.

The economizer used outside air and only kicked in to heat or cool when the air temp was less than 65F or more than 90F. They didn’t control for humidity – not a problem in their temperate desert location – and only used a standard household air filter to remove large particles.

Their cheap air conditioner let temperature vary from 64F to over 92F. Humidity varied from 90% to 4%. And the servers got covered in dust.

Results
The air-conditioned space had a server failure rate of 2.45% and the naked space just 4.46% – and the former was lower than the main data center rate. The authors called this difference “minimal” and not knowing the statistics I have to take them at their word.

The power savings are impressive:

Based on our 74 percent measured decrease in power consumption when using the economizer during the PoC, and assuming that we could rely on the economizer 91 percent of the year, we could potentially save approximately 67 percent of the total power used annually for cooling . . . .

. . . In a larger 10-MW data center, the estimated annual cost reduction would be approximately USD 2.87 million.

Plus think of the HVAC gear you wouldn’t need to buy or maintain.

The StorageMojo take
IT gear has grown more rugged over the years. Enterprise disks used to have a 25,000 hour MTBF – and vendors bragged on it.

The combination of plunging hardware prices, improved availability through software and increasing energy prices mean it is time to examine the assumptions of 40 years ago. Even if free air cooling increased server mortality, $3 million will buy a lot of servers.

Intel will continue testing this concept with a larger PoC. Kudos to Don, John and Intel for this ground-breaking work.

Update: I misinterpreted the server failure rates and corrected the data and conclusion above. Thanks to alert reader Paul for spotting the error. End update.

Courteous comments welcome, of course.

11 Comments

Rex on Monday, 13 April, 2009 at 8:28 am

Lots of good presentations at Google’s Efficient Data Centers Summit held April 1, 2009, available on YouTube:
http://www.youtube.com/watch?v=Ho1GEyftpmQ (2+ hours!)
http://www.youtube.com/watch?v=m03vdyCuWS0
http://www.youtube.com/watch?v=91I_Ftsd-7s (2+ hours!)

Relevant to chilling data centers:

— Google runs containerized data centers at 81 F.

— James Hamilton wonders why ASHRAE and IT folks insist on restricted temperature ranges for equipment. Dell guarantees servers to 95 F; Telco NEBS requires operations to 104 F; major server components are speced to work to 122 F – 140 F. With the latter specs, you could run your IT equipment in the hottest place on earth — without “air conditioning”. Hamilton’s slides are here:
http://mvdirona.com/jrh/TalksAndPapers/JamesHamilton_Google2009.pdf

=====

From 1987 to 2006, I ran lots of off-the-shelf IT equipment in unconditioned industrial spaces on south San Francisco Bay. Our failure rates were no higher than identical equipment in data centers and air-conditioned offices a few miles away. When we opened up one five-year-old router to upgrade RAM, we had to scrape gypsum dust off the motherboard and DIMMs to see what we were doing — thanks to a large, open pile of gypsum dust next door. Our equipment was obsolete before it was killed by heat, humidity, salt air, gypsum dust, etc.

=====

I learned IT in the days of glass-walled mainframes and punched cards, bringing a down parka to work in Mojave desert summers, because of the icy hurricane blowing through the computer room. Too much of what we do in IT is based on hearsay and tradition, especially for data center design. I’m as guilty as anyone.

Ironic for a field that often calls itself “computer science”.
RC on Monday, 13 April, 2009 at 11:10 am

If you had a data center in each hemisphere, you could use the one that is in cool season.

Using cool outdoor air is happening in green data centers:
http://www.sabey.com/real_estate/about_fiber.html

I’ve seen several problems when people use more power in the data center than their PDU, UPS, and/or generators can supply during an outage. I worry that people will creep past the amount of cooling that their heat pumps can provide when a hot spell comes along.

I looked at the Intel paper, and I can’t tell if power cycling servers was included in the test. When a data center gets too hot, typically few systems fail outright. But if you loose power at the end of a hot time, systems fail at start up. Thermal cycling will bring out every flaky solder joint, every bad power supply component, and every sticky sleeve bearing in the shop.

Also, replacing every server once a year may save on power, but would the replaced gear go to the landfill? Has anyone committed to rebuilding or recycling the old equipment?
TimC on Monday, 13 April, 2009 at 6:10 pm

The problem is they only ran the test over 10 months. That isn’t long enough to cover things like fans dying from being wore out prematurely from dust. They also don’t specify what “failure” is. Do they mean “complete server failure”? Do they mean “CPU fan failure”?

ANY downtime is a *failure* where I come from, if they’re not counting that, I’d say their numbers are more than slightly skewed. And again, this would need something closer to a 3 year run-rate to be anywhere near realistic. I don’t know too many companies that rip and replace gear after 10 months.
Paul on Friday, 17 April, 2009 at 8:29 am

I enjoy your highlighting unconventional wisdom in IT…thanks! More details would help: e.g., failure modes, whether each blade included a disk drive, differences with the main data centers.

BTW, please check your numbers in Results. I see Intel’s cooled-side (68 F) failure rate in the study as 2.45%, lower than their main data center rate of 3.83%. So the 4.46% rate for the economizer side (65-90 F) is actually higher than both.
Robin Harris on Friday, 17 April, 2009 at 5:48 pm

Paul, good catch. I misread their conclusion on server failure rates:

Despite the dust and variation in humidity and temperature, there was only a minimal difference between the 4.46 percent failure rate in the economizer compartment and the 3.83 percent failure rate in our main data center over the same period. The failure rate in the trailer compartment with DX cooling was 2.45 percent, actually lower than in the main data center.

I’ll correct it in the post. Thanks!

Robin
Jonathan on Thursday, 30 April, 2009 at 6:56 am

When looking at cooling capacity, it’s important to look aat more than ambient room temperature. Make sure that equipment is cooled evenly all of the way up the rack.

The reality is that you can run things more efficiently if you have total control over the environment. Google is targeting ~70 degrees in some of their datcenters at this point.

In situations where more than one group has access to the facility, it’s unlikely that you will have the COBS (control over bullshit) to evenly cool all of your servers. In many traditional rack designs, the top 2/3 of servers in a rack are significantly undercooled. Having temperature measurements all the way up the rack will provide you with a much better picture of what will work for you.

HotLok blanking panels are one easy way to increase cooling efficiency and get this data at the same time.

Best.
Jonathan
Rick Cockrell on Friday, 1 May, 2009 at 7:11 am

Robin, I’ll say the same thing I said on James blog, there are solutions that are most efficient than outside air, that maintain the enviroment at consistant levels so we don’t generate more e-waste with failure rates that increase from 2.46 to 4.45% in a controlled 8 month study. This is actually a 180% increase and that’s not small. If your bonus for selling servers went up that much you’d be a happy man. You IT guys arn’t getting it, so I suggest you guys put your money were your mouth is and start a sign up sheet to address where we start shipping this waste when the failures due to ASE happen… By the way what happens when government starts getting it and wight they start taxing the manufactures to get them to build true reclamation facilites? According to the EPA, 18% of e-waste gets actually recycled, and of that 18%, 0% gets brought down into it’s raw materials to be used in the production of new products. It’s just brought down to the level that it can be either reused in school computers (that get thrown away in a couple of years) or to the point where it can be piled up without anybody be able to figure out what it was.

I’m just saying ASE isn’t the best solution. There are others. I won’t post again. thanks

I see this whole issue like MTBE in California, it looked like a good idea, every industry expert was toughting it. We all know why they were and we all know what happened. It looks like the server industry has the same lobiest as the petreoleum industry as they are never held accountable for they actions and nobody every talks about their profit levels and record sales.
Robin Harris on Friday, 1 May, 2009 at 10:10 am

Rick, you misinterpret the Intel statistics. But I defer to James Hamilton’s response, from which I excerpt this:

The most important observation Iâ€™ll make is that 85% to 90% of servers are replaced BEFORE they fail which is to say that obsolescence is the leading cause of server replacement. They no longer are power efficient and get replaced after 3 to 5 years. If I could save 10% of the overall data center capital expense and 25%+ of the operating expense at the cost of having an additional 2% in server failures each year. Absolutely yes.

Robin
Rick Cockrell on Saturday, 2 May, 2009 at 9:10 am

Yeh, I already read that and disagree totally, but I’m never going to be able to convince the IT world of it. A server is only as efficient as it’s use pattern and utilization. Which means, I can have the most inefficient server in the world and if I utilize it correctly it will be more efficeint than any “new highly efficient” server in a typical data center … So with that said we need servers that will last 6-10 years (regardless of claims on efficiency increases) and we need to utilize them or manage their power. The ways of the past are over, changing server based on efficiency is only going to get you so much, if your not utilizing them. Yes this will slow down the rate in which I get my info from my favorite search engine, but to be honest I am and most would be willing to wait. It’s still faster and more sustainable than driving down to the library to look up facts or picking up my newspaper from the lawn. We need data centers and we need every technology that saves energy, water, and e-waste. We need them soon. Yes, a data center in general reduces 10X more emmissions that they create, but they could be a lot better.

I’d talk about energy use but we all know about that, let’s talk about water use for a data center. According to the USGS 39% of the water use in the United States is used for power production followed by the Public Sector Water use at 13% (which includes water used at the data centers). http://ga.water.usgs.gov/edu/wupt.html. What this tells us is that the energy we use at our data centers has a bigger effect on the nations or regional water tables than most every other industry. The according to the EPA the data center industry as of 2006 used 61 Billion kWh of energy which according to the NREL equates to 120 Billions of gallons of water used and evaporated to power data centers. http://www.nrel.gov/docs/fy04osti/33905.pdf Does this water come back, yes in some cases the next yearâ€™s rains bring this water back to the area but in drought years we can hardly affords to waste this water. Also, with the data centers industries expected 2X growth do we have enough water to support the energy needed for these power hungry data centers in your area? Water use often gets over looked when designing and permitting a data center. Water only becomes an issue when a city starts talking about rationing and in most cases the city only rations the domestic consumer or public sector uses. Data Centers must realize there total sustainable footprint, includes the water used at a power plant. Energy use is critical to the total water use.

As an example of water use: If you take a typical 2.0 PUE 1 mW facility, we assume the water use will be limited to use in humidifiers and to any water used for the HVAC equipment. That canâ€™t be farther from the whole story. A 1 mW facility uses 17,520,000 gallons of water at the power plant annually. Any reduction in energy has great effects at the power plant not only in energy and Co2 reductions but also it improves the facilities total water footprint. If this facility were improved to a PUE of 1.29, the water use would be reduced to 11,300,400 Gallons per year, PUE of 1.19 would be 10,424,400 gallon a year. Yes, these are all staggering numbers but obtainable numbers and that is the point

Lets evaluate water use at facilities in California where water is about as precious as gold, according to the NREL California uses 4.42 gallons of water to produce 1kWh of energy. For a 1 mW facility that equates to more than 38 M gallons of water use at the power plant. Now you are beginning to understand why our reservoirs are almost empty, when we are not even technically in a drought. Now thatâ€™s some water use. Just for a reference according to the USGS a typical house uses 70 gallons of water a day and 25000 gallons of water annually. As compared to a Data Center it would take more than 1520 houses to accumulate more water use than a 1mW facility and we all know thatâ€™s a small facility.

I’m working hard on getting a major server manufacture to come out and say that if they had the failure increases from 2-4% (as much as Intel suggests in the controlled study in 8 months), the industry leader’s profit margins would disappear and they would be delisted from the NYSE (yeh, that gonna be tough). Also, this company suggests that they are building servers to last and realize that utilization and power management can do far more, than a change based solely on the server efficiency. This would be really good for the worlds e-waste problems, yes we have e-waste problems. Wake-up, nobody knew about the water use, why would they know about the e-waste.

When it comes down to it, the IT world (needs to get off their high horse) will begin to see that their pratices are as bad as drilling off shore for oil, driving a SUV, taking an hour shower, and not using the new CFL bulbs. The data center industry is the growing faster than any other power, water, and waste consuming and creating industry. The current practices are not sustainable. Either it changes itself or it’s regulated to change. Which would everyone prefer? It’s coming ! Now go just in your Prius and think as you drive how about the future impact of our actions.

I could go on for years as to why ASE is bad but the facts are that their are designs out their that are easier to impliment (WSE), that create less waste (RSE) and that are more efficienct (WSE & RSE) than the ASE were talking about. So this is a moot point. No ASE, it’s actually funny were are even arguing about it.
Rick Cockrell on Saturday, 2 May, 2009 at 9:31 am

One more thing (doubt it’s getting published anyways), Does anyone every wonder why we were not talking about a matrix for utiliztion in a data center? When it can make the single biggest impact on energy 7 water use? We have the PUE, which includes everything mechanical component. If we can get a facility down to 1.19 PUE at 68F SAT and 78 RAT with a cooling retrofit without ASE, why would we even talk about adding to the e-waste stream? We’re talking about server efficiency and the fact that servers are disposable. Do you know how funny that sounds. Servers arn’t the problem it’s the application of them. Facts are facts, bits are bit, a kW is a kW(3412 btu’s of heat), a btu is a btu, and water is wet. Some things we can’t change, now let get more bits moved for our kW and lets remove more btu’s with less kWs, while producing less waste and reducing the water use.
Jayctd on Wednesday, 21 October, 2009 at 2:18 pm

Interesting concept I like the idea of increasing efficiencies like this. Specially coming from a northern state (Minnesota) where in the winter dust is low and temperature even lower

(Heck we would have to worry about frost if we really exchanged the -30F outside air at the rate we do now)

I do wonder though if the cost saving calculations for warmer area’s take into account the things like increased power usage from fan/PSU cooling. Most of todays systems allow scaling cooling based on internal temperature. With fan’s being as inefficient as they are one would wonder if having all my blade chassises fans spun all the way up would impact the cost savings.