Disks: how hot is too hot?

by Robin Harris on Friday, 30 May, 2014

This is a StorageMojo summary of technical research.

The effect of temperature on disk drives: shorten their life or not? Most studies say no – including a new one – but Microsoft/UVA researchers seem to disagree.

Backblaze published a detailed blog post on observed effects of temperature on disk drives. Like most studies, they didn’t find one:

After looking at data on over 34,000 drives, I found that overall there is no correlation between temperature and failure rate.

But they also linked to a study by 3 Microsoft researchers who DID find an issue. Which led to this post, a summary of available research on temperature and disks.

Backblaze data
Backblaze looked at 17 drive models from Seagate, WD, Hitachi and Toshiba. Author Brian Beach used a point-biserial correlation coefficient on drive average temperatures and whether drives failed.

He found one drive – a Seagate 1.5TB Barracuda LP – that had a weak but statistically significant correlation between failure rate and higher temperature. The Annual Failure Rate (AFR) doubled from cool drives to warm (above average temperature) drives. But because so many continued to work fine at any temperature, the correlation was weak.

Two more models, a Seagate Barracuda 3TB and a Hitachi Deskstar, showed weaker correlations – but in opposite directions. The Hitachi failed slightly more often at 21°C than at 31°C, while the Seagate failed slightly more often at the higher temperature.

Google study
These results fit well with the results of a 2007 Google study, Failure Trends in a Large Disk Drive Population, by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso. They found:

Microsoft/UVA study
A 2010 Microsoft study, Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures by Sriram Sankar, Mark Shaw and Kushagra Vaid of Microsoft and Sudhanva Gurumurthi, U of Virginia, reached different conclusions:

1) We show strong correlation between temperature observed at different location granularities and failures observed. . . .
2) Although average temperature shows a correlation to disk failures, we show that variations in temperature or workload changes do not show significant correlation to failures observed in drive locations.
3) We . . . show that Chassis design knobs (disk placement, fan speeds) have a larger impact than tuning Workload knobs (intensity, different workload patterns), on disk temperature.
4) With the help of Arrhenius based temperature models and the datacenter cost model, we . . . show that datacenter temperature control has a significant cost advantage over increased fan speeds.

Here’s a couple of relevant charts:

Some like it hot
A 2012 study, Temperature Management in Data Centers: Why Some (Might) Like It Hot, researchers Nosayba El-Sayed, Ioan Stefanovici, George Amvrosiadis, Andy A. Hwang and Bianca Schroeder of the University of Toronto, using data from 3 organizations with several dozen data centers, found (among many other things):

Drive vendors have their say
It may surprise you to know that now most drives are spec’d at higher temperatures than any of these studies considered: 60°C (140°F) or 70°C (158°F) operating temperature. Per the Microsoft-UVA study, it is the average temperature, not variations in temperature, that affect drive life the most.

Of course, they say the drive will operate, but not for how long.

Reconciliation, to a point
Backblaze temps stop at 31°C. The Google study shows declining AFR up to almost 40°C (104°F), while Microsoft found increasing AFR after 40°C, as did the UToronto study.

Part of the difference between the MS/UVA study and the others may be the granularity of the study. They looked at how the location of the drive in the server box affected AFR, while the others did not.

The StorageMojo take
Readers with a professional interest are encouraged to read each of the papers carefully. There’s a wealth of data on other temperature and age related topics far beyond what is covered here.

For example, do higher temperatures also reduce server performance? At what temperature does a disk’s Read after Write (RaW) process kick in? What is the impact of age on drive life?

The MS/UVA paper is worth special consideration, because they go deep into issues such as where the drive is placed in a chassis and the impact of chassis fan speed vs reduced data center temperature. Give that giant data centers typically buy power by the megawatt, the results may not apply to you directly, but it is food for thoughtful analysis.

Hard drives are marvels, mash-ups of mechanical, electrical, manufacturing and architectural genius. 30 years ago a 25,000 hour MTBF was considered good, and today we get more than 10x that – and at higher operating temperatures – at a much lower price.

But the bottom line – that drives have a 30°C – 40°C sweet spot – seems to be the consensus result. Plan accordingly.

Courteous comments welcome, of course.

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: