<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Everything You Know About Disks Is Wrong</title>
	<atom:link href="http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/feed/" rel="self" type="application/rss+xml" />
	<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/</link>
	<description>Data storage info &#38; analysis</description>
	<lastBuildDate>Tue, 07 Feb 2012 16:02:02 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Phil Koenig</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-222894</link>
		<dc:creator>Phil Koenig</dc:creator>
		<pubDate>Thu, 02 Feb 2012 18:22:07 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-222894</guid>
		<description>Late late late comment, sorry. 

Re: the safety of backing up a RAID array with a failed drive first, versus swapping the failed drive and rebuilding the array first.

I would think the main advantage of the &quot;backup first&quot; strategy is that it does not require any new disk writes, only reads.

Seems to me that there would likely be far more potential failures resulting from re-writing all the data/parity during a rebuild than simply reading what&#039;s there onto a backup.</description>
		<content:encoded><![CDATA[<p>Late late late comment, sorry. </p>
<p>Re: the safety of backing up a RAID array with a failed drive first, versus swapping the failed drive and rebuilding the array first.</p>
<p>I would think the main advantage of the &#8220;backup first&#8221; strategy is that it does not require any new disk writes, only reads.</p>
<p>Seems to me that there would likely be far more potential failures resulting from re-writing all the data/parity during a rebuild than simply reading what&#8217;s there onto a backup.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Анализ надёжности SSD по сравнению с жёсткими дисками &#124; Лаборатория Чеканова</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-216236</link>
		<dc:creator>Анализ надёжности SSD по сравнению с жёсткими дисками &#124; Лаборатория Чеканова</dc:creator>
		<pubDate>Tue, 10 May 2011 08:33:21 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-216236</guid>
		<description>[...] SSD, наши новости прольются как бальзам на душу. Как написал Робин Харрис (Robin Harris) на StorageMojo, &quot;Забудьте о RAID, просто копируйте данные три [...]</description>
		<content:encoded><![CDATA[<p>[...] SSD, наши новости прольются как бальзам на душу. Как написал Робин Харрис (Robin Harris) на StorageMojo, &quot;Забудьте о RAID, просто копируйте данные три [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Choosing a Nas &#124; mischkulanz</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-215038</link>
		<dc:creator>Choosing a Nas &#124; mischkulanz</dc:creator>
		<pubDate>Sun, 13 Feb 2011 12:14:57 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-215038</guid>
		<description>[...] Another one is: which harddisk to put into a nas? From these ressources it doesnt matter if a harddisk is a server-model (made for running all the time) or if it is a cheaper desktop model &#8211; mtf is at the same low value: http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/ [...]</description>
		<content:encoded><![CDATA[<p>[...] Another one is: which harddisk to put into a nas? From these ressources it doesnt matter if a harddisk is a server-model (made for running all the time) or if it is a cheaper desktop model &#8211; mtf is at the same low value: <a href="http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/" rel="nofollow">http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: How to Setup Software RAID for a Simple File Server on Ubuntu :: Sysadmin Geek</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-214606</link>
		<dc:creator>How to Setup Software RAID for a Simple File Server on Ubuntu :: Sysadmin Geek</dc:creator>
		<pubDate>Fri, 07 Jan 2011 20:04:05 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-214606</guid>
		<description>[...] may say that there is no difference in fail rate between the two types. That may be true, however despite these claims, server grade drives still [...]</description>
		<content:encoded><![CDATA[<p>[...] may say that there is no difference in fail rate between the two types. That may be true, however despite these claims, server grade drives still [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Disk disaster! Recommendations for the future!</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-211653</link>
		<dc:creator>Disk disaster! Recommendations for the future!</dc:creator>
		<pubDate>Wed, 24 Nov 2010 12:58:17 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-211653</guid>
		<description>[...] in their range?  ...which may be sad, as some of those questions are easier to answer than others.  Here is an article that you ought to read, although it is mostly a summarisation/popularisation of an [...]</description>
		<content:encoded><![CDATA[<p>[...] in their range?  &#8230;which may be sad, as some of those questions are easier to answer than others.  Here is an article that you ought to read, although it is mostly a summarisation/popularisation of an [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ItsMe</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-211103</link>
		<dc:creator>ItsMe</dc:creator>
		<pubDate>Thu, 07 Oct 2010 21:23:52 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-211103</guid>
		<description>The last two posts did a good job explaining MTBF.  Here&#039;s another way of explaining it.

Get 1,000,000 hard drives in a room.  Run them all and see when they fail.  Let&#039;s say that in the first year you had one hard drive fail every hour.  That would be 8760 drives that would fail.  During the second year you might also have 8760 drives fail.  During the third year you might have 8760 drives fail.  During the 4th year you might have 50,000 drives fail.  During the 5th year all the remaining drives might fail.

What is the MTBF?  You would clearly decide that the useful life of a hard drive is 3 years, because you start getting a lot of failures in the 4th year, and all of them failed during the 5th year.  So you look at you average failure rate for the first three years.  Well for the first three years, you had 1,000,000 drives running, and one failed every hour.  But every hour you have 1,000,000 hard-drive hours accumulated.  So you have one failure per 1,000,000 hours of operation.  Thus your MTBF is 1,000,000 hours.  MTBF means mean time between failures.  You have a one failure for every 1,000,000 hours of operation, thus a 1 million hour MTBF.

Notice the fact that all the drives failed by the fifth year.  The MTBF has nothing to do with the life expectancy.

On an unrelated note, I have not read any of the referenced papers, but it seems to me that the statistic showing clustered failures is totally bogus.  It turns out that when you find a drive has failed and you go to rebuild the raid, it&#039;s not that another drive fails during the rebuild, but rather that the other drive has in fact failed before the rebuild (failure meaning having unreadable data), but the failure is not discovered until the rebuild.</description>
		<content:encoded><![CDATA[<p>The last two posts did a good job explaining MTBF.  Here&#8217;s another way of explaining it.</p>
<p>Get 1,000,000 hard drives in a room.  Run them all and see when they fail.  Let&#8217;s say that in the first year you had one hard drive fail every hour.  That would be 8760 drives that would fail.  During the second year you might also have 8760 drives fail.  During the third year you might have 8760 drives fail.  During the 4th year you might have 50,000 drives fail.  During the 5th year all the remaining drives might fail.</p>
<p>What is the MTBF?  You would clearly decide that the useful life of a hard drive is 3 years, because you start getting a lot of failures in the 4th year, and all of them failed during the 5th year.  So you look at you average failure rate for the first three years.  Well for the first three years, you had 1,000,000 drives running, and one failed every hour.  But every hour you have 1,000,000 hard-drive hours accumulated.  So you have one failure per 1,000,000 hours of operation.  Thus your MTBF is 1,000,000 hours.  MTBF means mean time between failures.  You have a one failure for every 1,000,000 hours of operation, thus a 1 million hour MTBF.</p>
<p>Notice the fact that all the drives failed by the fifth year.  The MTBF has nothing to do with the life expectancy.</p>
<p>On an unrelated note, I have not read any of the referenced papers, but it seems to me that the statistic showing clustered failures is totally bogus.  It turns out that when you find a drive has failed and you go to rebuild the raid, it&#8217;s not that another drive fails during the rebuild, but rather that the other drive has in fact failed before the rebuild (failure meaning having unreadable data), but the failure is not discovered until the rebuild.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: M-A-O-L &#187; Everything You Know About Disks Is Wrong</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-210797</link>
		<dc:creator>M-A-O-L &#187; Everything You Know About Disks Is Wrong</dc:creator>
		<pubDate>Mon, 30 Aug 2010 21:28:22 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-210797</guid>
		<description>[...] &#8217;07, und hat ein paar interessante Eindr&#252;cke mitgebracht. Die Zusammenfassung davon in Everything You Know About Disks Is Wrong geht folgendermassen: [...] these results validate the Google File System’s central redundancy [...]</description>
		<content:encoded><![CDATA[<p>[...] &#8217;07, und hat ein paar interessante Eindr&#252;cke mitgebracht. Die Zusammenfassung davon in Everything You Know About Disks Is Wrong geht folgendermassen: [...] these results validate the Google File System’s central redundancy [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tim</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-208996</link>
		<dc:creator>Tim</dc:creator>
		<pubDate>Mon, 12 Apr 2010 13:08:13 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-208996</guid>
		<description>Further to Tracy Valleau

The industry is moving towards using AFR (Annual Failure Rate).  The reason is that MTBF is really confusing, and AFR gives the consumer a better idea of what the number is.  an AFR of 0.87% is equivalent to MTBF of 1,000,000. the equation is AFR = 1-exp(-8760/MTBF)

Both of these measures are POPULATION statistics.  One would expect from a large population that a small fraction might be faulty or break earlier than expected.  Most people can intuitively understand that about 1% of disks might fail in a single year, or there is a 1% chance of a disk failing in a year.  They also do not link this failure rate with the disks lifetime.  As such AFR is much more sensible metric for this type of information.  and AFR=0.87% is exactly the same as MTBF of 1,000,000 hours.

This statistic also in no way defines how long a disk will last.  That is the useful life value (say 30,000 POH (power on hours)).  This will be linked to the warranty period, wear-out etc.

On a slightly different note....  The paper did not measure disk failures, rather, &quot;disk replacements&quot;.  There is a difference between the two, namely mis-diagnosis.  This may also help explain why she got a autocorrelation.  If I incorrectly replace a disk that is faulty, I still leave the root cause of the problem, and am likely to repeat the same mistake a week or so latter.... hence the autocorrelation result.  

My hypothesis is that the autocorrelation seen is caused by mis-diagnosis.  Unfortunately I do not have the data to prove/disprove that hypothesis.</description>
		<content:encoded><![CDATA[<p>Further to Tracy Valleau</p>
<p>The industry is moving towards using AFR (Annual Failure Rate).  The reason is that MTBF is really confusing, and AFR gives the consumer a better idea of what the number is.  an AFR of 0.87% is equivalent to MTBF of 1,000,000. the equation is AFR = 1-exp(-8760/MTBF)</p>
<p>Both of these measures are POPULATION statistics.  One would expect from a large population that a small fraction might be faulty or break earlier than expected.  Most people can intuitively understand that about 1% of disks might fail in a single year, or there is a 1% chance of a disk failing in a year.  They also do not link this failure rate with the disks lifetime.  As such AFR is much more sensible metric for this type of information.  and AFR=0.87% is exactly the same as MTBF of 1,000,000 hours.</p>
<p>This statistic also in no way defines how long a disk will last.  That is the useful life value (say 30,000 POH (power on hours)).  This will be linked to the warranty period, wear-out etc.</p>
<p>On a slightly different note&#8230;.  The paper did not measure disk failures, rather, &#8220;disk replacements&#8221;.  There is a difference between the two, namely mis-diagnosis.  This may also help explain why she got a autocorrelation.  If I incorrectly replace a disk that is faulty, I still leave the root cause of the problem, and am likely to repeat the same mistake a week or so latter&#8230;. hence the autocorrelation result.  </p>
<p>My hypothesis is that the autocorrelation seen is caused by mis-diagnosis.  Unfortunately I do not have the data to prove/disprove that hypothesis.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: &#187; RAID limitations &#8211; an interesting read</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-205687</link>
		<dc:creator>&#187; RAID limitations &#8211; an interesting read</dc:creator>
		<pubDate>Fri, 02 Oct 2009 17:38:59 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-205687</guid>
		<description>[...] http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/ [...]</description>
		<content:encoded><![CDATA[<p>[...] <a href="http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/" rel="nofollow">http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/</a> [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Should There Be A Tape Backup Drive in Your Future? ~ Revelations From An Unwashed Brain</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-3/#comment-201680</link>
		<dc:creator>Should There Be A Tape Backup Drive in Your Future? ~ Revelations From An Unwashed Brain</dc:creator>
		<pubDate>Tue, 19 May 2009 21:02:18 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-201680</guid>
		<description>[...] his own website, Mr. Harris attempts to give the reader a quick education on the problems of drives, and what you think you might know is probably [...]</description>
		<content:encoded><![CDATA[<p>[...] his own website, Mr. Harris attempts to give the reader a quick education on the problems of drives, and what you think you might know is probably [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tracy Valleau</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-2/#comment-199321</link>
		<dc:creator>Tracy Valleau</dc:creator>
		<pubDate>Fri, 13 Feb 2009 05:00:40 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-199321</guid>
		<description>I often get asked about MTBF (Mean Time Between Failure) and it&#039;s amazing how many &quot;industry people&quot; don&#039;t understand it.

And for those who have already figured out that their 1.5M MTBF drives don&#039;t last 150 years, but are not sure what that MTBF thing is... here&#039;s a quickie:

Why your hard drive doesn&#039;t last 150 years.

(There are about 8700 hours in a year, but to make this example simple, let&#039;s call it 10,000.)

Here&#039;s how MTBF works: it&#039;s an aggregate of many units based on expected life of a single unit.

Let&#039;s say you have a hard drive that is warranted to last 3 years, or 30,000 hours.

You put it in a server, and behold, it lasts 3 years. You take it out and put in a new one, and that also lasts 3 years. So you replace it with a new one, and that too.... well, you get it.

Let&#039;s say you keep doing that and finally, on the 50th unit, only two years into it&#039;s life, it breaks.

You now have 3 years or 30,000 hours per unit, times 50 units = 1,500,000.

And that&#039;s your MTBF.

So anyone who says &quot;Wow! MTBF of 1.5 million hours! that mean this thing will last (1.5M / 10000) 150 years!&quot; -clearly- doesn&#039;t know what they&#039;re talking about.

(MTBF is more complex than my example, including &quot;infant mortality&quot; and &quot;wear out&quot; phases; &quot;theoretical&quot; vs &quot;operational&quot; MTBF and so on, but the gist of what&#039;s here is correct.)

Cordially,

Tracy Valleau

&quot;Don&#039;t believe everything you think.&quot;</description>
		<content:encoded><![CDATA[<p>I often get asked about MTBF (Mean Time Between Failure) and it&#8217;s amazing how many &#8220;industry people&#8221; don&#8217;t understand it.</p>
<p>And for those who have already figured out that their 1.5M MTBF drives don&#8217;t last 150 years, but are not sure what that MTBF thing is&#8230; here&#8217;s a quickie:</p>
<p>Why your hard drive doesn&#8217;t last 150 years.</p>
<p>(There are about 8700 hours in a year, but to make this example simple, let&#8217;s call it 10,000.)</p>
<p>Here&#8217;s how MTBF works: it&#8217;s an aggregate of many units based on expected life of a single unit.</p>
<p>Let&#8217;s say you have a hard drive that is warranted to last 3 years, or 30,000 hours.</p>
<p>You put it in a server, and behold, it lasts 3 years. You take it out and put in a new one, and that also lasts 3 years. So you replace it with a new one, and that too&#8230;. well, you get it.</p>
<p>Let&#8217;s say you keep doing that and finally, on the 50th unit, only two years into it&#8217;s life, it breaks.</p>
<p>You now have 3 years or 30,000 hours per unit, times 50 units = 1,500,000.</p>
<p>And that&#8217;s your MTBF.</p>
<p>So anyone who says &#8220;Wow! MTBF of 1.5 million hours! that mean this thing will last (1.5M / 10000) 150 years!&#8221; -clearly- doesn&#8217;t know what they&#8217;re talking about.</p>
<p>(MTBF is more complex than my example, including &#8220;infant mortality&#8221; and &#8220;wear out&#8221; phases; &#8220;theoretical&#8221; vs &#8220;operational&#8221; MTBF and so on, but the gist of what&#8217;s here is correct.)</p>
<p>Cordially,</p>
<p>Tracy Valleau</p>
<p>&#8220;Don&#8217;t believe everything you think.&#8221;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kmann</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-2/#comment-197240</link>
		<dc:creator>Kmann</dc:creator>
		<pubDate>Fri, 22 Aug 2008 18:01:30 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-197240</guid>
		<description>The Bianca Schroeder paper is excellent, but I saw something very interesting in the paper that seems to have gone unnoticed here,

Table 2. -- &quot;Node outages that were attributed to hardware problems broken down by the responsible hardware component.&quot;

Component (HPC1)        
CPU              44%
Memory           29%
Hard drive       16%
PCI motherboard   9%
Power supply      2%

Fully 82% of the failures were related to &quot;solid state&quot; components.

This in spite of the fact that the system population included 3,406 disks and 784 servers. DRAM was almost twice as likely to cause a failure and the CPUs were three times more likely to cause an outage. Moreover, 784 motherboards produced 9% of failures while 3,400 disks produced only 16%.

And this is a very high-end system, presumably &quot;top-shelf&quot; DRAM, CPU and motherboard components.

Also, from the text:

&quot;...we have analyzed failure data covering any type of node outage, including those caused by hardware, software, network problems, environmental problems, or operator mistakes. The data was collected over a period of 9 years on more than 20 HPC clusters and contains detailed root cause information. We found that, for most HPC systems in this data,
more than 50% of all outages are attributed to hardware problems... Consistent with the data in Table 2, the two most common hardware components to cause a node outage are memory and CPU.&quot;

So much for the myth of &quot;solid state&quot; reliability.

For some perspective, while CPU makers stopped publishing MTBF many years ago, and DRAM manufacturers have to my knowledge never published them, most motherboard manufacturers do publish -- typically in the 100,000 hour range. So...if 784 motherboards produced 9% of failures, and 3,400 disks only produced 16%, then it seems that perhaps the numbers published by the disk drive makers are, in relative terms, not so wildly off the mark. It would appear (from a system/sub-system perspective) that disks are relatively much more reliable than the &quot;solid state&quot; components. 

I wonder how people would react if they actually knew the MTBF numbers on stuff like DRAM and CPUs? Perhaps we should all remember that silicon DOES &quot;wear out&quot; (in a manner of speaking).

All this makes me wonder why everyone assumes that Flash SSD is going to be so much more reliable than other silicon. Are we to believe the ridiculous MTBF claims of the SSD makers (Intel sez 2,000,000 hrs), given the numbers on DRAM?

It will be interesting to see the results on the first large-scale deployments of flash-SSD. Unfortunately it will probably be five or more years that the &quot;free ride&quot; for SSD continues before folks begin to realize that solid-state in not necessarily more reliable than mechanical disks...and very frequently (in the case of DRAM and CPUs) less reliable!</description>
		<content:encoded><![CDATA[<p>The Bianca Schroeder paper is excellent, but I saw something very interesting in the paper that seems to have gone unnoticed here,</p>
<p>Table 2. &#8212; &#8220;Node outages that were attributed to hardware problems broken down by the responsible hardware component.&#8221;</p>
<p>Component (HPC1)<br />
CPU              44%<br />
Memory           29%<br />
Hard drive       16%<br />
PCI motherboard   9%<br />
Power supply      2%</p>
<p>Fully 82% of the failures were related to &#8220;solid state&#8221; components.</p>
<p>This in spite of the fact that the system population included 3,406 disks and 784 servers. DRAM was almost twice as likely to cause a failure and the CPUs were three times more likely to cause an outage. Moreover, 784 motherboards produced 9% of failures while 3,400 disks produced only 16%.</p>
<p>And this is a very high-end system, presumably &#8220;top-shelf&#8221; DRAM, CPU and motherboard components.</p>
<p>Also, from the text:</p>
<p>&#8220;&#8230;we have analyzed failure data covering any type of node outage, including those caused by hardware, software, network problems, environmental problems, or operator mistakes. The data was collected over a period of 9 years on more than 20 HPC clusters and contains detailed root cause information. We found that, for most HPC systems in this data,<br />
more than 50% of all outages are attributed to hardware problems&#8230; Consistent with the data in Table 2, the two most common hardware components to cause a node outage are memory and CPU.&#8221;</p>
<p>So much for the myth of &#8220;solid state&#8221; reliability.</p>
<p>For some perspective, while CPU makers stopped publishing MTBF many years ago, and DRAM manufacturers have to my knowledge never published them, most motherboard manufacturers do publish &#8212; typically in the 100,000 hour range. So&#8230;if 784 motherboards produced 9% of failures, and 3,400 disks only produced 16%, then it seems that perhaps the numbers published by the disk drive makers are, in relative terms, not so wildly off the mark. It would appear (from a system/sub-system perspective) that disks are relatively much more reliable than the &#8220;solid state&#8221; components. </p>
<p>I wonder how people would react if they actually knew the MTBF numbers on stuff like DRAM and CPUs? Perhaps we should all remember that silicon DOES &#8220;wear out&#8221; (in a manner of speaking).</p>
<p>All this makes me wonder why everyone assumes that Flash SSD is going to be so much more reliable than other silicon. Are we to believe the ridiculous MTBF claims of the SSD makers (Intel sez 2,000,000 hrs), given the numbers on DRAM?</p>
<p>It will be interesting to see the results on the first large-scale deployments of flash-SSD. Unfortunately it will probably be five or more years that the &#8220;free ride&#8221; for SSD continues before folks begin to realize that solid-state in not necessarily more reliable than mechanical disks&#8230;and very frequently (in the case of DRAM and CPUs) less reliable!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jered Floyd</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-2/#comment-197222</link>
		<dc:creator>Jered Floyd</dc:creator>
		<pubDate>Wed, 20 Aug 2008 21:08:37 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-197222</guid>
		<description>Robin,

A bit of a late comment here, but I think what&#039;s even more interesting than bogus MTBFs for drives is the interesting difference in bit error rate for SCSI/FC vs. SATA drives.  I just wrote an article on this, &lt;a href=&quot;http://permabit.wordpress.com/2008/08/20/are-fibre-channel-and-scsi-drives-more-reliable/&quot; rel=&quot;nofollow&quot;&gt;Are Fibre Channel and SCSI Drives More Reliable?&lt;/a&gt;  It turns out that they are, at least for RAID, and not for the reason you might suspect!  I think there&#039;s a false market segmentation going on here...

Jered Floyd
CTO, Permabit Technology Corp.</description>
		<content:encoded><![CDATA[<p>Robin,</p>
<p>A bit of a late comment here, but I think what&#8217;s even more interesting than bogus MTBFs for drives is the interesting difference in bit error rate for SCSI/FC vs. SATA drives.  I just wrote an article on this, <a href="http://permabit.wordpress.com/2008/08/20/are-fibre-channel-and-scsi-drives-more-reliable/" rel="nofollow">Are Fibre Channel and SCSI Drives More Reliable?</a>  It turns out that they are, at least for RAID, and not for the reason you might suspect!  I think there&#8217;s a false market segmentation going on here&#8230;</p>
<p>Jered Floyd<br />
CTO, Permabit Technology Corp.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: wgh</title>
		<link>http://storagemojo.com/2007/02/20/everything-you-know-about-disks-is-wrong/comment-page-2/#comment-109191</link>
		<dc:creator>wgh</dc:creator>
		<pubDate>Fri, 24 Aug 2007 04:47:37 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=383#comment-109191</guid>
		<description>Joe Claborn said (on February 21st, 2007 at 6:41 am):  Is this right? A MTBF of ‘only’ 300,000 hours translates in 34 years. Our disk drives seem to last about 3 years. Why the difference? 
---
I&#039;ve skimmed the above thread but didn&#039;t see anyone note that MTBF (and to a degree MTTF) should be divided by the number of drives that are in your environment to estimate how often you&#039;ll see a single drive within the environment fail. Yes, as you&#039;ve mentioned, the MTBF numbers suggest 34 yrs to fail for one drive, but if you have 10 drives in your environment you can expect one of them to fail in about 3.4 years. Just as when you have 10 men working construction there&#039;s 10 times the probability of one of them getting sick on any given day. When working in a &quot;big iron&quot; shop with thousands of RAID devices, this is (usually) taken into account. Those who say triplicate the data instead of using RAID appear to me to not be faced with needing up to date accurate data available in one location, without time available (due to SLAs) to restore or even time to fail over to a seperate set of drives. Many in mainframe environments have come to heavily rely on no down time to restore or fall over to other drives, that is unless the situation is very dire (of a disaster type). If one were to &quot;simply&quot; have three copies, as someone suggested above, then which one do you update? All three? Doing so and waiting for validation of completion of I/O would typically cause response times on heavily I/O burdened systems to degrade beyond acceptability. To not wait on validation opens a window to potential corruption to any copies that were not being synchronously updated (synchronous updates are expensive). Thus RAID. Yes, drives will fail and drives will be replaced. But a well laid out RAID array will still give the needed response times during failures, even at peak transaction time... again, I said if they&#039;re &quot;well laid out&quot;.  And yes, if the data is mission critical, such RAID arrays should be copied to another location... for the event of a disaster (including at a minimum, lightening).</description>
		<content:encoded><![CDATA[<p>Joe Claborn said (on February 21st, 2007 at 6:41 am):  Is this right? A MTBF of ‘only’ 300,000 hours translates in 34 years. Our disk drives seem to last about 3 years. Why the difference?<br />
&#8212;<br />
I&#8217;ve skimmed the above thread but didn&#8217;t see anyone note that MTBF (and to a degree MTTF) should be divided by the number of drives that are in your environment to estimate how often you&#8217;ll see a single drive within the environment fail. Yes, as you&#8217;ve mentioned, the MTBF numbers suggest 34 yrs to fail for one drive, but if you have 10 drives in your environment you can expect one of them to fail in about 3.4 years. Just as when you have 10 men working construction there&#8217;s 10 times the probability of one of them getting sick on any given day. When working in a &#8220;big iron&#8221; shop with thousands of RAID devices, this is (usually) taken into account. Those who say triplicate the data instead of using RAID appear to me to not be faced with needing up to date accurate data available in one location, without time available (due to SLAs) to restore or even time to fail over to a seperate set of drives. Many in mainframe environments have come to heavily rely on no down time to restore or fall over to other drives, that is unless the situation is very dire (of a disaster type). If one were to &#8220;simply&#8221; have three copies, as someone suggested above, then which one do you update? All three? Doing so and waiting for validation of completion of I/O would typically cause response times on heavily I/O burdened systems to degrade beyond acceptability. To not wait on validation opens a window to potential corruption to any copies that were not being synchronously updated (synchronous updates are expensive). Thus RAID. Yes, drives will fail and drives will be replaced. But a well laid out RAID array will still give the needed response times during failures, even at peak transaction time&#8230; again, I said if they&#8217;re &#8220;well laid out&#8221;.  And yes, if the data is mission critical, such RAID arrays should be copied to another location&#8230; for the event of a disaster (including at a minimum, lightening).</p>
]]></content:encoded>
	</item>
</channel>
</rss>

