<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: NetApp&#8217;s research offensive</title>
	<atom:link href="http://storagemojo.com/2008/02/26/netapps-research-offensive/feed/" rel="self" type="application/rss+xml" />
	<link>http://storagemojo.com/2008/02/26/netapps-research-offensive/</link>
	<description>Data storage info &#38; analysis</description>
	<pubDate>Fri, 08 Aug 2008 20:39:58 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
		<item>
		<title>By: Interesting tidbits from around the storage blogosphere &#8212; Storage Soup</title>
		<link>http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-180422</link>
		<dc:creator>Interesting tidbits from around the storage blogosphere &#8212; Storage Soup</dc:creator>
		<pubDate>Fri, 14 Mar 2008 18:47:13 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-180422</guid>
		<description>[...] NetApp&#8217;s rebranding has it trying to downplay its geekitude a bit, but StorageMojo&#8217;s Robin Harris posted recently about some interesting whitepapers it released at FAST on storage subsystem failures. These were in response to a challenge from Harris for storage vendors to respond to last year&#8217;s well-publicized research from Google and Carnegie Mellon University on disk-drive failures. NetApp&#8217;s response showed some interesting results, according to Harris: The cynical, myself among them, might be tempted to dismiss the work as exercise in self-justification. The studies find disk scrubbing useful in eliminating silent data corruption, a result any half-awake SE will use to their advantage. [...]</description>
		<content:encoded><![CDATA[<p>[...] NetApp&#8217;s rebranding has it trying to downplay its geekitude a bit, but StorageMojo&#8217;s Robin Harris posted recently about some interesting whitepapers it released at FAST on storage subsystem failures. These were in response to a challenge from Harris for storage vendors to respond to last year&#8217;s well-publicized research from Google and Carnegie Mellon University on disk-drive failures. NetApp&#8217;s response showed some interesting results, according to Harris: The cynical, myself among them, might be tempted to dismiss the work as exercise in self-justification. The studies find disk scrubbing useful in eliminating silent data corruption, a result any half-awake SE will use to their advantage. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter honeyman</title>
		<link>http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-177187</link>
		<dc:creator>peter honeyman</dc:creator>
		<pubDate>Sun, 02 Mar 2008 16:01:53 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-177187</guid>
		<description>All of the FAST papers are online at http://www.usenix.org/events/fast08/tech/</description>
		<content:encoded><![CDATA[<p>All of the FAST papers are online at <a href="http://www.usenix.org/events/fast08/tech/" rel="nofollow">http://www.usenix.org/events/fast08/tech/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Todd</title>
		<link>http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-176389</link>
		<dc:creator>Bill Todd</dc:creator>
		<pubDate>Fri, 29 Feb 2008 05:21:31 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-176389</guid>
		<description>'Parity Lost/Regained' leaves a bit to be desired:

a) It observes that parental checksums can't avoid problems with parity pollution *if the parity mechanism doesn't coordinate with the checksum verification*, but fails to note that in ZFS they *do* so coordinate (RAID-Z is still brain-damaged, but not in that respect - and if it changed its mechanism to create stripes out of serial blocks within the same file rather than out of each individual file block it would be just fine).  So coming to the conclusion that block checksums are the best base from which to move forward is indeed somewhat self-(NetApp-)serving (though it's possible that this was merely the result of experiential myopia rather than an intentional distortion).

b) Furthermore, not only is the contention that "the tendency of scrubs to pollute parity increases the chances of data loss when only one error occurs" (your quote above) incorrect when using such a ZFS-style approach, but it's incorrect *in general*.  The situation that they describe leading to data loss when parent checksums are used independently of the parity mechanism *does not occur* during conventional scrubbing (which simply verifies that each disk sector can be read without error):  it only occurs (and indeed *any* effect of scrubbing on parity pollution can only occur) when scrubbing also verifies (and if necessary corrects) the parity (and it's not clear why that would be a good thing to do for precisely this reason, though flagging any inconsistency for human analysis would be reasonable:  as they state elsewhere, scrubbing is primarily aimed at preventing latent sector errors from combining with a second error later on to cause data loss, and that has nothing to do with verifying/correcting parity information).

c) Write-verify, while it does protect against lost writes, doesn't protect against torn writes at all, at least if they're due to power loss (the example given in the paper):  if power is lost, the verify never happens and the tear remains a tear (unless something like a log gets replayed on restart, but in that case it takes care of torn writes without any need for write-verify).

d) In view of point (a) above, all you need to reduce the 'chance of data loss' to zero (at least within the scope of their analysis) is parity-based redundancy plus in-parent checksums that coordinate with it (plus scrubbing to detect latent errors before they can combine with another error to cause data loss) - with no need at all for write-verify, version-mirroring, logical/physical identity, or in-sector/in-block checksums.

That NetApp finds parity-based redundancy particularly interesting is hardly surprising given the evolution of their product.  But that doesn't excuse the amount of myopia evident in the paper:  there's little reason to use parity-based redundancy save to economize on disk-space use, which means that you can use it only for large files (which account for most disk-space use in the vast majority of installations) without significantly compromising its effectiveness and thus *can* use it only within individual files where it can not only be easily coordinated with parent-checksum mechanisms but provide entirely acceptable run-time performance (all the validation checksums are already in memory) plus reasonably efficient scrubbing (even while following the metadata paths).

The information that 19% of 'nearline' disks develop unreadable sectors within 2 years (presumably including those detected and revectored before they become unreadable, which is where scrubbing makes a major difference) was interesting (perhaps the enterprise-class disks are only about 1/10th as prone to this at least in part due to lower recording densities), as was the observation that the incidence of lost or misdirected writes was as high as about 0.03% per year for nearline disks (or about 0.005% per year for enterprise disks); the information about torn writes was less so, since that's just something any good system knows it has to deal with (or just pass on up to let applications do so).

But that's starting to get into the territory covered by "Data Corruption in the Storage Stack", where the emphasis on 'silent data corruption' suggests that NetApp feels some need to respond to the ZFS hoopla in this area.  Unfortunately, that paper proved disappointing:  while it may offer some insights for system administrators into whether to replace a disk after a particular kind of error, in general it added little to conventional understanding of error modes.

Not only did NetApp pioneer stellar technology a decade and a half ago that the competition is only now even beginning to catch up with, but it's still got some of the world's best and most innovative file system engineers on tap.  These papers just don't reflect such excellence:  they smell a lot more like PR aimed at countering Sun's attempt to position ZFS as a better solution in that market space.

- bill</description>
		<content:encoded><![CDATA[<p>&#8216;Parity Lost/Regained&#8217; leaves a bit to be desired:</p>
<p>a) It observes that parental checksums can&#8217;t avoid problems with parity pollution *if the parity mechanism doesn&#8217;t coordinate with the checksum verification*, but fails to note that in ZFS they *do* so coordinate (RAID-Z is still brain-damaged, but not in that respect - and if it changed its mechanism to create stripes out of serial blocks within the same file rather than out of each individual file block it would be just fine).  So coming to the conclusion that block checksums are the best base from which to move forward is indeed somewhat self-(NetApp-)serving (though it&#8217;s possible that this was merely the result of experiential myopia rather than an intentional distortion).</p>
<p>b) Furthermore, not only is the contention that &#8220;the tendency of scrubs to pollute parity increases the chances of data loss when only one error occurs&#8221; (your quote above) incorrect when using such a ZFS-style approach, but it&#8217;s incorrect *in general*.  The situation that they describe leading to data loss when parent checksums are used independently of the parity mechanism *does not occur* during conventional scrubbing (which simply verifies that each disk sector can be read without error):  it only occurs (and indeed *any* effect of scrubbing on parity pollution can only occur) when scrubbing also verifies (and if necessary corrects) the parity (and it&#8217;s not clear why that would be a good thing to do for precisely this reason, though flagging any inconsistency for human analysis would be reasonable:  as they state elsewhere, scrubbing is primarily aimed at preventing latent sector errors from combining with a second error later on to cause data loss, and that has nothing to do with verifying/correcting parity information).</p>
<p>c) Write-verify, while it does protect against lost writes, doesn&#8217;t protect against torn writes at all, at least if they&#8217;re due to power loss (the example given in the paper):  if power is lost, the verify never happens and the tear remains a tear (unless something like a log gets replayed on restart, but in that case it takes care of torn writes without any need for write-verify).</p>
<p>d) In view of point (a) above, all you need to reduce the &#8216;chance of data loss&#8217; to zero (at least within the scope of their analysis) is parity-based redundancy plus in-parent checksums that coordinate with it (plus scrubbing to detect latent errors before they can combine with another error to cause data loss) - with no need at all for write-verify, version-mirroring, logical/physical identity, or in-sector/in-block checksums.</p>
<p>That NetApp finds parity-based redundancy particularly interesting is hardly surprising given the evolution of their product.  But that doesn&#8217;t excuse the amount of myopia evident in the paper:  there&#8217;s little reason to use parity-based redundancy save to economize on disk-space use, which means that you can use it only for large files (which account for most disk-space use in the vast majority of installations) without significantly compromising its effectiveness and thus *can* use it only within individual files where it can not only be easily coordinated with parent-checksum mechanisms but provide entirely acceptable run-time performance (all the validation checksums are already in memory) plus reasonably efficient scrubbing (even while following the metadata paths).</p>
<p>The information that 19% of &#8216;nearline&#8217; disks develop unreadable sectors within 2 years (presumably including those detected and revectored before they become unreadable, which is where scrubbing makes a major difference) was interesting (perhaps the enterprise-class disks are only about 1/10th as prone to this at least in part due to lower recording densities), as was the observation that the incidence of lost or misdirected writes was as high as about 0.03% per year for nearline disks (or about 0.005% per year for enterprise disks); the information about torn writes was less so, since that&#8217;s just something any good system knows it has to deal with (or just pass on up to let applications do so).</p>
<p>But that&#8217;s starting to get into the territory covered by &#8220;Data Corruption in the Storage Stack&#8221;, where the emphasis on &#8217;silent data corruption&#8217; suggests that NetApp feels some need to respond to the ZFS hoopla in this area.  Unfortunately, that paper proved disappointing:  while it may offer some insights for system administrators into whether to replace a disk after a particular kind of error, in general it added little to conventional understanding of error modes.</p>
<p>Not only did NetApp pioneer stellar technology a decade and a half ago that the competition is only now even beginning to catch up with, but it&#8217;s still got some of the world&#8217;s best and most innovative file system engineers on tap.  These papers just don&#8217;t reflect such excellence:  they smell a lot more like PR aimed at countering Sun&#8217;s attempt to position ZFS as a better solution in that market space.</p>
<p>- bill</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Magda</title>
		<link>http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-175982</link>
		<dc:creator>David Magda</dc:creator>
		<pubDate>Wed, 27 Feb 2008 23:49:06 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-175982</guid>
		<description>So when will more companies add checksums to their storage offerings and file systems? It's much hard to miss corruption (and create parity pollution) when everything is 'secured' via Merkle (hash) tree.

The IEEE encryption standard (1619) may actually help in this regard somewhat since you have MAC authentication built into the encryption.</description>
		<content:encoded><![CDATA[<p>So when will more companies add checksums to their storage offerings and file systems? It&#8217;s much hard to miss corruption (and create parity pollution) when everything is &#8217;secured&#8217; via Merkle (hash) tree.</p>
<p>The IEEE encryption standard (1619) may actually help in this regard somewhat since you have MAC authentication built into the encryption.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Wes Felter</title>
		<link>http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-175955</link>
		<dc:creator>Wes Felter</dc:creator>
		<pubDate>Wed, 27 Feb 2008 21:59:36 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-175955</guid>
		<description>NetApp uses 11% of the space on SATA disks as checksum protection (separate from RAID and the disks' internal ECC). Disk space has really become cheap.</description>
		<content:encoded><![CDATA[<p>NetApp uses 11% of the space on SATA disks as checksum protection (separate from RAID and the disks&#8217; internal ECC). Disk space has really become cheap.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Allen C</title>
		<link>http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-175922</link>
		<dc:creator>Allen C</dc:creator>
		<pubDate>Wed, 27 Feb 2008 19:15:40 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-175922</guid>
		<description>Yea, disk scrubbing can spread array corruption.  It works if you have bad blocks on a single drive.  If the scrubbing programs were smart enough to not spread corruption...

Thanks--Allen</description>
		<content:encoded><![CDATA[<p>Yea, disk scrubbing can spread array corruption.  It works if you have bad blocks on a single drive.  If the scrubbing programs were smart enough to not spread corruption&#8230;</p>
<p>Thanks&#8211;Allen</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brad Collins</title>
		<link>http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-175889</link>
		<dc:creator>Brad Collins</dc:creator>
		<pubDate>Wed, 27 Feb 2008 17:28:19 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/2008/02/26/netapps-research-offensive/#comment-175889</guid>
		<description>Is the paper available online?</description>
		<content:encoded><![CDATA[<p>Is the paper available online?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in -0.358 seconds -->
