<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Cool stuff at SNW</title>
	<atom:link href="http://storagemojo.com/2007/04/26/cool-stuff-at-snw/feed/" rel="self" type="application/rss+xml" />
	<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/</link>
	<description>Data storage info &#38; analysis</description>
	<lastBuildDate>Tue, 07 Feb 2012 16:02:02 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: dirkmeister.de &#187; Blog Archive &#187; Compression vs. Data Deduplication</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-198328</link>
		<dc:creator>dirkmeister.de &#187; Blog Archive &#187; Compression vs. Data Deduplication</dc:creator>
		<pubDate>Tue, 04 Nov 2008 21:32:37 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-198328</guid>
		<description>[...] the storage blogs &#8220;Backup Central&#8221; and &#8220;StorageMojo&#8221;. A StorageMojo author says:  I still don’t get why the industry refers to “de-duplication” rather than compression - why [...]</description>
		<content:encoded><![CDATA[<p>[...] the storage blogs &#8220;Backup Central&#8221; and &#8220;StorageMojo&#8221;. A StorageMojo author says:  I still don’t get why the industry refers to “de-duplication” rather than compression &#8211; why [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robin Harris</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-69233</link>
		<dc:creator>Robin Harris</dc:creator>
		<pubDate>Sun, 27 May 2007 20:36:31 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-69233</guid>
		<description>Over on &lt;a href=&quot;http://www.backupcentral.com/content/view/111/47/&quot; target=&quot;_blank&quot; rel=&quot;nofollow&quot;&gt;Backup Central&lt;/a&gt; W. Curtis Preston gently takes me to task for calling de-dupe compression. He makes some good points about how de-dupe works so it is worth looking at, even if I remain unpersuaded.

Here is my response:

Curtis,

Good stuff! I realize that the de-dupe ship has sailed and no one is going to call de-dupe compression. My interest is the marketing of new technology: how do you communicate to maximize uptake? My point is that by inventing the term de-dup, the companies hurt themselves.

Other markets aren&#039;t such purists. MPEG-4 is my favorite example, since it is popularly known as compression, and it is a toolbox of compression techniques, not a single algorithm, which share a lot of similarities with de-dupe technology. De-dupe has more in common with image compression than text compression.

Nor is de-dupe implemented the same way by the vendors, so it isn&#039;t a single algorithm either. Data Domain has a patent on a technique for figuring out how to split the data into the chunks they use. Diligent does it differently, and if it figures a block is similar enough they&#039;ll delta the two and store the differences. In either case, both techniques look like out-of-order MPEG-4 compression.

The technology aside, I believe the de-dupe folks set themselves back 12-18 months by inventing a new term for buyers to learn. De-dupe has some wrinkles that you&#039;ve ably pointed out, yet from the perspective of accelerating the product uptake, hardly worth the confusion the industry created for itself.

Great technology, lousy marketing. I&#039;ll link to you from my post on StorageMojo.

Robin</description>
		<content:encoded><![CDATA[<p>Over on <a href="http://www.backupcentral.com/content/view/111/47/" target="_blank" rel="nofollow">Backup Central</a> W. Curtis Preston gently takes me to task for calling de-dupe compression. He makes some good points about how de-dupe works so it is worth looking at, even if I remain unpersuaded.</p>
<p>Here is my response:</p>
<p>Curtis,</p>
<p>Good stuff! I realize that the de-dupe ship has sailed and no one is going to call de-dupe compression. My interest is the marketing of new technology: how do you communicate to maximize uptake? My point is that by inventing the term de-dup, the companies hurt themselves.</p>
<p>Other markets aren&#8217;t such purists. MPEG-4 is my favorite example, since it is popularly known as compression, and it is a toolbox of compression techniques, not a single algorithm, which share a lot of similarities with de-dupe technology. De-dupe has more in common with image compression than text compression.</p>
<p>Nor is de-dupe implemented the same way by the vendors, so it isn&#8217;t a single algorithm either. Data Domain has a patent on a technique for figuring out how to split the data into the chunks they use. Diligent does it differently, and if it figures a block is similar enough they&#8217;ll delta the two and store the differences. In either case, both techniques look like out-of-order MPEG-4 compression.</p>
<p>The technology aside, I believe the de-dupe folks set themselves back 12-18 months by inventing a new term for buyers to learn. De-dupe has some wrinkles that you&#8217;ve ably pointed out, yet from the perspective of accelerating the product uptake, hardly worth the confusion the industry created for itself.</p>
<p>Great technology, lousy marketing. I&#8217;ll link to you from my post on StorageMojo.</p>
<p>Robin</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Todd</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-60428</link>
		<dc:creator>Bill Todd</dc:creator>
		<pubDate>Wed, 02 May 2007 19:13:12 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-60428</guid>
		<description>Well, there&#039;s a reasonable argument that in at least many cases losing *any* copy of the data is bad, since the application or user that lost that copy does not necessarily know where to go to find another.  In that case, there&#039;s no safety in numbers at all.

Even for situations in which that&#039;s not the case, there&#039;s a very easy solution:  replicate the single copy a bit more than usual.  You don&#039;t need 5,000 copies of a datum to make it secure beyond any reasonable doubt, you don&#039;t even need 5:  3 or at the very most 4 will do nicely - and it you want, you can compress the overhead of a large segment down to that of little more than a single copy by using double- or triple-parity RAID to store it.

As for complexity, there&#039;s really not noticeably more than has existed in Unix (and VMS, and RSX) file systems for time immemorial:  multiple pointers to a single copy of data is precisely what hard links are all about (and Unix even handles it right by using link counts, though it took VMS a lot longer to do so, since it wasn&#039;t intended to be a generally-used feature there).

Deduping only on the backup stream by definition relegates the facility to backup-only use.  I strongly suspect that conventional backup mechanisms may go the way of the dodo within not all that many years, but that deduping may even increase in importance as larger and larger objects get routinely stored (and potentially duplicated).  In any event, deduping your on-line storage (as well as any backups) has significant benefit.

- bill</description>
		<content:encoded><![CDATA[<p>Well, there&#8217;s a reasonable argument that in at least many cases losing *any* copy of the data is bad, since the application or user that lost that copy does not necessarily know where to go to find another.  In that case, there&#8217;s no safety in numbers at all.</p>
<p>Even for situations in which that&#8217;s not the case, there&#8217;s a very easy solution:  replicate the single copy a bit more than usual.  You don&#8217;t need 5,000 copies of a datum to make it secure beyond any reasonable doubt, you don&#8217;t even need 5:  3 or at the very most 4 will do nicely &#8211; and it you want, you can compress the overhead of a large segment down to that of little more than a single copy by using double- or triple-parity RAID to store it.</p>
<p>As for complexity, there&#8217;s really not noticeably more than has existed in Unix (and VMS, and RSX) file systems for time immemorial:  multiple pointers to a single copy of data is precisely what hard links are all about (and Unix even handles it right by using link counts, though it took VMS a lot longer to do so, since it wasn&#8217;t intended to be a generally-used feature there).</p>
<p>Deduping only on the backup stream by definition relegates the facility to backup-only use.  I strongly suspect that conventional backup mechanisms may go the way of the dodo within not all that many years, but that deduping may even increase in importance as larger and larger objects get routinely stored (and potentially duplicated).  In any event, deduping your on-line storage (as well as any backups) has significant benefit.</p>
<p>- bill</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robin Harris</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-60151</link>
		<dc:creator>Robin Harris</dc:creator>
		<pubDate>Tue, 01 May 2007 23:52:41 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-60151</guid>
		<description>Most of the de-dupe products I&#039;ve seen work on the byte-stream emitted from a backup application. They have no file system metadata. The magic comes in very rapidly splitting up the stream into likely segments, comparing those segments to all those already stored and, if you do it this way, doing a compare and storing the differences. All at wire speed.

The technology is very impressive. 

To Joshua&#039;s point about the clarity of the term de-duplication: it is a matter of timing, isn&#039;t it? With first use, the term needs to be explained, so it is less clear. Reading through the technically astute commenters here, one sees that the clarity is arrived at through time and discussion, since de-duplication has been used before. 

Then there is the question of intent: given that you&#039;ve now spent precious time and energy imparting the meaning of the word de-duplication, how does this affect the customer&#039;s buying behavior? Does it change the business value of your product? Does it create valuable differentiation? Will they buy sooner? Buy more? 

Now I think the folks at Data Domain might say that since they&#039;ve explained de-duplication they can now explain why they are better at it. Because as their marketing implicitly recognizes, de-dup is riskier, since you are now relying on one copy, plus pointers, plus an index, plus a bunch of software, instead of a bunch of copies - slow and expensive to create and difficult to track and read copies - but there is safety in numbers.

This is turning into another post, so I&#039;ll stop here. 

Robin</description>
		<content:encoded><![CDATA[<p>Most of the de-dupe products I&#8217;ve seen work on the byte-stream emitted from a backup application. They have no file system metadata. The magic comes in very rapidly splitting up the stream into likely segments, comparing those segments to all those already stored and, if you do it this way, doing a compare and storing the differences. All at wire speed.</p>
<p>The technology is very impressive. </p>
<p>To Joshua&#8217;s point about the clarity of the term de-duplication: it is a matter of timing, isn&#8217;t it? With first use, the term needs to be explained, so it is less clear. Reading through the technically astute commenters here, one sees that the clarity is arrived at through time and discussion, since de-duplication has been used before. </p>
<p>Then there is the question of intent: given that you&#8217;ve now spent precious time and energy imparting the meaning of the word de-duplication, how does this affect the customer&#8217;s buying behavior? Does it change the business value of your product? Does it create valuable differentiation? Will they buy sooner? Buy more? </p>
<p>Now I think the folks at Data Domain might say that since they&#8217;ve explained de-duplication they can now explain why they are better at it. Because as their marketing implicitly recognizes, de-dup is riskier, since you are now relying on one copy, plus pointers, plus an index, plus a bunch of software, instead of a bunch of copies &#8211; slow and expensive to create and difficult to track and read copies &#8211; but there is safety in numbers.</p>
<p>This is turning into another post, so I&#8217;ll stop here. </p>
<p>Robin</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Todd</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-60145</link>
		<dc:creator>Bill Todd</dc:creator>
		<pubDate>Tue, 01 May 2007 22:54:39 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-60145</guid>
		<description>Joshua&#039;s excellent explanation demonstrated but failed to call out explicitly the major difference between deduplication and compression as typically implemented:  the former works by eliminating duplicate objects (files), while the latter works *within a single* object (file) to eliminate redundancy therein (normally by collapsing duplicate bit- or byte-sequences, though one could imagine sufficiently advanced mechanisms that could collapse other bit- or byte-sequences - e.g., those which could be recreated by a discoverable algorithm whose storage requirements were smaller than their size).

Now, one might argue that if you look at all your backing storage as a single very large object the two mechanisms would *then* be identical, but this is not the case, because in-object compression works within relatively small scopes (say, a few KB for a file system using compression, because it has to be able to reconstruct data in a reasonable amount of time, which means it can&#039;t afford to read in large numbers of disk sectors before it can extract the information requested; compression in sequentially-accessed files can have somewhat larger scope, because it can, within reason, be assured that all earlier parts of the file will be available for decompressing the next part).

Deduplication, by contrast, works *only* on relatively large, identical byte sequences:  a) the sequences must be sufficiently large - and therefore sufficiently few in number - that indexing them across the entire system (with fast access if you want to dedupe synchronously rather than as a background operation) is feasible, and b) the sequences must be sufficiently large that going elsewhere to access one (e.g., if you&#039;re deduping at the block level and part of one file is duplicated in another) won&#039;t seriously slow down access (though when *entire* small files are so deduped that&#039;s not a problem, since the entire access gets revectored).

- bill</description>
		<content:encoded><![CDATA[<p>Joshua&#8217;s excellent explanation demonstrated but failed to call out explicitly the major difference between deduplication and compression as typically implemented:  the former works by eliminating duplicate objects (files), while the latter works *within a single* object (file) to eliminate redundancy therein (normally by collapsing duplicate bit- or byte-sequences, though one could imagine sufficiently advanced mechanisms that could collapse other bit- or byte-sequences &#8211; e.g., those which could be recreated by a discoverable algorithm whose storage requirements were smaller than their size).</p>
<p>Now, one might argue that if you look at all your backing storage as a single very large object the two mechanisms would *then* be identical, but this is not the case, because in-object compression works within relatively small scopes (say, a few KB for a file system using compression, because it has to be able to reconstruct data in a reasonable amount of time, which means it can&#8217;t afford to read in large numbers of disk sectors before it can extract the information requested; compression in sequentially-accessed files can have somewhat larger scope, because it can, within reason, be assured that all earlier parts of the file will be available for decompressing the next part).</p>
<p>Deduplication, by contrast, works *only* on relatively large, identical byte sequences:  a) the sequences must be sufficiently large &#8211; and therefore sufficiently few in number &#8211; that indexing them across the entire system (with fast access if you want to dedupe synchronously rather than as a background operation) is feasible, and b) the sequences must be sufficiently large that going elsewhere to access one (e.g., if you&#8217;re deduping at the block level and part of one file is duplicated in another) won&#8217;t seriously slow down access (though when *entire* small files are so deduped that&#8217;s not a problem, since the entire access gets revectored).</p>
<p>- bill</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Red</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-60109</link>
		<dc:creator>Red</dc:creator>
		<pubDate>Tue, 01 May 2007 18:03:12 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-60109</guid>
		<description>I see the confusion -- allow me to explain.  From my archaic mainframe perspective I was referring to deduplicating records within a file.  Even common sorting utilities such as DFSORT and SYNCSORT will do that.  And since that reduces the size of the file, I can see why you might call it a form of compression.  Once a file has been deduped, it may be formally compressed.  But the second operation I refer to would use an algorithm such as Lempel-Ziv compression.</description>
		<content:encoded><![CDATA[<p>I see the confusion &#8212; allow me to explain.  From my archaic mainframe perspective I was referring to deduplicating records within a file.  Even common sorting utilities such as DFSORT and SYNCSORT will do that.  And since that reduces the size of the file, I can see why you might call it a form of compression.  Once a file has been deduped, it may be formally compressed.  But the second operation I refer to would use an algorithm such as Lempel-Ziv compression.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joshua Sargent</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-59784</link>
		<dc:creator>Joshua Sargent</dc:creator>
		<pubDate>Mon, 30 Apr 2007 15:13:08 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-59784</guid>
		<description>Assume you have 10 x 1GB files that were exactly the same.  If you use compression to reduce their size (assuming 50% compression rate), you&#039;d end up with 10 x 500MB files, or 5GB of data.  

If you use de-duplication, the system will keep one copy and leave pointers for the other nine copies.  Hence, you end up with 1GB of data vs. 5GB of data.  Add compression on the back end and you end up with 500MB....10% of what you would have ended up with using only compression.

Obviously de-duplication implementations aren&#039;t this simple, but this should illustrate the difference sufficiently for most.  In -=essense=-, de-duplication does the exact same function as compression...but in -=practice=- that function is implemented quite differently.  The difference between the two is more dramatic than the difference between, say, bzip and gzip.  

So I don&#039;t really have a problem with the new term - it more clearly defines the function.  Clarity is a good thing.</description>
		<content:encoded><![CDATA[<p>Assume you have 10 x 1GB files that were exactly the same.  If you use compression to reduce their size (assuming 50% compression rate), you&#8217;d end up with 10 x 500MB files, or 5GB of data.  </p>
<p>If you use de-duplication, the system will keep one copy and leave pointers for the other nine copies.  Hence, you end up with 1GB of data vs. 5GB of data.  Add compression on the back end and you end up with 500MB&#8230;.10% of what you would have ended up with using only compression.</p>
<p>Obviously de-duplication implementations aren&#8217;t this simple, but this should illustrate the difference sufficiently for most.  In -=essense=-, de-duplication does the exact same function as compression&#8230;but in -=practice=- that function is implemented quite differently.  The difference between the two is more dramatic than the difference between, say, bzip and gzip.  </p>
<p>So I don&#8217;t really have a problem with the new term &#8211; it more clearly defines the function.  Clarity is a good thing.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robin Harris</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-58808</link>
		<dc:creator>Robin Harris</dc:creator>
		<pubDate>Sun, 29 Apr 2007 00:17:55 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-58808</guid>
		<description>Uh, it reduces the size of large collections of bytes.

Explain how mpeg4 isn&#039;t compression?</description>
		<content:encoded><![CDATA[<p>Uh, it reduces the size of large collections of bytes.</p>
<p>Explain how mpeg4 isn&#8217;t compression?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: TimC</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-58732</link>
		<dc:creator>TimC</dc:creator>
		<pubDate>Sat, 28 Apr 2007 22:23:11 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-58732</guid>
		<description>Ummm... they don&#039;t call de-duplication compression because it isn&#039;t compression...</description>
		<content:encoded><![CDATA[<p>Ummm&#8230; they don&#8217;t call de-duplication compression because it isn&#8217;t compression&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Robin Harris</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-57983</link>
		<dc:creator>Robin Harris</dc:creator>
		<pubDate>Sat, 28 Apr 2007 00:12:36 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-57983</guid>
		<description>Red,

Would I be correct in assuming that real computers actually remove duplicate files during de-duplication and not, say, doing compares and storing the deltas?

MPEG4 compression does exactly what the moderns call &quot;de-duplication&quot;, with one tiny difference: de-duplication does out-of-order MPEG4 compression.

This kind of hair-splitting sales prevention is what happens when you let techies, like Prof. Li at Data Domain and Neville Yates at Diligent, do end-user marketing.

I remember engineers grousing about calling 100mbit ethernet ethernet, since it wasn&#039;t CSMA/CD like &quot;real&quot; ethernet. I have yet to meet a single actual customer who a) cared, or b) knew the difference.

I estimate they set themselves back at least 12 months by insisting on a new word. But I&#039;m willing to look on the bright side: maybe their stuff wasn&#039;t working and they needed the extra time. I&#039;ve seen that happen too.

Robin</description>
		<content:encoded><![CDATA[<p>Red,</p>
<p>Would I be correct in assuming that real computers actually remove duplicate files during de-duplication and not, say, doing compares and storing the deltas?</p>
<p>MPEG4 compression does exactly what the moderns call &#8220;de-duplication&#8221;, with one tiny difference: de-duplication does out-of-order MPEG4 compression.</p>
<p>This kind of hair-splitting sales prevention is what happens when you let techies, like Prof. Li at Data Domain and Neville Yates at Diligent, do end-user marketing.</p>
<p>I remember engineers grousing about calling 100mbit ethernet ethernet, since it wasn&#8217;t CSMA/CD like &#8220;real&#8221; ethernet. I have yet to meet a single actual customer who a) cared, or b) knew the difference.</p>
<p>I estimate they set themselves back at least 12 months by insisting on a new word. But I&#8217;m willing to look on the bright side: maybe their stuff wasn&#8217;t working and they needed the extra time. I&#8217;ve seen that happen too.</p>
<p>Robin</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Red</title>
		<link>http://storagemojo.com/2007/04/26/cool-stuff-at-snw/comment-page-1/#comment-57905</link>
		<dc:creator>Red</dc:creator>
		<pubDate>Fri, 27 Apr 2007 22:06:42 +0000</pubDate>
		<guid isPermaLink="false">http://storagemojo.com/?p=442#comment-57905</guid>
		<description>Maybe with these fancy open system thingies deduplication and compression are the same thing.  But on a real system (mainframe) one may dedupe and compress data as separate actions.  One may even dedupe and then compress the same data.  Oh well, I&#039;ll never understand these open systems thingies.  Time to retire and go trout fishing.</description>
		<content:encoded><![CDATA[<p>Maybe with these fancy open system thingies deduplication and compression are the same thing.  But on a real system (mainframe) one may dedupe and compress data as separate actions.  One may even dedupe and then compress the same data.  Oh well, I&#8217;ll never understand these open systems thingies.  Time to retire and go trout fishing.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

