Some /. readers (OK, most) expressed enormous skepticism about 25x data compression. So StorageMojo has dug deeper, looking at some patents, to get a better handle on the technology.
Let’s start with why so many people believe 25x is impossible. Though not often cited by name it appeared that many recalled Shannon’s statement in A Mathematical Theory of Communication or something similar:
The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols will be called its relative entropy. This is the maximum compression possible when we encode into the same alphabet. One minus the relative entropy is the redundancy. The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%.
From this is derived the idea that compression of English language text is mathematically limited to something far less than 25x and much closer to 2x.
However, what Data Domain, Diligent, Avamar and others are doing is NOT English language text compression. Instead, they are all looking at the byte-level data streams emitted by backup software, i.e. a byte-stream that knows nothing about language, file type or file system.
So what these guys do is, in principle, not unlike MPEG-4. Given similar frames MPEG-4 stores the differences to achieve great compression. The big difference from video: they compare a new data segment to ALL segments, not just the prior one.
In enterprise backup there are three kinds of data:
- The files that haven’t changed since the last backup.
- The files that are similar but not exactly the same floating about — think of PowerPoints that are all the same except different presenter or customer names, or databases that have gained another few dozen megabytes while many gigabytes are unchanged.
- Files made up of entirely new data, like the new 3rd shift sysadmin’s MP3 collection, which soon looks like (1) above.
Thus backup streams include lots of redundant data, even in a single stream, and much more so if the byte stream is compared to prior backups, leading to what Shannon would term a low relative entropy.
The goal of these products is to segment and compare these byte streams and store only one copy of each unique stream. Diligent does some things a little differently (and a little smarter) that I’ll cover later.
As Data Domain’s co-founder, Princeton CompSci professor and CTO Kai Li states in the patent:
[The invention] segments the incoming data stream, uniquely identifies these data segments, and then compares them to segments previously stored. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again but a reference created for it. If the segment is deemed to be unique, it is then further compressed with conventional ZIP style algorithms for an additional average 2:1 reduction, and stored to disk. This process operates at a very low level of granularity — the average segment size is 8K — to identify as much redundancy as possible.
Identifying duplicate segments is not rocket science. The difficulty is tracking and comparing billions of segments at wire speed. Thus DD’s patent focuses more on rapid segment comparison than anything else:
There have been attempts to prevent redundant copying of data that stay the same between backups. . . .
While these approaches achieve some efficiency gains by not copying the same data twice, it incurs significant latency due to disk input/output (I/O) overhead as a result of constantly accessing the disk to search for the data segments. It would be desirable to have a backup system that could reduce the latency while eliminating unnecessary data replication.
It looks like DD uses hardware in its backup gear to speed the process.
Of course, knowing where to break up a data stream into segments that are exact duplicates is also a trick. Thus another patent, this one by Avamar:
. . .determination of data sequences using sticky byte factoring to determine breakpoints in digital sequences.
But all this seems like a lot of work to me, which is why I like Diligent’s approach. Diligent’s method doesn’t require identical data segments. Close enough will do. As with MPEG-4 if the segments are similar enough they will store the differences. No sticky byte factoring. No looking for exact duplicates. Potentially it sounds as if it could even be more resiliant to back end data rot as well. And they do it all in software, so they get faster every year without lifting a finger. If they do everything else about as well as DD it looks like a significant advantage.
While I love this technology, I have to comment on the business end as well. IMHO, none of these guys has a sustainable business model. Backup compression is a feature, not a product. And who in the tightknit and profitable world of data storage is going to willingly step up and offer to sell customers a lot fewer storage arrays for D2D backup?
Their problem will be finding an acquirer who believes that selling fewer terabytes in the short run will enable them to sell more terabytes in the long run. Given the general level of complacency and short-sightedness among the largest players in the storage industry, they may have quite a long wait.
Heh, you expect the Slashdot crowd to know what they’re talking about? If it was 25x compression in the Linux kernel they’d believe it without question!
I can do 25x in lossless with just 8 to 14 passes of all data types(my tests are with 16bit wide Random Integers working down in binary.
I have maths to prove the possibility of this.
Oh and I can break all Cellular Automation down to simple recursive rules can prove that too.
25x that’s nothing
x 0,1,0,1,2,1,0,1,2,3 …
y 1,1,2,2,1,3,3,4,2,1 …
12345678910 …
Hi Robin,
I took a quick look at the /. comments, and I think the skepticism among that crowd arises because “compression” in the computer science context means a compression ==algorithm==, which is something very different from deduplication technology in the enterprise backup context.
Strictly speaking, the idea of a =compression algorithm= that can do 25X is worthy of ridicule (both here and on /.) because it is simply not possible with single-instance dataset — you will NEVER, for example, back-up your personal laptop and get more than 3:1 even if you used the most sophisticated deduplication AND compression techniques that exist or are imaginable.
Computer science guys get this but Data Storage guys mostly don’t.
To really distinguish compression from deduplication, consider: if my ad agency’s server contains a single volume full of (unique) JPEG files, then that volume cannot be compressed OR de-duplicated by even 1.1x much less 25x. But if those JPEGs happen to be a store of stock photos that have been distributed across hundreds of user stores and likewise embedded into thousands of project files across the enterprise, then “de-duplication” ratios of 25:1 are quite believable — even though there is absolutely NO further compression going on with any of that image data.
Re: “The difficulty is tracking and comparing billions of segments at wire speed.”
Absolutely…and there’s the key differentiator. Compression is extremely CPU intensive while deduplication is extremely I/O and memory intensive.
Here’s how IBM gets 40:1 deduplication , using circa 2004 technology they acquired from Diligent:
http://www.haifa.ibm.com/conferences/systor2009/papers/2_1_1.ppt