Some /. readers (OK, most) expressed enormous skepticism about 25x data compression. So StorageMojo has dug deeper, looking at some patents, to get a better handle on the technology.

Let’s start with why so many people believe 25x is impossible. Though not often cited by name it appeared that many recalled Shannon’s statement in A Mathematical Theory of Communication or something similar:

The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols will be called its relative entropy. This is the maximum compression possible when we encode into the same alphabet. One minus the relative entropy is the redundancy. The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%.

From this is derived the idea that compression of English language text is mathematically limited to something far less than 25x and much closer to 2x.

However, what Data Domain, Diligent, Avamar and others are doing is NOT English language text compression. Instead, they are all looking at the byte-level data streams emitted by backup software, i.e. a byte-stream that knows nothing about language, file type or file system.

So what these guys do is, in principle, not unlike MPEG-4. Given similar frames MPEG-4 stores the differences to achieve great compression. The big difference from video: they compare a new data segment to ALL segments, not just the prior one.

In enterprise backup there are three kinds of data:

  1. The files that haven’t changed since the last backup.
  2. The files that are similar but not exactly the same floating about — think of PowerPoints that are all the same except different presenter or customer names, or databases that have gained another few dozen megabytes while many gigabytes are unchanged.
  3. Files made up of entirely new data, like the new 3rd shift sysadmin’s MP3 collection, which soon looks like (1) above.

Thus backup streams include lots of redundant data, even in a single stream, and much more so if the byte stream is compared to prior backups, leading to what Shannon would term a low relative entropy.

The goal of these products is to segment and compare these byte streams and store only one copy of each unique stream. Diligent does some things a little differently (and a little smarter) that I’ll cover later.

As Data Domain’s co-founder, Princeton CompSci professor and CTO Kai Li states in the patent:

[The invention] segments the incoming data stream, uniquely identifies these data segments, and then compares them to segments previously stored. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again but a reference created for it. If the segment is deemed to be unique, it is then further compressed with conventional ZIP style algorithms for an additional average 2:1 reduction, and stored to disk. This process operates at a very low level of granularity — the average segment size is 8K — to identify as much redundancy as possible.

Identifying duplicate segments is not rocket science. The difficulty is tracking and comparing billions of segments at wire speed. Thus DD’s patent focuses more on rapid segment comparison than anything else:

There have been attempts to prevent redundant copying of data that stay the same between backups. . . .

While these approaches achieve some efficiency gains by not copying the same data twice, it incurs significant latency due to disk input/output (I/O) overhead as a result of constantly accessing the disk to search for the data segments. It would be desirable to have a backup system that could reduce the latency while eliminating unnecessary data replication.

It looks like DD uses hardware in its backup gear to speed the process.

Of course, knowing where to break up a data stream into segments that are exact duplicates is also a trick. Thus another patent, this one by Avamar:
. . .determination of data sequences using sticky byte factoring to determine breakpoints in digital sequences.

But all this seems like a lot of work to me, which is why I like Diligent’s approach. Diligent’s method doesn’t require identical data segments. Close enough will do. As with MPEG-4 if the segments are similar enough they will store the differences. No sticky byte factoring. No looking for exact duplicates. Potentially it sounds as if it could even be more resiliant to back end data rot as well. And they do it all in software, so they get faster every year without lifting a finger. If they do everything else about as well as DD it looks like a significant advantage.

While I love this technology, I have to comment on the business end as well. IMHO, none of these guys has a sustainable business model. Backup compression is a feature, not a product. And who in the tightknit and profitable world of data storage is going to willingly step up and offer to sell customers a lot fewer storage arrays for D2D backup?

Their problem will be finding an acquirer who believes that selling fewer terabytes in the short run will enable them to sell more terabytes in the long run. Given the general level of complacency and short-sightedness among the largest players in the storage industry, they may have quite a long wait.