StorageMojo’s /.’d post on 25x data compression invited much derision and a few knowledgeable comments. One of those comments pointed to Data Domain who claim up to 50x compression on “. . . data sets in certain use cases . . . .” Obviously the folks at Diligent Technologies are only half as good since they claim 25x. Expect heated “mine is smaller than yours!” arguments at the next SNW. Despite the hype, the technology appears to work. Just don’t expect 50x. 15x to 20x is more realistic, and YMMV.

In a messaging misfire, Data Domain calls their technology “Capacity Optimized Storage” instead of something self-explanatory like, say, “20x Backup Compressor”. So how does it work?

COS segments the incoming data stream, uniquely identifies these data segments, and then compares them to segments previously stored. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again but a reference created for it. If the segment is deemed to be unique, it is then further compressed with conventional ZIP style algorithms for an additional average 2:1 reduction, and stored to disk. This process operates at a very low level of granularity — the average segment size is 8K — to identify as much redundancy as possible.

What they are talking about is backup compression. The goal is to maintain only the data needed for a complete restore instead of endless copies of the same stuff. It appears that neither works on files. Rather they work with the byte stream from backup software so there are no filesystem or data format issues.

What Data Domain and Diligent do is, in principle, not unlike MPEG-4. MPEG-4 is a toolbox of techniques for efficiently compressing a variety of video and audio inputs which are invoked as appropriate based on the content. In one case you have frame after frame that are very similar and the algorithm stores the differences to achieve great compression. It just isn’t the kind of compression that Shannon developed the math for. If you really want to get into the details of what Data Domain does, take a look at their co-founder, Princeton comp sci professor and CTO Kai Li’s patent.

Kai Li, a at Princeton, contrasts DD’s method with others in the patent:

There have been attempts to prevent redundant copying of data that stay the same between backups. One approach is to divide the data streams from the data sources into segments and store the segments in a hash table on disk. During subsequent backup operations, the data streams are again segmented and the segments are looked up in the hash table to determine whether a data segment was already stored previously. If an identical segment is found, the data segment is not stored again; otherwise, the new data segment is stored. Other alternative approaches including storing the segments in a binary tree and determining whether an incoming segment should be stored by searching in the binary tree.

While these approaches achieve some efficiency gains by not copying the same data twice, it incurs significant latency due to disk input/output (I/O) overhead as a result of constantly accessing the disk to search for the data segments. It would be desirable to have a backup system that could reduce the latency while eliminating unnecessary data replication.

If you think of a typical enterprise there are two kinds of backup compression issues.

  1. There are simply the files that haven’t changed since the last backup.
  2. There will be hundreds of similar but not exactly the same documents floating about — think of Powerpoint presos that are all the same except different presenter or customer names.

Where Diligent’s technique appears to differ is that they don’t require segments to be identical. If the segments are similar they will store the differences. If they do everything else about as well as DD it sounds like a significant advantage.

Both of these companies are trying to make their money as appliance vendors. In the long run I don’t think either has a sustainable business model, since backup data compression is simply a feature. Someone who wants to make a lot of money selling disk-to-disk backup will buy this technology and make their product that much more competitive. There is no reason that both these companies won’t be purchased at handsome multiples to their funding once they overcome the marketing problem of convincing the IT world that 20x backup data compression really works.

In addition, both Data Domain and Diligent are making a marketing mistake by not calling it compression. To anyone but a purist it is a type of compression, just as MPEG-4 is.

And if there is a way to make this a software-only solution that could run on Windows, Linux and Unix, I believe either company could significantly accelerate their uptake through marketing it as utility rather than a product.