So here I am in a cloudy and rainy San Diego, visiting Storage Networking World. This is a show for big data center types. Typical opening question: “So how many data centers do you have?” But there is frequently some interesting stuff presented amidst the vendor driven chaff that might have meaning for the SMB market.
With a 25x data compression factor, the winner, yesterday, is Diligent Technologies (are all the good names are taken?). They claim their technology enables data volume compression that is over 10x what ordinary data compression achieves — a real breakthrough. Common compression algorithm are lucky to get 2x compression.
So if you have 100 GB to back up, their product, Protectier (see name comment above) can turn it into 4GB, something you could burn onto a DVD in a few minutes. All in all, a wonderful product for SMB’s — but they aren’t selling it to SMB’s (good marketers must be scarce too).
Having spent some time looking at compression algorithms in my mis-spent youth, I was very sceptical of the 25x reduction claim. I was gradually cornering the charming but less-technical than me Melissa, when up walked Neville Yates, Diligent’s CTO, whose movie-star good looks and English accent give no clue to his manly technical chops, which are impressive.
The way Diligent achieves it exceptional compression ratio is by comparing all incoming data to the data already arrived. When it finds an incoming stream of bytes similar to an existing series of bytes it compares the two and stores the differences. The magic comes in a couple of areas, as near as I can make out given Neville’s natural reticence on the “how” of the technology.
First, one has to be smart about how big the series of bytes before worrying about trying to compess it, since if it’s too short there won’t be much or any compression. Secondly, the system needs a very fast and efficient method of knowing what is has already received so it can know when it is receiving something similar. And it all has to be optimized to run in-line at data rate speeds on a standard server box — which runs the cool and reliable Linux OS.
The big plus to this technology besides the compression ratio, is its reliability. Since there is no assumption that two files are the same just because their metadata is, the problem of not backing up something you mistakenly thought was already backed up (a problem with file-based de-duplication software) is eliminated. Further, since the software operates on byte-streams, it can compress anything: email, databases, archives, mp3’s, encrypted data or whatever weird data format your favorite program uses.
So naturally I am a bit disappointed that this wonderful technology is targetted to large data centers, even though I understand Diligent’s thinking. A viral marketing, disruptive technology approach would be to release a consumer version, that maybe offers just 10x compression, but proves to hundreds of thousands of people in a few months that the technology really works. Then the data center guys — the smart ones anyway — will be calling Diligent.
If you read their website, they do not claim to offer 25x compression. They only claim to enables the effective capacityincrease of disk systems by 25 times or more. If you watch their demo, they explain that this 25x is accomplished by only storing data that does not already exist. They are just a better backup solution, not a better compression per say. Your article is very misleading.
Don’t be fooled. True 25x compression is a pipe dream, a mathematical impossibility, plain and simple. People come along all the time and claim some incredible breakthrough in compression technology that blows the doors off of standard information theory. Interestingly, they never seem to make actual product.
Diligent seems not to be in that category, rather more in the category of questionable marketing spinners. Apparently, the claim of 25x compression does not apply to all your date, just changes you make moving forward, and to data with a lot of redundancy. In other words, it’s some sort of incremental backup system with some sort of global duplicate detection. This was discovered by posters over at Slashdot where your article has been picked up.
You can download their white paper here, which makes it clear:
http://www.diligent.com/pdf/diligent_wp_ProtecTIER%200106.pdf
So don’t expect to be backing up 100GB of data onto a DVD any time soon.
Well, I’m looking at a document from them that says “Reduce Required Backup Storage Capacity by 25X With 100% Data Integrity.” Whether that is better compression, or better backup, I’ll leave to others to decide. But if they can really do it, even if it is only 10x in practice, it is still huge compared to existing technology.
Robin,
From your own quote you provided from them you have stated that they do not claim anything in regards to data compression. Lossless data compression depends entirely on the data type and can be quite effective and achieve results much higher then the 2x you claim. You are comparing two completely different things and making claims and conclusions that are incorrect. You have given no evidence that Diligent Technologies has made any claims of their data compression capabilities.
They are not the first to do this. Data Domain was first to market with this (almost two years) and has multiple appliances installed at large customer accounts that use a similar technology. I personally installed one recently that is exceeding 25x as aggregate overall compression with 169x as the highest single job rate seen to date. They can also put their gateway box in front of existing fiber channel storage instead of using their own if a customer has pre-existing disk they want to use.
Michael,
I understand the elements of data compression from working with a tape drive team that used LZ compression and you are absolutely correct. That is why I found it amazing and interesting that they claim an “up to” (in marketing speak that usually means “guaranteed not to exceed”). The specific example they gave me is a customer with about 880TB in primary storage that backs it up to 40TB. It isn’t straight LZ-style compression, but it will look like magic to your average IT director.
This not a primary storage technology. As I hope I made clear Diligent is talking about a product, which they announced a couple of months ago and they say is in production use, for reducing the size of backups. If it wasn’t an amazing claim it wouldn’t be the coolest thing at SNW, now would it?
Robin,
I do appreciate the differences and can see how their claims are quite confusing. My main point is that they are not claiming to have a new compression. They are claiming that their backup methods store less data then traditional methods. Many backup systems store lots of redundant data mainly due to be inefficient. The same file may be backed up many different times, when it only needs to happen once. Their claim is in regards to this and not compression. That is why i feel your article is misleading.
Cough….DataDomain….cough…..
Granted, Diligent claims to scale big. Really, really big. But if you’re giving away the award for the coolest idea, then give it to the people who’ve been executing on the idea for several years now.
Also, this type of technology doesn’t work like this:
“So if you have 100 GB to back up, their product, Protectier (see name comment above) can turn it into 4GB, something you could burn onto a DVD in a few minutes.”
Content optimizing, and single instance wouldn’t achieve 25x on the first backup of anything. It would mean that your first backup of 100GB would probably be around 50GB after LZ, and then subsequent backups of the same data might be as small as 4GB, or they might not, depending upon your data structure and change rate.
And don’t ask about what happens when you encrypt your backups.
In the emerging tradition of “always on” interactive journalism I went and talked to the product manager for the product. He had two comments about my post:
However, there is no currently accepted alternative term for this kind of data volume reduction. I’ve always gagged on the term SAN (Storage Area Network) because no one could ever show me a storage area. But someone coined the term, it stuck, and there we are. I think it is likely that as this technology gains critical mass some term will likewise appear, sensible or not, and we’ll start to use it. As a guy who has marketed a lot of cutting edge technology over the years, it may very well be that “compression” will be that term. Does someone have a better idea? Please comment.
Hmm.. Data Domain makes the same type of device, with similar “compression” claims, in a product that is actually shipping, with real customers.
This type of compression won’t work for many types of data; for example, digital video files won’t see much, if any, compression, unless there are lots of duplicate clips.
I actually have created a means of lossless compression of random binary data. 25x… as if that was the limit of what I can do.
Open challenge. The excel sheet is there, do the math yourself. It works.
http://www.security1.free2host.net/Compress.php
Warning it IS memory intensive!
I just had a quick comment on compression of encrypted data:
Generally speaking, if an encryption algorithm is any good at all, you will not be able to compress it any better than 1:1. This is because encrypted data (encrypted with a good algorithm) is indistinguishable from random data. Since random data has no redundancies, it is impossible to compress this data by attempting to remove redundancies (which is what compression does).
That being said, there are some poor cryptographic methods, such as “Electronic Code Book” (ECB), that do leave redundancy in the data. With ECB mode, the encryption algorithm takes the plaintext (unencrypted) data, divides it into blocks (generally either 8 or 16 bytes), then encrypts each block with a ‘block cipher’. With this method, it is not possible to decrypt a ciphertext (encrypted data) block without knowledge of the encryption key. However, if any blocks of plaintext are identical, they will produce identical ciphertext blocks. In this way, the ciphertext will potentially contain many identical blocks. However, this is a very poor cryptographic algorithm because it is possible to infer much about the plaintext based solely on which ciphertext blocks are identical. If an attacker already knows one plaintext block, it is then possible to use duplicate ciphertext blocks to identify identical plaintext blocks.
Even with ECB encryption, much of the byte-level redundancy gets removed, so you should expect the compression ratio to go down substantially. A good system will always compress the data before attempting any kind of encryption.
I have been using ProtecTier for over a year now. Compression is not how I would describe the product, hence the coined term “Hyperfactor”. It is a very neat product that is growing in features. Typical, unstructured file system data yields about 6 to 10 for factoring. Structured data like exchange, oracle, and MS SQL will yield about 8 – 60, depending on change rate. Development oracle and MS SQL boxes that have limited writes consistently yielded nearly “free” backups after the initial backup. Highly written databases were constantly in the single digit range. Using products like Netbackup’s Flashbackup, backing up raw partitions, will constantly yield factoring in the 40 – 60 range, unless you defrag the filesystem. This is because the streams are nearly identical during each backup. Oracle logs, which are nothing but changed data, will yield about 1.5 to 3 for Hyperfactoring.
One most notably missed fact is that you can not just drop it in your environment, and sit back and watch. There are changes, some subtle / some not, that you have to make, in order to enjoy higher hyperfactoring numbers.
The magic bullet behind Hyperfactoring, is stored hashes of blocks of user data. Anyone discounting 25+ compression needs to think out of the box. It is a little different thinking than the typical 2:1 tape compression. There is no fooling about it, if you structure the data properly for ProtecTier, you can see over 25. Notice that I didn’t say easily…