NAB made a big impression on me
I’ve been putting off writing about SNW because I’ve been busy. I still am. But the successful procrastinator has to use every excuse possible, and writing about SNW is my last one.
Paul Calleja
Paul is responsible for delivering HPC services at Cambridge University. He has an interesting take on HPC issues because he not only follows the technology, but he is also responsible for figuring out how to make it pay for itself under a government edict. Somebody paid to bring him over to SNW, and he told me who, but I didn’t write it down and now I can’t remember. So he may be shamelessly plugging someone and I wouldn’t know it.
Paul started using HPC for molecular modeling 14 years ago and has been in charge of the Cambridge HPC center for 18 months.
Some observations – as I typed them, which may be different than what he actually said –
- His budget:
- £2 million capital
- £2 million operating cost
- over 3 year life
- MPI (message passing interface) programming is a crock – you’re lucky to get 20-25% of peak performance 10-15% is common
- Big problem is storage: how do you connect thousands of processors to hundreds of TB?
The good news: Paul believes that Microsoft and Intel will figure out how to make parallel programming work on CPUs like the 80 core monster Intel demo’d a few months ago.
He’s also interested in the fact that GPU power is doubling every 12 months, while CPU power is doubling every 18. That means that in three years CPU power will be up 4x while GPU power will be up 8x. In six years those numbers are 16x and 64x.
Power factors will always bite you.
Unrated storage blogger picture!
Anil Gupta tried to get a group of storage blogger’s together which I missed because of NAB, but I did meet Anil, a very nice fellow, and Tony Pearson and Clark Hodge. Scroll down for the picture. That glazed look in my eye: I’d just had a couple of martinis with an i-banker.
Sun posse ambushes naive storage blogger
I’d scheduled a meeting with Nigel Dessau, Sun’s new storage sacrificial lamb marketing VP. My reputation preceded me, as it was 6 against 1.
The good news was that I didn’t recognize a single one of them, which meant some housecleaning had occurred. To put them at ease I noted that Sun storage had lost market share for 10 years straight under at least three GM’s. So what would be different now?
Nigel responded after only a few milliseconds of a clenched jaw. His story, in bullets:
- Sun/StorageTek merger has led to a simplification of the storage organization:
- The IO stack and its related software engineers have been moved to Solaris
- Device management is owned by the storage group
- Since the IO stack is attached to Solaris, and Solaris is open source, Sun is moving to open source storage
- Nigel then outlined the first three of Sun’s four part open source strategy
- Solaris picked as a general purpose software platform
- Now, make Solaris the best choice for running any storage regardless of what the applications are running on
- Make that software downloadable open source
- Monetize by TBD or maybe TBA – my notes are unclear
And he noted that Sun’s QFS acquisition turned out well.
I was favorably impressed by the group, and the fact that a couple of the women play poker with ZFS’s architect. So I’ll ratchet down the Sun storage (self-inflicted) threat meter from RED to ORANGE level. I hope I don’t regret it.
Data Domain
The charming Beth White and equally charming Ed Reidenbach took time away from more important things to meet with me as well. Data Domain is on a roll and well they should be since they are going public. I may yet take a look at their S1, but no promises.
DD now has 750 customers and has shipped over 2200 systems: disk-based backup appliances and diskless gateways.
I still don’t get why the industry refers to “de-duplication” rather than compression – why use a well-understood term when you can invent a new one? – but they did make the point that compression rates depend on your data types, backup policies and retention policies. Basically the more stuff stays the same the higher your back up compression rate.
So the 20x – 50x – 300x backup compression number arguments are a bit silly. It sounds like DD has some good technology, like their Data Invulnerability Architecture – does that come with a warranty? – as well as some up-to-the-minute features like their ability to search email archives – handy for figuring who knew what when about stock option back dating or fired US attorneys.
Any readers using DD kit? I’d love to hear about your experience with it.
Update: W. Curtis Preston takes me to task for confusing de-dupe with compression. He didn’t change my mind, but he makes some good points in a well-written post.
The StorageMojo take
The coolest thing at SNW was that I heard from a number of up-and-coming vendors that sales have really started to move – things like 100% year-over-year growth. That says to me that some of the new paradigms are finding traction with buyers – the reason we do all this stuff. And that is the best news of all.
Comments, corrections, clarifications welcome. And a cool storage product is coming out next week. Come by Monday for the details – or as many I was able to weasel out of the CEO.
Maybe with these fancy open system thingies deduplication and compression are the same thing. But on a real system (mainframe) one may dedupe and compress data as separate actions. One may even dedupe and then compress the same data. Oh well, I’ll never understand these open systems thingies. Time to retire and go trout fishing.
Red,
Would I be correct in assuming that real computers actually remove duplicate files during de-duplication and not, say, doing compares and storing the deltas?
MPEG4 compression does exactly what the moderns call “de-duplication”, with one tiny difference: de-duplication does out-of-order MPEG4 compression.
This kind of hair-splitting sales prevention is what happens when you let techies, like Prof. Li at Data Domain and Neville Yates at Diligent, do end-user marketing.
I remember engineers grousing about calling 100mbit ethernet ethernet, since it wasn’t CSMA/CD like “real” ethernet. I have yet to meet a single actual customer who a) cared, or b) knew the difference.
I estimate they set themselves back at least 12 months by insisting on a new word. But I’m willing to look on the bright side: maybe their stuff wasn’t working and they needed the extra time. I’ve seen that happen too.
Robin
Ummm… they don’t call de-duplication compression because it isn’t compression…
Uh, it reduces the size of large collections of bytes.
Explain how mpeg4 isn’t compression?
Assume you have 10 x 1GB files that were exactly the same. If you use compression to reduce their size (assuming 50% compression rate), you’d end up with 10 x 500MB files, or 5GB of data.
If you use de-duplication, the system will keep one copy and leave pointers for the other nine copies. Hence, you end up with 1GB of data vs. 5GB of data. Add compression on the back end and you end up with 500MB….10% of what you would have ended up with using only compression.
Obviously de-duplication implementations aren’t this simple, but this should illustrate the difference sufficiently for most. In -=essense=-, de-duplication does the exact same function as compression…but in -=practice=- that function is implemented quite differently. The difference between the two is more dramatic than the difference between, say, bzip and gzip.
So I don’t really have a problem with the new term – it more clearly defines the function. Clarity is a good thing.
I see the confusion — allow me to explain. From my archaic mainframe perspective I was referring to deduplicating records within a file. Even common sorting utilities such as DFSORT and SYNCSORT will do that. And since that reduces the size of the file, I can see why you might call it a form of compression. Once a file has been deduped, it may be formally compressed. But the second operation I refer to would use an algorithm such as Lempel-Ziv compression.
Joshua’s excellent explanation demonstrated but failed to call out explicitly the major difference between deduplication and compression as typically implemented: the former works by eliminating duplicate objects (files), while the latter works *within a single* object (file) to eliminate redundancy therein (normally by collapsing duplicate bit- or byte-sequences, though one could imagine sufficiently advanced mechanisms that could collapse other bit- or byte-sequences – e.g., those which could be recreated by a discoverable algorithm whose storage requirements were smaller than their size).
Now, one might argue that if you look at all your backing storage as a single very large object the two mechanisms would *then* be identical, but this is not the case, because in-object compression works within relatively small scopes (say, a few KB for a file system using compression, because it has to be able to reconstruct data in a reasonable amount of time, which means it can’t afford to read in large numbers of disk sectors before it can extract the information requested; compression in sequentially-accessed files can have somewhat larger scope, because it can, within reason, be assured that all earlier parts of the file will be available for decompressing the next part).
Deduplication, by contrast, works *only* on relatively large, identical byte sequences: a) the sequences must be sufficiently large – and therefore sufficiently few in number – that indexing them across the entire system (with fast access if you want to dedupe synchronously rather than as a background operation) is feasible, and b) the sequences must be sufficiently large that going elsewhere to access one (e.g., if you’re deduping at the block level and part of one file is duplicated in another) won’t seriously slow down access (though when *entire* small files are so deduped that’s not a problem, since the entire access gets revectored).
– bill
Most of the de-dupe products I’ve seen work on the byte-stream emitted from a backup application. They have no file system metadata. The magic comes in very rapidly splitting up the stream into likely segments, comparing those segments to all those already stored and, if you do it this way, doing a compare and storing the differences. All at wire speed.
The technology is very impressive.
To Joshua’s point about the clarity of the term de-duplication: it is a matter of timing, isn’t it? With first use, the term needs to be explained, so it is less clear. Reading through the technically astute commenters here, one sees that the clarity is arrived at through time and discussion, since de-duplication has been used before.
Then there is the question of intent: given that you’ve now spent precious time and energy imparting the meaning of the word de-duplication, how does this affect the customer’s buying behavior? Does it change the business value of your product? Does it create valuable differentiation? Will they buy sooner? Buy more?
Now I think the folks at Data Domain might say that since they’ve explained de-duplication they can now explain why they are better at it. Because as their marketing implicitly recognizes, de-dup is riskier, since you are now relying on one copy, plus pointers, plus an index, plus a bunch of software, instead of a bunch of copies – slow and expensive to create and difficult to track and read copies – but there is safety in numbers.
This is turning into another post, so I’ll stop here.
Robin
Well, there’s a reasonable argument that in at least many cases losing *any* copy of the data is bad, since the application or user that lost that copy does not necessarily know where to go to find another. In that case, there’s no safety in numbers at all.
Even for situations in which that’s not the case, there’s a very easy solution: replicate the single copy a bit more than usual. You don’t need 5,000 copies of a datum to make it secure beyond any reasonable doubt, you don’t even need 5: 3 or at the very most 4 will do nicely – and it you want, you can compress the overhead of a large segment down to that of little more than a single copy by using double- or triple-parity RAID to store it.
As for complexity, there’s really not noticeably more than has existed in Unix (and VMS, and RSX) file systems for time immemorial: multiple pointers to a single copy of data is precisely what hard links are all about (and Unix even handles it right by using link counts, though it took VMS a lot longer to do so, since it wasn’t intended to be a generally-used feature there).
Deduping only on the backup stream by definition relegates the facility to backup-only use. I strongly suspect that conventional backup mechanisms may go the way of the dodo within not all that many years, but that deduping may even increase in importance as larger and larger objects get routinely stored (and potentially duplicated). In any event, deduping your on-line storage (as well as any backups) has significant benefit.
– bill
Over on Backup Central W. Curtis Preston gently takes me to task for calling de-dupe compression. He makes some good points about how de-dupe works so it is worth looking at, even if I remain unpersuaded.
Here is my response:
Curtis,
Good stuff! I realize that the de-dupe ship has sailed and no one is going to call de-dupe compression. My interest is the marketing of new technology: how do you communicate to maximize uptake? My point is that by inventing the term de-dup, the companies hurt themselves.
Other markets aren’t such purists. MPEG-4 is my favorite example, since it is popularly known as compression, and it is a toolbox of compression techniques, not a single algorithm, which share a lot of similarities with de-dupe technology. De-dupe has more in common with image compression than text compression.
Nor is de-dupe implemented the same way by the vendors, so it isn’t a single algorithm either. Data Domain has a patent on a technique for figuring out how to split the data into the chunks they use. Diligent does it differently, and if it figures a block is similar enough they’ll delta the two and store the differences. In either case, both techniques look like out-of-order MPEG-4 compression.
The technology aside, I believe the de-dupe folks set themselves back 12-18 months by inventing a new term for buyers to learn. De-dupe has some wrinkles that you’ve ably pointed out, yet from the perspective of accelerating the product uptake, hardly worth the confusion the industry created for itself.
Great technology, lousy marketing. I’ll link to you from my post on StorageMojo.
Robin