All de-dup works

by Robin Harris on Tuesday, 3 May, 2011

Forget the flame wars over moving window versus fixed block de-duplication. A recent paper, A Study of Practical Deduplication (pdf) from William J. Bolosky of Microsoft Research and Dutch T. Meyer of the University of British Columbia found that whole file deduplication achieves about 75% of the space savings of the most aggressive block level de-dup for live filesystems and 87% of the savings for backup images.

Presented at FAST 11 – and winner of a “Best Paper” award – the researchers looked at file systems from 857 Microsoft desktop computers over 4 weeks. Researchers asked permission to install rather invasive scanning software.

The scanner took a snapshot using Window’s volume shadow copy service and then recorded metadata about the file system itself. The scanner recorded each file’s metadata, retrieval and allocation pointers as well as the computer’s hardware and systems configuration. They excluded the pagefile, hibernation file, the scanner itself and the VSS snapshots the scanner created.


During scanning each file was broken into chunks using both fixed block or Rabin fingerprinting. They also identified whole file duplicates.

Rabin uses dynamically variable block sizes to maximize compression. Figuring out where to break the file adds to the overhead.

The resulting data set was 4.1 TB compressed – too large to import into a database – and was further groomed to lose unneeded data.

De-dup issues
De-duplication is expensive. You’re giving up direct access to the data to save capacity.

The expense is in I/Os and CPU cycles. Comparing each chunk’s fingerprint to all other chunks is nontrivial. De-duplication indirection adds to I/O latency. A file’s chunks are scattered around, requiring small and expensive random I/O’s to read.

Older techniques, such as sparse files and Single Instance Storage, are more economical even if their compression ratios aren’t as high. Fewer CPU cycles, less indirection and good compression.

The StorageMojo take
If capacity is expensive – read “enterprise” – and I/Os cheap – SSD or NVRAM in the mix – fancy dedup can make sense. It is at the margin of capacity cost and I/O availability that the value prop gets dicey.

Low duty cycle storage – SOHO – with plenty of excess CPU and light transactions could use deduped primary storage. But with a 10 TB of data to backup, most users would’t notice the difference between whole file and 8KB Rabin.

It’s the price tag and user reviews the SOHO/SMB crowd will be looking at.

Courteous comments welcome, of course. The paper also included some interesting historical data about Windows file system that I covered on ZDNet.

{ 12 comments… read them below or add one }

Sajeev Aravindan May 4, 2011 at 1:42 am

To quote from the paper – “For four weeks of full backups, whole file deduplication (where a new backup image contains a reference to a duplicate file in an old backup) achieves 87% of the savings of
block-based. ”

In my opinion the results would have been different if incremental backups were considered. That would have resulted in very less savings from file deduplication. The block level deduplication ratio in that case could have been much higher depending on the incremental backup block size and the deduplication block size used.

Dirk Meister May 4, 2011 at 2:06 am

Well, other people have come to other conclusions than the MS people with regards to deduplication in comparison to full file duplicates.

Two years ago I have done a very similar study (party even the figure titles have been the same as in the Microsoft paper). I observed the central home directory server of the University over a few month. Ok, the data set was much, much smaller (only around 500 GB), but given the massive amount of “.h” files in the MS dataset, critics claiming that my and MS dataset is not representative for enterprises have a valid point.

I found significant advantages of looking deeper into the files. The study can be found here: http://portal.acm.org/citation.cfm?id=1534541 (if this shameless plug is allowed).

Jacob Marley May 5, 2011 at 3:07 pm

A few years ago, when dedup was first pitched as the next big thing for backup disk storage, I was sold on the idea.

This report and a few other sources of information makes me wonder…

1. How much extra space savings can fixed or variable block de-dup achieve over and above synthetic full backups?

If the answer to question 1 is, the space savings correlate closely to the amount of duplicate files that exits across the data set, then…

2. How much extra savings can fixed or variable block de-dup achieve over and above backup systems that do single instances file backup?

EricE May 7, 2011 at 8:34 am

I’m still on the fence about de-dupe for the enterprise for general purpose storage. Over the last couple of years, I think we have achieved more by automatizing our storage and implementing electronic records retention schedules that catalog, archive and most importantly destroy data that is no longer required. It’s not as easy as slapping a nifty feature like “de dupe” on the network and declaring victory, but I think it’s far more realistic for the long term. The best way to optimize the storage of data is to not store it if you don’t have to :)

The most impressive example of de-dupe I have seen lately is for backup of workstations. I was rather surprised at just how effective the de-dupe for the workstation backup feature of Windows Home Server is! It makes sense as there are tons of duplicate files on the average pool of workstations. I’m looking forward to testing it out on larger networks with the new SBS Essentials offering that will back up 25 machines instead of just 10. I realize this is more of a small/medium business issue, but in such environments that typically lack formal IT support, having backups of workstations where the files on them or the workstations themselves can be recovered quickly is extremely valuable. De dupe coupled with bare metal restore is very effective in these smaller environments, and I wish MS offered a version that went above 25 workstations for some of the larger environments I help out with. As it is, the product is cost effective enough I can justify running multiple instances, but that detracts from the efficiency quite a bit.

mrperl May 7, 2011 at 11:32 pm

If all de-dupe works, then I wonder why de-dupe companies are worth billions of dollars to acquirers.

Even when I tell storage vendors that I don’t have an application for de-dupe, I still can’t stop them from pitching me on it.

John (other John) May 17, 2011 at 9:28 am

Thanks, EricE, for turning me on to Home Server having dedupe! We have a designer just joined, who is going to be working from home, young kid to keep up with and forget our insane hours, and i was going spare thinking about backing up her prodigious output of multi gig PS files over WAN at a reasonable cost. Shows how blinkered i can be by complex “enterprisey” solutions. (and in this context, the media features of Home Server will be pretty handy).

I’m with you on getting files herded, too. One of the very first arguments i had with my busines partner (may he RIP, sadly) was on limitations and document destruction (which was largely shredding then!). Don’t quote me on this, but i think the massive tome which is the ’06 Companies Act in the UK allows small (really small, to be honest) companies to shred after 3 years. I’d not advocate that though. Limitations Act runs 6 years from the *discovery* of tort, and in some cases 12 years from discovery, depending on what caused inaction. So, aggressive classification is the key and something i teach absolutely everyone to learn. I’ve sent hires on days off to sit and watch cases which depend on these distinctions. Funnily, when i talk about this to people selling me data management tools (lifecycle management, anyone, gah) i get blank faces. The only time to apply metadata is at the time of creation. One neat Windows feature i started to use was libraries. 5 icons on the desktop, color coded, choose 1-5 for what we Brits would term “cockup factor”, but on internal descriptions is “scrutiny desirability”! So anything in Red 5 (another brit pun, for one good racing driver we had*) gets eyeballs, and 3 thru 5 hit WORM tape offsite.

What’s cool about Eric’s take, is it’s plain sensible. Lead on!

Just amusing myself with this thought: what economic revolution might happen if we got small busines to grok the reality of the incessant inchoate and irritant IT pitch, and use common sense? There’s a initiative for you. No, for us all to start. Hmm, when i strike up these conversations, i have problems coming across as not being part of the problem. Hopefully that’s just me. Guess what we need is 20 Robs, publishing far more widely. I totally forget his name, but one guy in the 80s over here talked the early “digital revolution” game and got very geek discussion syndicated. There sems to be a converse problem, in the media, where once a perceptually arcane interest was treated with some reverance, digital ubiquity has forced the LCD on so called reportage.

cheers all,

– john

*Extra cool, because “Our Nige” won Indy, too :)

John (other John) May 17, 2011 at 9:56 am

Dirk Meister,

I would love to read your paper, which is widely cited. But – and this is commercially close to me since i can remember – i really have problems paying ACM for what I imagine are publicly funded papers. Spotted your blog, and it looks excellent. Will read properly. Not a personal comment at all, i just vote with my feet on things which i think are artificial barriers to innovation.

. . .

Silos.

This is what i think about dedup:

it’s another way top sell a software layer on a disk pool which exists to feed a tape array. And then that is a way to sell commodity hardware at silyl markups.

I’ve said as much before, but i can do pretty well on our backup software plan, and throw commodity WE.4s and Quantum LTOs at the problem, Spectra for anything which needs to be kept well indexed for posterity on the physical side. I’m thinking of how the amazing Gaumont / Losey Don Giovanni had to be remastered and the slight hitch was finding 16track 2″ reels in 7 miles of Sony storage in New Jersey, when they lost the index card.

Eric touched on something, to my view nailed it, pointing out the workstation side of dedupe. Run the darned thing overnight! OK, we don’t have much of a window for that. But what was the SUN distributed compute loader for workstations? I’m thinking something like that. Local cloud. Ugh, sorry for saying “cloud”. And my apols for my love-in with Eric, just a really helpful comment :)

. . .

Rob,

they did it anyhow, 30 years later, for profiteering reasons, and a early kind of quantitative easing, but in i think ’54, my pop, no mean exec, went to the yearly conference, and – keynote – said to the obviously vested audience “let’s disband the Thrifts, we got no more social purpose”. Sometimes i think your take on the storage industry is very close to that, i.e. cut out the nonsense, get to the data, storage is just a function, and the industry is peripheral to business aims. I may misread you, and i tell you, my pop nearly trashed his career in a permanent way which would have meant poverty so i do not recommend, but what amazes me in my personal study of business is how effective we all are about saying a industry has gone to the wall, but it pops up in another guise. Only in software, does whole categories of business get rinsed away. This, is my alternative take on the Microsoft anti-trust cases (which i read in full) and can be imagined by anyone who ever deflated a .zip under DOS. The result has been phenomenal growth*. My abject failing is inability to grasp how to apply the same economies more generally.

best to all & thanks guys for the discussion,

– john

*can someone please “dedupe” “web 2.0″, pretty please!

DrDedupe May 23, 2011 at 9:29 am

Interesting paper from Microsoft, but (from my eyes) a bit myopic.

Fixed block, variable length, and single instancing all have their pros and cons. Microsoft’s viewpoint that SIS is the most “practical” is flawed for a few reasons:

1. Microsoft jettisoned SIS from Exchange 2010 because it was too compute intense. No form of dedupe is penalty-free.

2. SIS provides reasonable saving for file system data, but what about all the other data on primary and secondary storage? How many different dedupe processes do you want to manage?

3. Storage functions like dedupe are probably best left to the storage systems themselves, and not to servers that have their hands full already.

I’m a little biased but still believe that NetApp deduplication provides the best balance of space savings and low performace impact across all SAN and NAS data types.

Thanks,

DrDedupe (a NetApp employee)

Jacob Marley May 23, 2011 at 10:27 pm

@DrDedupe

If the application/OS does either of the following…
a) single instance storage
b) compressed storage of data
Storage level dedupe fails entirely or returns marginal benefits for the required computational overhead.

I would argue, that the application can make the best decisions about what can be deduped, well, within reasonable CPU resources.

Perhaps the time for application level dedupe isn’t here yet but look at how dedupe backup appliances compare to backup software with builtin dedupe.

DrDedupe May 24, 2011 at 9:59 am

@Jacob Marley

Ken Olsen once said “the nice thing about standards is there are so many of them.” I guess the same could be said of dedupe, everyone seems to think their way is the best way to Scrooge out every last bit of storage capacity. Dedupe is one way to do this, as is thin provisioning, compression, snapshots, clones and many other storage technologies. I am a believer that storage systems, not servers, are the most practical place to apply this muscle and invoke space-savings technologies wholistically where the data rests.

Thanks

DrDedupe

Abdul Rasheed May 24, 2011 at 7:01 pm

Great debate here! Storage savings from deduplication need not automatically translate to reduced TCO. A deduplication solution with expensive upfront cost and maintainance cost (this is where Rabian can be expensive in terms of watts required per GB of storage saved) can diminish overall benefits. Deduplication is efficient when the technology is application aware, hence it makes more sense to leave it up to a backup agent that already knows how to identiy the file boundaries and offsets and decide segments. It is a good compromise between processor intensive variable block method and traditional fixed block method. Deduplication at primary storage is great, but that is not where you want to start. Deduplicate at source or target during backups. What is the point in deduplicating at primary storage layer if backup is going to rehydrate everyday for sending it to secondary storage?

Disclaimer: I work for Symantec, my views need not represent those of my employer.

Joe July 19, 2011 at 3:55 pm

EricE,
Dissappointing but true, the only other way to get SIS from Microsoft is ‘Storage Server’ (which is OEM only). I just got off the phone with MS and acording to a SBS sales tech it is not included in any version of SBS 2011.

Leave a Comment

Previous post:

Next post: