De-duplicating primary storage

by Robin Harris on Tuesday, 30 September, 2008

NetApp is announcing a deal today: use their de-dup software with a new NetApp filer for VMware storage and they guarantee that you’ll need a minimum of 50% less storage. You can be sure that NetApp considers 50% a low bar – 80% is more like it.

Why not for most storage?
In a world of unstructured data that is rarely accessed de-duplication of primary storage is an obvious next step. A recent post discussed the findings of a joint NetApp/UC Santa Cruz study.

A quick recap of some of the study’s findings:

  • Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
  • Over 60% of file re-opens are within a minute of the first open.
  • Less than 1% of clients account for 50% of requests.
  • Infrequent file sharing. Over 76% of files are opened by just 1 client.
  • Concurrent file sharing very rare. As the prior point suggests, only 5% of files are opened by multiple clients and 90% of those are read only.
  • Most file types have no common access pattern.

And there’s this: over 90% of the active storage was untouched during the study.

Is it real?
Some commenters were dubious about the results of the study, citing sample size and atypical workload concerns. But the corporate overhead – marketing, finance, HR etc. – part of the workload felt right to me.

A lot of stuff comes in and gets saved “just in case.” Most of it never gets looked at, but when you need a particular file, you need it.

I’m less clear on engineering workloads – I suspect there are major differences among disciplines – but again it didn’t seem unreasonable. But let’s leave the engineers out of the equation.

How important is performance?
The big knock against de-dup for primary storage is the performance hit. Some vendors claim in-line de-dup at wire speed, while others optimize for backup windows and de-dup in the background. Maybe the latter are more efficient.

But given that 90% of the active storage was untouched and 1% of the servers account for 50% of the requests, how important is performance? Cherry-picking the low-access users – i.e. road warriors whose notebook is their primary I/O bucket – shouldn’t be hard.

So what percentage de-dup compression of unstructured data is feasible? That is the key to understanding the economic basis of primary storage de-duplication of unstructured data.

Academics, start your engines!

The StorageMojo take
Primary storage de-dup could be the next big win for IT shops. We just don’t have the data that can tell us how big the win could be.

NetApp (disclosure: I’ve done a minuscule amount of work for them in the last year and accepted their annual analyst junket) is well positioned. Their de-dup software license is free on their NearStore/FAS boxes.

NetApp tells me that they’ve got 13,000 systems running de-dup. Maybe some of those people are using it for primary storage and can tell us how well it works.

If the feature is free, de-duping some primary storage will be standard practice in most data centers within 5 years. As the de-dup technology improves and Moore’s Law drives performance, more and more unstructured data will be de-dup’d as a matter of course.

Courteous comments welcome, of course.

{ 8 comments… read them below or add one }

TylerB September 30, 2008 at 3:12 pm

(disclaimer: I work for an NTAP Partner)
This does work and we have a ton of customers using it. While unstructured data is decent (30% is common), VMware is THE killer app for primary storage dedupe. We have plenty of customer at 70, 80, and even 90% dedupe rates. The beauty of it is since its post process, it has no noticeable effect on the live data.
Basically we’ve either been installing new NetApp arrays or fronting older ones with v-series all over the place.

Steven Schwartz September 30, 2008 at 3:26 pm

Come on Robin, did you read the NetApp release? Everyone has written about it already, they never claim 50% reduction in storage required due to Deduplication, it is claimed on several things…I posted a silly but funny corollary on my blog.

open systems storage guy October 1, 2008 at 1:10 pm

I’ve used it- it’s not for all workloads, but it’s a nice feature for low use file systems and whatnot. I wouldn’t suggest it on anything that really hits the controllers heavily because every time a write is done, a process running on the filer hashes the data, which creates something like a 5% processor overhead. During idle times, it goes up considerably as the algorithm will do a byte to byte comparison of all suspected duplicate data chunks before pointing both sections of volume to the same chunk.

Netapp filers use the overhead everyone’s been complaining about to save space in the end. If you have to clone databases, can thin provision, take snapshots, and have heavily duplicated files, you’ll probably end up with more data stuffed into your filer than you could get in an equivalent traditional disk box. If you don’t, however, then you’ll need more disks in your filer than you would otherwise.

max October 1, 2008 at 3:30 pm


Have ASIS running w/ ESX on a (primary storage w/ ASIS) In my experience, the 50% is a very low bar for netapp with this setup in a hosted ESX environment (~400 VMs.)

Ausmith1 October 2, 2008 at 5:18 pm

Here is the sanitized output from ‘df -s -h’ on one of our filers, it houses about 250 Windows based ESX development VMs on VMFS volumes.
Filesystem used saved %saved
/vol/vol0/ 648MB 0MB 0%
/vol/vol1/ 731GB 1230GB 63%
/vol/vol2/ 356GB 299GB 46%
/vol/vol3/ 9639MB 10GB 53%
/vol/vol4/ 108GB 1302GB 92%
/vol/vol5/ 158GB 500GB 76%
/vol/vol6/ 176GB 903GB 84%
/vol/vol7/ 186GB 290GB 61%
/vol/vol8/ 148GB 36GB 20%
/vol/vol9/ 71GB 53GB 43%
/vol/vola/ 150GB 236GB 61%
/vol/volb/ 268GB 397GB 60%
/vol/volc/ 146GB 42GB 22%

That makes 2.5TB of disk space used and 5.3TB saved by my count.

There are some volumes that ASIS is not enabled on, therefore I have not included them in this output. The only reason that ASIS is not enabled on them is that they are large (>2TB) volumes created before ASIS was freely available. Enabling ASIS on a volume is dependent on the size of the volume relative to the RAM available in the filer. i.e. the largest volume this particular filer can handle is 2TB. A 6000 series filer can handle 16TB ASIS volumes.

Joe Kraska October 5, 2008 at 7:48 am

We have NetApp systems running dedup on primary storage in our environment. This doesn’t slow things down in any appreciable manner at all. I believe NetApp is saying that the 7.2.4 release will contain changes to facilitate dup’s and cache hits, which could very well end up providing performance *increases* in a highly duplicative environment, as with VMWare.

I only wish I’d known about the 2TB limit long ago. We have some >2TB volumes, and migrating off of them would be… painful.

Joe Kraska

Joe Kraska October 11, 2008 at 7:21 pm

The guarantee is mostly there to provide comfort to buyers. Most of our virtual machine volumes are at or near 80% recoup rates from NetApp’s dedup.


Jeremy October 27, 2008 at 10:21 am

I was involved in a project evaluating dedupe for backup but we ended up moving in the direction of DataDomain’s inline deduplication. In a proof of concept using DataDomain and we were able to get their advertised 1TB/hr rate. We experimented with direct database backups even though DataDomain usually seems to target VTL solutions. We chatted about deduped primary storage but I haven’t personally been involved in any projects yet to actually try it. And NetApp probably has a better proposition for that; I’m just guessing but inline dedupe is probably too computationally expensive at the moment to be feasible.

Leave a Comment

{ 6 trackbacks }

Previous post:

Next post: