NetApp is announcing a deal today: use their de-dup software with a new NetApp filer for VMware storage and they guarantee that you’ll need a minimum of 50% less storage. You can be sure that NetApp considers 50% a low bar – 80% is more like it.

Why not for most storage?
In a world of unstructured data that is rarely accessed de-duplication of primary storage is an obvious next step. A recent post discussed the findings of a joint NetApp/UC Santa Cruz study.

A quick recap of some of the study’s findings:

  • Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
  • Over 60% of file re-opens are within a minute of the first open.
  • Less than 1% of clients account for 50% of requests.
  • Infrequent file sharing. Over 76% of files are opened by just 1 client.
  • Concurrent file sharing very rare. As the prior point suggests, only 5% of files are opened by multiple clients and 90% of those are read only.
  • Most file types have no common access pattern.

And there’s this: over 90% of the active storage was untouched during the study.

Is it real?
Some commenters were dubious about the results of the study, citing sample size and atypical workload concerns. But the corporate overhead – marketing, finance, HR etc. – part of the workload felt right to me.

A lot of stuff comes in and gets saved “just in case.” Most of it never gets looked at, but when you need a particular file, you need it.

I’m less clear on engineering workloads – I suspect there are major differences among disciplines – but again it didn’t seem unreasonable. But let’s leave the engineers out of the equation.

How important is performance?
The big knock against de-dup for primary storage is the performance hit. Some vendors claim in-line de-dup at wire speed, while others optimize for backup windows and de-dup in the background. Maybe the latter are more efficient.

But given that 90% of the active storage was untouched and 1% of the servers account for 50% of the requests, how important is performance? Cherry-picking the low-access users – i.e. road warriors whose notebook is their primary I/O bucket – shouldn’t be hard.

So what percentage de-dup compression of unstructured data is feasible? That is the key to understanding the economic basis of primary storage de-duplication of unstructured data.

Academics, start your engines!

The StorageMojo take
Primary storage de-dup could be the next big win for IT shops. We just don’t have the data that can tell us how big the win could be.

NetApp (disclosure: I’ve done a minuscule amount of work for them in the last year and accepted their annual analyst junket) is well positioned. Their de-dup software license is free on their NearStore/FAS boxes.

NetApp tells me that they’ve got 13,000 systems running de-dup. Maybe some of those people are using it for primary storage and can tell us how well it works.

If the feature is free, de-duping some primary storage will be standard practice in most data centers within 5 years. As the de-dup technology improves and Moore’s Law drives performance, more and more unstructured data will be de-dup’d as a matter of course.

Courteous comments welcome, of course.