Forget the flame wars over moving window versus fixed block de-duplication. A recent paper, A Study of Practical Deduplication (pdf) from William J. Bolosky of Microsoft Research and Dutch T. Meyer of the University of British Columbia found that whole file deduplication achieves about 75% of the space savings of the most aggressive block level de-dup for live filesystems and 87% of the savings for backup images.

Presented at FAST 11 – and winner of a “Best Paper” award – the researchers looked at file systems from 857 Microsoft desktop computers over 4 weeks. Researchers asked permission to install rather invasive scanning software.

The scanner took a snapshot using Window’s volume shadow copy service and then recorded metadata about the file system itself. The scanner recorded each file’s metadata, retrieval and allocation pointers as well as the computer’s hardware and systems configuration. They excluded the pagefile, hibernation file, the scanner itself and the VSS snapshots the scanner created.

During scanning each file was broken into chunks using both fixed block or Rabin fingerprinting. They also identified whole file duplicates.

Rabin uses dynamically variable block sizes to maximize compression. Figuring out where to break the file adds to the overhead.

The resulting data set was 4.1 TB compressed – too large to import into a database – and was further groomed to lose unneeded data.

De-dup issues
De-duplication is expensive. You’re giving up direct access to the data to save capacity.

The expense is in I/Os and CPU cycles. Comparing each chunk’s fingerprint to all other chunks is nontrivial. De-duplication indirection adds to I/O latency. A file’s chunks are scattered around, requiring small and expensive random I/O’s to read.

Older techniques, such as sparse files and Single Instance Storage, are more economical even if their compression ratios aren’t as high. Fewer CPU cycles, less indirection and good compression.

The StorageMojo take
If capacity is expensive – read “enterprise” – and I/Os cheap – SSD or NVRAM in the mix – fancy dedup can make sense. It is at the margin of capacity cost and I/O availability that the value prop gets dicey.

Low duty cycle storage – SOHO – with plenty of excess CPU and light transactions could use deduped primary storage. But with a 10 TB of data to backup, most users would’t notice the difference between whole file and 8KB Rabin.

It’s the price tag and user reviews the SOHO/SMB crowd will be looking at.

Courteous comments welcome, of course. The paper also included some interesting historical data about Windows file system that I covered on ZDNet.