Deduplication has been accepted as an enterprise-class compression technology. Is it time for data compression to be a standard feature of primary storage?
I’ve been doing some work for Nimble Storage a cool Valley startup. Talking to co-founder Varun Mehta, he mentioned that Nimble’s storage/backup/archive appliance does data compression on all data, all the time.
That’s right, primary storage on Nimble’s box is always compressed. Not only that, all their performance numbers are quoted with compressed data.
They aren’t kidding.
Data compression is one of the oldest computer storage technologies around. Bell Labs mathematician Claude Shannon published A Mathematical Theory of Communication in 1948 which, among other things, laid out the math behind compression.
The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols will be called its relative entropy. This is the maximum compression possible when we encode into the same alphabet. One minus the relative entropy is the redundancy. The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%.
In line compression has been part of every enterprise tape drive for decades. The algorithms – Lempel-Ziv was big 20 years ago – have been tuned to a fare-thee-well.
Compression is as thoroughly wrung out as any technology in the data center.
So why don’t we use it everywhere, like Nimble?
Not about capacity
The doubling of capacity from compression is not the big win. The larger benefit is that it more than doubles the internal bandwidth of the array – because bandwidth is more expensive than capacity.
And bandwidth is more important than capacity. As John von Neumann noted in his First Draft of a Report on the EDVAC (pdf):
This result deserves to be noted. It shows in a most striking way where the real difficulty, the main bottleneck, of an automatic very high speed computing device lies: At the memory.
Varun reports that Nimble’s comdec operates at wire speed on a multicore CPU, no ASIC or FPGA required. It must increase latency, but given Nimble’s focus on full stripe writes the increase in bandwidth must more than make up for it.
The StorageMojo take
Since it is possible to perform wire-speed compression/decompression with a commodity CPU, why not everywhere?
Will RAID controllers stumble reconstructing compressed data? Is compressed data more prone to corruption? Is bandwidth so cheap that we don’t need more?
I don’t think so, but I’m open to dissenting opinions. With disk capacity growth slowing comdec everywhere is a good way to increase performance, reduce $/GB and have something new to show customers.
Courteous comments welcome, of course. StorageMojo dove into this 5 years ago in 25x data compression made simple.