Geeky computer guy that I am, I have my machine instrumented with programs (Mac users: MenuMeters) that tell me all kinds of useless information. Network usage, memory usage, CPU load and, of course, disk activity.

Mostly all this stuff just tells me that the machine hasn’t crashed. But sometimes it tells me something surprising.

Cache out, laid-off, says he’s got a bad cough, wants to get it paid off – look out kid
Like my virtual memory page usage: pageins; pageouts; page faults; copy-on-writes; and cache hits and misses.

Get this: 5,292,427 cache lookups and only 32,860 cache hits – a measly 0.6% hit rate. Why bother?

What is “virtual memory” anyway?
If you know the answer, skip ahead.

Back 30 years ago, when RAM cost over $1,000 per MB, people were particular about how much they bought, even on big machines. Virtual memory extends physical RAM with disk capacity. Typically, least-used memory pages are swapped out to disk. If a document’s memory pages are sitting on disk, they get swapped into physical RAM once you start editing it again.

Data dynamics have changed
“Locality of reference” is the behavior that give cache -and virtual memory – their power. Locality of reference is the empirical observation that once a piece of data is accessed, it tends to be accessed again several times, maybe even hundreds of times. So it makes great sense to keep that piece of data close to the action until demand for it falls off.

That’s the theory. Yet if data accesses are near-random, you’ll see what I see: almost no cache hits. Which means the overhead of cache management is buying nada.

“Locality” doesn’t matter if you don’t “reference”
Data is cooling. Vast amounts of data are being stored as storage prices decline, and the number of data accesses per megabyte is steadily dropping. And that’s a good thing since disks accesses per megabyte are dropping too.

What I hadn’t thought about, and I haven’t seen discussed anywhere else, is the impact this change must have on system architecture. Much effort has gone into making cache mechanisms, including second and third level caches, virtual memory, system caches and disk caches, fast and efficient. Yet, if you use your system the way I do mine, much of this effort and overhead is wasted.

Expensive array assets are becoming less valuable
Many applications, such as databases, do exhibit high levels of locality of reference, and they probably always will. But for unstructured data, how valuable is it to spend good money on costly caches and the associated engineering for a resource that may return very little value?

The StorageMojo take
As scale-out storage architectures continue to evolve, engineers will need to look at the workloads they are designing for to determine the most cost-effective means of supporting them. The cost-adding “cache everywhere” architectures – disk, network, system, and more – may actually hurt performance while adding cost and complexity. It is another nail in the coffin of the traditional disk array.

Its something worth thinking about the next time you lay down cold hard cache cash.

Comments welcome, as always. Comments moderated, because moderation is a virtue, except in the defense of liberty.

I updated this article by shortening it and adding a gratuitous Subterranean Homesick Blues reference. My apologies to Mr. Dylan.