StorageMojo has long held the view that our storage workloads are changing: more file storage, less block storage; larger file sizes; and cooler data. While all the indicators said this was happening it’s good to find a study that confirmed this intuition.

In the Measurement And Analysis Of Large-Scale Network File System Workloads (pdf) researchers Andrew W. Leung and Ethan L. Miller from UC Santa Cruz and Shankar Pasupathy and Garth Goodson of Netapp measured 2 large file servers for 4 months. Their results are worth reviewing, since so many of the optimizations in storage infrastructures rely on workload assumptions.

Unstudied CIFS
The authors point out that there have been no major studies of the CIFS protocol, odd since it is the default on Windows systems. Furthermore, the last major study of network file loads was performed in 2001 – seven years ago – an interval in which average this drive sizes have gone from 20 GB to 500 and network speeds from 100 MB to 1 GB.

Most surprising, however is that no published study has ever analyzed large-scale enterprise file system workloads. Researchers have studied workloads closer to home: university and engineering workloads.

Enterprise workloads
One was a midrange file server with 3 TB of capacity with almost 3 TB used by over 1000 marketing sales and finance employees. The second server was a high end Netapp filer with 28 TB capacity – 19 TB used – supporting 500 engineering employees.

Yes, marketers, engineers get the good toys. You can cry about it over your next 3 martini lunch.

Some significant differences from prior studies:

  • Workloads more write oriented. Read/write byte ratios and are now only 2 to 1 compared to the 4-1 or higher ratios reported earlier.
  • Workloads less read-centric. Read/write workloads are now 30x more common.
  • Most bytes transferred sequentially. These runs are 10x the length found in the old studies.
  • Files 10x bigger.
  • Files live 10x longer. Less than half are deleted within a day of creation.

Cool new findings

  • Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
  • Over 60% of file re-opens are within a minute of the first open.
  • Less than 1% of clients account for 50% of requests.
  • Infrequent file sharing. Over 76% of files are opened by just 1 client.
  • Concurrent file sharing very rare. As the prior point suggests, only 5% of files are opened by multiple clients and 90% of those are read only.
  • Most file types have no common access pattern.

And there’s this: over 90% of the active storage was untouched during the study. That makes it official: data is getting cooler.

Another interesting finding: 91% of VMWare Virtual Disk (vmdk) files accesses were small sequential reads – not the larger sequential accesses I’d expect.

The StorageMojo take
The writers rightly suggest that given the rarity of file reads after creation it makes sense to migrate files to cheap storage sooner than later.

Perhaps primary file storage should be thought of as a large FIFO buffer – tossing 3 month old files to an archive for long-term storage. A data flow architecture instead of a series ever-larger buckets.

Kudos to NetApp and UCSC for this work. It seems like NetApp has been doing the best job of leveraging academic researchers lately. I’d like to see them get more marketing mileage out of their good work.

Courteous comments welcome, of course.