Our changing file workloads

by Robin Harris on Tuesday, 9 September, 2008

StorageMojo has long held the view that our storage workloads are changing: more file storage, less block storage; larger file sizes; and cooler data. While all the indicators said this was happening it’s good to find a study that confirmed this intuition.

In the Measurement And Analysis Of Large-Scale Network File System Workloads (pdf) researchers Andrew W. Leung and Ethan L. Miller from UC Santa Cruz and Shankar Pasupathy and Garth Goodson of Netapp measured 2 large file servers for 4 months. Their results are worth reviewing, since so many of the optimizations in storage infrastructures rely on workload assumptions.

Unstudied CIFS
The authors point out that there have been no major studies of the CIFS protocol, odd since it is the default on Windows systems. Furthermore, the last major study of network file loads was performed in 2001 – seven years ago – an interval in which average this drive sizes have gone from 20 GB to 500 and network speeds from 100 MB to 1 GB.

Most surprising, however is that no published study has ever analyzed large-scale enterprise file system workloads. Researchers have studied workloads closer to home: university and engineering workloads.

Enterprise workloads
One was a midrange file server with 3 TB of capacity with almost 3 TB used by over 1000 marketing sales and finance employees. The second server was a high end Netapp filer with 28 TB capacity – 19 TB used – supporting 500 engineering employees.

Yes, marketers, engineers get the good toys. You can cry about it over your next 3 martini lunch.

Some significant differences from prior studies:

  • Workloads more write oriented. Read/write byte ratios and are now only 2 to 1 compared to the 4-1 or higher ratios reported earlier.
  • Workloads less read-centric. Read/write workloads are now 30x more common.
  • Most bytes transferred sequentially. These runs are 10x the length found in the old studies.
  • Files 10x bigger.
  • Files live 10x longer. Less than half are deleted within a day of creation.

Cool new findings

  • Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
  • Over 60% of file re-opens are within a minute of the first open.
  • Less than 1% of clients account for 50% of requests.
  • Infrequent file sharing. Over 76% of files are opened by just 1 client.
  • Concurrent file sharing very rare. As the prior point suggests, only 5% of files are opened by multiple clients and 90% of those are read only.
  • Most file types have no common access pattern.

And there’s this: over 90% of the active storage was untouched during the study. That makes it official: data is getting cooler.

Another interesting finding: 91% of VMWare Virtual Disk (vmdk) files accesses were small sequential reads – not the larger sequential accesses I’d expect.

The StorageMojo take
The writers rightly suggest that given the rarity of file reads after creation it makes sense to migrate files to cheap storage sooner than later.

Perhaps primary file storage should be thought of as a large FIFO buffer – tossing 3 month old files to an archive for long-term storage. A data flow architecture instead of a series ever-larger buckets.

Kudos to NetApp and UCSC for this work. It seems like NetApp has been doing the best job of leveraging academic researchers lately. I’d like to see them get more marketing mileage out of their good work.

Courteous comments welcome, of course.

{ 11 comments… read them below or add one }

Jonathan Morgan September 10, 2008 at 3:52 am

> Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
> Over 60% of file re-opens are within a minute of the first open

My suspicion would be that even fewer files are “really” re-opened for use. E.g., take that lovely application, Word. When it saves a file, it actually saves a bit. Reopens the file, adds a bit more, and closes it. Reopens the file, amends a few bits and closes it again. Etc. What you think is a single save in fact is not a sequential save action at all. The survey indicates 13% of files were .docs + .xls’s.

Wow! Even fewer files than indicated are really reused.

Computer data storage is moving more and more toward the ultimate ILM storage system known to man – brain memory storage – for that surely files away far more memories than are actually ever recalled!!

Open Systems Storage Guy September 10, 2008 at 5:35 am

Interesting data- I wish someone would do this for Oracle, SQL, Exchange, and some other heavy hitting DB applications…

Steve Todd September 10, 2008 at 6:38 am

Robin,
Thanks for sharing the pointer to the paper. I find it interesting from an industry and product perspective. The SNIA XAM initiative (www.snia.org/forums/xam) is an industry effort that deals with exactly this issue (managing fixed or unchanging content). At EMC we’re also seeing strong demand for products that silently migrate untouched data, whether it’s performed at the server, in the network, or in the filer.
Steve

Matthew Emery September 10, 2008 at 6:40 am

Thanks for this, I have found the study interesting.

One recommendation though has been left out from the end of section 5

“While access to file metadata should
be fast, this indicates much file data can be compressed,
de-duplicated, or placed on low power storage, improving
utilization and power consumption, without significantly
impacting performance. In addition, our observation
that file re-accesses are temporally correlated (see
Section 4.5) means there are opportunities for intelligent
migration scheduling decisions.”

… or data can be deleted.

Ernst Lopes Cardozo September 10, 2008 at 2:21 pm

To me these observation suggest a different policy: write new files to backup storage first, then copy to primary storage when requested. Primary storage becomes a cache that does not need backup. Effectively, the backup process is spread over the day. Motivation: all new files need to be backuped, but only a fraction will be accessed. Backup is to disk and needs to be replicated to disk or tape at a DR site. The capacity of primary storage becomes flexible, there will be no ‘disk full’ conditions, only sub-optimal performance when the primary storage (the cache) is too small.

Ryan Malayter September 11, 2008 at 6:03 am

Funny, we just did a similar study last year internally on all of our storage, including DB and mail servers. The results were startingly similar. 2:1 read:write ratio, 12 KB average IO size, and very few files are ever touched beyond 30 days of creation.

That said, I do not think a study of two NAS installations is statistacilly significant. WHat is really needed is a long-term study of hundreds of disparate organizations. Something only a vendor could do, I suppose, with remote monitoring and opt-in from customers.

Also, the fact that such studies are “blind” to access patterns inside datbases and virtual machine files is also quite a flaw.

Jered Floyd September 11, 2008 at 3:26 pm

Robin,

I saw this paper, and while I appreciate the data, I find it unfruitful for drawing conclusions. The particular customer involved appears to make extraordinarily little use of their storage. Their needs could likely be met with a 386 running Linux running with drives from the same era.

I’d really like to see aggregate data from a wide variety of customers, and I hope the authors will go on to do so. I’ll say for a fact that zero of the conclusions you can draw from that report apply to our development file server, and I bet that’s likely true for many others!

Regards,
Jered Floyd
CTO, Permabit Technology Corp.

Steve Jones September 12, 2008 at 5:37 am

There’s no major surprises here. As the available storage capacity for a given amount of money increases on an exponential curve, then the tendency will be to store more and more low-access, low-value data. The costs of managing this storage (by the individual creating it) do not so decline – or at least without automation.
Measured another way, it’s just as well that the access frequency of storage is on the decline. The rate of increase of I/O throughput bandwidth and IOPs are on much lower curves (especially the latter). Quite simply, if the access density per GB of today’s data looked anything like it did a decade ago then the storage systems would be unusable.

Martin G September 16, 2008 at 8:11 am

The 76% of files only accessed by a single user, makes you wonder why we provide shared file-systems for most users. There’s got to be a better way of doing this.

Brian October 1, 2008 at 3:43 pm

Why not virtualize the NAS? Utilize the workflow, data to a degree that can use the all the storage capacity, resources available. Product like Acopia for taking advantage of the NetApp, EMC, and Network Attached Storage.

That would be a better way.

Bobby Moulton October 16, 2008 at 9:23 am

Good article Robin –

This data is important. This is the kind of information our company – Seven10 – has used to promote a data management paradigm change. And while I think you are almost there in your view of how data should be managed, I still think you are putting too much emphasis on migration.

Data placement across tiers with retention policies is the key.

Take a look at our EAS product – tiered storage offering support for disk, cas, and tape (as an archive) and I think you will see that there is a better approach to managing the growth of fixed content.

We see an end of days for back-up as an archive – it simply is too costly to manage.

I look forward to hearing from you.

Bobby Moulton
President
Seven10
bmoulton@seventenstorage.com

Leave a Comment

{ 5 trackbacks }

Previous post:

Next post: