From the category archives:

Enterprise

Flash and the new storage pyramid

by Robin Harris on Thursday, 4 December, 2008

I got a note from David Flynn, co-founder and CTO of Fusion-io (disclosure: I’ve done work for them) in response to The new storage pyramid. He makes several points about the nature of the array model that I wish I’d made.

Well worth the read.

David Flynn’s note:
Geat analysis Robin.

And, great comments.

My $.02 ….

I think it’s not just about the proprietary nature, the somewhat better performance and features, and the high markups that differentiates “storage arrays” from “clustered storage”.

It’s actually more to do with the vertically integrated nature of the business model of the companies in the array building business. This leads to proprietary architectures, higher margins and, true, somewhat better performance and features.

Let me explain through an analogy…

We used to get graphics workstations from SGI, Apollo, and other vertically integrated vendors, who sold everything end-to-end, down to the monitors and their own proprietary OS’s. These guys commanded HUGE margins – partly to reward their risky investment in solving a worthy, complex problem.

Similarly, the military (and other few others who could afford a million dollar price-tag) used to get flight simulators from Evans&Sutherlands who were also vertically integrated and insanely expensive. You even had niche vendors like Intergraph doing 3D graphics information systems who could justify their own proprietary architectures.

At least for a while.

They were all doing 3D graphics in one form or another. And, now, they are all GONE – thanks to the emergence of a component, the 3D graphics card.

With enough capability to be applicable across all of these different verticals, the 3D graphics accelerator has now shattered the benefit of running a vertically integrated business.

Today, there are myriads of “integrators” who make graphics workstations, flight simulators, GIS systems, etc. at very low margin by comparison. And, they do it by pulling together off-the-shelf components – all commoditized down to the software that provides even the high-value features.

They might have been inferior to the proprietary solutions at first, but not anymore.

Now, what happens when you introduce to the storage industry a component that commoditizes and trivializes the linch-pin reason for expensive proprietary disk arrays, namely the caching tier – using NAND flash.

Once anyone can easily get the performance across any use case (OLTP, OLAP, Data Warehousing, BI, VOD, content caching, etc. etc.) you no longer need vertical specific, highly tuned, proprietary solutions from vertically integrated companies.

Every capability that doesn’t migrate into the component itself becomes nothing but commoditized software to be layered on top by any number of interchangeable integrators. Things like replication, disaster recover, backup, dedup, and so on just become commoditized software that can run anywhere.

This is a classic Adam Smithian market evolution. What used to be a single, vertically integrated provider becomes a layered market where some people build the components, others integrate them (with some bit of value add), and you go to having many players competing on many levels.

And prices go down.

But, thankfully, (for those of us in the business of creating this componentized building-block) volume, productivity, and efficiencies all go up.

So, actually everyone wins. Including society as a whole.

Well, almost everyone wins. Everyone, that is, except for the proprietary array vendors who get caught by the innovators dilemma and a business model that used to be the correct one, but no longer is.

This generally makes them the slowest to simplify their proprietary infrastructures around the commoditized component – to help justify their investment into their heroic proprietary solutions.

In an effort to protect their margins, they endeavor to make things seem as complicated as possible. They do this, say, by preferring that NAND be forced to pretend to be an HDD and be put into HDD drive bays behind HDD protocols, where it has little ability to simplify things or get much additional performance.

They are the last to come out and say it can be simplified. Instead they’ll tell you you must have features X, Y, Z. And, see, those aren’t as good as with our proven architecture.

Let’s take high availability as an example. They aren’t going to tell you that a “shared nothing” strategy – where two separate RDBMS servers with terabytes of direct attached NAND inside of each use off-the-shelf log-shipping for asynchronous replication (or query replication to do it synchronously) to get fault tolerance.

No, they aren’t going to tell you that it’s actually simpler, more cost effective, and, here’s the real kicker… more fault tolerant to share nothing, than to use shared storage – no matter how fault tolerant they claim their monolithic storage array is, it’s still shared.

I’m not saying this market transformation is going to happen by tomorrow. But, given the geometric growth of the performance gap between processors and storage, and the geometric decline in cost of NAND flash – leading to a “Moore’s Law Squared” effect in the benefit to cost ratio – it is going to happen faster than people would think. Even considering the “stodgy” nature of storage folks who are in the business of obsessively caring for precious bits.

It doesn’t hurt that in this global recession companies are looking for ways to reduce costs while still needing to grow throughput. So, there’s more of a willingness to look at different, innovative ways to skin the cat.

I agree with you Robin. It will be a fait accompli by 2015.

David Flynn
CTO, Fusion-io

The StorageMojo take
Technology diffusion is a complex mashup of secular trends, technology development, individual creativity and happenstance. But the current direction of the high-end storage market points to the greatest change we’ve seen since the early 90’s and the advent of arrays.

The “Moore’s Law Squared” effect is particularly intriguing. Humans are terrible at estimating the impact of power functions, so this one is likely to be even more surprising than we dream.

Courteous comments welcome, of course.

{ 6 comments }

The new storage pyramid

by Robin Harris on Tuesday, 2 December, 2008

OK, it is still a pyramid
Predictions of the storage array’s death struck some commenters as premature. Commenters raised a host of issues:

  • Cost. Low-end storage arrays are cheaper than clusters.
  • Complexity. The complexity of clustered hardware – all those cables and boxes – increases management costs
  • Functionality. “Unless the cluster storage also provides the same reliability, scalability, and supportability as the larger monolithic arrays. . . ” it won’t supplant traditional arrays.
  • Cost pt. II: Lower-cost modular arrays, combined with a software layer that knits them into a seamless whole, could provide a full-service storage infrastructure complementing today’s virtual servers.

History repeats itself
The issues are similar to the mainframe vs everybody arguments of the last 40 years. Within living memory mainframes from IBM and the 7 dwarfs – Burroughs, Sperry Rand, NCR, RCA, Honeywell, CDC and GE – went through the same process monolithic storage arrays will.

Mainframes faced the same negatives: costly; complex management; inflexible; limited applications; and optimized for batch computing in an interactive world. Proponents argued the positives: reliability; scalability; efficiency; security; and control.

Reinventing the wheel – without end
Mainframes were expensive because they were a) low-volume products and b) had high (60%+) gross margins. Each mainframe architecture had its own processors, peripheral interconnects, networks, OS, application software and sales and support groups.

Every mainframe company had to solve all the problems every other mainframe company did – at enormous cost.

Mainframes today
Mainframes are far from dead, but they are very different today. There are fewer vendors; they use commodity processors, networks and interconnects; run open source software such as Linux; and adjusted for inflation they are much cheaper.

That is the future of big monolithic arrays.

Monolithic arrays tomorrow
That we still have as many large arrays and vendors is due to the fact that the vendors have already gone far down the mainframe path. Commodity server motherboards, Linux, SATA drives and Xyratex enclosures are all common in high-end arrays, helping cut costs.

But at a fast approaching point, cutting costs isn’t enough. Vendors have to give customers good reasons to keep buying the big iron. The traditional mantra of availability, performance, scalability and supportability won’t hold customers forever.

Why?

Moore’s Law keeps moving the tiers
The industry has been pushing tiered storage in multiple guises for decades: HSM; ILM; and now, cloud storage. But customers embrace tiers out of necessity, not love.

The powerful visual picture of the layer-cake storage pyramid is deceptive. The x and y axes are cost and capacity, but they are only proxies for the application requirements of the layer above.

Array vendors want to believe that there will always be an “array layer” in the storage pyramid. But why should there be?

As Moore’s Law keeps moving commodity server performance up, the performance envelope of commodity-based storage systems will enlarge. With the commoditization of 10GigE, flash, 6 and 12 Gbit SAS and a 10x increase in areal density, the bandwidth to exploit higher CPU performance will push today’s “archive” cluster storage into monolithic array territory. At a lower price, too.

The software that ties commodity hardware together will improve, weakening the availability argument. If performance is bandwidth driven, pNFS will close the deal for clusters. Scalability goes to clusters today and will only improve with time. Supportability isn’t owned by hardware companies – plenty of software-only companies have cracked the code.

Here’s the future storage pyramid

The storage pyramid in 2015

The storage pyramid in 2015

Won’t arrays disappear?
No, but they will change. For example, they’ll look a lot more like cluster storage under the sheetmetal and GUI. Flash will be an integral part of the architecture – and not as a disk drive. There will be less add-on software, because more will be built in.

Arrays will continue to support legacy interconnects, such as FC and FCoE – remember, this is the future we’re talking about – and legacy OS’s that commodity-based storage won’t. Storage is a conservative part of IT and arrays won’t disappear.

The StorageMojo take
I was at DEC when the company was growing fat selling VAXen. Many predicted that PCs would be the death of the minicomputer companies, but it took 8 years to hit DEC.

There is life after arrays. Minicomputers still exist – and are selling more than ever – but the business model is totally different. The loss of 30 gross margin points forced the issue.

Storage requirements will keep growing. But the days of 60%+ gross margins are drawing to a close. Survivors will follow classic military strategy: concentration of force; short supply chains; and clear objectives.

Courteous comments welcome, of course.

{ 5 comments }

Economic crisis and the storage industry

by Robin Harris on Wednesday, 19 November, 2008

Yes, Virginia, the storage industry will survive the crisis
Economists and business leaders generally agree that the current, as yet unofficial, recession will be the worst we have seen since the Great Depression. The credit bubble has popped and we are facing global de-leveraging that will take years to unwind.

De-leveraging is fancy term for “a lot less money rolling around.” The computer industry started after the Great Depression so this will be the worst times we’ve ever seen.

How bad will it get for storage?
Storage is a special case. Disk drives underlie everything we do and they show no sign of slowing their capacity increases and price drops.

Data growth rates are a little less certain – contracting businesses produce less data – but the economic advantages of online data continue to grow as cost per gigabyte drops. Even in the financial sector someone is going to have to unravel all of those credit derivative swaps and synthetic securities that the “rocket scientists” – heckuva job, guys! – developed.

Where will this impact IT operations? Right in the heart of the array business.

A little smarter, a lot cheaper
Assume 80% of all business data is unstructured. And suppose 80% of that data is stored on storage arrays that are optimized for transactional data.

If RAID arrays average $6/GB today and cluster storage averages $2/GB we can begin to estimate the potential impact. In a perfect world 64% – 80% of 80% – of all corporate data could be migrated from high cost storage arrays to much lower cost storage clusters.

If the storage array business is a $21 billion a year today that means there is roughly a total available market of $13 billion of IT spend that could go to storage clusters. If storage clusters are 1/3 the price of storage arrays that suggests a total storage cluster business of $4 billion a year.

That ignores, of course, the traditional impact of sharply lower storage costs: a rapid increase in the amount of data stored. Online and easily searched data is much more valuable than data is stored on paper or tape. A first-order guess is that in today’s market there is the potential for an $8 billion a year storage cluster IT spend.

That’s the theory, anyway. The reality is that most IT professionals will not give up the storage arrays they know and love without a fight. But the economic pressure will be unrelenting.

Winners and losers
This won’t be a rapid process. The early not-very-good storage arrays came out in 1990 and took 8 years before sales reached 50% of the capacity of enterprise storage. The economic advantages of cluster storage are greater and the pressure to contain costs much stronger today. It will be 6 years before half of all enterprise storage capacity sales are in storage clusters.

The winners will be those companies that embrace and extend the capability of storage clusters the soonest. Among large companies HP and EMC appear to have the lead. Among the small companies several will be purchased while others will continue to grow as independent entities.

The losers? IBM appears to have no discernible strategy. NetApp is bogged down in its efforts to integrate the GX global namespace with the contradictory requirements of its traditional Data OnTap code base.

Sun has good building blocks but will fail if they lead with Lustre. HDS will wait until the market is defined to start moving – but that may be too late. This is a software play in more ways than one.

Smaller companies in the array business have a steep learning curve with cluster storage. Expect most of them to fade over time. There will be opportunities for OEM suppliers to the mid-tier vendors.

The StorageMojo take
The age of the raid array is coming to an end. They won’t disappear anymore than mainframes have. But they will become much less common. The array business will see single-digit sales drops and general long-term stagnation. The storage cluster business will show robust growth.

The race for storage cluster dominance is still young. There are many variables where newcomers and existing players can find or fumble important advantages. Can storage clusters be effectively productized? Or will integration requirements favor service-oriented companies? How will flash be best integrated into storage clusters? How will the SMB market be cracked?

The economic crisis does not create new trends. It accelerates existing ones. IT professionals should not underestimate the power and impact of the current crisis on once sacrosanct IT budgets.

IT likes to talk about “business partnership.” Now is the time for action. Show the CFO that you know how to do more with less and you’ll be a partner. Insistence on business as usual is the wide road to a pink slip.

Courteous comments welcome, of course. Disclosure: I’ve recently done some work for HP on their announced but not-quite-shipping Extreme Data Storage 9100. I was impressed.

{ 25 comments }

The computer science behind EMC’s cloud storage

by Robin Harris on Wednesday, 12 November, 2008

EMC has announced Hulk/Maui, now known as Atmos. I’m flying to Boston today and don’t have access to EMC’s announcement documents.

But I have something better: the papers that provide the theoretical underpinning for Atmos. They provide an in-depth background that isn’t often available for new products.

These papers have too many interesting details to summarize them all. Here are some points that strike my fancy. YMMV.

If you want to understand Atmos these papers are essential. Details of EMC’s implementation will differ of course, but the underlying architectural trade-offs and management issues remain.

A 10 trillion file store
In 2000 a UC Berkeley paper OceanStore: An Architecture for Global-Scale Persistent Storage, authored by John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao, laid out the architecture of what is now Atmos. EMC provided funding for the research and Patrick Eaton went to work for EMC a couple of years ago.

The abstract says:

OceanStore is a utility infrastructure designed to span the globe and provide continuous access to persistent information. Since this infrastructure is comprised of untrusted servers, data is protected through redundancy and cryptographic techniques. To improve performance, data is allowed to be cached anywhere, anytime. Additionally, monitoring of usage patterns allows adaptation to regional outages and denial of service attacks; monitoring also enhances performance through proactive movement of data.

The design center: 1 billion users; each storing 10,000 files. 10 trillion files. Utility storage indeed!

A cluster of clusters
OceanStore is a software layer that creates a global storage cluster. While the paper simply refers to servers, the servers can be clusters as well.

EMC’s engineers chose to use a 3rd party cluster product – IBRIX I think – for the local data stores so they could focus on the layer that glues the sites together. Each local store can itself be a petabyte or more.

Update: several commenters assure us that IBRIX is not the local cluster file system. EMC is using some open source software in Atmos. End update.

Untrusted infrastructure
A key goal of the paper and its prototype was to assume untrusted infrastructure – a phrase that fairly sums up today’s Internet. Only clients are trusted with cleartext – all stored content is encrypted – but most servers are assumed to be working correctly and to help maintain file consistency.

Nomadic data
A global storage system has a unique requirement for locality. But it also needs to be able to store data anywhere, anytime to maintain persistence in the face of outages and catastrophes. Thus data has to be separated from its physical location.

Files are encrypted at the source and stored as persistent objects with unique Global User ID’s (GUID). OceanStore has no knowledge of a file’s objects, so it relies on introspection, a mechanism that notes correlations among objects.

Thus the system moves highly correlated objects together, reducing the latency problems that a non-introspective object store faces in a global infrastructure.

Ciphertext
The paper notes that restricting OceanStore to ciphertext limits what can be done with the data. But there is more flexibility that you might suppose.

The operations compare version, compare-size, compare-block, and search are all possible. In addition there are several feasible update operations, such as replace-block, insert-block, delete-block and append.

Applications
Multi-petabyte data stores for scientific, security or commercial applications are obvious applications. But telcos and ISPs are most interested in mobile apps.

The authors call out email as an apt OceanStore application.

OceanStore alleviates the need for clients to implement their own locking and security mechanisms, while enabling powerful features such as nomadic email collections and disconnected operation. Introspection permits a user’s email to migrate closer to his client, reducing the round trip time to fetch messages from a remote server. OceanStore enables disconnected operation through its optimistic concurrency model—users can operate on locally cached email even when disconnected from the network; modifications are automatically disseminated upon reconnection.

APIs
OceanStore offered its own API. But the authors also developed facades for the base API that emulated a Unix file system. a transactional database and a World Wide Web gateway.

Replication
OceanStore used erasure codes, not unlike the mechanism Cleversafe uses for its distributed data store system. Replica management is a major task for a global system and the paper goes into some detail on their solutions.

The 2nd paper
A 2nd paper, Antiquity: Exploiting a Secure Log for Wide-Area Distributed Storage (available at the same link above) published last year, expands on the OceanStore work.

. . . the secure log interface implemented by Antiquity is a result of breaking OceanStore into layers. In particular, a component of OceanStore was a primary replica implemented as a Byzantine Agreement process. This primary replica serialized and cryptographically signed all updates. Given this total order of all updates, the question was how to durably store and maintain the order? . . . The secure log structure assists the storage system in durably maintaining the order over time. The append-only interface allows a client to consistently add more data to the storage system over time. Finally, when data is read from the storage system at a later time, the interface and protocols ensure that data will be returned and that returned data is the same as stored.

Finally, self-verifying structures such as a secure log lend themselves well to distributed repair techniques. The integrity of a replica can be checked locally or in a distributed fashion. In particular, we implemented a quorum repair protocol where the storage server replicas used the self-verifying structure. The structure and protocol provided proof of the contents of the latest replicated state and ensured that the state was copied to a new configuration.

The StorageMojo take
Bravo! EMC is taking cutting edge computer science and turning it into a product. I’ll comment on the specifics of Atmos later.

New storage paradigms are rare. To have so many academic papers on the underlying technology is rarer still.

EMC would never provide this much information themselves – it would slow down the sales cycle. But these papers – and the couple of dozen others on the OceanStore site – provide implementors with a wealth of technical background.

Comments welcome, of course. Anybody want to comment on what these papers mean for the patentability of Atmos?

{ 12 comments }

Axxana fixes the speed of light

by Robin Harris on Monday, 13 October, 2008

Or a reasonable facsimile thereof
If you are interested in Disaster Recovery check out Axxana. They solve the limited synchronous data copy distance problem with a black box designed for data. Concept is simple but getting the details right is hard.

The problem
Synchronous replication requires that apps wait until the remote site completes the write. Given the speed of light, that means that synch sites can’t be very far away. Certainly not the 300 miles the SEC would like to see for financial institutions – we still have a few of those, don’t we?

Axxana’s answer
No matter what happens in a plane crash, they always seem to be able to recover the “black box” that tells them what the plane was doing shortly before the crash. Axxana has developed a black box for data centers.

Here’s how they describe it:

The Phoenix Black Box is located near the storage system at the primary data center and records a synchronous data stream from the storage. At the same time, an asynchronous data replication system is moving data to a secondary data center (the remote recovery site). The Phoenix Black Box has to protect only the Gigabytes of data that would have been lost in a typical asynchronous replication scenario. Data is protected inside the Black Box during the course of the disaster and can be immediately extracted.

Data extraction is achieved either by:

  • Physically locating the system by tracking the homing signal and connecting a laptop with an Axxana software component to the Phoenix System™ at the disaster site, or
  • The self sufficient and well protected system transferring the data to the secondary site using highly resilient cellular broadband technology.

Your data phones home after a disaster.

Compelling economics
It will take a while to suss out all the implications, but one simple scenario is a company with 3 data centers around the world could in-source their DR strategy with the equivalent of synchronous data recovery. How much would that save?

Distribution
They are working with as many of the major vendors as they can to get the product to you through people you already deal with. Expect to see some announcements.

The StorageMojo take
They are in contention for StorageMojo’s “coolest new product as SNW” award. It looks like they can handle anything up to an A-bomb blast. If that happens even synchronous data replication may not work. Besides, a dirty bomb is much more likely. Happy thoughts, eh?

Comments welcome, of course. Guys, sorry if I jumped the gun. But when I saw the web site was up . . . .

{ 4 comments }

HP/LeftHand: cluster market shapes up

by Robin Harris on Wednesday, 8 October, 2008

Hewlett-Packard’s acquisition of the LeftHand Networks shows how cluster storage is going mainstream – and how HP plans to be right in the middle of it. First PolyServe and now LeftHand.

This is about commodity-based clusters
Not iSCSI or GigE or 10 GigE as a storage interconnect. Fibre Channel’s failure to move downmarket – and Infiniband’s similar problem – means GigE is the only game in town.

Reaching the huge, not currently imploding, SMB market requires meeting people where they live. SMBs don’t live in Fibre Channel glass houses. GigE isn’t ideal, but it’s cheap and it works.

Did HP overpay?
$360 million isn’t pocket change, but it is only about 4x the $86 million investors put in. The investors get some nice coin, but it isn’t the 10-bagger they were hoping for.

Once the Lefties go through the interminable internal HP meat grinder, sales will grow rapidly. I suspect they weren’t up to Isilon’s $100M in sales – maybe $70M – but LeftHand was much closer to profitability. Net net: the price looks fair for a market leader in a high-growth market.

HP vs EMC
Battle of the competing cluster storage visions. Polyserve handles files; LeftHand blocks. EMC’s Maui is aimed at large-scale distributed file storage, a utility that ISP’s might resell to SMBs, but nothing an SMB would implement on their own.

Which will win – and there’s room for both – rests on the answer to the question Are there economies of scale in storage?. If there are, small-scale clusters sales will suffer and Maui should win.

The StorageMojo take
This is cluster storage market skirmishing, not a pitched battle. That will come but right now everyone is feeling their way, coming into the market from different directions, waiting to see what clicks.

Right now though, HP seems to have the strongest position. XIV is too new; Maui even newer; Lustre too complex; Isilon is digging out of a big hole. HP has the pole position with implementable products today and the services to back them up. Should be a powerful combination.

Courteous comments welcome, of course. Disclosure: I’ve done some work for HP, Isilon and Sun.

{ 4 comments }

De-duplicating primary storage

by Robin Harris on Tuesday, 30 September, 2008

NetApp is announcing a deal today: use their de-dup software with a new NetApp filer for VMware storage and they guarantee that you’ll need a minimum of 50% less storage. You can be sure that NetApp considers 50% a low bar – 80% is more like it.

Why not for most storage?
In a world of unstructured data that is rarely accessed de-duplication of primary storage is an obvious next step. A recent post discussed the findings of a joint NetApp/UC Santa Cruz study.

A quick recap of some of the study’s findings:

  • Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
  • Over 60% of file re-opens are within a minute of the first open.
  • Less than 1% of clients account for 50% of requests.
  • Infrequent file sharing. Over 76% of files are opened by just 1 client.
  • Concurrent file sharing very rare. As the prior point suggests, only 5% of files are opened by multiple clients and 90% of those are read only.
  • Most file types have no common access pattern.

And there’s this: over 90% of the active storage was untouched during the study.

Is it real?
Some commenters were dubious about the results of the study, citing sample size and atypical workload concerns. But the corporate overhead – marketing, finance, HR etc. – part of the workload felt right to me.

A lot of stuff comes in and gets saved “just in case.” Most of it never gets looked at, but when you need a particular file, you need it.

I’m less clear on engineering workloads – I suspect there are major differences among disciplines – but again it didn’t seem unreasonable. But let’s leave the engineers out of the equation.

How important is performance?
The big knock against de-dup for primary storage is the performance hit. Some vendors claim in-line de-dup at wire speed, while others optimize for backup windows and de-dup in the background. Maybe the latter are more efficient.

But given that 90% of the active storage was untouched and 1% of the servers account for 50% of the requests, how important is performance? Cherry-picking the low-access users – i.e. road warriors whose notebook is their primary I/O bucket – shouldn’t be hard.

So what percentage de-dup compression of unstructured data is feasible? That is the key to understanding the economic basis of primary storage de-duplication of unstructured data.

Academics, start your engines!

The StorageMojo take
Primary storage de-dup could be the next big win for IT shops. We just don’t have the data that can tell us how big the win could be.

NetApp (disclosure: I’ve done a minuscule amount of work for them in the last year and accepted their annual analyst junket) is well positioned. Their de-dup software license is free on their NearStore/FAS boxes.

NetApp tells me that they’ve got 13,000 systems running de-dup. Maybe some of those people are using it for primary storage and can tell us how well it works.

If the feature is free, de-duping some primary storage will be standard practice in most data centers within 5 years. As the de-dup technology improves and Moore’s Law drives performance, more and more unstructured data will be de-dup’d as a matter of course.

Courteous comments welcome, of course.

{ 13 comments }

Our changing file workloads

by Robin Harris on Tuesday, 9 September, 2008

StorageMojo has long held the view that our storage workloads are changing: more file storage, less block storage; larger file sizes; and cooler data. While all the indicators said this was happening it’s good to find a study that confirmed this intuition.

In the Measurement And Analysis Of Large-Scale Network File System Workloads (pdf) researchers Andrew W. Leung and Ethan L. Miller from UC Santa Cruz and Shankar Pasupathy and Garth Goodson of Netapp measured 2 large file servers for 4 months. Their results are worth reviewing, since so many of the optimizations in storage infrastructures rely on workload assumptions.

Unstudied CIFS
The authors point out that there have been no major studies of the CIFS protocol, odd since it is the default on Windows systems. Furthermore, the last major study of network file loads was performed in 2001 – seven years ago – an interval in which average this drive sizes have gone from 20 GB to 500 and network speeds from 100 MB to 1 GB.

Most surprising, however is that no published study has ever analyzed large-scale enterprise file system workloads. Researchers have studied workloads closer to home: university and engineering workloads.

Enterprise workloads
One was a midrange file server with 3 TB of capacity with almost 3 TB used by over 1000 marketing sales and finance employees. The second server was a high end Netapp filer with 28 TB capacity – 19 TB used – supporting 500 engineering employees.

Yes, marketers, engineers get the good toys. You can cry about it over your next 3 martini lunch.

Some significant differences from prior studies:

  • Workloads more write oriented. Read/write byte ratios and are now only 2 to 1 compared to the 4-1 or higher ratios reported earlier.
  • Workloads less read-centric. Read/write workloads are now 30x more common.
  • Most bytes transferred sequentially. These runs are 10x the length found in the old studies.
  • Files 10x bigger.
  • Files live 10x longer. Less than half are deleted within a day of creation.

Cool new findings

  • Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
  • Over 60% of file re-opens are within a minute of the first open.
  • Less than 1% of clients account for 50% of requests.
  • Infrequent file sharing. Over 76% of files are opened by just 1 client.
  • Concurrent file sharing very rare. As the prior point suggests, only 5% of files are opened by multiple clients and 90% of those are read only.
  • Most file types have no common access pattern.

And there’s this: over 90% of the active storage was untouched during the study. That makes it official: data is getting cooler.

Another interesting finding: 91% of VMWare Virtual Disk (vmdk) files accesses were small sequential reads – not the larger sequential accesses I’d expect.

The StorageMojo take
The writers rightly suggest that given the rarity of file reads after creation it makes sense to migrate files to cheap storage sooner than later.

Perhaps primary file storage should be thought of as a large FIFO buffer – tossing 3 month old files to an archive for long-term storage. A data flow architecture instead of a series ever-larger buckets.

Kudos to NetApp and UCSC for this work. It seems like NetApp has been doing the best job of leveraging academic researchers lately. I’d like to see them get more marketing mileage out of their good work.

Courteous comments welcome, of course.

{ 15 comments }