StorageMojo





Robin Harris    


De-duplicating primary storage

September 30th, 2008 by Robin Harris in Architecture, Enterprise, Future Tech

NetApp is announcing a deal today: use their de-dup software with a new NetApp filer for VMware storage and they guarantee that you’ll need a minimum of 50% less storage. You can be sure that NetApp considers 50% a low bar - 80% is more like it.

Why not for most storage?
In a world of unstructured data that is rarely accessed de-duplication of primary storage is an obvious next step. A recent post discussed the findings of a joint NetApp/UC Santa Cruz study.

A quick recap of some of the study’s findings:

  • Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
  • Over 60% of file re-opens are within a minute of the first open.
  • Less than 1% of clients account for 50% of requests.
  • Infrequent file sharing. Over 76% of files are opened by just 1 client.
  • Concurrent file sharing very rare. As the prior point suggests, only 5% of files are opened by multiple clients and 90% of those are read only.
  • Most file types have no common access pattern.

And there’s this: over 90% of the active storage was untouched during the study.

Is it real?
Some commenters were dubious about the results of the study, citing sample size and atypical workload concerns. But the corporate overhead - marketing, finance, HR etc. - part of the workload felt right to me.

A lot of stuff comes in and gets saved “just in case.” Most of it never gets looked at, but when you need a particular file, you need it.

I’m less clear on engineering workloads - I suspect there are major differences among disciplines - but again it didn’t seem unreasonable. But let’s leave the engineers out of the equation.

How important is performance?
The big knock against de-dup for primary storage is the performance hit. Some vendors claim in-line de-dup at wire speed, while others optimize for backup windows and de-dup in the background. Maybe the latter are more efficient.

But given that 90% of the active storage was untouched and 1% of the servers account for 50% of the requests, how important is performance? Cherry-picking the low-access users - i.e. road warriors whose notebook is their primary I/O bucket - shouldn’t be hard.

So what percentage de-dup compression of unstructured data is feasible? That is the key to understanding the economic basis of primary storage de-duplication of unstructured data.

Academics, start your engines!

The StorageMojo take
Primary storage de-dup could be the next big win for IT shops. We just don’t have the data that can tell us how big the win could be.

NetApp (disclosure: I’ve done a minuscule amount of work for them in the last year and accepted their annual analyst junket) is well positioned. Their de-dup software license is free on their NearStore/FAS boxes.

NetApp tells me that they’ve got 13,000 systems running de-dup. Maybe some of those people are using it for primary storage and can tell us how well it works.

If the feature is free, de-duping some primary storage will be standard practice in most data centers within 5 years. As the de-dup technology improves and Moore’s Law drives performance, more and more unstructured data will be de-dup’d as a matter of course.

Courteous comments welcome, of course.

YottaNotta

September 22nd, 2008 by Robin Harris in Future Tech

StorageMojo has been informed that YottaYotta, a storage networking company that EMC invested in a couple of years ago has shut down. EMC reportedly scooped up the IP and key employees. I once worked there and hold stock in the company.

The YY web site is not opening. Reportedly CEO Barton Shigemura, CTO Wayne Karpoff and a number of development and test managers have been let go. An estimated 40 former YY’ers now have EMC badges.

Update: A second source has confirmed the YY shutdown and EMC’s uptake of the IP and about half the people. End update.

The Maui connection
As StorageMojo noted in April, 2006:

EMC’s GM of the Grid & Utility Computing, Ian Baird, mentioned at EMC World in Boston this week that EMC had invested in distributed caching technology developed by YottaYotta, a Canadian startup, for their “Grid Storage” strategic direction.

Distributed caching technology is crucial to creating WAN-based storage infrastructures that operate as if local, despite being spread over thousands of miles, where normal network latency would cripple response times. YottaYotta, an $80M startup based in Edmonton, Alberta, has been working on the technology since its founding in early 2000.

The YottaYotta system was a network-based RAID controller. The controller’s backplane was a network - Infiniband or GigE - so the controller could be physically distributed. The coordination of the distributed controller boards through wide-area cluster software is the company’s key IP.

The StorageMojo take
A coup for EMC. For a few million dollars EMC got the benefit of $80 million in R&D and some fine engineers. I speculate that EMC will use the YY block technology behind filer heads to provide fast data replication and access across dispersed data centers in the Maui infrastructure.

YY’s fate was sealed 5 years ago when the company abandoned the storage systems market in favor of attempting to market only the RAID controllers. That market could never justify the $80 million invested in the company.

Comments welcome, of course.

StorageMojo in San Diego Monday

September 19th, 2008 by Robin Harris in Off-Topic

Sorry for the short notice, but I’ll be in San Diego Monday the 22nd. If you have a company or a storage research project you’d like to show me, please send me a note or leave a comment. I like to see what people are up to.

The StorageMojo take
With the UCSD storage research lab and the supercomputer center, plus a number of startups, San Diego is getting to be an important player in storage. I’d like to see a company there hit it big.

Courteous comments welcome, of course.

Are there economies of scale in storage?

September 18th, 2008 by Robin Harris in Architecture, Future Tech

The assumption that underlies much of the interest in cloud computing is that there are economies of scale. If there are not, the extra costs of bandwidth and latency will make cloud computing too costly.

Ever since Google demonstrated that massive infrastructures could be built from commodity hardware and open source software system architects have sought similar advantages at lesser scale. People tend to ignore the fact that Google’s infrastructure is optimized for a few very specific applications.

The Google filesystem and the Google storage system, BigTable, are designed to handle the massive amounts of data that Google acquires and searches every day. Each Google rack only contains 120 disk drives, which is low density compared to most commodity servers.

Google has shown us a way to build massive infrastructures, but not the way. They have built a warehouse sized search appliance.

What makes storage cheaper?
Here is a list:

  • Commodity drives. Cheap drives make for a cheap storage.
  • Wide fan out. Amortizing interconnect costs across more drives will further lower costs. Performance may suffer depending on workload.
  • Free software. Linux, openSolaris, Hadoop and other products are among the candidates.
  • Low cost networking. Unmanaged Ethernet switches.
  • Self management. When the rest of the infrastructure is either cheap or free people costs will rapidly become the dominant factor.
  • Low entry cost. Cloud storage has a definite advantage. Faster setup and lower capital costs are tangible benefits.

Other than fan out none of these factors are very sensitive to scale. Of course there are other issues: network costs; data center costs; and power costs.

Where are the economies?
But once you get above a dozen of so racks what other economies of storage scale are there? I’m asking the question so feel free to provide answers.

The StorageMojo take
People may be the most important economy of scale in storage. If one infrastructure requires 1 admin for 100 TB and another only 1 for 500 TB it is obvious who will win, at least in the United States.

This suggests that cloud storage will need unique services to win. Online backup is an example of a service where users are buying more than capacity.

Then the problem becomes, at least for consumer services, designing offers that are attractive enough to get consumers to sign up and are profitable for the provider. And that means a marriage of marketing, finance and technology. Competing purely on price will be a fool’s game.

Courteous comments welcome, of course.

SNIA reprise online

September 16th, 2008 by Robin Harris in Off-Topic

I gave a keynote at the last SNIA Symposium about the impact of 5 new datacenter technologies:

  • 2.5″ drives
  • flash SSDs
  • guaranteed uptime storage
  • 10 gigE
  • Cloud storage

Then last week I gave a reprise of the presentation to the SNIA end-user council.

Whelming demand
In response to overwhelming demand - 2 requests (I’m easily overwhelmed) - here is the presentation and a recording of the session. You’ve been warned.

You can download the slides (0.5 MB pdf) and now the audio (15 MB mp3).

Update: A commenter suggested a smaller file so I’ve replaced the 7 MB pdf with a 0.5 MB version. Not as sharp or clear, but the content is there - faster! FWIW, the audio file is heavily compressed and you may notice some compression artifacts. End update.

The StorageMojo take
I’ve never gotten into the podcast thing, but then I no longer commute 45 minutes a day.

EMC guys will certainly want to blow an hour listening to my comments on the Hopkinton giant. Make a pitcher of 3-2-1 Margaritas, put your feet up, and “go to work” courtesy of StorageMojo.

Courteous comments welcome, of course.

Our changing file workloads

September 9th, 2008 by Robin Harris in Architecture, Enterprise, NAS, IP, iSCSI

StorageMojo has long held the view that our storage workloads are changing: more file storage, less block storage; larger file sizes; and cooler data. While all the indicators said this was happening it’s good to find a study that confirmed this intuition.

In the Measurement And Analysis Of Large-Scale Network File System Workloads (pdf) researchers Andrew W. Leung and Ethan L. Miller from UC Santa Cruz and Shankar Pasupathy and Garth Goodson of Netapp measured 2 large file servers for 4 months. Their results are worth reviewing, since so many of the optimizations in storage infrastructures rely on workload assumptions.

Unstudied CIFS
The authors point out that there have been no major studies of the CIFS protocol, odd since it is the default on Windows systems. Furthermore, the last major study of network file loads was performed in 2001 - seven years ago - an interval in which average this drive sizes have gone from 20 GB to 500 and network speeds from 100 MB to 1 GB.

Most surprising, however is that no published study has ever analyzed large-scale enterprise file system workloads. Researchers have studied workloads closer to home: university and engineering workloads.

Enterprise workloads
One was a midrange file server with 3 TB of capacity with almost 3 TB used by over 1000 marketing sales and finance employees. The second server was a high end Netapp filer with 28 TB capacity - 19 TB used - supporting 500 engineering employees.

Yes, marketers, engineers get the good toys. You can cry about it over your next 3 martini lunch.

Some significant differences from prior studies:

  • Workloads more write oriented. Read/write byte ratios and are now only 2 to 1 compared to the 4-1 or higher ratios reported earlier.
  • Workloads less read-centric. Read/write workloads are now 30x more common.
  • Most bytes transferred sequentially. These runs are 10x the length found in the old studies.
  • Files 10x bigger.
  • Files live 10x longer. Less than half are deleted within a day of creation.

Cool new findings

  • Files rarely re-opened. Over 66% are re-opened once and 95% fewer than 5 times.
  • Over 60% of file re-opens are within a minute of the first open.
  • Less than 1% of clients account for 50% of requests.
  • Infrequent file sharing. Over 76% of files are opened by just 1 client.
  • Concurrent file sharing very rare. As the prior point suggests, only 5% of files are opened by multiple clients and 90% of those are read only.
  • Most file types have no common access pattern.

And there’s this: over 90% of the active storage was untouched during the study. That makes it official: data is getting cooler.

Another interesting finding: 91% of VMWare Virtual Disk (vmdk) files accesses were small sequential reads - not the larger sequential accesses I’d expect.

The StorageMojo take
The writers rightly suggest that given the rarity of file reads after creation it makes sense to migrate files to cheap storage sooner than later.

Perhaps primary file storage should be thought of as a large FIFO buffer - tossing 3 month old files to an archive for long-term storage. A data flow architecture instead of a series ever-larger buckets.

Kudos to NetApp and UCSC for this work. It seems like NetApp has been doing the best job of leveraging academic researchers lately. I’d like to see them get more marketing mileage out of their good work.

Courteous comments welcome, of course.

StorageMojo live! Today!

September 4th, 2008 by Robin Harris in Off-Topic

In glorious lo-fi monaural!
Sterling opportunity to blow another precious hour of your work life in the guise of “continuing education.” Dial-in for a reprise of the keynote I gave at the SNIA summer symposium in San Jose.

5 technologies in search of a data center
That’s the title. The event is the Monthly General Meeting of the SNIA End User Council. I’m not 100% sure what they do either.

The dial-in number is 888-896-4477, Bridge #773423

The call is at 4pm Eastern Daylight Time. 1pm Pacific Daylight Time.

The 7 MB pdf presentation is here. Guaranteed SFW. Nice pictures of red rocks. Not a lot of bullets - the preso isn’t that informative if you don’t listen too.

The StorageMojo take
The presentation looks at some technologies that will be shaping the data center in the next 5 years. It starts small with 2.5″ drives and goes up to cloud storage. Includes SSDs and 10 GigE.

As usual I try to look past the hype into what the real effects of these technologies can be. I think there will be time for questions too.

Courteous comments welcome, of course. If you listen to the preso comment here.



StorageMojo RSS Feed November 2008 October 2008 September 2008 August 2008 July 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006 May 2006