StorageMojo




Robin Harris    


StorageMojo NPI

October 29th, 2007 by Robin Harris in Clusters, Enterprise, NAS, IP, iSCSI

New Product Introduction
As part of my campaign to increase the world’s consumption of disk capacity - see yesterday’s post - I’ve developed a new capacity gobbling product. For lack of a better term I call it a video white paper.

The impetus? No one reads anymore. Especially white papers. We’d much rather watch videos.

Enter Gear6
When I looked around for a launch customer, Gear6 came to mind. Their marketing VP, Gary Orenstein, has one of the few marketing blogs, Thoughtput with real content instead of “aren’t we wonderful” happy talk. He’s done a number of podcasts as well. He’s a new-media, large file size kind of guy.

Happily, he agreed to be the launch customer.

Here’s your part
Gear6 and Gary responded favorably to this new product and now I would like to hear from all of you. I am continuing to enhance the concept with the goal of bringing more value to everyone that views it.

What’d I’d like you to do is to watch the 4.5 minute video and tell me what you think. What works, what doesn’t work. What you’d like to see more of and what you’d like to see less of.

Meet Nisha Talagala, CTO

Nisha is not just really smart - smarter than the average Silicon Valley CTO - she is also a very nice person. I was impressed.

Yes, I was paid for this. And I’d like to be paid to do more of them! But only if they are worthwhile for you. So help me figure out how to make that happen.

Comments welcome - more than ever. My goal is to create something that is genuinely useful for information seekers in a 3-5 minute package. Tell me how well you think it works. How would *you* get more valuable content into 3-5 minutes in a way that people will watch?

Update: I tweaked the wording a bit. Same video.

pNFS technical intro

October 15th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, NAS, IP, iSCSI

I don’t normally link and run but this is a good article on the Next Big Thing in NFS v4.1.

Written by 3 NetApp engineers, Garth Goodson, Sai Susarla, and Rahul Iyer, Standardizing Storage Clusters offers a good overview of what’s new. It’s on the ACM Queue web site.

If paragraphs like

protocol operations

The pNFS protocol adds a total of six operations to NFSv4. Four of them support layouts (i.e., getting, returning, recalling, and committing changes to file metadata). The two other operations aid in the naming of data servers (i.e., translating a data server ID into an address and getting a list of data servers). All the new operations are designed to be independent of the type of layouts and data-server names used. This is key to pNFS’s ability to support diverse back-end storage architectures.

get you interested the article is well worth a read.

The StorageMojo take
pNFS is going to commoditize parallel data access. In 5 years we won’t know how we got along without it.

Parascale’s CTO on what’s different about Parascale

October 4th, 2007 by Robin Harris in Architecture, Clusters, Future Tech

Is Parascale new or old?
There were many good reader questions about Parascale’s announcement. Even though I’ve done some work for them I didn’t know the answers so I invited their CTO, Cameron Bahar, to respond. He sent me a text only email, which I’ve decorated with some HTML to improve readability.

CTO Cameron Bahar:

Hi Robin,

We are delighted by the interest shown in both the file management challenges that Parascale seeks to address…and in our newly-announced solution. Your readers bring up many important issues, especially in regards to how existing solutions compare to Parascale. Permit me to try to group these questions into categories and to highlight how Parascale is different.

HPC solutions High Performance Computing (HPC) solutions are typically implemented with kernel code and employ custom client-side software to achieve high bandwidth. For example, Lustre has been successful at many national labs as mentioned in one post. Parascale is targeting a different market. Parascale is all about industry standards. We support NFS, HTTP, and FTP protocols because we don’t expect our customers to recompile their applications. We want our software to be simple to use, as well as to scale in capacity and bandwidth for our target digital content applications.

Archival solutions. Several companies, including Archivas, have delivered archival systems. These solutions are generally WORM (write once read many) systems and disallow updates to existing files. By comparison, Parascale is POSIX-compliant and designed to support large read/write bandwidth—not always a requirement for archiving. Finally, if a large vendor has acquired these technologies (e.g. HDS-Archivas), they’re usually shipped as a rack of pre-installed appliances, limiting choice of hardware provider and hardware configuration.

Clustered file systems. Shared-disk clustered file systems such as Red Hat GFS have the characteristics of traditional distributed file systems such as tight cache coherency, distributed lock management, symmetric topology. Scalability of these file systems is generally limited to 16 or 32 nodes due to heavy cache coherency traffic and message passing between nodes.

Members of our engineering team have written several clustered file systems in previous undertakings. From that experience we elected to adopt a very different architecture for Parascale. For starters, we elected to adopt a loosely-coupled architecture for scalability. Further, we chose not to write a new file system. File systems are very delicate (as we know by having written them in the past) and they take 5-7 years to fully stabilize and stop corrupting data. We simply aggregate existing file systems to present a “virtual file system” layer to clients/applications over standard protocols.

Appliances versus software. NAS appliances are ideal for many markets, like SMBs and enterprise workgroups, that need simplicity of installation and for which scalability in volume and bandwidth are not key requirements. Appliances generally employ hardware highly-customized for serving files, including hardware features like NVRAM to boost write-performance and RAID controllers for data redundancy.

Parascale seeks to solve a different problem, that for management of large digital content repositories. Think of video on demand, photo archives, medical imaging, seismic data, and genomics data. Don’t fault us for being inappropriate as secondary storage for an RDBMS. We didn’t design Parascale for block storage because many excellent products already address this market.

We’ve constrained our solution to run as an application (with no kernel code) on industry-standard servers, as qualified only by Red Hat. We want our customers to enjoy the very latest advances in server hardware (motherboards, processors, memory, disks) available from Dell, HP and others. And we want our customers to be able to buy servers from their “regular hardware vendor”

Parascale’s software-only solution lets our customers to tune the disk capacity, CPU, RAM, I/O and network bandwidth independently—as required by the application at hand. Growth can be incremental—one disk drive or server at a time. You never have to discard hardware or licenses. Another useful benefit of a software-only solution is that other applications can coexist on the Parascale storage nodes, allowing data mining, trans-coding, encryption, or compression on the servers where the data resides. This is not possible with closed appliances.

What qualifies as “software-only” file storage solution? Our perspective is, first, that the software has to support standard network file access protocols like NFS, HTTP, or CIFS. You can store files in an RDBMS, but that doesn’t make it a software-only file management solution. Second, the disk drives must be direct-attached to the servers. Shared disk distributed or parallel filesystems (over SAN) are software products, but don’t qualify because they require specialized SAN hardware on the back end.

Finally, because all our engineering resources are focused on software, we’ve been able to innovate (with patents to prove it) and to deliver features like transparent, automated file migration (to eliminate server hot spots) and replication (to raise read bandwidth). And our roadmap promises a lot more innovation to follow!

Asked another way, where does Parascale fit in the market? Choose us if:

  • You want industry-standard hardware (e.g. because you want to run applications on the storage nodes, or because you have corporate hardware standards).
  • You need more bandwidth than one server/head can provide.
  • You need the benefits of data mobility across servers (e.g. migration to balance data and eliminate hot spots, replication to increase read bandwidth, smart load balancing to optimize system performance).

Lastly, Parascale aspires to be new and modern in its business model. When our product goes production, we plan to allow you to download our software to try it out at no cost. We’re confident you’ll like it. Our pricing is per-spindle, so you never have to deploy or pay for storage capacity before you need it. And if a drive fails, replace it with a new drive in the manufacturers’ current sweet-spot; we’re not trying to make money on advances by the disk drive manufacturers.

Hope I’ve addressed some of the questions posted. I applaud the thoughtful discussion that your post has prompted.

Best,

Cameron

Comments welcome, of course.

Sun’s adds Lustre to supercomputing

September 26th, 2007 by Robin Harris in Clusters, Information Management

What about Sun’s acquisition of Cluster File Systems, Inc.?
Yawn. CFSI was going out of business. Sun bought the assets, not the company.

Good for CFSI employees
They get a paycheck from a solvent company. They may even get some sensible marketing. Hey, it could happen.

What is Lustre?
Arguably the highest-end parallel file system. At the Seattle Conference on Scalability, founder Peter Braam spoke about current 25,000 node Lustre clusters and plans to 10x that number in the next 5 years.
Update: It appears the Lustre.org and the Lustreusers.org sites are suspended. Hm-m-m? Update II: They are back up.

Cool, huh?

So why aren’t they rich?
CFSI was a tech playpen, not a company. Like Formula 1 racing. Instead of Ferrari, CFSI had the national labs backing them. Great stuff, except nobody else has the problems the national labs have, so it limits the market.

Lustre will be facing some serious competition from pNFS once it gets baked into Linux and other operating systems. The fast-growing commercial HPC market will eat pNFS clusters up. Lustre isn’t part of that.

The StorageMojo take
Sun bought a hook into a customer base that, when budgets are good, can be very profitable. They also bought a technical team that is very knowledgeable about fabric interconnects, which in the shift to cluster storage and grids will be a very good thing for Sun.

Comments welcome, as always. OK, Lustre proponents, tell me where I’m wrong.

Parascale launches Google-like storage software

September 25th, 2007 by Robin Harris in Architecture, Clusters, Future Tech

Yay!
Parascale (parallel scale) launched its beta Virtual Storage Network this week. I’ve done some consulting for them so I won’t pretend to be objective. I’m a big fan of software-based storage clusters no matter who makes them.

GFS-like is more accurate
You can read about VSN architecture here. AFAIK it is the first GFS-type software-only storage product intended for enterprises.

Here’s a diagram from their architecture page.

Parascale launches Google-like storage software

It should be pretty solid for a beta. Blue Coat, which used to be CacheFlow, has been using it for 18 months as a FIFO buffer to stage backups. They set it up and it has worked flawlessly.

The StorageMojo take
With EMC edging into storage clusters next year the credibility of the concept will take a giant leap forward. Parascale is well-positioned to take advantage of the interest, especially if, as I suspect, someone buys them.

Comments welcome, as always. If you decide to test VFN, let me know how it goes.

Update: This post was not my finest hour: I got Parascale’s acronym wrong - it is Virtual Storage Network, not Virtual File Network. Memory like a sieve. And many have taken me to task for calling VSN “. . . the first software-only commercial storage product.” I was trying to point to the GFS-style architecture for the enterprise and did not word it well. So I re-worded it.

In researching some of the suggestions from the comments I noticed that not every vendor talks about their architecture. Interesting. End update.

EMC buys leader in telekinetic security

September 25th, 2007 by Robin Harris in Backup, Clusters, Enterprise

Time to get serious, guys
TechCrunch and VNUnet are reporting that EMC is buying online backup provider Mozy for $76 million. Neither EMC or Mozy has issued any confirmation, so who knows if it is real. But let’s assume it is.

Puzzled?
This is a good fit for EMC on several levels.

  • SMB branding. For reasons I still don’t get, Dell has happily handed EMC tens of millions of dollars worth of SMB branding that EMC could never have bought on its own. Mozy gives EMC a nice brand extension for servicing that market.
  • F1000 notebook backup. Mozy’s huge GE deal is just the first of many once EMC’s sales force gets its marching orders.
  • Margin enhancement. Mozy charges a flat $50/yr for unlimited personal backup. Storage prices fall every year. Margins grow automatically.
  • Grid storage. Mozy wasn’t using Symm’s and I doubt they’ll start now. EMC is buying expertise in the new storage paradigm.

The StorageMojo take
IBM, HP, NetApp: listen up. EMC is definitely moving into storage clusters as commercial products. If you guys don’t want to see a three-peat of the mid-90’s, when EMC rolled IBM big time, it is time to get serious.

EMC’s storage cluster strategy is hardly bulletproof. Yet playing catchup isn’t where you want to be when this party starts. If you liked EMC in the 90’s you’ll love them in the ‘teens!

Comments welcome, of course. BTW, Mozy’s beta for Mac is still buggy. The Windows client is much better. And yes, Mozy did - once - claim they could protect your data from “. . . potential telekinetic security breaches.”

EMC’s coming strategic shift

September 10th, 2007 by Robin Harris in Clusters, Enterprise, NAS, IP, iSCSI

I’m always curious about the context of the communications as well as the content. The Bush administration, for example, has been very disciplined in releasing bad news late Friday in the reasonable expectation that most people won’t ever hear about it.

Why some enterprising editor doesn’t have a Monday morning front-page box: “What the White House doesn’t want you to know” is beyond me.

So I get an EMC press release on Friday . . .
Nothing nefarious, since the release actually went out Thursday and didn’t get reported until Friday. The two major bits are:

  • “. . . former Dell and Bain executive Louise O’Brien has joined the company as Executive Vice President, Corporate Strategy and Development. O’Brien will report to Joe Tucci, EMC Chairman, President and CEO, and be responsible for overseeing EMC’s corporate strategy, mergers and acquisitions, Office of the Chief Technology Officer (CTO), and New Ventures Group.”
  • ” . . . EMC promoted three of its senior executives to President. Mark Lewis, 45, has been named President of EMC’s Content Management and Archiving (CMA) business, after having served most recently as Chief Development Officer (CDO); David Donatelli, 42, has been named President, EMC Storage Division; and Howard Elias, 50, has been named President, EMC Global Services.”

Let’s see, the CTO reports to a sales, marketing and strategy person
Him-m-m? Nothing new for EMC, whose CTO has typically been tasked with making a dog’s breakfast product portfolio look good to customers. EMC has always been a sales company where technology is secondary. Just an observation folks.

No, this is the interesting part
Putting Lewis in charge of Content Management and Archiving. He’s been an internal advocate for storage clusters and grids, which horrifies the Symm folks, but there is no doubt that clusters are coming and EMC has to do something.

So how to thread the needle, i.e. keep up the high-margin Symm sales while slowly introducing scalable storage clusters without inducing a mass migration? Simple. Sell storage clusters as the place where your data goes to die.

Massive, cheap - compared to Symms, but the gross margin remains sacred - capacity with some nifty lock-ins to keep customers coming back for more. Archive meta-data?

ILM rises from the dead
ESG, always a reliable indicator of and cheerleader for EMC thinking, is pushing the model of dynamic data and persistent data. There is a lot more persistent data so you’ll need a lot more capacity but without the performance requirements of database apps.

Enter the grid
EMC has been seeding money among innovative startups for years with a special emphasis on network-based storage. Now it is harvest time. I expect they’ll be buying, under Ms. O’Brien’s watchful eye, some of the grid/cluster/network companies they’ve invested in.

Persistent storage is actually much harder than dynamic storage because it is anti-entropic. And if you can get a customer to buy enough you will have an annuity business for many years.

The StorageMojo take
Yet the competition isn’t standing still. The movement of companies into cluster-based storage isn’t over by a long shot. The line between persistent and dynamic data is drawn by Moore’s Law and the system architects, not by the data itself.

Oracle’s adoption of direct NFS and the coming pNFS standard both point to a world where massive capacity clusters are also capable of massive IOPS with low latency. And archival storage based on open standards will have an intuitive appeal that even EMC’s high-commission sales force won’t have much luck fending off.

Nonetheless, expect EMC to introduce their first cluster product by year-end ‘08. They’re hoping to get it out before June but I don’t think they’ll make it. This stuff is always harder than it looks.

Also, I tap Ms. O’Brien as the next CEO of EMC. She may look like a dark horse, but those Bain alums are smooth operators.

Comments welcome, of course. Also, I am formally abandoning my promised Part III of EMC has Ph.D’s Pt I and Part II. Every time I looked at the list of EMC patents I’d have terminal brain cramp. Maybe an ardent EMC’er will do part III instead.

And it’s a bit pointless anyway. EMC buys its innovation if I’ve heard them correctly.

Long-haul Infiniband

July 25th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, SAN, FC

I’ve liked Infiniband ever since I learned about it at YottaYotta in 2000. The switches are fast and cheap, the latency very low and the bandwidth - 6 GB/sec full-duplex at 12x - stunning. (Cisco has an excellent technical overview introduction here.)

One thing it didn’t do, though, was handle distance. Even fiber-based IB was limited to a few hundred meters. A great computer room interconnect, but not so good for the disaster-tolerant configurations that YottaYotta’s cluster-based RAID controller was hoping to address.

YY made due with gigE links, and managed some impressive demonstrations of terabyte long-distance data transfers. Just the thing for a long weekend at the lake.

Of course, there is a downside
Infiniband was designed to be more a fixed resource like Fibre Channel than an easy-come, easy-go WAN like Ethernet. Five years ago the management was less than optimal. Some 3rd-party tools were available from Voltaire - hey, guess who’s going public! - but most folks ended up writing their own management. But if you want an “always on” network this isn’t a big problem.

Putting all one’s eggs in one basket was something that always concerned me. A single data center, no matter how well-built, is asking for trouble. I mocked up this up to dramatize the issue:

eggs

Ideally, Infiniband would at least offer metro are networking for redundancy. I don’t think you can buy it yet, but long-haul I-band may be coming.

Enter Obsidian Research
Meanwhile, up in northern Alberta, one of YY’s former whizzes, David Southwell, formed Obsidian Research, dedicated to taking I-band long-haul. The company says

Longbow XR allows arbitrarily distant InfiniBand fabrics to communicate at full bandwidth through 10Gbits/s Wide Area Networks. The WAN connection is managed out of band, and except for flight time induced latency is transparent to the InfiniBand hardware, stacks, operating systems and applications.

XR achieves flow control by shaping WAN traffic and managing buffer credits to ensure extremely high efficiency bulk data transfers — including RDMAs — making the system a highly effective transport mechanism for very large data sets between geographically separated InfiniBand equipment.

In switch mode, Longbow XR looks like a 2-port switch to the InfiniBand subnet manager. A point-to- point WAN link presents as a pair of serially connected 2-port InfiniBand switches spanning the conventional InfiniBand fabrics at each site. A single subnet spans the Wide Area Network connection, unifying what were separate subnets at each site.

Longbow XR also provides an InfiniBand router mode — improving global system manageability, scalability and robustness. In this mode, each site remain separate subnets, with independent subnet managers, easing possible security and performance concerns associated with remote subnet management. 4x SDR InfiniBand provides just 8Gbits/s of data payload bandwidth; two totally independent Gigabit Ethernet links are also encapsulated across the WAN link to make full use of the extra bandwidth.

Longbow XR communicates over IPv6 Packet Over SONET (POS), ATM, and 10Gb Ethernet, as well as dark fiber applications.

Southwell is one of the smartest hardware engineers I’ve ever worked with. If he says he can do this, I’m willing to believe he can, given enough time. And if he’ll stop “improving” it and just ship.

The StorageMojo take
I-band has knocked about the industry for some time, a solution looking for that special problem that would provide volume and profits. With the growth of clusters - compute and storage - I believe it has found its niche. Long-haul I-band doesn’t solve distance latency problems, but it sure can move boatloads of data. As Google and others reach for 100x scaling, long-haul I-band could be a helpful tool.

Comments welcome, of course. What is the state of Infiniband today?

Has Crosswalk closed its doors?

July 20th, 2007 by Robin Harris in Clusters

I received a tip from a reader yesterday that Crosswalk, a company I wrote about last year, has closed its doors.

I’ve called Crosswalk to confirm. I hope the tip is wrong, but many of the early players in the company have moved on already, which is rarely a good sign.

After describing their clustered NAS head for HPC I concluded with:

The StorageMojo take
Crosswalk, founded by Jack McDonnell, who had good success with McData, with CTO Raju Bopardikar, formerly of ill-fated Cereva, certainly has the bones for success. They’ve done a number of important things right: no host software; no custom silicon; commodity hardware; partnering where possible; horizontal scaling. This puts them ahead of getting-long-in-the-tooth startup BlueArc.

The High Performance Computing (HPC) focus is questionable. My experience is that folks who start with HPC stay there, because each HPC customer has so many interesting requirements that engineers love to solve and that will never make a dime for the company. Performance-driven customers ask for all kinds of enhancements that most commercial customers will never notice. So I wish them luck expanding past that market.

Another concern: the Denver location. STK culture - mainframe, big iron, slow to adapt - looms so large in storage circles there that there really haven’t been many successful storage startups. Jack overcame that at McData, although you might recall that McData sold mainframe ESCON directors to IBM for years before getting into, and largely outmaneuvered in, the Fibre Channel market. Does Crosswalk really want to go after the big NetApp and EMC NAS boxes?

Crosswalk has the potential to upset the current NAS players. Yet I think they’ll need a stronger cost argument in addition to cool technology. Fortunately their architecture gives them lots of options. I wish them luck.

Comments welcome. Anyone have any first-hand knowledge of what happened?

All (almost) Seattle Conference on Scalability videos now online

July 10th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, Information Management

An alert reader sent this in as a comment this morning. Thank you!

As of Jul 10, 1:00am PDT, 10 of the talks have been published (including the Lustre and Verisign ones). Searching for “seattle conference on scalability” on google video seems to return most, but not all of them. Weird. Anyway here is a complete list of links:

Building a Scalable Resource Mgmt System for Grid Computing (Khalid Ahmed, Platform Computing)

Lustre File System (Peter Braam, Cluster File Systems)

Abstractions for Handling Large Datasets (Jeff Dean, Google)

Scalable Test Selection Using Source Code Deltas (Ryan Gerard, Symantec Corporation)

Lessons In Building Scalable Systems (Reza Behforooz, Google)

Using MapReduce on Large Geographic Datasets (Barry Brumitt, Google)

YouTube Scalability (Cuong Do Cuong, Youtube)

Scaling Google for Every User (Marissa Mayer, Google)

SCTPs Reliability and Fault Tolerance (Brad Penoff, Mike Tsai, Alan Wagner, UBC)

VeriSign’s Global DNS Infrastructure (Scott Courtney, Pat Quaid, VeriSign)

I know how I’ll be spending an hour today.
I’m going to watch the YouTube talk, which was on at the same time as Amazon.

Still waiting for the Amazon talk. Hope it arrives soon. Even if it doesn’t you can read about it below.

Update: Dan Creswell reminded me that Amazon has a paper coming out in the first half of August. So maybe the video is waiting on that. I hope to review the paper once it ships.

Seattle Conference on Scalability videos

July 5th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, Information Management

The wily Googlers fooled me
I thought the videos were supposed to be on YouTube - the video service they bought for $1.6 billion a few months ago.

But NO!
They’re on Google Video. I just figured that out.

The good news: better quality on Google Video.

The bad news: I don’t see either the YouTube or the Amazon presentations up, so they probably won’t be. They were on at the same time and I choose the Amazon presentation. Who would have thought that a Google subsidiary wouldn’t give permission to publish their talk at a Google sponsored conference. It isn’t on YouTube either.

Weird. Update: The redoubtable Dan Creswell who also blogged about the Amazon talk, says that they are just a bit slow getting them up. Marissa Meyer’s afternoon keynote is now up. So let’s wait and see. Patience, grasshopper.

Anyone who attended the YouTube session want to trade notes?

Here are the links:
This links to Barry Brummit’s entertaining and informative presentation on using MapReduce on large geographic data sets.

This is Jeff Dean’s excellent talk about abstractions for handling large data sets, but don’t let the title fool you, it covers a lot of ground on Google infrastructure.

And this is Reza Behforooz’s talk about integrating GoogleTalk with two large existing services.

There is a fourth talk by the founder of Platform Computing on Building a Scalable Resource Mgmt System for Grid Computing . I attended the first few minutes until my ADD kicked in. If you watch it send me anything interesting you hear.

Comments welcome.

Seattle Scalability Conference, Pt II

June 28th, 2007 by Robin Harris in Architecture, Clusters, Future Tech

Building REALLY big clusters
You may be surprised to learn that Google DOESN’T build the world’s largest clusters. That honor goes to the government agencies who are Cluster File Systems Inc. customers. CFSI produces the Lustre File System, today’s high-end cluster file system, which is also available as an open source project.

Lustre stores data as objects on object storage servers which are managed by metadata servers which can also be a cluster for scale and uptime. This architecture is not unlike the pNFS proposal before the IETF.

How high is high?
Peter Braam, founder and CEO of CFSI, stated that they have clusters with over 25,000 nodes doing stuff that CFSI employees aren’t cleared to know. That is about 3x the size of the biggest published Google cluster size.

They also have clusters that support 25,000 clients. For Google that’s a rounding error.

This line intentionally left blank
With such a monster file system would you expect networking to occupy half the code? Me neither. But that’s the word from Dr. Braam. Turns out that really high-end clusters might use any of some 10 networks. Let’s see: Ethernet, Fibre Channel, Myrinet, Infiniband, Quadrics - man, there must be a lot of high-end networks I’ve never heard of.

Double your pleasure
Storage tidbit: Peter reports that with 2000 disks he sees double disk failures every two months. And he thinks ZFS is “beautiful”. So beautiful that he is planning to support Lustre on Solaris with ZFS.

The pace is accelerating
With a Petabyte FS, Peter says Lustre can do 100 GB/sec sustained I/O supporting 25,000 clients. That is a lot of iTunes video.

He’s expecting to see the first Peraflop system in 1-2 years and 1 TB/sec growing to 10 TB/sec in a few years later.

By 2020 - just over 12 years away - he expects to see Exascale computing:

  • 250 milion cores
  • 2 million CPUs - 125 core CPUs
  • 250 TB/sec sustained bandwidth

With Terabit Ethernet and a really big switch fabric, I suppose you could. 10 Tb Ethernet would make it more manageable.

This is for you, ZFS team
With such large clusters the problem of disconnections and subsequent reintegration of cluster nodes is a serious problem. Peter recommends that versioning become a standard part of cluster file systems because it helps keep everyone coordinated. I’d just like to have versioning so I know what I sent to people, or backed up, or just lost. Most people aren’t familiar with the concept, but I love it.

The StorageMojo take
After his informative and well-delivered talk I asked Peter if he expected pNFS to displace Lustre in the market. At the low-end, yes, once adoption gets under way. But he is confident that CSFI and Lustre will continue to own the high-end. They will support pNFS anyway, so they’ll be playing there as well.

Clearly, Lustre has some very high-end capabilities that will continue to make it attractive to the very high end. Yet CFSI is missing an opportunity to build a volume business by not going after the sub-100 node cluster market, which will become much more common in the enterprise over the next several years.

Comments welcome. More on the conference coming soon.

Seattle Conference on Scalability, Pt. I

June 26th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, Information Management

I survived Seattle’s “summer” weather
And the Google-sponsored Seattle Conference on Scalability. It was like spending 10 hours trying to drink from a fire hose. Great stuff.

I took notes on four of the sessions I attended. I would have taken more, but since Apple hasn’t shipped a notebook with a ten hour battery life I had to stop to recharge. It’s been so long since I wrote anything by hand that I can’t even read my handwriting any more.

This is a highly idiosyncratic account of the conference: I’m just talking about what i found interesting. Fortunately Google video’d the event and will put it up on YouTube. When I get the URL I’ll update this post.

Jeff Dean, senior architect at Google
Jeff is the architect of virtually every large scale system at Google. He kicked off the event with a key note on scalability at Google. As I suspected, Google is looking for new ideas on scaling another 100x over the next few years. That would mean clusters of 500,000 to over 800,000 nodes - or at least cores.

Jeff noted that BigTable, Google’s storage system that runs on top of GFS has about 500 cells, the largest of which is up to 3000 terabytes of data.

The benefits of massive scale
Jeff talked about the impact of scale on machine translation, which is a major effort inside Google. The goal is to enable a someone to ask a question in Urdu and to get access to relevant documents no matter what language they are written in through machine translation of their query into many languages with machine translation back into Urdu.

The translation model is probabilistic rather than dictionary-based, so the more examples the system has to work with the better the translation. The MT team has found that translation accuracy increases 0.5% with each doubling of the training content. That means a *lot* of storage.

And a lot of I/O: over a million lookups per second. A lot of that is cached and it is still a lot of data.

Today’s Google rack
Jeff showed a picture of the current Google datacenter rack, which appeared to consist of 20 mobo’s, each with two dual-core Intel processors for a total of 80 cores per rack. There is a 4U gap in the middle of the rack, which I assume has the DC power distribution unit. It looked very neat and tidy, unlike the pictures of Google’s early racks.

MapReduce
I’ve meant to write about MapReduce, but I couldn’t quite get a handle on it. Jeff spent a fair amount time describing the advantages of MarReduce, so now I have that handle.

MapReduce is essentially a programming language that abstracts the messy details of programming a large cluster. The Map piece extracts the data that one wants to work on into a essentially a big spreadsheet or table, while the Reduce piece massages the data into the final form. With this tool a program of 50 lines can put thousands of compute nodes to work.

Google’s scalability challenges
Google is pretty happy with their tools, but it is American to want something better. And what they’d like is a single global namespace so that data can be accessed from anywhere. So the scalability number I offered at the beginning of this post may be way low. Instead of scaling a single cluster 100x, Google would actually like to scale and interconnect their entire cluster population - which I estimate is now over 4 million cores - 100x.

The StorageMojo take
Wow! More tomorrow as I continue the report on the conference.

Comments welcome, as always.

Google Seattle scalability conference

June 11th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, Information Management

I’m pumped!
Next week I’m flying to Seattle to attend a one day conference on scalability hosted by Google’s Kirkland office.

It is a great set of presentations with leading edge practitioners. Here’s the agenda, presenters and edited descriptions of the topics:

  • Keynote I: MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets - Jeff Dean
    Search is one of the most important applications used on the internet, but it also poses some of the most interesting challenges in computer science. Providing high-quality search requires understanding across a wide range of computer science disciplines, from lower-level systems issues like computer architecture and distributed systems to applied areas like information retrieval, machine learning, data mining, and user interface design. In this talk, I’ll highlight some of the behind-the-scenes pieces of infrastructure that we’ve built in order to operate Google’s services.
    • Breakout I: Lustre File System - Peter Braam
      This lecture will explain the Lustre architecture and then focus on how scalability was achieved. We will address many aspects of scalability mostly from the field and some from future requirements, from having 25,000 clients in the Red Storm computer to offering exabytes of storage. Performance is an important focus and we will discuss how Lustre serves up over 100GB/sec today going to 100TB/sec in the coming years. It will deliver millions of metadata operations per second in a cluster and, write 10’s of thousands of small files per second on a single node. If you like big numbers (but less than a Gogol) please come to this talk.

    • Breakout I: Building A Scalable Resource Management Layer for Grid Computing - Khalid Ahmed
      We will show how to build a centralized dynamic load information collection service that can handle up to 5000 nodes/20,000 cpus in a single cluster. The service is able to gather a variety of system level metrics and is extensible to collect up to 256 dynamic or static attributes of a node and actively feed them to a centralized master. A built-in election algorithm ensures timely failover of the master service ensuring high-availability without the need for specialized interconnects.

      This building block is extended to multiple clusters that can be organized hierarchically to support a single resource management domain that can span multiple data centers. We believe the current architecture could scale to 100,000 nodes/400,000 cpus. Additional services such as a distributed process execution service, and a policy-based resource allocation engine which leverage this core scale-out clustering service are described. The protocols, communication overheads, and various design tradeoffs that were made the development of these services will be presented along with experimental results from various tests, simulations and production environments.

    • Breakout II: VeriSign’s Global DNS Infrastructure - Patrick Quaid, Scott Courtney
      VeriSign’s global network of nameservers for the .com and .net domains sees 500,000 DNS queries per second during its daily peak, and ten times that or more during attacks. By adding new servers and bandwidth, we’ve recently increased capacity to handle many times that query volume. Name and address changes are distributed to these nameservers every 15 seconds — from a provisioning system that routinely receives one million domain updates in an hour. In this presentation we describe VeriSign’s production DNS implementation as a context for discussing our approach to highly scalable, highly reliable architectures. We will talk about the underlying Advanced Transactional Lookup and Signaling software, which is used to handle database extraction, validation, distribution and name resolution. We also will show the central heads-up display that rolls up statistics reported from each component in the infrastructure.

    • Breakout II: Using MapReduce on Large Geographic Datasets & Google Talk: Lessons in Building Scalable Systems - Barry Brumitt, Reza Behforooz
    • MapReduce is a programming model and library designed to simplify distributed processing of huge datasets on large clusters of computers. This is achieved by providing a general mechanism which largely relieves the programmer from having to handle challenging distributed computing problems such as data distribution, process coordination, fault tolerance, and scaling.

      Since launching Google Talk in the summer of 2005, we have integrated the service with two large existing products: Gmail and orkut. Each of these integrations provided unique scalability challenges as we had to handle a sudden big increase in the number of users.

  • Keynote II: Description TBD - Marissa Mayer
    • Breakout III: Stream Control Transmission Protocol’s Additional Reliability and Fault Tolerance - Brad Penoff, Mike Tsai, and Alan Wagner
      The Stream Control Transmission Protocol (SCTP) is a newly standardized transport protocol that provides additional mechanisms for reliability beyond that of TCP. The added reliability and fault tolerance of SCTP may function better for MapReduce-like distributed applications on large commodity clusters.

      SCTP has the following features that provide additional levels of reliability and fault tolerance. Selective acknowledgment (SACK) is built-in to the protocol with the ability to express larger gaps than TCP; as a result, SCTP outperforms TCP under loss. For cluster nodes with multiple interfaces, SCTP supports multihoming, which transparently provides failover in the event of network path failure. SCTP has the stronger CRC32c checksum which is necessary with high data rates and large scale systems. SCTP also allows multiple streams within a single connection, providing a solution to the head- of-line blocking problem present in TCP-based farming applications like Google’s MapReduce. Like TCP, SCTP provides a reliable data stream by default, but unlike TCP, messages can optionally age or reliability can be disabled altogether. The SCTP API provides both a one-to-one (like TCP) and a one-to-many (like UDP) socket style; use of a one-to-many style socket can reduce the number of file descriptors required by an application, making it more scalable.

      The additional scalability and fault tolerance come at a cost. The CRC32c checksum calculation currently is not off-loaded to any NIC available on the market, so it must be performed by the host CPU. In high bandwidth environments with no loss, SACK processing may become a burden on the host CPU.

    • Breakout III: Scalable Test Selection Using Source Code Deltas - Ryan Gerard
      As the number of automated regression tests increase, the ability to run all of them in a reasonable amount of time becomes more and more difficult, and simply doesn’t scale. Since we are looking for regressions, it is useful to hone in on the parts of the code that have changed from the last run to help select a small subset of tests that are likely to find the regression. In this way we are only running the tests that need to be run as your system gets larger and the number of possible tests scales outward. We have devised a method to select a subset of tests from an existing test set for scalable regression testing based on source code changes, or deltas.

    • Breakout IV: YouTube Scalability - Cuong Do
      This talk will discuss some of the scalability challenges that have arisen during YouTube’s short but extraordinary history. YouTube has grown incredibly rapidly despite having had only a handful of people responsible for scaling the site. Topics of discussion will include hardware scalability, software scalability, and database scalability.

    • Breakout IV: Challenges in Building an Infinite Scalable Datastore - Swami Sivasubramanian, Werner Vogels
      In this talk, we will present the design of one of our internal datastores, HASS. HASS is designed to be “always” available, i.e., it will always accept read/write requests even if disks are failing, routes are flapping or if datacenters are being destroyed by tornados. HASS is designed for incremental scalability where adding or removing nodes can be done easily and the load gets evenly distributed among the nodes uniformly without requiring any operator intervention. In this talk, we will focus on a single and one of the most crucial ideas in HASS’s design: its ability to partition data. HASS uses consistent hashing to partition its data across its storage nodes. The basic consistent hashing algorithm is well understood in the academic literature and several research systems have been designed using it. In this talk, we will discuss our experiences with using the basic consistent hashing algorithm and the optimizations we performed to achieve more uniform load distribution and ease of operation.

    Which ones should I attend?
    I’m torn between a couple of the breakout options. Lustre vs. scalable resource management. YouTube vs. infinitely scalable datastore.

    I know some of you folks are intimately involved with these topics, so I’d appreciate your suggestions, not only for which to attend, but what questions you’d like to see addressed. If some of you are also going to be there I’d also be pleased to meet f2f as well.

    That last breakout session is a really tough choice. How can I be in two places at once?

    While I’m up there I’m also hoping to tour Isilon’s lab and see their gear in action.

    Comments and suggestions welcome. Last I heard the conference was full with a waiting list.

Hot new 10Gb switch will shake up storage networks

April 17th, 2007 by Robin Harris in Clusters, Enterprise, SAN, FC

This morning Woven Systems announced their new 10 Gbit Ethernet switch. I named Woven “coolest hardware” at last years Datacenter Ventures conference. Harry Quackenboss, their CEO, promised they’d have the switch working in six months. Well, here it is a mere seven months later, and they’ve done it. My hats off to the engineering team.

Now let’s get into Woven’s Mojo.

I’d rather switch than fight
The switch is unique is several respects:

  • 10 Gigabit ethernet only
  • Up to 144 non-blocking ports on a single switch
  • Up to 4,000 non-blocking ports in a fabric of Woven switches
  • Built from commodity parts - with one vital exception
  • Low-cost
  • The killer feature: active congestion management
  • Uses standard ethernet protocols

What is it going to kill?
It shouldn’t be a surprise that fibre channel has some features that storage systems find really useful. After all, FC was developed as a storage interconnect. So it has bandwidth, flow control, low latency and rapid failover.

Gigabit ethernet lacks in all these areas: limited bandwidth; lost packets in congested networks; high IP latency; and failover that is too slow for storage drivers to manage.

It looks like Woven has solved 3 of the 4
Woven’s secret sauce is built into an ASIC that sits in front of the commodity 24 port ethernet chip (picture helpfully provided by Woven).

Woven switch blade

The vScale Packet Processor - I don’t know what the “v” stands for - inserts low-overhead probe packets into the data stream, which the vPP at the other end of the stream, be it in the same switch or one across a fabric, bounces back, so the originating vPP has a real-time view of path latency. In milliseconds or less. It works across a fabric of up to 4,000 ports, ensuring that QoS even as the fabric grows.

That’s pretty cool, but the coolest thing is this:
When path latency is too high, the vPP has two tools it uses to manage the latency.

  • It can change to a less congested data path in less than 10ms
  • It can pause the HBA using a standard ethernet protocol

I know what you are thinking:
Wow, path failover in 10ms - drivers won’t even notice.
And
Pausing HBAs when congestion strikes is flow control for ethernet - a process FC handles with buffer credits.
All done using standard ethernet protocols, albeit creatively.

That bell you hear is tolling for Fibre Channel, which is about to meet its toughest competitor yet. Which may be why the FC over ethernet proposals are gathering steam in the T11 committee. Adding FC’s low latency protocol to a very fast and reliable 10 Gb switch adds real value and helps protect existing FC investment. Could be a nice win for all involved.

The StorageMojo take
I’m sure all the usual Internet Data Center suspects are lined up to beta Woven’s switch. Linking several hundred thousand servers via ethernet requires a lot of bandwidth, and 10GigE delivers. For the massive storage clusters it is an even bigger win: lost packets are still a pain even if the cluster can survive them.

If everything works as advertised, FC’s decline may be faster than forecast, at least among the large enterprise base that can use a switch of this size. Woven’s switch will be a shot in the arm for big clusters and the people who build them.

Update: I’d inadvertently left out the fact that you can cross-couple the switches to create a 4,000 port fabric so I’ve added it.

Update II: Harry, Woven’s CEO, helpfully added some budget pricing for all you folks with new fiscal years starting mid-year - like the Cisco tear-down guys - and I couldn’t just leave it buried in the comments.

Pricing will be finalized when general availability is announced (planned for Q3 2007), but a 144 10GE port configuration will be about $1500/10GE port, with fully-redundant fans, power supplies, and management cards.

Compare that to Cisco’s current $23k/port pricing and Riverstone’s very aggressive $10k/port pricing for full speed 10 Gb and the term “disruptive technology” just leaps to mind.

Comments welcome, of course. I spent six hours at NAB today and drove over 1,000 km, so moderation may be a bit sluggish today. Me too.



« Previous ArticleNext Article »
StorageMojo RSS Feed July 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007