StorageMojo




Robin Harris    


Mission-critical OSS: an oxymoron?

August 1st, 2007 by Robin Harris in Architecture, Future Tech

Think high-end enterprise requirements

  • Mission critical reliability
  • High performance & scalability
  • “One throat to choke” support

Would you expect to find free open-source software crowding out closed proprietary solutions? It is at JPMorgan and other financial services companies.

Enterprise message bus
No, this isn’t storage software, but it is close.

I came across this article Toward a Commodity Enterprise Middleware in ACM Queue (ACM membership required, I think) by John O’Hara, a senior architect and distinguished engineer at JPMorgan. In it he describes the forces and the process that led to the development of AMQP (Advanced Message Queuing Protocol), an open spec EMB with several open-source instantiations which I list later.

I found it interesting because of the parallels I see between EMBs and data storage, especially high-end data storage. The business models of the companies that support AMQP are interesting as well, since enterprises must have good support even for free software.

RAID is middleware between servers and disks
EMB’s are middleware that enable independent applications to work together through message passing. Common protocols include store-and-forward, publish and subscribe and others. They commonly run over Ethernet and Infiniband and are designed to be usable on many operating systems and programming languages.

Big banks are pushing EMB performance thanks to automated trading systems that can generate as many as 500,000 events per second. The reliability of the EMB under heavy load is paramount.

Why an OSS EMB?
The usual reasons. Proprietary EMBs are costly. They don’t play well with other EMBs. And the banks found that their needs often outstripped what the closed vendors could provide. John got tired of waiting for an open source version and led the effort himself.

He notes that to make AMQP happen he felt he had to meet some some requirements. I’ll quote John at length:

This had to be an industry initiative. Home-grown middleware could not thrive in the small market available within a host organization, even the largest host.

It is also notable that pervasive networking standards such as Ethernet, the Internet Protocol, e-mail, and the Web share some traits. They are all royalty-free and unencumbered by patents, they are all publicly specified, and they all shipped with a useful early implementation for free. The combination of freedom and usefulness drives their adoption when predicated on fitness for purpose.

To succeed, AMQP needed to adopt these same characteristics:

  • It needed to be a fully defined, open, royalty-free, unpatented specification to enable anyone to implement a compatible service. Furthermore, the standard specification had to be clearly separate from the implementations; otherwise, it would not be a fair market for commercial entities to enter. AMQP had to be appealing for commercial implementation and exploitation or it would not succeed.
  • AMQP needed to have real implementations of the specification; otherwise, the specification would not be immediately useful or interesting to front-line developers with pressing needs. Ideally, it should have more than one implementation to qualify as a potential IETF draft standard. So, there are real implementations you can run today (as detailed later in the article).
  • AMQP software had to be proven in live systems. Middleware is a critical piece of any system and must be trusted. That trust has to be earned. To this extent, it was clear we would have to deploy an implementation in a high-profile, mission-critical application to assuage the fears of other early adopters. So, a combination of OpenAMQ and Qpid are live at JPMorgan, supporting 2,000 users on five continents and processing 300 million messages per day.
  • Finally, and most importantly, AMQP needed to be a collective effort. Openness to partnership and the ideas of others had to be there from the beginning. To this end, we carefully selected a partner to co-develop the specification and implement the software. We chose iMatix, a boutique European development house that had clearly demonstrated a commitment to open source and sound ethics, and had a strong engineering background and excellent writing abilities.

Because the project was sponsored by a bank, it also had to “wash its own face,” as they say. This was not a research project. Through sheer good luck, there was a need to refresh some large systems with very specific requirements. This provided a tangible return for AMQP investment, so I was able to convince a forward-looking CIO that AMQP was the way to go.

John goes on to make valuable points about other issues such as governance (where Java stumbled), user-driven implementation and architecture, which are important for any product development effort, with governance the special issue for OSS.

Here’s a list of suppliers and resources on AQMP

The StorageMojo take
The emergence of commodity cluster-based storage - as epitomized by the Google FS - has demonstrated the value of an open implementation of a similar product. We have a venture-backed OSS router company, but no storage companies. Are VC’s clueless about storage? Or is storage doomed to remain riven by proprietary products with poor interoperability?

Maybe some of the Google engineers whose options have vested will start the company to commoditize cluster storage. Storage is a conservative realm, but there are enough high-growth, high-capacity apps to create an opening for the right product. And even storage people will change if they have good reason to.

Comments welcome, of course. And what is the name of that GFS-like OSS project? Keywords are no good if you can’t remember them! Update: the first commenter, Jim, got the one I was thinking of: Hadoop DFS. Thanks!

Powering a warehouse-sized computer - part 2

July 31st, 2007 by Robin Harris in Architecture, Enterprise

Cumulative distribution functions
The paper describes three large-scale Google workloads:

  • Websearch: high request throughput and large data processing requirements for each request - IIRC 70-100 MB of data searched per request - with variable demand by time of day.
  • Gmail: A more disk-intensive service whose machines tend to be configured with more disk drives, with each request accessing a relatively small number of servers. Also correlated with TOD.
  • Mapreduce: large, offline batch jobs, running on hundreds or thousands of machines involving terabytes of data.

They picked a sample of 5,000 servers running each of the workloads. Since they want to limit power usage peaks and valleys, they took a close look at power usage at the limit.

Websearch CDF

Websearch
The full distribution diagram shows that rack power usage starts at 45% of peak load and rises to 98% when, presumably, all the machines in the rack are operating near peak power. Most of the time the rack is in the 60-80% range.

Portfolio effect
The other curves show the “PDU” (power distribution unit) curve of about 20 racks or 800 machines, while the “cluster” curve is the CDF for about 5,000 machines. The curves display the smoothing effect of large numbers so the entire cluster never gets above 93%.

Here are the figures for Gmail and Mapreduce:
Webmail

Mapreduce

Google’s own commercial data center?
Google could have stopped with their own highly tuned applications. Instead the included an unnamed data center whose workloads are lower because they are

. . . in development, or simply not highly loaded. For example, machines can be assigned to a service that is not yet fully deployed, or might be in various stages of being repaired or upgraded, etc. Figure 9 shows the power CDF for one such datacenter. We note the same trends as seen in the workload mix, only much more pronounced. Overall dynamic range is very narrow (52 - 72%) and the highest power consumption is only 72% of actual peak power. Using this number to guide deployment would present the opportunity to host a sizable 39% more machines at this datacenter.

datacenter

Is this a Google datacenter for the company’s internal operations, as posited in So what does Google use when they aren’t Googling?

Power save
The point of this study was to evaluate power-saving techniques once power usage was understood. Google evaluated two power saving techniques.

CPU voltage/frequency scaling has been implemented on some AMD and Intel chips. When the processor is less busy it reduces its input voltage and clock frequency to save power.

Modeling CPU v/f scaling under data center loads, Google found that datacenters could see power savings of 15-25%, depending on how aggressively it was used. They also found that I/O bound servers benefitted less than compute-intensive workloads.

Non-peak power efficiency
Another option is to improve non-peak power efficiency. Most “efficiency per watt” metrics are based on peak loads, but Google found that systems spend little time at “peak”. Google found that idle systems power never dropped below 50% of peak load, while ideally an idle system’s consumption would drop to zero.

If idle power were only 10% of peak power, an enterprise data center could save 50% on its power. Even Google’s well-behaved apps would see savings in the 30%+ range.

The StorageMojo take
Google’s been pushing power savings for five years now and it is all to the good. The average data center has so many high-cost items that power costs don’t garner much attention. Google’s unique infrastructure makes power important to them. And since they buy a half million servers a year vendors care.

The important finding of this paper is that the big power savings come from optimizing non-peak power usage. This means optimizations at the motherboard and sub-system levels - like storage - that many companies can contribute to.

Google will continue to push the non-peak power efficiency issue hard, and that is good for all of us. For home and office users the savings could be substantial since typical office tasks rarely stress a system. While many question the validity of Google’s cluster-based architecture for data center use, their focus on power-saving will benefit us all.

Comments welcome, of course. Wes, this is your area of research interest. What is your perspective?

So what does Google use when they aren’t Googling?

July 27th, 2007 by Robin Harris in Architecture, Enterprise

A reader wrote me a note that asks a question that I think is on the minds of many data center folks. He said it well himself, so I’ll quote liberally, starting with the compliment.

I really enjoy reading your blogs!

One thing I’ve notices in your blog, other blogs, and all over the computer media is how Google keeps coming up as an example. It’s how Google leverages commodity servers, drives, custom software (Google filesystem), etc, etc. It’s as if everyone thinks Google should be emulated.

Ok . . .I can buy that . . . but I wonder, how much of what Google does is really applicable to my work. I work for a utility. The vast majority of our IS computer resources go toward transaction processing type systems.

What you _NEVER_ read about Google is what computer resources they use for their internal business processes. You only hear about what resources they use for their products (search, gmail, maps, etc). When I read about how wonderful the Google infrastructure is and how it should be emulated, I always start wondering if the described infrastructure (commodity server, cheap disk, google filesystem and massive parallelism) are also used for their billing, payroll, accounting systems, and whatever? Do they use custom written software for these, or Oracle apps or SAP? Do they use a database, which one? What disk systems do they use with it? How is it laid out?

It always interesting to hear how Google does things, expecially since they are so secret about it, but I’m not convinced that what does come out is useful to us, or is a very complete picture of their infrastructure. I guess I wonder if buried deep in their datacenters is a more normal infrastructure like ours.

Good questions - one’s I’ve often asked myself.
Google the 10,000 person company doesn’t have nearly the clout that Google, the world’s largest internet advertising company and, more importantly, buyer of 500,000 servers a year, does. If they asked IBM to clusterize MVS to win an order they’d get laughed at just like anyone else.

But dangle a few hundred million a year in front of Intel while asking them to do things their engineers want to do anyway and that is much warmer.

I suspect that their offices and internal data centers look a lot like yours, at least for the database business apps - the corporate underwear. But I bet they back up their unstructured data on GFS - why not?

Linux, PCs and Macs
I know they use Macs and PCs and that, at the very least, they outsource some of their IT work to people using Microsoft server products. They may even have Microsoft servers inside the company, though I’ve never seen evidence of it.

However, I have never held up Google’s infrastructure as one that could be used to count money. Check out the StorageMojo take on the Google File System and I said as much.

Amazon is a different story
The more appropriate example is Amazon. They have millions of customers, they count billions of dollars, they customize each web page on the fly and they do it with a services-based distributed architecture based on open source software clusters. They scale well. And they arrived at that architecture only after trying all the “enterprise” products, including a mainframe. They not only built it, they migrated to it from a very large installed base.

If Werner Vogels ever decides to build his own company, that would be the pitch.

Amazon does transaction processing on a cluster. That is the enterprise problem.

Amazon is the company IT architects should be studying. They just don’t publish very much.

Enterprise forever
I don’t believe that “enterprise” hardware and software are going away in my lifetime, any more than the mainframe has or probably will. What will shift is the growth. When the market shifts, the weaker players will fold or consolidate, just as they did in the mainframe market.

But with 85%+ of digital data in ordinary files, even mid-range RAID solutions are overkill. Big blobs of cheap cluster storage would solve all kinds of IT problems. Back up window closing fast? Back up to a storage cluster sized to be a 6-10 week FIFO buffer. I suspect there are many data center applications for cheap cluster storage today if someone offered a reasonable product and notoriously conservative IT managers tried them.

Enterprise growth rate
Moore’s Law is driving up CPU power faster than enterprise application growth rates. The enterprise market share has been shrinking for years, and in the next five years that market’s growth could stall entirely.

The StorageMojo take
Google is a fun story, the way Microsoft was in the 1980’s. They picked up a lot of ideas that folks had been working on for, in some cases, decades and rolled them out in a big way. They’ve produced something we’d never seen before even though much of it was percolating around CompSci departments for years. The antics of the boy billionaires makes good copy.

The real power of Google will be seen when the computer scientists who are now multi-millionaires get tired of working for a big company and decide to see if lightning can strike twice. For most of them it won’t, but what the hey, they didn’t go into for the money anyway. They’re the hot rodders of the digital age, channeling, chopping, stroking and boring the bits to create beauty, handling and speed. With luck, all three.

Comments welcome, of course. Have a good weekend!

Powering a warehouse-sized computer - part 1

July 26th, 2007 by Robin Harris in Architecture, Enterprise

Power is probably the least understood/most widely used technology in computing. We don’t understand it, - what is a ground loop?- we rarely measure it, and its behavior is a mystery. Labels are no help either. My computer spec is “100-240 V alternating current, 12 A (low voltage range) or 6 A (high voltage range), 50-60 Hz”.

Does it really use almost as much power as a hair dryer? I don’t think so.

Google power
Beginning 5 years ago, Google took the lead in making a power consumption an IT vendor issue. Today, Intel’s power hungry NetBurst architecture is history and power-efficient multi-core architectures are all the rage. Google wasn’t the only factor, but their use of free software, cheap hardware and massive scale meant that energy consumption became one of the few places they could cut costs.

The fact that they are purchasing over a half million servers a year didn’t hurt either.

Using the data-intensive methodologies their scale enables, Google has now published results of the their studies of data center power.

Googlers take a long hard look at power
What are the key determinants of data center power consumption? How can data centers maximize the return on their big investments? What, if any, technologies may help data centers become more efficient? Google now gives us one large-scale data point.

Power Provisioning for a Warehouse-sized Computer by Xiaobo Fan, Wolf-Dietrich Weber and Luiz André Barroso sheds light on creating energy efficient data centers and on Google’s operations. Anil Gupta over at the Network Storage blog turned me on to this - thanks Anil - a fact I spaced on while writing this.

Power to the servers
In the usual massive-scale Google style, the paper looks at the power consumption of groups of up to 15,000 servers. That’s a tiny fraction of the server population of a recent Google data center, but large enough to be useful for the rest of us with less exalted infrastructures.

A datacenter costs $10-$20 per deployed watt of peak computer power, excluding cooling and other loads. Ideally you’d build and operate the data center at 99% level all the time. The problem is that equipment power ratings are pretty useless for determining peak load.

This leads to a couple of problems. The cost of the datacenter actually exceeds the cost of power for 10 years of operation. I ran the numbers for the power costs at the new Oregon facility and it works out to about 50 cents per watt/year, or $5 for 10 years. So it maximizes Google’s investment to keep their power consumption pegged.

Not only that, but Google gets charged based on their peak watt/hour power consumption - at least in Oregon. If they do one hour at 100 MW and the rest of the month at 25 MW, they get charged for consuming a 100 MW for a month. That’s a little different than us home users.

Therefore it makes sense for them to utilize the data center’s capacity as fully as possible, and as uniformly as possible, so they don’t overbuild the datacenter or overpay for the power. They really need to understand power consumption.

So what did they figure out?
Well, a whole heck of a lot. Here’s some key findings.

  • The gap between aggregate and spec power can be as great as 40% for a datacenter, though Google’s applications are better behaved
  • Dynamic power management is most useful for preventing overloads
  • Power management is more effective at the datacenter level than at the rack level

Stop back soon for part 2 of Powering a warehouse-sized computer

Long-haul Infiniband

July 25th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, SAN, FC

I’ve liked Infiniband ever since I learned about it at YottaYotta in 2000. The switches are fast and cheap, the latency very low and the bandwidth - 6 GB/sec full-duplex at 12x - stunning. (Cisco has an excellent technical overview introduction here.)

One thing it didn’t do, though, was handle distance. Even fiber-based IB was limited to a few hundred meters. A great computer room interconnect, but not so good for the disaster-tolerant configurations that YottaYotta’s cluster-based RAID controller was hoping to address.

YY made due with gigE links, and managed some impressive demonstrations of terabyte long-distance data transfers. Just the thing for a long weekend at the lake.

Of course, there is a downside
Infiniband was designed to be more a fixed resource like Fibre Channel than an easy-come, easy-go WAN like Ethernet. Five years ago the management was less than optimal. Some 3rd-party tools were available from Voltaire - hey, guess who’s going public! - but most folks ended up writing their own management. But if you want an “always on” network this isn’t a big problem.

Putting all one’s eggs in one basket was something that always concerned me. A single data center, no matter how well-built, is asking for trouble. I mocked up this up to dramatize the issue:

eggs

Ideally, Infiniband would at least offer metro are networking for redundancy. I don’t think you can buy it yet, but long-haul I-band may be coming.

Enter Obsidian Research
Meanwhile, up in northern Alberta, one of YY’s former whizzes, David Southwell, formed Obsidian Research, dedicated to taking I-band long-haul. The company says

Longbow XR allows arbitrarily distant InfiniBand fabrics to communicate at full bandwidth through 10Gbits/s Wide Area Networks. The WAN connection is managed out of band, and except for flight time induced latency is transparent to the InfiniBand hardware, stacks, operating systems and applications.

XR achieves flow control by shaping WAN traffic and managing buffer credits to ensure extremely high efficiency bulk data transfers — including RDMAs — making the system a highly effective transport mechanism for very large data sets between geographically separated InfiniBand equipment.

In switch mode, Longbow XR looks like a 2-port switch to the InfiniBand subnet manager. A point-to- point WAN link presents as a pair of serially connected 2-port InfiniBand switches spanning the conventional InfiniBand fabrics at each site. A single subnet spans the Wide Area Network connection, unifying what were separate subnets at each site.

Longbow XR also provides an InfiniBand router mode — improving global system manageability, scalability and robustness. In this mode, each site remain separate subnets, with independent subnet managers, easing possible security and performance concerns associated with remote subnet management. 4x SDR InfiniBand provides just 8Gbits/s of data payload bandwidth; two totally independent Gigabit Ethernet links are also encapsulated across the WAN link to make full use of the extra bandwidth.

Longbow XR communicates over IPv6 Packet Over SONET (POS), ATM, and 10Gb Ethernet, as well as dark fiber applications.

Southwell is one of the smartest hardware engineers I’ve ever worked with. If he says he can do this, I’m willing to believe he can, given enough time. And if he’ll stop “improving” it and just ship.

The StorageMojo take
I-band has knocked about the industry for some time, a solution looking for that special problem that would provide volume and profits. With the growth of clusters - compute and storage - I believe it has found its niche. Long-haul I-band doesn’t solve distance latency problems, but it sure can move boatloads of data. As Google and others reach for 100x scaling, long-haul I-band could be a helpful tool.

Comments welcome, of course. What is the state of Infiniband today?

Two summer scalability conferences

July 22nd, 2007 by Robin Harris in Architecture, Future Tech

Still wishing you’d made the Seattle Conference on Scalability last month?
There’s a couple of upcoming East Coast meetings that look worthwhile. These are focused on compute intensive cluster computing, not Internet Data Center workloads, but many of the issues overlap.

The Commodity Cluster Symposium 2007 (CCS2007)

. . . addresses challenges associated with the rapidly expanding interest and acceptance of the use of commodity computer clusters for scientific applications.

This year’s focus is using commodity clusters to solve large-scale scientific problems. One of the co-developers of the Beowulf cluster architecture will keynote.

It is in Annapolis, MD and runs from July 23 through the 25th. Sorry for the late notice. You can read more here.

HECIWG FSIO
I hope I’ve decoded the acronyms correctly: the High End Computing Inter-agency Working Group’s (HECIWG) File Systems & I/O (FSIO) 2007 Workshop. August 5-8 in Arlington, VA.

HECIWG is a group devoted to coordinating the US Government’s work on big computing. There are a bunch of agencies that spend a lot of money on computing R&D. Until the working group, they didn’t much talk to each other about it. A little cross-fertilization among agencies, academics and a few companies can’t hurt.

Some interesting topics:

  • Microdata Storage Systems
  • Performance Insulation and Predictability For Shared Cluster Storage
  • Petascale I/O for High End Computing
  • Toward Automated Problem Analysis of Large Scale Storage Systems

Oddly enough, few of these folks seem to publish their stuff on the web so mere mortals can look at it. So I’m thinking about attending.

Centers for high-end computing research
I purloined this list of HEC research sites from the LANL website. It surprised me a little. I didn’t expect to see UCSC on the list, although I’ve looked at some of the research and researchers there. And I’d never heard of anything at Stony Brook. Cool.

All (almost) Seattle Conference on Scalability videos now online

July 10th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, Information Management

An alert reader sent this in as a comment this morning. Thank you!

As of Jul 10, 1:00am PDT, 10 of the talks have been published (including the Lustre and Verisign ones). Searching for “seattle conference on scalability” on google video seems to return most, but not all of them. Weird. Anyway here is a complete list of links:

Building a Scalable Resource Mgmt System for Grid Computing (Khalid Ahmed, Platform Computing)

Lustre File System (Peter Braam, Cluster File Systems)

Abstractions for Handling Large Datasets (Jeff Dean, Google)

Scalable Test Selection Using Source Code Deltas (Ryan Gerard, Symantec Corporation)

Lessons In Building Scalable Systems (Reza Behforooz, Google)

Using MapReduce on Large Geographic Datasets (Barry Brumitt, Google)

YouTube Scalability (Cuong Do Cuong, Youtube)

Scaling Google for Every User (Marissa Mayer, Google)

SCTPs Reliability and Fault Tolerance (Brad Penoff, Mike Tsai, Alan Wagner, UBC)

VeriSign’s Global DNS Infrastructure (Scott Courtney, Pat Quaid, VeriSign)

I know how I’ll be spending an hour today.
I’m going to watch the YouTube talk, which was on at the same time as Amazon.

Still waiting for the Amazon talk. Hope it arrives soon. Even if it doesn’t you can read about it below.

Update: Dan Creswell reminded me that Amazon has a paper coming out in the first half of August. So maybe the video is waiting on that. I hope to review the paper once it ships.

Seattle Conference on Scalability videos

July 5th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, Information Management

The wily Googlers fooled me
I thought the videos were supposed to be on YouTube - the video service they bought for $1.6 billion a few months ago.

But NO!
They’re on Google Video. I just figured that out.

The good news: better quality on Google Video.

The bad news: I don’t see either the YouTube or the Amazon presentations up, so they probably won’t be. They were on at the same time and I choose the Amazon presentation. Who would have thought that a Google subsidiary wouldn’t give permission to publish their talk at a Google sponsored conference. It isn’t on YouTube either.

Weird. Update: The redoubtable Dan Creswell who also blogged about the Amazon talk, says that they are just a bit slow getting them up. Marissa Meyer’s afternoon keynote is now up. So let’s wait and see. Patience, grasshopper.

Anyone who attended the YouTube session want to trade notes?

Here are the links:
This links to Barry Brummit’s entertaining and informative presentation on using MapReduce on large geographic data sets.

This is Jeff Dean’s excellent talk about abstractions for handling large data sets, but don’t let the title fool you, it covers a lot of ground on Google infrastructure.

And this is Reza Behforooz’s talk about integrating GoogleTalk with two large existing services.

There is a fourth talk by the founder of Platform Computing on Building a Scalable Resource Mgmt System for Grid Computing . I attended the first few minutes until my ADD kicked in. If you watch it send me anything interesting you hear.

Comments welcome.

How Yahoo can beat Google

July 5th, 2007 by Robin Harris in Architecture, Enterprise

Google has pummeled Yahoo into near-obscurity: the early search leader - the Google of the 1990’s - Yahoo’s market cap is a fraction of GOOG’s while their search share is a distant second. It is easy to forget that Yahoo is actually a large and highly profitable company by most standards - over $6 billion in sales and more than $700 million in profits in 2006.

But when you’re competing against the baddest internet company around, good isn’t good enough. Which is why ex-CEO Terry Semel was shown the door and a new team, co-founder Jerry Yang and CFO Mary Decker, are now at bat.

Sclerotic decision-making
Analysts point to Yahoo’s inability to make fast decisions to acquire hot properties, like FaceBook and YouTube, as indicative of the company’s malaise. A compelling vision and crisp decisions would surely help.

Yet for all their success, the Google’s top management is hardly more experienced than Yahoo’s new team. It is the battle of the billionaire geeks.

But Yahoo has a much bigger problem than corporate culture. How about some basic blocking and tackling?

Profit = Revenue minus Cost.
Financial analysts are focused on Yahoo’s missed revenue opportunities. What about the cost side?

Bringing a knife to a gun fight
Yahoo’s infrastructure is built like a very large enterprise data center with brand name products. For example, Yahoo’s very successful mail system is run on NetApp filers. (Is it just me or does that NetApp page sound a little off?) Yahoo does use as much free software - FreeBSD, Apache and Perl - as they can, so their problem is hardware capital expense and operating expense.

Google’s infrastructure is built on commodity PC products. Cheap SATA drives velcro’d on quad-core mobo’s instead of high-performance network storage. A software layer that optimizes and manages across the cluster, so people don’t have to. No costly RAID arrays. No RAID at all, just 3x - or more - replication.

Cheaper, better, faster: pick any three
I compared Yahoo and Google IT cost structures a year ago and found that for every dollar Google and Yahoo invest in IT, Google generates 50-60% more revenue with 4,000 fewer people. Since then, I’ve estimated that Google has a 5-8x cost advantage per user I/O over Yahoo. That includes search, mail, and other services.

Google’s cost advantage gives it two huge advantages:

  • Its ROI bar is lower than competitors, so it can afford to make improvements that competitors can only dream of.
  • Its lower costs mean that even if a deep-pocketed competitor like Microsoft wanted to eliminate the profits in AdSense, they’d be bleeding red ink while Google broke even.

The low-cost producer doesn’t always win. But they sure have more maneuvering room than higher-cost producers.

So how can Yahoo win?
This isn’t rocket science.

  • Get IT costs under control. Go to a Google-like infrastructure using commodity products.
  • Defend your market leadership where you have it, like Yahoo mail. Google’s marketing is pretty much MIA - as the $1.6 billion purchase of YouTube reflected. Google is pleased with the growth of Gmail, which is largely a function of people opening multiple email accounts to take advantage of the $10 Google Checkout promotion, but Yahoo has the users.
  • Add capabilities to its new commodity-based infrastructure, such as transaction processing, that Google can’t easily add, and use it to drive new business and traffic.

The StorageMojo take
Google looks invulnerable, but their inability to win outside of search and advertising points to just how weak their management and their market position is. They don’t have the integration skills of a Cisco - look at how poor YouTube’s search function still is months after purchase by the world leader in search - and while a huge bag of money can rent a lot of love, Microsoft has proven that it can’t buy internet success.

Yahoo has an opportunity to catch the next wave of search goodness if they are aggressive about bringing their infrastructure costs down. If not, it doesn’t matter what they do on the revenue side: they will continue a long, slow decline. They’ll be good, but they’ll never be great.

Comments welcome, as always. And as an aside I’m here to report that Brad Bird’s new movie, Ratatouille is a must see - even better than The Iron Giant and The Incredibles. The animated short Lifted is hysterical. One caution: I don’t think the movie is suitable for some pre-schoolers due to disturbing images of swarming vermin. Wait for the DVD.

Seattle Scalability Conference, Pt II

June 28th, 2007 by Robin Harris in Architecture, Clusters, Future Tech

Building REALLY big clusters
You may be surprised to learn that Google DOESN’T build the world’s largest clusters. That honor goes to the government agencies who are Cluster File Systems Inc. customers. CFSI produces the Lustre File System, today’s high-end cluster file system, which is also available as an open source project.

Lustre stores data as objects on object storage servers which are managed by metadata servers which can also be a cluster for scale and uptime. This architecture is not unlike the pNFS proposal before the IETF.

How high is high?
Peter Braam, founder and CEO of CFSI, stated that they have clusters with over 25,000 nodes doing stuff that CFSI employees aren’t cleared to know. That is about 3x the size of the biggest published Google cluster size.

They also have clusters that support 25,000 clients. For Google that’s a rounding error.

This line intentionally left blank
With such a monster file system would you expect networking to occupy half the code? Me neither. But that’s the word from Dr. Braam. Turns out that really high-end clusters might use any of some 10 networks. Let’s see: Ethernet, Fibre Channel, Myrinet, Infiniband, Quadrics - man, there must be a lot of high-end networks I’ve never heard of.

Double your pleasure
Storage tidbit: Peter reports that with 2000 disks he sees double disk failures every two months. And he thinks ZFS is “beautiful”. So beautiful that he is planning to support Lustre on Solaris with ZFS.

The pace is accelerating
With a Petabyte FS, Peter says Lustre can do 100 GB/sec sustained I/O supporting 25,000 clients. That is a lot of iTunes video.

He’s expecting to see the first Peraflop system in 1-2 years and 1 TB/sec growing to 10 TB/sec in a few years later.

By 2020 - just over 12 years away - he expects to see Exascale computing:

  • 250 milion cores
  • 2 million CPUs - 125 core CPUs
  • 250 TB/sec sustained bandwidth

With Terabit Ethernet and a really big switch fabric, I suppose you could. 10 Tb Ethernet would make it more manageable.

This is for you, ZFS team
With such large clusters the problem of disconnections and subsequent reintegration of cluster nodes is a serious problem. Peter recommends that versioning become a standard part of cluster file systems because it helps keep everyone coordinated. I’d just like to have versioning so I know what I sent to people, or backed up, or just lost. Most people aren’t familiar with the concept, but I love it.

The StorageMojo take
After his informative and well-delivered talk I asked Peter if he expected pNFS to displace Lustre in the market. At the low-end, yes, once adoption gets under way. But he is confident that CSFI and Lustre will continue to own the high-end. They will support pNFS anyway, so they’ll be playing there as well.

Clearly, Lustre has some very high-end capabilities that will continue to make it attractive to the very high end. Yet CFSI is missing an opportunity to build a volume business by not going after the sub-100 node cluster market, which will become much more common in the enterprise over the next several years.

Comments welcome. More on the conference coming soon.

Seattle Conference on Scalability, Pt. I

June 26th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, Information Management

I survived Seattle’s “summer” weather
And the Google-sponsored Seattle Conference on Scalability. It was like spending 10 hours trying to drink from a fire hose. Great stuff.

I took notes on four of the sessions I attended. I would have taken more, but since Apple hasn’t shipped a notebook with a ten hour battery life I had to stop to recharge. It’s been so long since I wrote anything by hand that I can’t even read my handwriting any more.

This is a highly idiosyncratic account of the conference: I’m just talking about what i found interesting. Fortunately Google video’d the event and will put it up on YouTube. When I get the URL I’ll update this post.

Jeff Dean, senior architect at Google
Jeff is the architect of virtually every large scale system at Google. He kicked off the event with a key note on scalability at Google. As I suspected, Google is looking for new ideas on scaling another 100x over the next few years. That would mean clusters of 500,000 to over 800,000 nodes - or at least cores.

Jeff noted that BigTable, Google’s storage system that runs on top of GFS has about 500 cells, the largest of which is up to 3000 terabytes of data.

The benefits of massive scale
Jeff talked about the impact of scale on machine translation, which is a major effort inside Google. The goal is to enable a someone to ask a question in Urdu and to get access to relevant documents no matter what language they are written in through machine translation of their query into many languages with machine translation back into Urdu.

The translation model is probabilistic rather than dictionary-based, so the more examples the system has to work with the better the translation. The MT team has found that translation accuracy increases 0.5% with each doubling of the training content. That means a *lot* of storage.

And a lot of I/O: over a million lookups per second. A lot of that is cached and it is still a lot of data.

Today’s Google rack
Jeff showed a picture of the current Google datacenter rack, which appeared to consist of 20 mobo’s, each with two dual-core Intel processors for a total of 80 cores per rack. There is a 4U gap in the middle of the rack, which I assume has the DC power distribution unit. It looked very neat and tidy, unlike the pictures of Google’s early racks.

MapReduce
I’ve meant to write about MapReduce, but I couldn’t quite get a handle on it. Jeff spent a fair amount time describing the advantages of MarReduce, so now I have that handle.

MapReduce is essentially a programming language that abstracts the messy details of programming a large cluster. The Map piece extracts the data that one wants to work on into a essentially a big spreadsheet or table, while the Reduce piece massages the data into the final form. With this tool a program of 50 lines can put thousands of compute nodes to work.

Google’s scalability challenges
Google is pretty happy with their tools, but it is American to want something better. And what they’d like is a single global namespace so that data can be accessed from anywhere. So the scalability number I offered at the beginning of this post may be way low. Instead of scaling a single cluster 100x, Google would actually like to scale and interconnect their entire cluster population - which I estimate is now over 4 million cores - 100x.

The StorageMojo take
Wow! More tomorrow as I continue the report on the conference.

Comments welcome, as always.

Google Seattle scalability conference

June 11th, 2007 by Robin Harris in Architecture, Clusters, Future Tech, Information Management

I’m pumped!
Next week I’m flying to Seattle to attend a one day conference on scalability hosted by Google’s Kirkland office.

It is a great set of presentations with leading edge practitioners. Here’s the agenda, presenters and edited descriptions of the topics:

  • Keynote I: MapReduce, BigTable, and Other Distributed System Abstractions for Handling Large Datasets - Jeff Dean
    Search is one of the most important applications used on the internet, but it also poses some of the most interesting challenges in computer science. Providing high-quality search requires understanding across a wide range of computer science disciplines, from lower-level systems issues like computer architecture and distributed systems to applied areas like information retrieval, machine learning, data mining, and user interface design. In this talk, I’ll highlight some of the behind-the-scenes pieces of infrastructure that we’ve built in order to operate Google’s services.
    • Breakout I: Lustre File System - Peter Braam
      This lecture will explain the Lustre architecture and then focus on how scalability was achieved. We will address many aspects of scalability mostly from the field and some from future requirements, from having 25,000 clients in the Red Storm computer to offering exabytes of storage. Performance is an important focus and we will discuss how Lustre serves up over 100GB/sec today going to 100TB/sec in the coming years. It will deliver millions of metadata operations per second in a cluster and, write 10’s of thousands of small files per second on a single node. If you like big numbers (but less than a Gogol) please come to this talk.

    • Breakout I: Building A Scalable Resource Management Layer for Grid Computing - Khalid Ahmed
      We will show how to build a centralized dynamic load information collection service that can handle up to 5000 nodes/20,000 cpus in a single cluster. The service is able to gather a variety of system level metrics and is extensible to collect up to 256 dynamic or static attributes of a node and actively feed them to a centralized master. A built-in election algorithm ensures timely failover of the master service ensuring high-availability without the need for specialized interconnects.

      This building block is extended to multiple clusters that can be organized hierarchically to support a single resource management domain that can span multiple data centers. We believe the current architecture could scale to 100,000 nodes/400,000 cpus. Additional services such as a distributed process execution service, and a policy-based resource allocation engine which leverage this core scale-out clustering service are described. The protocols, communication overheads, and various design tradeoffs that were made the development of these services will be presented along with experimental results from various tests, simulations and production environments.

    • Breakout II: VeriSign’s Global DNS Infrastructure - Patrick Quaid, Scott Courtney
      VeriSign’s global network of nameservers for the .com and .net domains sees 500,000 DNS queries per second during its daily peak, and ten times that or more during attacks. By adding new servers and bandwidth, we’ve recently increased capacity to handle many times that query volume. Name and address changes are distributed to these nameservers every 15 seconds — from a provisioning system that routinely receives one million domain updates in an hour. In this presentation we describe VeriSign’s production DNS implementation as a context for discussing our approach to highly scalable, highly reliable architectures. We will talk about the underlying Advanced Transactional Lookup and Signaling software, which is used to handle database extraction, validation, distribution and name resolution. We also will show the central heads-up display that rolls up statistics reported from each component in the infrastructure.

    • Breakout II: Using MapReduce on Large Geographic Datasets & Google Talk: Lessons in Building Scalable Systems - Barry Brumitt, Reza Behforooz
    • MapReduce is a programming model and library designed to simplify distributed processing of huge datasets on large clusters of computers. This is achieved by providing a general mechanism which largely relieves the programmer from having to handle challenging distributed computing problems such as data distribution, process coordination, fault tolerance, and scaling.

      Since launching Google Talk in the summer of 2005, we have integrated the service with two large existing products: Gmail and orkut. Each of these integrations provided unique scalability challenges as we had to handle a sudden big increase in the number of users.

  • Keynote II: Description TBD - Marissa Mayer
    • Breakout III: Stream Control Transmission Protocol’s Additional Reliability and Fault Tolerance - Brad Penoff, Mike Tsai, and Alan Wagner
      The Stream Control Transmission Protocol (SCTP) is a newly standardized transport protocol that provides additional mechanisms for reliability beyond that of TCP. The added reliability and fault tolerance of SCTP may function better for MapReduce-like distributed applications on large commodity clusters.

      SCTP has the following features that provide additional levels of reliability and fault tolerance. Selective acknowledgment (SACK) is built-in to the protocol with the ability to express larger gaps than TCP; as a result, SCTP outperforms TCP under loss. For cluster nodes with multiple interfaces, SCTP supports multihoming, which transparently provides failover in the event of network path failure. SCTP has the stronger CRC32c checksum which is necessary with high data rates and large scale systems. SCTP also allows multiple streams within a single connection, providing a solution to the head- of-line blocking problem present in TCP-based farming applications like Google’s MapReduce. Like TCP, SCTP provides a reliable data stream by default, but unlike TCP, messages can optionally age or reliability can be disabled altogether. The SCTP API provides both a one-to-one (like TCP) and a one-to-many (like UDP) socket style; use of a one-to-many style socket can reduce the number of file descriptors required by an application, making it more scalable.

      The additional scalability and fault tolerance come at a cost. The CRC32c checksum calculation currently is not off-loaded to any NIC available on the market, so it must be performed by the host CPU. In high bandwidth environments with no loss, SACK processing may become a burden on the host CPU.

    • Breakout III: Scalable Test Selection Using Source Code Deltas - Ryan Gerard
      As the number of automated regression tests increase, the ability to run all of them in a reasonable amount of time becomes more and more difficult, and simply doesn’t scale. Since we are looking for regressions, it is useful to hone in on the parts of the code that have changed from the last run to help select a small subset of tests that are likely to find the regression. In this way we are only running the tests that need to be run as your system gets larger and the number of possible tests scales outward. We have devised a method to select a subset of tests from an existing test set for scalable regression testing based on source code changes, or deltas.

    • Breakout IV: YouTube Scalability - Cuong Do
      This talk will discuss some of the scalability challenges that have arisen during YouTube’s short but extraordinary history. YouTube has grown incredibly rapidly despite having had only a handful of people responsible for scaling the site. Topics of discussion will include hardware scalability, software scalability, and database scalability.

    • Breakout IV: Challenges in Building an Infinite Scalable Datastore - Swami Sivasubramanian, Werner Vogels
      In this talk, we will present the design of one of our internal datastores, HASS. HASS is designed to be “always” available, i.e., it will always accept read/write requests even if disks are failing, routes are flapping or if datacenters are being destroyed by tornados. HASS is designed for incremental scalability where adding or removing nodes can be done easily and the load gets evenly distributed among the nodes uniformly without requiring any operator intervention. In this talk, we will focus on a single and one of the most crucial ideas in HASS’s design: its ability to partition data. HASS uses consistent hashing to partition its data across its storage nodes. The basic consistent hashing algorithm is well understood in the academic literature and several research systems have been designed using it. In this talk, we will discuss our experiences with using the basic consistent hashing algorithm and the optimizations we performed to achieve more uniform load distribution and ease of operation.

    Which ones should I attend?
    I’m torn between a couple of the breakout options. Lustre vs. scalable resource management. YouTube vs. infinitely scalable datastore.

    I know some of you folks are intimately involved with these topics, so I’d appreciate your suggestions, not only for which to attend, but what questions you’d like to see addressed. If some of you are also going to be there I’d also be pleased to meet f2f as well.

    That last breakout session is a really tough choice. How can I be in two places at once?

    While I’m up there I’m also hoping to tour Isilon’s lab and see their gear in action.

    Comments and suggestions welcome. Last I heard the conference was full with a waiting list.

ZFS On Mac: Now All-But-Official Pt. II

June 6th, 2007 by Robin Harris in Architecture, Future Tech, Information Management, SSD/Flash Disk

All we need now is teh Steve to say it . . .
Thanks to alert reader Petieg, I’ve learned that according to Mac Rumors Sun CEO Jonathan Schwartz said today that

In fact, this week you’ll see that Apple is announcing at their Worldwide Developer Conference that ZFS has become the file system in Mac OS 10.

Jonathan is wrong, of course, but it was sweet of him to say it
Folks tell me that if ZFS is in Leopard it is pretty well hidden. I’ll stick to my prediction that Apple, as with HFS+, will put ZFS on OS X Server first before bringing it out later for the great unwashed.

For one thing it will fix a persistent problem Xserve RAID admins have: pulling out the wrong drive, or scrambling drives, and losing lots of bits. V cool.

Now I’m going to pat myself on the back
As I noted in Bring Me the Head of WinFS:

Can Apple Trump Vista With ZFS?
Apple now has a clear path to trump Vista’s aging data management with a port of ZFS. While not offering a relational database and the promise of a single cross-application data store, ZFS is a modern file/storage management system whose end-to-end data integrity and protection makes it a strong foundation for future innovation. NTFS and Apple’s HFS+ are no match for it. Let’s hope Apple says more at their World Wide Developer Conference in August.

Well, cough, cough, it looks like August 2006 is finally arriving next week.

The NEW news
I finally put two and two together and figured this out: ZFS will be great for flash disks. Unlike today’s Mac OS and Windows, ZFS bunches writes - kind of like NetApp’s WAFL - which is just what flash drives need since their random write performance is even worse than I’d realized.

In fact, it just occurs to me that it could be on the iPhone. Why? Because Bonwick, Moore, et. al. managed to write all this stuff in very little code.

More info coming on flash
I’ve been delving deep into flash disks. Can you say “weird”? My take now is that flash drives are to disk drives what quantum mechanics is to Newtonian physics. I’m planning to have something out next week.

The StorageMojo take
The real importance of ZFS on Mac is that it raises the bar for the entire industry. Journaled file systems are better than not, but as the consumer-driven IT market booms customers need better data protection and recovery tools. And flash drives need a compatible file system. ZFS goes a long way towards meeting both requirements.

Update II: No mention of ZFS in Steve’s keynote or on the Apple website. I doubt we’ll hear much about it until Apple includes it in a release of OS X Server. Maybe in October, maybe not.

Update: Want to know more about ZFS? I’ve been hot on it for over a year. See:

Comments welcome, of course.

What is email?

May 28th, 2007 by Robin Harris in Architecture, Information Management

Mirrorworlds for the masses
David Gelernter’s company, Mirror Worlds Technologies, tried to put an interface on Windows that reflected how people actually remember and link their experiences, rather than some CompSci metadata accommodation.

The company ceased operations three years ago, and at the time I recall wondering why. Last week, on a concall with CEO Ray Bingham of Arcmail, I started thinking about what email *really* is. Arcmail makes email archive servers and they’re announcing something but there’s an embargo on it for some time - days? weeks? - so that’s probably the last you’ll hear of them from me.

Not what I thought
I always thought it was about communication. And it is. Email speaks to me. It isn’t the email. It is the by-product: an organized record of communication.

Why else do I keep 15,000 emails?

Email is my journal, my archive, my most used and reliable search tool. It tracks relationships. Helps keep cryptically named documents associated with something I do understand. It is easy to organize temporally, easy to search.

I’ve used email for 25 years. Yet I never thought about how I used it. Maybe its because I’m now also IM’ing and video chatting - methods with the immediacy that email seemed to have over snail mail - that I’m starting to get it.

I maintain several on-line identities - StorageMojo, Storage Bits, Data Mobility Group and several more email addresses.

My email client is where they all come together.

It is the original online social network
And we treat it like it is email. It is identity. In a very real sense it is who we are.

Email: your personal metadata generator
Email adds value because it adds context - metadata - to raw files and communications. Context that is human readable and human memorable. That fits the relational database in our brains, not our computers. That provides metadata that people use, like names, conversations, topics and words that mean something.

Ray made the point that email servers, like Exchange, are just email servers. They aren’t designed to handle multi-gigabyte mailboxes, frequent individual searches or company-wide searches. Arcmail is.

Fear trumps greed
I pointed out some of the business advantages of big mailboxes as a business tool to Ray. He responded that he agreed, and that he’d tried some of those messages. Yet the chief buying motivator is fear of lawsuits, not the business process advantages.

Thought leadership, anyone?
I’ve come to believe that innovation happens regularly where ever people confront problems. What changes is our individual and cultural receptiveness to innovation. I think we may be getting ready to accept that email is one of the most valuable organizational tools we use and that there could be new ways of extracting business value from it.

The StorageMojo take
Email isn’t electronic “mail,” any more than cars are “horseless carriages.” Email not only goes faster than snail mail, it also provides its own infrastructure that makes it uniquely accessible and valuable.

Instead of thinking of email “clients” how about “communication clients” where email, chat, downloads, uploads and VOIP contacts are logged and are reviewable and searchable. I’d love to be able to go to one place to search my email, Adium and Skype chats, review my downloads - Safari’s 20 download history is inadequate for me - uploads and surfing history. All the information exists, but only in stovepipes. I want it all, on my local machine, always available, with an easy archive function.

Comments welcome. I can’t be the first person to think of this, so has anyone done an open-source comm client?

Intel’s best and worst

May 22nd, 2007 by Robin Harris in Architecture, Future Tech

Over on ZDnet Monday I posted RAM to avoid: hot, expensive and slow. I wrote about Intel’s ill-fated attempt to get Fully Buffered DIMMs accepted as the memory standard for servers and workstations.

I got started on the topic because I’m thinking of buying the FB-DIMM saddled Mac Pro. I was wondering why the DIMMs were so expensive, which led me to the great AnandTech review, which totally demolishes the architectural arguments for FB-DIMMS by looking at their real-world performance, which in the worst case is about 20% of the “theoretical” spec Intel touts. “Pathetic” is too kind by half.

The modest virtues of FB-DIMMs have no chance of overcoming these shortcomings. OK, Hillsboro, back to the drawing board.

Architecture-based advocacy is always suspect
I noted that Intel seems to have a consistent problem with architecture decisions, as evidenced by NetBurst, Itanium, RDRAM and now FB-DIMM. I’m sure I’ve left out a bunch, like Infiniband, that ended up nowhere near what Intel originally intended, but whose real quality has earned it a continuing role. Even the success of USB2 is, IMHO, marred by the insistence on the “480 Mb/sec” spec when the actual performance is about half that.

FB-DIMMs will only be a footnote in any list of Intel failures. Anyone got some other candidates? I’m thinking Itanium has to be #1, but to be honest I haven’t followed Intel all that closely. Maybe some of you have.

’tis better to invent and lose, than never to invent at all
Intel’s willingness to take risks is a Good Thing. Yet the continuing problems with taking smart risks in Hillsboro and Santa Clara raises some interesting questions about Intel culture and decision making.

There is a depressing sameness to how poor and non-bright people screw up. Dissecting how the bright, clever and rich people screw up is much more entertaining. We may even reach some non-obvious insights.

Where Intel architects go wrong

  1. Premature excitation: the process of projecting a negative trend forward to a scary future while under estimating opposing trends. With FB-DIMMs, the trends were increasing CPU power and rising memory bus clock speeds, meaning shorter traces and fewer memory slots. (Il)logical conclusion: we’ll soon have really fast CPUs with no memory capacity.
  2. Boiling the ocean: “let’s solve all of today’s computing problems with an architecture so good that when we finally get it out the door in five years it may even solve some of the new problems.” This is the Itanium problem.
  3. Optimizing for marketing: this is the NetBurst problem. Clock speed sells, so let’s build the fast clocks. There are too many details in optimizing a whole system and, frankly, I don’t want to learn all the new stuff required to do that.

Got any other ideas?

Please nominate your favorite Intel architectures, both bad and good.
I’m picking on Intel because they blemished my likely next machine with FB-DIMMs. But I think we can have some fun with this too. Intel guys welcome, but no whining. One EMC is enough.



« Previous Article
StorageMojo RSS Feed May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007