From the category archives:

Clusters

A petascale parallel database

by Robin Harris on Monday, 8 February, 2010

MapReduce and its open source version, Hadoop, are parallel data analysis tools. A few lines of code can drive massive data reductions across thousands of nodes.

Cool.

Powerful though it is, Hadoop isn’t a database. Classic structured data analysis of the model/load/process type isn’t what it was designed for.

That’s where the paper HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads (pdf) comes in. Written by Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz and Alexander Rasin (the former 4 @Yale, and the latter @Brown) the paper proposes a method for building an open-source, commodity hardware-based massively scalable, shared-nothing, analytical parallel database.

What it is
HadoopDB coordinates SQL queries across multiple independent database nodes using Hadoop as the task coordinator and network communication layer. It uses the scheduling and job tracking of Hadoop while it intelligently pushes much of the query processing into the individual database nodes.

There are four components to HadoopDB.

  • Database Connector. Each node has its own independent database. The connector is the interface between the database and Hadoop’s task trackers. A MapReduce jobs supplies the Connector with an SQL query and other parameters. The Connector executes a SQL query on the database and returns results as key value pairs. It can implemented to support a variety of databases.
  • Catalog. The information needed to access the databases and metadata such as cluster data sets, replica locations and data partitions is kept in the catalog.
  • Data loader. The data loader is responsible for two jobs. First executing a MapReduce job over Hadoop that reads the raw data files and partitions them into as many parts as the number of nodes in the cluster. Second, the partitions are loaded into the local file system of each node and chunked according the system-wide parameter.
  • SQL to MapReduce to SQL planner. The planner provides a parallel database front end to enable SQL queries. The planner transforms the queries into map reduce jobs and optimizes the query plans for efficiency. This is where scratch that this is the secret sauce of HodoopDB.

HadoopDB complements the Hadoop infrastructure and does not replace it. Analysts have both available as needed.

Heterogeneity
A key issue for Internet-scale systems is the ability to run in a heterogenous environment where multi-year build-outs and rolling node replacement are the norm. That means that some nodes will be faster than others. HadoopDB breaks the work down into small tasks and moves them from slow to fast nodes automagically.

Results
The authors ran some benchmarks on Amazon’s EC to to test performance. The HadoopDB load times were about 10x that of Hadoop, but the higher performance of HadoopDB usually justified the longer set up time.

The authors found that HadoopDB was able to approach the performance of parallel database systems on much lower cost hardware and free software. Given the gift of the projects one can expect higher performance as improvements are made.

The killer app for private clouds?
MapReduce and Hadoop are already in wide use among Internet-scale datacenters. As companies begin to understand and correlate social media, web activity and ad response rates, the demand for large-scale parallel database processing will grow. But will they want to ship it out to Amazon?

Depending on the quantity and sensitivity of the data many organizations may prefer to keep the processing in-house. Private scale out Hadoop clusters may become the poor companies data warehouse of choice.

The StorageMojo take
HadoopDB is more science project than commercial tool today. Yet the project demonstrates the feasibility of using scale out compute/storage clusters for work that day typically requires proprietary high-end scale up system architectures.

If capital costs are reduced by two thirds with a commodity/FOSS architecture, companies could afford to hire the expertise required to make it work. The free software/paid support model will prove quite successful in this space.

Courteous comments welcome, of course.

{ 5 comments }

Verari restart

by Robin Harris on Wednesday, 20 January, 2010

Verari Systems is now Verari Technologies. The company’s assets were purchased by the original founder, Dave Driggers, after an attempt last year to get another round of financing foundered.

They’ve had some success with their containerized compute/storage systems. There haven’t been many buyers amidst the Great Recession and the credit crunch didn’t help.

Here are edited comments from their website:

Original Founder Leads Investment Group in Purchase of Verari Systems’ Assets

Founder aims to re-start company with concentration on data center design and optimization services, modular container-based data centers, blade-based storage and high performance computing solutions.

San Diego, Calif. – January 19, 2010 – David Driggers, the original Founder of Verari Systems, Inc., . . . today announced the successful acquisition of substantially all of Verari Systems’ corporate and intellectual property assets by an Investment Group led by Driggers.

Mr. Driggers is re-starting the Verari engine this week. The new company, Verari Technologies, is offering immediate support to past Verari Systems’ customers.

Verari’s award-winning FOREST containers are one of the industry’s best selling portable data center solutions. The containers, as well as Verari’s BladeRack architecture, utilize Verari’s patented Vertical Cooling Technology to increase energy efficiency while reducing a customer’s energy bills.

“You’re going to see a concerted effort on our part to license and promote these unique technologies,” states Mr. Driggers.

Most of the staff was laid off last year because the company couldn’t meet payroll. The new company retains much of the former senior management.

The StorageMojo take
Verari is wise to take a step back from direct competition with HP, SGI and IBM. HP owns the biggest chunk of the blade market, buys over half the world’s disk drives and, in the 9100, has some very dense storage. But HP can’t be all things to all people – and Verari can help fill the gaps.

While the density benefits of blades are undeniable, some question whether they are cost-effective compared to high-volume commodity boxes. Verari’s pricing seemed more aggressive than most blade vendors – perhaps too aggressive – but price is another competitive tool they may choose to wield to the benefit of buyers everywhere.

Courteous comments welcome, of course.

{ 2 comments }

Cloud at Storage Visions 2010

by Robin Harris on Wednesday, 13 January, 2010

I moderated a panel on cloud storage at Tom Coughlin’s Storage Visions 2010 conference. Some good stuff came out of it.

4 companies presented: IBM, Bycast, Cleversafe and Asankya.

IBM, now a services company, talked about the service needs of cloud providers or cloud customers.

Bycast
Bycast, which may have the largest installed base of any cloud software provider, presented on the process that they typically see for private cloud implementation. My interpretation of the process:

  • Edge sites install a gateway node to the central private cloud repository
  • The edge site learns what its local data needs are
  • A local disk cache is added to the gateway node to improve performance
  • A workable balance between local wants and economics is achieved.

It took 3 years for the enterprise to go from pilot to start full deployment. Data storage rose from 36 TB at the end of year 1 to 750 TB at the end of year 6.

Cleversafe
Cleversafe may be the leader in implementing advanced erasure codes in storage software. RAID 5 & 6 are both forms of erasure codes, but the math has been refined in the last 20 years. Much higher levels of data availability with lower overhead are now possible.

As disk capacities climb and disk error rates remain constant, the expected annual data loss rises. By 2020 you can expect that a 1,000 disk storage farm will lose over 200 GB of data annually – even with mirrored RAID 6. (RAID 16? The mind boggles).

Advanced erasure codes combined with physically dispersed storage make all that go away. Cleversafe estimates that a dispersed storage infrastructure requiring 10 of 16 nodes to reconstruct the data is 1,000,000 times more reliable than RAID 16 – reducing expected data loss from 200 GB to 200 KB.

Asankya
If Bycast has proven private cloud software and Cleversafe has disaster-proof storage, then we’re done, right? Except for the freakin’ network latency that makes “cloud” storage synomous with “slow” storage. That’s where Asankya comes in.

Their basic insight is this: TCP/IP was built when a 200 nanosecond CPU and a couple of meg of RAM was a Hot Box. What if we were to change the protocol to take advantage of modern resources – could we do better? Well, duh!

They’ve developed the RAPID protocol and an overlay network called RAPIDnet that they claim dramatically improves network performance. How?

  • Multipathing. Instead of tying a session to a single network path, RAPID decides on a per-packet basis the fastest route for that packet.
  • Maximum bandwidth utilization. Multiple paths means more available bandwidth – and RAPID loads each path as full as it can.
  • Network deduplication. Originating nodes keep track of all packets that pass through, so when a duplicate packet shows up it doesn’t resend it.

Net net: by increasing bandwidth and reducing delays, Asankya cuts latency, making cloud storage much more feasible for interactive apps. Cool!

Of course, this all has to work in the Real World. Evidently it does, as they have customers. And the technology came out of Georgia Tech.

The StorageMojo take
The latter 3 companies make an important point about cloud storage and computing: we can do much more to make it economical, safe and fast. That’s a Very Good Thing.

Asankya is asks if network intelligence should be in the core or on the edge? Cisco, of course, prefers a smart core, so Asankya is a clear threat to them. The rest of us might disagree.

Courteous comments welcome, of course. I’m doing some work for Bycast, but, alas, not for the other companies. Thanks to Tom Coughlin for assembling a good group for the panel. I’m hoping I can post links to more info on all of them.

{ 5 comments }

2009’s big STORies

by Robin Harris on Monday, 28 December, 2009

2009 has been an eventful year: the Great Recession has driven big changes in enterprise behavior, opening up the field to many new players. Isilon, for one, is reporting healthy growth and they were on the ropes 2 years ago.

Those changes are reflected in my take on the biggest stories of the year:

(8) Tiny server clusters
Instead of putting many virtual eggs in one power-hungry basket, why not build low-power/low-cost servers that don’t need VM software at all?

Microslice servers achieve availability through cheap redundancy. Of course, no enterprise salesman will sell them, so if their advantages prove out the efficiency gap between cloud and enterprise shops will only grow.

(7) Nightmare on DIMM street
Bianca Schroeder’s, et. al. finding that DRAM is hundreds to thousands of times more error-prone than chip vendors said means that every device that claims to be “enterprise” better have at least SECDED – single error correction/double error detection – ECC.

(6) Apple drops ZFS
A golden opportunity to bring a 21st century file system to millions of people sank without a trace. But if the Sun/Oracle deal gets closed it might be revived.

(5) Data Domain bidding war
An EMC blogger was trashing DD 2 weeks before the bid – and singing their praises after it. So what else is new?

EMC legitimized dedup – and the bastards say welcome.

(4) Cluster-based scale-out storage
HP bought IBRIX and Isilon is growing fast – storage clusters have arrived. EMC will continue to pooh-pooh it until they get Atmos functional – or maybe they’ll bite the bullet and buy someone who already has it working.

(3) Flash
STEC’s 10x stock leap – and crash – to everyone announcing flash drives and cards and appliances: this is not a flash in the pan. Fusion-io’s big OEM deals and announcements by newcomers say the party is just getting started.

(2) Cisco’s bong-sized cloud
Cisco’s UCS may not be a success, but they have forced everyone to rethink their businesses. Is a new round of verticalization about to begin as big companies seek to drive growth by taking away their former “partner’s” markets?

It used to be a commonplace that he who owned the customer’s data owned the business, but the horizontal model of the last 25 years changed that. But if the Oracle/Sun deal completes, Cisco will find that Oracle’s grip is tighter, giving HP and Cisco common cause once again.

(1) Cloud infrastructure
Unlike some other hype-driven IT trends, cloud infrastructure is here to stay because Google, Amazon, Yahoo and Microsoft have proven it makes economic sense. Which is more than client-server had going for it for many years.

Smart IT people looking to demonstrate added-value will figure out how to leverage that for real competitive advantage over less-nimble foes. It isn’t a quick fix though and enterprises will need to think long term – a skill rusty from disuse.

The StorageMojo take
Like a termite-riddled barn after a heavy snow, the Great Recession is seeing old models collapse. We can’t afford to keep doing what we’ve been doing.

As the new models emerge, competition will grow in the hot areas, leading to even more innovation in the next 3 years than we’ve seen in the last 5. More on that in a future post.

Courteous comments welcome, of course.

{ 6 comments }

Tiny server clusters

by Robin Harris on Sunday, 6 December, 2009

Virtual machines (VMs) solve the problem of many tiny servers on a big server. VMs are a logical outgrowth of Moore’s Law: server CPUs got bigger, faster, than the apps required. And Windows Server didn’t handle multiple apps well.

But the growth of 100 megawatt Internet-scale data centers has architects rethinking efficiency-at-scale. As James Hamilton put it in his presentation
Internet-Scale Service Infrastructure Efficiency (pdf):

Single dimensional performance measurements are not interesting at scale unless balanced against cost

Therefore: work done per $; per joule; and per rack.

Microslice server
Because CPU performance has grown so much faster than storage – disk and DRAM – over the last 30 years, powerful multicore CPUs are spending much of their time idling. The microslice server idea: build servers from slower, cheaper and much more power-efficient CPUs.

Amazon has done just that. A microslice prototype jointly developed with SGI – formerly Rackable – using a lower power Athlon 4850e CPU handled over 9x the requests per second (RPS) of a rack of conventional servers.

microslice_test
And the server cost just $500, used 1/5th the power and provided about 70% of the performance (RPS) of the much costlier server. Higher density – something like 6 servers per rack unit – provided the rack-level performance.

Disk Workload from Hell
At October’s 22nd ACM Symposium on Operating Systems Principles (SOSP) – David G. Andersen, Jason Franklin, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan – all from Carnegie Mellon University – and Michael Kaminsky (Intel Research Pittsburgh) presented FAWN: A Fast Array of Wimpy Nodes, a Best Paper award winner.

FAWN’s goal: maximizing queries per Joule in a high performance key-value storage system. Key-value stores are seeing increasing use in Internet-scale systems – the key is a unique identifier for the associated value.

The paper explains:

The workloads these systems support share several characteristics: they are I/O, not computation, intensive, requiring random access over large datasets; they are massively parallel, with thousands of concurrent, mostly-independent operations; their high load requires large clusters to support them; and the size of objects stored is typically small, e.g., 1 KB values for thumbnail images, 100s of bytes for wall posts, twitter messages, etc.

The paper describes both the hardware – which uses 500 MHz embedded processors, 256 MB DRAM and 4 GB CF flash – and the software – a log-structured per-node datastore that optimizes flash performance. The net/net: FAWN is over 6x more efficient – on queries per second – than conventional systems.

At 1/5th the cost. And 1/8th the power.

The StorageMojo take
This is more important than it looks. The Internet guys are optimizing for power, something most businesses ignore. But the low cost and performance of these nodes is attractive to everyone else.

Back in the day, DEC sold a lot of 3 node DSSI VAXclusters. Why? They were cheap(er) and if you lost a node you still had 2/3rds of your system.

In 2010 I expect to see low-end, cluster-based storage systems that offer multi-node resilience at low cost. Not just purchase price either, but service costs as well. A node went down? We’ll overnight you a new one.

The low-end is about to get a lot more interesting.

Courteous comments welcome, of course. The other SOSP best paper is fascinating too: RouteBricks: Exploiting Parallelism to Scale Software Routers. I hope I have time to post on it.

And BTW, Intel is also showing a microslice proto.

{ 3 comments }

Consolidated I/O for virtual data centers

by Robin Harris on Tuesday, 17 November, 2009

Xsigo (see-go) produces an I/O consolidation appliance whose elegance impresses.

I/O clutter
Typical blade servers have several I/O adapters for networks and storage. Today’s multi-CPU – each multi-core – mobo’s need much bandwidth to stay busy, thus 2-4 GigE or 10GigE network ports and 2 or more SAS or FC HBAs configs are common.

Each HBA/HCA eats slots and power, adds cost and makes I/O a pain to upgrade or replace. Xsigo offers an alternative.

Big cheap pipe
Built on 20 Gb/s DDR Infiniband, Xsigo replaces physical NICs and HBAs with virtual ones configured on the fly. Xsigo says that the Infiniband is not visible in daily operation.

They physical I/O is implemented in Xsigo’s I/O Director, a 15 slot box with 24 non-blocking DDR Iband ports for server connection. The slots support your choice of single-port 10GigE, dual-port 4 Gbit FC or 10-port GigE I/O modules.

Each 10GigE module supports up to 128 vNICs. The FC module supports 128 vHBAs. And the GigE module can support 160 vNICs.

Xsigo says you can do most anything with the v-adapters that you can do with the real thing: jumbo frames; LUN masking; link aggregation; VLANs; SAN boot; and QOS features like committed information rates.

Here’s the cool part: the v-adapter addresses can dynamically migrate with a specific VM. Big improvement over the default VM-only migration.

The StorageMojo take
Good to see Iband used as a big cheap pipe. Its low latency, cheap switch ports and high bandwidth make it the best choice for this application.

VMware and Hyper-V have serious I/O problems. Xsigo helps manage them.

Courteous comments welcome, of course. Xsigo was one of 10 or so sponsors that brought me and 15 other bloggers to Silicon Valley last week. They probably have some competition, but I couldn’t find them by Googling. Let me know who they are.

{ 15 comments }

Storage weather forecast: much coolness

by Robin Harris on Friday, 13 November, 2009

Spending the week in Silicon Valley catching up on storage progress. Short takes:

  • Hyper-V storage virtualization. Software now in beta that dramatically increases the Microsoft virtualization layer’s storage chops: cheap snapshots; high-performance I/O with multiple VMs; and an almost invisible UI. Snaps into the management bus as a standard VHD with a map magic smart driver behind it.
  • A NAS test appliance that replaces a lab full of equipment with a single server box that can generate millions of NFS connections and drive GB of traffic. CIFS too. Swifttest.
  • Update from Parascale: some vlarge customers seeing compelling economic benefit from an internal scale-out file storage utility. Time is ripe.
  • Quick intro to FOSS NAS – NFS, CIFS, HTTP, WebDAV & more – company Gluster. Metadata server is an architectural problem – so lose it! Want/need a deep dive on this.
  • Rapid growth at Nexenta with their ZFS-based storage server.
  • An informed observer posits that ZFS on Mac may not be dead – if Oracle’s acquisition of Sun goes through in the not-too-distant-future. See = believe.

And there’s more
Today the event I came to town for starts with briefings from several firms and a reception at one of my favorite places, the Computer Museum. Looking forward to visiting PARC and briefings from VMware, Xsigo, MDS Micro, 3PAR, Symantec (Veritas), Ocarina, Nirvanix and Data Robotics.

The StorageMojo take
Storage is not, historically, a fast moving market. But I’m seeing more action today than in years.

And that’s good for customers and the industry.

Courteous comments welcome, of course.

{ 10 comments }

Redundant array of inexpensive servers

by Robin Harris on Sunday, 8 November, 2009

A recent post on the dumb disk fallacy argues that enterprise storage isn’t overpriced. That misses the point: enterprise arrays may not be overpriced – but they overshoot most market requirements.

That’s why there’s so much innovation in the high capacity/low cost end of the market. And why high-end monolithic arrays are the mainframes of tomorrow.

Background
Disk drives are only 5-10% of an array’s cost. Since dual-parity arrays protect against 2 drive failure/read errors, the logical question is “why not just replicate the data 3x for 15-30% of the array’s cost and get better protection?”

Comparing bare drive costs to array costs isn’t realistic. Power supplies, chassis, cabinets, processors, interfaces and firmware all cost money.

Servers have it all
Which is why I prefer to look at server costs – especially servers with lots of disk slots. Like a Supermicro Superchassis 846E1-R710B 4U 24-bay storage server.

For $5500 you get:

  • A 4U storage 24-drive chassis w/ redundant power & cooling, 2 GigE ports
  • Mobo with dual quad-core Xeon 2.5 GHz processors and 16 GB ECC RAM
  • 24 1 TB 7200 rpm SATA drives
  • Dual 12 channel PCIx RAID cards
  • Support for 15k SAS drives if desired

Update: Some folks asked where that pricing came from. It is from Priority Computer & Networks. I have no experience with them, but I like the nifty online configurator. End update.

The drives are 40% of the configuration’s cost. And I’m sure you could do better. Throw Linux or OpenSolaris storage stacks on the box for free, or, for example, Nexenta’s Enterprise Gold Edition for another $4k and you’ve got a nifty NAS/iSCSI array.

Buy several and layer HP’s IBRIX or ParaScale software on them and you’ve got a scalable file cluster with redundancy and performance for way less than monolithic enterprise arrays. Or even mid-range modular arrays.

When are enterprise arrays better?
Enterprise arrays aren’t designed to compete with Supermicro boxes and FOSS. They offer benefits way beyond commodity-based storage – but at a price.

  • Performance. Big redundant write caches are perfect for transactional apps – but the corner cases make engineering them a nightmare.
  • Scale-up architectures. Embedded switches, star networks, high-performance FPGA controllers, FC and Infiniband – all the hot-rod, go-fast, low-volume tech give big arrays a scale-up capability that enterprise IT likes – until its gone.
  • Bullet-proof hardware and software. Years of tweaks to the exception-handling and careful drive qual and firmware control makes these systems more reliable.
  • Layered software. Database and email backup, LAN and WAN replication and all the other options give enterprise IT a warm feeling and operational flexibility.

So what’s the problem?
On the Supermicro the storage is $0.50/GB, while on the enterprise array it’s $5 GB or more. And enterprises can’t afford that for everything.

As intuition suggests and recent research confirms data is getting cooler. We store more and more and access it less and less.

Translation: we need cheap capacity, not high performance. And as Google proved years ago, just because it’s cheap doesn’t mean it is slow.

The StorageMojo take
I don’t know if enterprise arrays are “overpriced” – if people buy ‘em, how bad can they be? – but I am confident that they overshoot the performance requirements of most applications by a wide margin. And that performance is very expensive.

But enterprise IT learned long ago that it is better to be overconfigured than to be caught short. And in good times who cares?

But times aren’t good and despite the Q3 numbers we haven’t seen the end of the Great Recession. The good old days aren’t coming back.

The point behind comparing array prices to dumb disks is to remind people that there might be a better way to achieve their goals than spending $4.75 per GB on performance that most of their data doesn’t need.

If it is redundancy you’re after there are better alternatives than RAID 6. With the arrival next year of pNFS and the commoditization of 10GigE we have many more options for high performance at a low cost per GB than ever before.

Courteous comments welcome, of course. I once worked for Sun and have done work for IBRIX and Parascale.

{ 17 comments }