A petascale parallel database

by Robin Harris on Monday, 8 February, 2010

MapReduce and its open source version, Hadoop, are parallel data analysis tools. A few lines of code can drive massive data reductions across thousands of nodes.

Cool.

Powerful though it is, Hadoop isn’t a database. Classic structured data analysis of the model/load/process type isn’t what it was designed for.

That’s where the paper HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads (pdf) comes in. Written by Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz and Alexander Rasin (the former 4 @Yale, and the latter @Brown) the paper proposes a method for building an open-source, commodity hardware-based massively scalable, shared-nothing, analytical parallel database.

What it is
HadoopDB coordinates SQL queries across multiple independent database nodes using Hadoop as the task coordinator and network communication layer. It uses the scheduling and job tracking of Hadoop while it intelligently pushes much of the query processing into the individual database nodes.

There are four components to HadoopDB.

  • Database Connector. Each node has its own independent database. The connector is the interface between the database and Hadoop’s task trackers. A MapReduce jobs supplies the Connector with an SQL query and other parameters. The Connector executes a SQL query on the database and returns results as key value pairs. It can implemented to support a variety of databases.
  • Catalog. The information needed to access the databases and metadata such as cluster data sets, replica locations and data partitions is kept in the catalog.
  • Data loader. The data loader is responsible for two jobs. First executing a MapReduce job over Hadoop that reads the raw data files and partitions them into as many parts as the number of nodes in the cluster. Second, the partitions are loaded into the local file system of each node and chunked according the system-wide parameter.
  • SQL to MapReduce to SQL planner. The planner provides a parallel database front end to enable SQL queries. The planner transforms the queries into map reduce jobs and optimizes the query plans for efficiency. This is where scratch that this is the secret sauce of HodoopDB.

HadoopDB complements the Hadoop infrastructure and does not replace it. Analysts have both available as needed.

Heterogeneity
A key issue for Internet-scale systems is the ability to run in a heterogenous environment where multi-year build-outs and rolling node replacement are the norm. That means that some nodes will be faster than others. HadoopDB breaks the work down into small tasks and moves them from slow to fast nodes automagically.

Results
The authors ran some benchmarks on Amazon’s EC to to test performance. The HadoopDB load times were about 10x that of Hadoop, but the higher performance of HadoopDB usually justified the longer set up time.

The authors found that HadoopDB was able to approach the performance of parallel database systems on much lower cost hardware and free software. Given the gift of the projects one can expect higher performance as improvements are made.

The killer app for private clouds?
MapReduce and Hadoop are already in wide use among Internet-scale datacenters. As companies begin to understand and correlate social media, web activity and ad response rates, the demand for large-scale parallel database processing will grow. But will they want to ship it out to Amazon?

Depending on the quantity and sensitivity of the data many organizations may prefer to keep the processing in-house. Private scale out Hadoop clusters may become the poor companies data warehouse of choice.

The StorageMojo take
HadoopDB is more science project than commercial tool today. Yet the project demonstrates the feasibility of using scale out compute/storage clusters for work that day typically requires proprietary high-end scale up system architectures.

If capital costs are reduced by two thirds with a commodity/FOSS architecture, companies could afford to hire the expertise required to make it work. The free software/paid support model will prove quite successful in this space.

Courteous comments welcome, of course.

{ 6 comments… read them below or add one }

nate February 9, 2010 at 11:43 am

I was talking to a developer working on a project that will be running on hadoop soon and was interested to hear his comments on hadoop itself, it’s extremely poorly written, apparently Yahoo built it mostly by outsourcing the development overseas to some low quality coders, and the result is some pretty poor code. It can work it’s just not that good.

I find it pretty interesting how much stuff google does internally such as their own file system, mapreduce, server builds, their own switches and routers, their own http server, their own java servlet server.

Meanwhile others struggle to keep up trying to use as much off the shelf stuff as possible because they don’t have the engineering resources internally to even begin to approach doing it themselves, even a Microsoft insider admitted as much recently in an interview http://www.theregister.co.uk/2010/02/03/microsoft_bing_number_two_wannabe/

I suppose the message here is hope & pray you aren’t in a market that google is or might become interested in at some point if your relying on hadoop. Because whatever you can do, they can do 1000x faster with their ~billion servers, and their ~million PhDs.

ryan February 9, 2010 at 3:25 pm

Maybe you could elaborate on why one might choose to use this platform over some of the other options, namely, hbase, pig, sqoop, hive, etc, etc.

juliet February 10, 2010 at 12:37 am

Forget the problem of an application crash or slow data access or response time for an overloaded SAN switch port with Traverse’s service container that can monitor application response time and correlate that with the underlying storage components which are relevant to that application using its Business Container technology.
http://zyrion.com/solutions/server.php

rdp February 10, 2010 at 6:40 am

RE: “google does internally such as their own file system, mapreduce, server builds, their own switches and routers, their own http server, their own java servlet server” all key components of Enterprise Computing and its BIG brother, Cloud Computing”

I have always been a believer in scalability in computing. To this end I have been trying to decide on a Home/SOHO Cloud Computing design that could be implemented in this lifetime. Remember when “SuperComputing” looked so far out of reach of the little people?
My spin on this draws on Robin’s previous post “Why private clouds are part of the future”. So for a Home Cloud you would need:
“your heavily modified file system, Hadoop (mapreduce), your custom
server builds, your hand picked switches and routers, your own http
server, your own java servlet server”.
You could use COTS (Commercial Off The Shelf) components for the Home
Cloud since bandwidth and throughput will not make the difference
between your making a profit or not and surviving. This means that a new
market for Private Cloud components is developing to supply some of
the Google in-house developed components. “New” in the sense that these components have to be highly configurable (easily modified) to your local environment. A set of switches (hardware/software/firmware), and a roadmap for “dummies” to produce the required feature/function set without an in-house staff of programmers. Auto/Self configuring components are a possibility. It is not “rocket science” to do this.
IMHO,YMMV the rise of Private Clouds is a major shift in the computing paradigm.
I want a good cheeseburger (Information on Demand) with fries and a Pepsi (all “Value Add” Services) every time.
It is interesting to me that Robin has written on every topic required to compete with and even surpass the IDCs on a lesser scale. All the key components are outlined in his posts. Has anyone put them together?

Ryan Garrett February 11, 2010 at 5:42 pm

Very interesting article, Sebastian.

Anyone interested in MapReduce, big data management and big data analytics you should check out the Big Data Summit Bay Area next week in Burlingame, CA. You can register at http://bit.ly/5KUX01. This is the premier conference on data warehousing and big data analytics. Learn how leading companies are leveraging technologies like Hadoop and MapReduce to turn data into dollars. Hear from Aster Data customers like Intuit and Mobclix and leading analysts on new technologies and trends in big data management and advanced analytics.

Roger January 31, 2011 at 5:52 am

You are very good at this!
HadoopDB is focused on structured data (rather than unstructured data). HadoopDB can also be applied to other workloads, but the less structured the data gets, the less useful it will be relative to just using Hadoop.
I hope the project HadoopDB lives on…

Leave a Comment

Previous post:

Next post: