MapReduce and its open source version, Hadoop, are parallel data analysis tools. A few lines of code can drive massive data reductions across thousands of nodes.

Cool.

Powerful though it is, Hadoop isn’t a database. Classic structured data analysis of the model/load/process type isn’t what it was designed for.

That’s where the paper HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads (pdf) comes in. Written by Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz and Alexander Rasin (the former 4 @Yale, and the latter @Brown) the paper proposes a method for building an open-source, commodity hardware-based massively scalable, shared-nothing, analytical parallel database.

What it is
HadoopDB coordinates SQL queries across multiple independent database nodes using Hadoop as the task coordinator and network communication layer. It uses the scheduling and job tracking of Hadoop while it intelligently pushes much of the query processing into the individual database nodes.

There are four components to HadoopDB.

  • Database Connector. Each node has its own independent database. The connector is the interface between the database and Hadoop’s task trackers. A MapReduce jobs supplies the Connector with an SQL query and other parameters. The Connector executes a SQL query on the database and returns results as key value pairs. It can implemented to support a variety of databases.
  • Catalog. The information needed to access the databases and metadata such as cluster data sets, replica locations and data partitions is kept in the catalog.
  • Data loader. The data loader is responsible for two jobs. First executing a MapReduce job over Hadoop that reads the raw data files and partitions them into as many parts as the number of nodes in the cluster. Second, the partitions are loaded into the local file system of each node and chunked according the system-wide parameter.
  • SQL to MapReduce to SQL planner. The planner provides a parallel database front end to enable SQL queries. The planner transforms the queries into map reduce jobs and optimizes the query plans for efficiency. This is where scratch that this is the secret sauce of HodoopDB.

HadoopDB complements the Hadoop infrastructure and does not replace it. Analysts have both available as needed.

Heterogeneity
A key issue for Internet-scale systems is the ability to run in a heterogenous environment where multi-year build-outs and rolling node replacement are the norm. That means that some nodes will be faster than others. HadoopDB breaks the work down into small tasks and moves them from slow to fast nodes automagically.

Results
The authors ran some benchmarks on Amazon’s EC to to test performance. The HadoopDB load times were about 10x that of Hadoop, but the higher performance of HadoopDB usually justified the longer set up time.

The authors found that HadoopDB was able to approach the performance of parallel database systems on much lower cost hardware and free software. Given the gift of the projects one can expect higher performance as improvements are made.

The killer app for private clouds?
MapReduce and Hadoop are already in wide use among Internet-scale datacenters. As companies begin to understand and correlate social media, web activity and ad response rates, the demand for large-scale parallel database processing will grow. But will they want to ship it out to Amazon?

Depending on the quantity and sensitivity of the data many organizations may prefer to keep the processing in-house. Private scale out Hadoop clusters may become the poor companies data warehouse of choice.

The StorageMojo take
HadoopDB is more science project than commercial tool today. Yet the project demonstrates the feasibility of using scale out compute/storage clusters for work that day typically requires proprietary high-end scale up system architectures.

If capital costs are reduced by two thirds with a commodity/FOSS architecture, companies could afford to hire the expertise required to make it work. The free software/paid support model will prove quite successful in this space.

Courteous comments welcome, of course.