Scale out storage and Hadoop are a great duo for working with masses of data. Wouldn’t it be nice if it could also be used for more mundane storage tasks, like block storage?
Well, it can. Some Silicon Valley engineers have produced a software front end for Hadoop that adds an iSCSI interface. The team had 3 goals:
- Create an iSCSI volume as an HDFS file
- Make it interoperate with native iSCSI Initiators on Windows and Linux
- Performance comparable to common NAS appliances
The payback is that clients get a robust, resilient, scale-out infrastructure, at commodity hardware prices. Even small iSCSI arrays can’t compete, assuming, of course, that you’ve got a Hadoop cluster.
Performance?
Hmm-m-m, turning an enormous key-value store into a block device. What could go wrong?
Performance could suck, for one. But surprisingly the untuned prototype software offers disk levels of performance.
With optimizing it could likely do 25%-50% better. Better yet: put the iSCSI daemon on each node and your bandwidth grows with your cluster.
Testing
The team has done some testing on Hadoop on Ubuntu with a standard Windows 7 client. Everything is off-the-shelf, with a W7 client, a namenode and 3 data nodes on a 10Gb Ethernet network.
The test payload includes single 2 GB binary file, 25.2 Gigs (~4200) of ~5 meg JPEGs & a few 10+ meg MPEGs in many subfolders (JPEGs & MPEGs don’t compress much) and 10,000 1K text files.
Here are the team’s results on the 25GB J/MPEG test. Note that zero is on the graph’s right side and incoming data is the blue line.
And the results for copying 2 streams of J/MPEGS plus the 2GB binary:
The StorageMojo take
So, are we on the verge of creating the scale-out iSCSI market niche? That’s Marketing 101: create a niche and dominate!
Thankfully, no. iSCSI target mode on Hadoop is clearly a feature that should be incorporated into a larger product. And that’s why the engineers contacted StorageMojo.
They’d like to sell or license their IP and software prototype to a company looking to differentiate their product – customers love options – and expand their use cases with a speedy block service.
If you’re interested, please contact StorageMojo by sending mail. After some diligence I’ll put you touch with the team.
Courteous comments welcome, of course. Readers, what say you? Does it broaden Hadoop’s appeal or meh?
Um, both ceph and gluster support being iscsi targets and handle Hadoop uses. How is this different besides being new? Using ceph’s rbd for both is a well-tested solution now.
It’s not a new idea, but as the prior poster noted there are lots of ways to skin this cat without throwing all in with Hadoop, without losing the ability to Hadoop.
I’ve messed around with having VMware virtualised Hadoop present iSCSI back to the VMware hosts to mount yet more storage from. Convoluted, but it worked and fast enough to forget that I was going through so many layers of abstraction.