Open source storage array

by Robin Harris on Wednesday, 20 July, 2011

Most business files are only opened a few times, yet remain valuable enough to keep on line, just in case. That cold data is normally stored on high-performance, high-price NAS boxes at $$/GB.

Why?

2 years ago Backblaze, an online backup provider, open-sourced their storage pod design: 45 drives in a box (see Build a RAID 6 array for $100/TB). Now they’re back with v2: 45 3TB drives in a box with higher performance.

Backblaze now has over 16PB of storage pods in production.

Now for the good news
Backblaze isn’t in the box building business. They designed the storage pod for their backup business and released the plans out of the goodness of their hearts and for the free publicity.

I’ve thought that this could be a viable business for someone who doesn’t want to be the next NetApp or Isilon. Someone happy to build and ship boxes on a cost-plus basis to people who understand and can support a fault-tolerant software layer above the box, but who don’t have time to chase down miscellaneous hardware from vendors who prefer to sell in bulk.

That vendor has emerged: Protocase, the quick-turn enclosure shop that builds Backblaze’s enclosures.

I spoke to Protocase co-founder Doug Milburn – a PhD in mechanical engineering – today. Protocase will announce a complete just-add-drives storage pod: assembled, tested and software loaded box. Look for it in 2-4 weeks, priced at ≈$6k. With another $5500 for 3TB drives, it will come in at less than $90 per raw TB.

Why no drives? That’s the lion’s share of the cost and also the fastest to decline in price. They don’t need the inventory exposure and tech savvy shoppers can probably do better anyway. BTW, Backblaze has had good experience with the Hitachi HDS5C3030ALA630 drive.

The StorageMojo take
This will help energize the private cloud market by reducing the entry price. Amazon and Google don’t use NetApp or EMC. Why should you?

And the savings over renting cloud storage can be substantial as this Backblaze chart suggests:

True, Amazon provides many more services, but if you need petabytes for mini-bucks, this is hard to beat.

Courteous comments welcome, of course. Read about the v2 storage pod at the Backblaze blog post. Or get the shorter version in my ZDnet post “Build a 135TB array for $7,384.

{ 19 comments… read them below or add one }

David Magda July 20, 2011 at 6:28 pm

Some info on how they set up their RAID and LVM:

At the lowest level there are three RAID groups in each pod. Each RAID group is made of 15 [i.e., 13+2 -- D.M.] drives configured in software RAID 6 with 2 parity drives. This means you can lose 2 dives and the data is entirely safe and intact. If 3 or more drives completely fail simultaneously (not just pop out of the RAID group or power down, but where that drive is lost forever, like it will never power up again) you will lose at least some of the data on that RAID group. Layered on top of the 15 drive RAID group is LVM. The specifics of the LVM config are there is one PV (Physical Volume) spanning the 15 drives, then on top of that are one VG (Volume Group) spanning the same exact 15 drives. Then on top of that are as many LV (Logical Volumes) as it takes to keep each logical volume under the ext4 limit of 16 TB. With 3 TB Hitachi drives, there are 3 separate LV on top of the same exact 15 drives. Finally, there is one ext4 file system sitting on top of each of the LV (one to one with the LV). Disclaimer: I work at Backblaze, but datacenter and pods aren’t my main area of focus.

http://news.ycombinator.com/item?id=2786473

Some other interesting stuff in the HN comments:

http://news.ycombinator.com/item?id=2786066

( Yet Another ) John July 21, 2011 at 5:42 am

I am trying to build a flexible storage array using off-the-shelf components. I will write my own software. What I need is something that can connect 10 to 20 drives to a PC, or a PC with built-in bus for that many drives to plug in. The important thing is the PC should see all the drives INDIVIDUALLY instead of a pool so that my software can manage them according to my needs. Are there such devices readily available with reasonable price? ( You can see I am cheap, otherwise I wouldn’t be building my own system. ) Thanks for any information.

Andy July 21, 2011 at 9:15 am

Do you have to put 3 TB drives in it or can you fill up v2 with 2 TB drives? Sure your capacity of a single unit would come down by 50% but the $/TB is less with 2 TB drives. You can buy two 2 TB drives == 4 TB for less than a single 3 TB drive (at least for now).

Robin Harris July 21, 2011 at 9:48 am

Andy, I’ll let Backblaze comment on the 2TB drive issue, though I can’t see why not.

But are you sure 2TB drives are cheaper? I’ve seen 3TB drives for as little as $110 on sale. But you also need to factor in the fixed cost of each drive slot in the pod. Let’s say each slot costs $50 and a 2TB drive is $70 and a 3TB drive is $130. If you add $50 slot cost, you get $120/2 or $180/3 for $60/TB either way. With power and labor being the same, the 3TB drive is cheaper even though it costs almost twice as much.

And if you wait 3 months, they’ll be even cheaper.

Andy July 21, 2011 at 8:40 pm

Wow, I think the cheapest I have seen 3 TB drives is about $140.

In any case, I was merely commenting on the “Industry Sweet Spot” where the raw $/TB is most favorable and which is currently found in 2 TB drives. I wouldn’t be surprised to see the 3 TB drives capture that spot in the very near future.

I expect a 4 or 5 TB drive to be introduced to the market soon which should accelerate that process by taking the premium spot away from the 3 TB drives.

I wonder how much of the “premium price” for newly introduced drives is the result of actual manufacturing costs vs the fact that most of those kinds of drives are first bought for use in large storage arrays where things like power usage, heat generation, and slot prices can greatly influence the buying decision.

Taylor July 22, 2011 at 10:44 am

This is great info and I love that Backblaze are making it open. Their notes on disk failures are interesting. I think even with deployments this large, YMMV still applies. I’ve got > 1000 WD RE3/RE4 1TB drives deployed, idle less than 5% of the time, many of them for 1 to 2 years, and have had maybe 10 failures, so 1%.

In another application, we’ve got a mixed population of 350 Hitachi and WD RE3 2TB drives, under medium write loads for ~1 year, with exactly one disk failure so far.

mother July 22, 2011 at 2:04 pm

I love that they open sourced their design, opened up their components list, that they are using it to build a good, profitable company. Kuds backblaze.

Unfortunately I’m going to have a hard time convincing my workplace that we can build a whole storage system around one of these, even though, of course, we can (though obviously there are many technological hurdles, unless we just drop ZFS on it ;) ) because, even in 2011 I’m still seeing bias against open source software. Some places still want to over pay by 80-500% to get commercial storage, which inevitably is BUILT on open source.

Rocky July 23, 2011 at 10:05 am

The Backblaze v2 blog post has other good info on disk failure rates, costs versus other storage methods (including Amazon S3 at 26x the cost!), and this telling quote:

“If all you need is cheap storage, this may suffice. If you need to build a reliable, redundant, monitored storage system, you’ve got more work ahead of you.”

Correlated drive failures could cause problems even for RAID 6. Consider building these pods with drives from different batches and even different manufacturers.

Filling or emptying this array will take about 7 days running the Hitachi’s flat out, assuming no other bottlenecks in the system. Backup window? RTO?

TS July 23, 2011 at 11:19 am

Hi, Robin:

It is unfortunate that the Backblaze v2.0 system didn’t correct the design flaws that existed in the v1.0. It is just painful to see component selection and just overall cheapness and wrongness of design.

1. Syba SATA cards! 3 of them! (what kind of software RAID can survive a 1 HBA failure situation? I am not sure you have Solaris drivers for that card.

2. i3 processor with 8 gb of non-ECC ram! Yuk. With 130+TB of storage space, linux XFS or EXT4 won’t be able to even fit file system metadata in ram. I know it is supermicro motherboard, but seriously, socket 1156? The guy doesn’t understand Intel roadmaps for product segmentation.

3. They still made the same mistake of choosing SATA port multiplier technology over SAS to SATA expander topology.

4. No mirrored boot.

5. No redundant PS.

The whole design is simply a mess. To be honest, I am sure Nufire is a good newegg shopper, but not a good enough system architect. The backblaze pot is simply a thumper using consumer grade CPU. (Even the thumper, designed by Andy B. understood early on to use SAS to SATA STP topology years ago) There are better alternatives.

Alternative 1:
Supermicro SYS-6036ST
http://www.supermicro.com.tw/products/system/3U/6036/SYS-6036ST-6LR.cfm
15 bays, dual dual port Westmere-EP controller with ECC ram, hot swappable $3K per 15 spindles(45TB)

Or even better alternative design is the Facebook open compute storage pod. 50 drives, dual active active controllers using active active SAS expander topology. Although it is hard to find who OEMs facebook open compute designs.
http://www.theregister.co.uk/2011/06/28/facebook_open_compute_2_preview/page2.html

I personally think the Backblaze pod design is flawed, and overpriced for what it really is, given open market alternatives.

nate July 24, 2011 at 10:11 pm

Amazon and Google don’t use EMC or NetApp, why should you?

For one support, what happens when (not if) your data system blows up, unless your the one building most of the software that controls it (likely what Amazon and Google do) who are you going to turn to for help? It’s not a switch or a router where you just put in a new piece of equipment and keep going.

How about HDD firmware upgrades? Oh my god I can’t imagine what kind of nightmare that can be.. A few years ago I was talking to an Isilon customer who had a couple of their arrays with some buggy disk drives in them that would cause the drive to freeze, and no way to apply firmware updates (I believe this is long fixed now), short of taking the disk out putting it in a server, booting to DOS and flashing it, rinse & repeat. Many times you don’t need to upgrade firmware but there are times that you do, my last big SATA array for example we went through at least 3 different firmwares on the majority of disks on the system, I’m sure other systems are similar but basically supported logged into the system and installed the new OS (which included the firmware), then over the next 24 hours or so the drives were individually taken out of service, flashed, and brought online again seamlessly(there was a few hundred of them). We were not suffering any problems at the time that prompted the upgrades but obviously the vendor/manufacturer thought the upgrade was a good idea in any case. Same goes for HBA firmware, upgraded on several occasions.

A while back my boss’s boss actually suggested we evaluate FreeNAS and Openfiler as potential replacements for our high performance NAS system. The guy clearly had no idea what was going on, did not grasp the concepts of storage in general, the only thing he cared about was trying to make it cheaper by cutting corners that shouldn’t be cut. Pushing upwards of a half a gig of throughput through a 2-node active-active cluster on more than 200TB of raw storage isn’t something you put a Openfiler or FreeNAS to. If you have to ask why there’s no point in continuing the conversation.

Different products for different needs. By the same token I wouldn’t recommend someone put a $100k storage system as a SOHO file server, when a server with some disks plugged into it will probably do just fine (though if you have a $100k storage system already it may be good to leverage it’s capabilities).

For two – features. Sure you can go build one/more of these boxes and install an OS on them – then what? put ftp servers on them? nfs? cifs? What about managing multiple systems? Maybe go write your own data management system? Oh maybe install HDFS on them because HDFS solves all problems (yeah been there too).

A lot of work goes into the software that runs those fancy storage systems, and that software usually provides a lot of value. These aren’t appropriate for everyone. But look at Apple even – they apparently bought several (10+ I believe) PB of Isilon storage systems (they seem to have extra $$ laying around). Even at cloud scale Apple saw value in the platform rather than trying to build it themselves.

Extending this further, I know for a fact that both Google and Amazon are massive users of Citrix Netscaler load balancers (which run Intel CPUs), Facebook has a ton (probably literally) of F5 Viprions. With open source proxies and load balancers being out for a while why aren’t they using them? Or have written their own? Both Google and Amazon largely deploy their own access layer switching systems (I suspect they source the same broadcom chips and write some basic software on top of them). I cringe when someone says they use nginx as their load balancer.

I don’t see mention of replication of data between servers, though I hope it is there.

Myself I recently bought a new server for co-lo (own personal use), in part for off site backups, I looked at the cost to backup 1.5-3TB of data(HD porn takes up a lot of space) from several cloud companies (though not backblaze as they didn’t come up in my searches but I am somewhat weary of someone that advertises something as unlimited) and doing it myself even with my meager purchasing power and volume measuring in 1 was much better than using any other cloud. My config is pretty basic, single proc, 8GB ECC ram, single PSU (upgradeable to dual), 4x2TB SAS disks RAID 1+0, 3ware raid card with BBU and write back cache, ESXi, remote management card with KVM and remote media – co-located at a local bay area hosting facility for around $100/mo (about $3k for the server). It will be my off site backup as well as host my email/web/blog/anything else I can think of.

This pod design certainly looks halfway decent for an organization building something that needs to be really big, provided they have the software development resources to manage it. More often than not I’ve seen PHBs want to leverage the cost structure of the cheap stuff but then not give any internal support resources to it, and somehow expect it to work just as good (I don’t want to be around to see it blow up).

For the other 99% of organizations and customers out there they are better served with one of the many commercial storage systems on the market, especially if your doing random I/O.

Same goes for switches, routers, load balancers, even servers etc..

Erkki July 25, 2011 at 4:48 am

From the article:
“cold data is normally stored on high-performance, high-price NAS boxes at $$/GB.

Why?”

My take: Because in many many businesses it is more important to keep up great values in assets than to get better returns from investments. It is a must in today’s debt driven economy.

However, since I commented, I have to express my gratitude that Robin Harris covered Blackblaze’s approach. I find it very refreshing.

@TS
IMHO you fail to understand the whole philosophy in Blackblaze design, although, there is nobody to blame for any failure, because the real effective philosophy behind the storage pod design is a lot of guesswork and imaginary.

What I have read, the storage pod is just a storage pod, which is designed by Blackblaze for itself. Blackblaze have not published its whole operating philosophy and its whole architecture, what could better reveal, why they manage and are successful even with a flawed storage pod design, as you plainly put it. We have to guess the whole picture, in which sense I feel saying “you fail to understand” a bit harsh above, sorry.

As it was briefly mentioned in Blackblaze storage pod v1 blog writings, a lot of smartness, reliability, high availability, parallelism etc. etc. has been implemented at higher application levels, what apparently aren’t run on the storage pod.

Also one have to try to guess the characterists of the storage load the storage pods are put to handle at Blackblaze. One may easily characterize the load as “cold data”. Also the bandwidth and/or RTO for a storage pod, for a LVM-volume or for a RAID6-device can not be an issue. Maybe there is some HPC front-end system, which takes the backups and restores from clients, slices and dices them, computes parities and finaly lets some backstage storage management system to place the n+1 copies to the pods at different data centers. This is just guessing, but makes sense why the storage pods are optimized for TB/data center space and lacking all common redundancy at hardware level.

To make this opinion short and simple: It is a failure to think about the Blackblaze storage pod as a general purpose storage solution. Your considerations and alternatives are much better as general purpose storage solutions – at least regarding the alternative 1. I should get better know the Facebook open compute storage pod. Anyway it is easy to calculate that the alternative 1 is 55% worse in TB/data center space than Blackblaze storage pod, and add to that the features Blackblaze, empasizing Blackblaze, probably do not need like ECC, the horsepower etc., and the persentage gets even worse, or better for the storage pod at Blackblaze’s use.

What I’m missing in the big picture, why Blackblaze is using RAID6 and LVM at the storage pod? There are also interesting questions, how much thought they have put on power usage and power management? What about the vibration? Vibration is known to cause spindle storage to malfunction or poor performance and in high drive densities vibration problems ought to show up easier?

TS July 25, 2011 at 10:14 am

@Erkki

Look, I don’t want to be argumentative, but when you said “fail to understand the whole philosophy in Blackblaze design”, I just wanted to add just a few statements.

Backblaze “design” isn’t even “design grade”. It is “newegg legos engineering”. I am a cheap bastard myself, but there is a difference between cheap and dumb. A single dual ported SAS Expander backplane is actually cheaper than multiple(9) SATA multipliers. Instead of letting 5 hard drives go through a single SATA port on a Syba controller, you should have let all of them go through 4x ganged SAS channels(6Gbps per channel now)

The blackblaze “design philosophy” is this:
“Let’s copy the vertical mounting design of the thumper and put the cheapest crap I can find so that Backblaze can convince the venture capitalists that they are technically competent in storage subsystem design.”

Lastly, the difference between v2.0 and v1.0 is that Backblaze can swap out a consumer motherboard and whoops, Hitachi can do 3TB a spindle now. When one of those zippy power supply goes out, taking out half the hard drives, good luck doing fsck on a 135TB box with multiple 16TB ext4 file systems driven by Sybas.

Eric July 25, 2011 at 12:21 pm

@TS – Agreed, it’s viable for their very narrow application and not as a mainstream storage solution. They should stop comparing their raw costs to storage vendors end costs.

When I reviewed it the last time I realized that it had serious problems and it was only for “cost effective, mostly reliable” storage solutions where a layer was written on top to ensure data integrity ( which they gloss over, how much space is lost to the corrective layers? ).

Bob Plankers July 25, 2011 at 3:47 pm

To me, it seems like the best way to use something like these pods would be to run Gluster, or another distributed & replicated filesystem on them, and replicate everything fully between nodes. And yeah, that’s the layer on top that ensures integrity, and also doubles your storage costs.

Yes, there are potentially architectural problems with the setup. The beauty of open source storage, though, is that if it’s not quite the solution you need you can fix it yourself. After all, that’s what the Backblaze guys were doing when they started this whole thing. Personally I think it’d be cool to have a couple of these pods with SAS interconnects,perhaps some SSD in it in conjunction with the Facebook FlashCache module, front-ended with Gluster.

Jeff Brue July 25, 2011 at 5:42 pm

I have to agree that the Backblaze storage design is poor. The cost to move to server grade components (SAS) when compared to the cost of say an isilon is really ridiculously low. I’m up to my 78th >100 TB server. So much more intelligence can be put into the software/hardware marriage.

mother July 26, 2011 at 12:53 pm

@TS – They use mdadm to do the raid not the software raid cards.

I like the backblaze stuff. Considering it along with other options for data preservation.

But I’m also considering the option of LSI SAS switches with JBODS, perhaps like the Dataon 60 bay system http://www.dataonstorage.com/dataon-products/6g-sas-jbod/dns-1660-4u-60-bay-6g-35inch-sassata-jbod.html but note that this is all theoretical so far.

ChuckM July 28, 2011 at 12:50 pm

Hi Robin, I found the Backblaze stuff pretty interesting as well having been both at NetApp and at Google and thus being intimately familiar with how each of them ‘do’ storage.

I’ll point out that the Backblaze guys don’t include the cost of the guy who is swapping out replacement drives in their chart (which is included in Amazon’s offering). Not a huge difference but if you’re paying either your hosting facility for some sort of on site drive swap or just hiring a system admin type to do it, you’re looking at somewhere between $50K and $100K/year of expense there. That isn’t part of the slot tax but it is part of the TCO model.

Drives have a 4 – 5% annualized failure rate so triply replicated PB out of terabyte drives is 3000 drives, or 135 drives a year (best case) you’ll be replacing. Since drives have transient failures that are much higher than that getting to that number requires some process engineering as well (like auto-reformatting erroring out disks).

thattommyhall October 2, 2011 at 12:31 pm

I have just set up a store selling them in the EU for about the same price as protocase.

http://www.openstoragesystems.co.uk

The backblaze is the first offering, next we are doing custom machines for Hadoop and storage systems based on Supermicro chassis with a choice of FreeNAS, GlusterFS or Nexenta.

Tom

Nathaniel November 5, 2012 at 10:49 am

I hate to say it but your wrong on amazon and google not using netapps’s and EMC’s, I have worked for both companies and they do have that hardware in there Noc’s.

Leave a Comment

{ 1 trackback }

Previous post:

Next post: