2008: cluster storage goes mainstream

by Robin Harris | Thursday, January 10, 2008 | Clusters, Enterprise | 18 comments

Enough of Google’s bathtub brew
IBM’s purchase of XIV makes it official: cluster storage is on a roll.

XIV’s website could have been ripped from the webpages of StorageMojo:

. . . enterprise-class storage systems typically comprise proprietary, special-purpose hardware, such as backplanes, shared memory architecture, and disk shelves. Huge amounts of resources are spent on developing and testing these products — with the associated costs passed on to the user. Moreover, special-purpose hardware quickly becomes obsolete, with a long wait time until new-generation processors, switches, and other components are integrated.

It also appears that the Nextra product has done away with RAID 5:

The Nextra system uses innovative RAID-X design, in which each disk is split into small pieces, and each piece is mirrored on a different disk. As a result, when a disk does fail, all disks in the system participate in the rebuild.

They don’t like ILM:

The ILM concept is rendered redundant, saving on ILM-related software license costs, administration efforts, and management attention, and sparing users migration-related downtime and other service issues related to ILM

Payback time?
Moshe Yanai, the executive chairman of XIV, was the chief engineer for the original Symmetrix that EMC used – along with raging incompetence at IBM – to destroy IBM’s lock on enterprise storage. Not only was he the engineer, he also got a percentage of the sales price of each Symm sold, making him a wealthy man.

But when EMC bought Data General to acquire the Clariion storage division, Moshe didn’t like it. After a long fight, CEO Joe Tucci pushed Moshe out of EMC and continued the successful Clariion product line. Moshe went back to Israel and eventually developed the Nextra product for XIV.

Gee, do you think his tie-up with IBM might be aimed at his former employer? A little?

The StorageMojo take
EMC’s Hulk/Maui and IBM’s XIV products are aimed at different parts of the market. IBM doesn’t have a large high-end array business to protect so the Nextra’s positioning as

A winning new storage paradigm for the enterprise
XIV Ltd., creator of Nextraâ„¢, has undertaken to design and produce the next generation of enterprise-class SAN (Storage Area Networks) systems. Nextra was created based on the principle of providing a simple solution for meeting the herculean IT challenges of today and tomorrow.

isn’t the problem for IBM that it is for EMC, desperate to protect the margins and revenue of the Symm line.

But both products are built on (quality) commodity hardware, so if one or the other needs to make mid-course corrections they can do it in software. Positioning Nextra as enterprise storage puts the heat on EMC and the Symm.

It will be interesting to see who gets to a boil first.

Comments welcome, of course.

18 Comments

Kevin Closson on Thursday, 10 January, 2008 at 11:02 pm

Robin,

Good coverage, as always. I’ve slipped in my mojo reading…and I’m reminded why I need to RSS your blog!

I still remember how odd it felt to watch a third party get in there and sell storage into IBM shops in the early 1990s…
Richard on Friday, 11 January, 2008 at 5:58 am

Robin,
The backend is running over one Gbit Ethernet ?
I donâ€™t believe their stated disk rebuild figures. Failure of a 1 TB disk will cause a prolonged â€˜replicationâ€™ stormâ€¦ particularly under this schemeâ€¦ which is based on random distribution of data, over a large number of disks.

Rebuilds will take time or the performance will need to degradeâ€¦. while the system runs totally unprotected.
Surely this is not the â€˜innovationâ€™ you are looking for.

IBM must be desperateâ€¦ the irony of the situation wrt Moshe is obvious.

This will never worry EMCâ€¦ not even Isilon.
James on Friday, 11 January, 2008 at 6:39 am

Richard,
Ever hear of bonding? There is no limit to the bandwidth that can be achieved using GbE. Using common components the internal badwidth of a cluser can easily reach 8Gb/s times the number of nodes in that cluster.

As for disk load during rebuild: 100 drives using only 5MB/s of the each drive’s bandwidth will rebuild at a total rate of 500MB/s. This means about 30 minutes to rebuild a 1TB drive with very little performance penalty. Can anyone beat this?
Robin Harris on Friday, 11 January, 2008 at 7:07 am

Hm-m,

Panasas, for one, does fast parallel rebuilds as well.

FWIW, XIV uses 2 Gig E back end switched networks for redundancy.

But the 15 minute claim? 3 Gbit SATA = 375 MB/s or 22.5 GB/min or 337.5 GB in 15 minutes. Seems a tad shy of ~~1 TB~~ 500 GB/15 min.

Perhaps they “rebuild” a failed drive to 4 other physical drives?

Anyone?

Robin
BJ Blazkowicz on Friday, 11 January, 2008 at 9:30 am

From what I see on their website, they claim the 15 min rebuild time for 500 GB disks, not 1 TB ones. But still impossible to do with a single 3 Gb interface, not to mention the far more modest real world transfer rate of a 7200 rpm SATA disk (perhaps 70 MB/s or so). So presumably they indeed rebuild the failed drive to multiple spares then. The tech sheet mentions that net capacity included 3 disks worth of spares.

I think the answer to this riddle is to think “distributed sparing”. Leave a little bit of empty space on every single drive rather than having dedicated spare drives. Thus when rebuilding they get massive bandwidth both for reading and writing. Then the question is perhaps indeed better asked as why the heck does it take as long as 15 minutes?
James on Friday, 11 January, 2008 at 10:27 am

BJ,
The interface is 3Gb but a SATA drive has only 30MB/s sustained bandwidth.
To avoid performance degredation for other applications they can’t use more than 10-20% of that bandwidth. Perhaps 5MB/s as I suggested before.
NFJ on Friday, 11 January, 2008 at 11:53 am

Doesn’t this distributed block mirroring and distributed spare space looks a little like HP EVAs?
Wes Felter on Friday, 11 January, 2008 at 1:09 pm

Here’s how the rebuild works:

http://www.ibm.com/developerworks/blogs/page/InsideSystemStorage?entry=spreading_out_the_re_replication

So yeah, they rebuild a failed drive onto the empty space of many other drives. Also, much of the rebuild traffic could stay inside the bricks, making the interconnect irrelevant.
joe m. on Friday, 11 January, 2008 at 3:51 pm

Anyone have any idea on cost of the Nextra? Particularly interested in $/GB
joe m. on Friday, 11 January, 2008 at 3:52 pm

NFJ..

Yes, it sounds like 3Par’s “chunklets” too
Barry Whyte on Friday, 11 January, 2008 at 4:39 pm

The general RAID-X technology is not new, as pointed out, EVA, 3Par and many others do the same, its the overall architecture of the actual ‘boxes’ thats interesting, with 1-way functionality in the bricks that makes a difference. Once something has been ‘cloned’ by the RAID, any underlying functions needed are simple to implement. The internal ‘fabric’ maybe GigE today, but we have plans in that area.
Richard on Friday, 11 January, 2008 at 8:06 pm

Robin,
What is this new definition of â€˜qualityâ€™ commodity boxesâ€¦. power hungry general purpose Intel server reference design, made in Chinaâ€¦ a new marketing term?

Also, their cabling to externally connected switches must be a clutter and the number of power supplies and fans donâ€™t add much to the overall system reliability.

The argument regarding It is very clear that such â€˜commodityâ€™ general purpose hardware, even when built for Dell, HP or IBM is based on limited production runs, constantly â€˜newâ€ chipsets and become obsolete on a yearly basisâ€¦. i.e. unsupportable throw-away replacement mentalityâ€¦.with all of the implications re OS, BIOS, drivers, etc.

It is easier & cheaper to design/build a high quality, reliable X86 â€œcontrollerâ€ box, with the exact hardware functionality and minimum power consumption to do the job â€¦. much like some of the X86 Telco or computing â€˜bladeâ€™ infrastructure hardware.

One would think that IBM understands thisâ€¦. even if Moshe & team do not.
Anonymous on Friday, 11 January, 2008 at 11:51 pm

RAID-1 protection is both a heavy penalty and a thin armor in a system that professes to scale out to hundreds or thousands of disks. Isilon’s N+M protection scheme seems much more suitable to address expected MTBF rates while providing a scalable file system on top.
Richard on Saturday, 12 January, 2008 at 10:57 pm

Robin,
After taking a good look at their product, it seems that they use â€˜purpose builtâ€™ 15 disk JBOD-style chassis and probably front end the disks with a commodity Intel motherboard, mounted at the rear of the chassis â€¦so it is not the usual 6 disk commodity motherboard approach as with (say) Google, where disks are supported by the on-board chipset.

Jamesâ€¦we all know about bonding . As noted above, this seems to be an X86 motherboard with 15 disk backend interface, supporting 1 Gbit iSCSI protocol, with totally random seek patterns to disks (design criteria). They will get around 5 MB per disk … but it is cheap, obviousâ€¦ and probably a â€˜good enoughâ€™ solution, for the time-being.

A question â€¦. What happens if we lose a complete 15 disk chassis ?
It contains a single motherboard, not designed for hot swap. What are their repair logistics & the expected time to completely re-sync the mirrors (?).
Surely they donâ€™t expect the customer to swap the chassis & disks. Let see how it self-heals here.

Another Questionâ€¦ the system is based on a minimum configuration of 160 disks. Does this mean that the stripe width is fixed ? What is required as a “minimum disk expansion increment”, in terms of hardware ?
Dimitris Krekoukias on Monday, 14 January, 2008 at 6:21 am

Probably, the reason a large number of disks is needed from the get-go is to not be able to start out with a bad config due to the algorithms used, and it probably needs to make sure no brick can go down and affect the entire system, so, EMC Centera-like, everything on a brick has to be distributed among the other bricks.

Just my gut feeling, and how I’d do it…

3Par, Compellent and others are pretty similar.

D
Tony Pearson (IBM) on Monday, 14 January, 2008 at 10:50 am

Richard, Dimitris,
There is one spare “data module”. The copies of 1MB blobs are copied across data modules. In the event of a data module failure, the spare is brought into action, and all data is already available on the other modules. Background tasks re-balance the 1MB blobs when bad module is replaced.
Richard on Monday, 14 January, 2008 at 8:30 pm

Robin,
Yes, a spare box is an obvious solution . So how long will it take to rebuild & rebalance the content of 15 disks. While maintaining reasonable performance levelsâ€¦probably a dayâ€¦ while the system runs unprotected?

They obviously need 120 disks to establish a fairly wide stripe width in order to guarantee some level of performance. What is their stripe width..?

Their seek patterns are random and not a lot can be done to improve i/o per sec using cache algorithms (disks already implement â€˜elevatedâ€™ seeks). Soâ€¦all they have left is brute force caching and this explains the very large size of cache in each disk enclosure.

Nextra datasheet shows that the cabinet contains 8 disk enclosures, i.e. their stated 120 disk minimum configuration. This means that one of the disk enclosures is a â€˜spareâ€™?
If so, then it seems that they have a â€˜stripe widthâ€™ of 52 disks, with (say) 1 spare disk.
Some spare space is required on each disk â€¦. So the useable space is at around 40% of the array capacity â€¦ and the same goes for wasted power.

It all seems to be â€˜brute forceâ€™ engineeringâ€¦ it is hard to get impressed.
Pete Steege on Tuesday, 15 January, 2008 at 7:33 am

Hi Robin,
If IBM’s smart, they’ll move fast to finally make inroads on EMC’s turf! Nextra’s SATA drive architecture will give them a cost advantage.

IBM (reach) + XIV (technology) is the magic that provides an edge for this version of the technology over all the others.