Last week my friend and fellow storage blogger Mark Farley took me to task for posting Desktop RAID is a bad idea. He wrote in his EqualLogic blog post:
Two posts ago I sent my props to Robin Harris (Storagemojo), now I’m feeling the bloggers biteback. Robin – what are you doing? . . .
My heartburn is with Robin for implying that Jon’s numbers might be relevant for enterprise storage and implying (by reference) that RAID is a problem too. Geez, Robin, you know better. I’m not even going to address the RAID thing because that is so wrong.
Server RAID is a problem
Never one to leave well enough alone, I have to respond to Marc: RAID is a problem. A problem that is getting worse every day. It is a problem for servers and it is a growing problem for enterprise IT.
Don’t get me wrong: RAID works for protecting data and improving storage performance. The problem is that the concept was developed almost 20 years ago and is showing its age.
Why RAID is a growing problem
There are three general problems with RAID:
- Economic
- Managerial
- Architectural
RAID costs too much
Look at any mid-range or larger RAID array of 10 terabytes. It prices out, today, at about $80,000 to $120,000 or more. Call it $10k per terabyte. How much do the disks cost? Good quality SATA drives may be had for $500/TB. So the actual capacity of the array, only accounts for about 5% of the price. Use top of the line “enterprise” drives and it’s maybe 10%.
This is just silly. With cheap power and packaging and some front end RAM you could have three complete copies of all your data and lots of fast reads and writes for 25% of the cost.
Management is based on a broken concept
Why do we manage disks? We don’t manage main memory: virtual memory systems do that for us. With RAID, we don’t manage disks, we manage LUNs, which is a group of disks virtualized to look like a big disk. Useful, until you get to the point where you have as many LUNs as you once had disks. Now what?
Much of the popularity of NAS, IMHO, is because you manage files instead of disks and the data is network accessible.
Parity RAID is architecturally doomed
Vendors have made much of the limits of parity RAID – unless they didn’t have it to sell – because with the increase in disk capacity you are much more likely to encounter an uncorrectable read error during a rebuild. The answer: another costly disk. Good if you sell disks, less so if you buy them. But that’s the small problem.
The big problem with parity RAID is that I/O rates are flat as capacity rises. 20 years ago a 500 MB drive could do 50 I/O per second (IOPS), or 1 IOPS for every 10 megabytes of capacity. Today, a 150 GB, 15k drive, the ne plus ultra of disk technology, is at 1 IOPS for every 750 MB of capacity. Big SATA drives are at 1 IOPS per several gigabytes. And the trend is down.
With parity RAID, those precious IOPS get squandered on read/modify/write cycles. After a disk failure more get squandered rebuilding the data, effectively cutting that rank’s performance in half. This isn’t sustainable over the next decade.
The StorageMojo take
Yes, Marc, I do think server RAID is a problem. Not an omigawd-stop-buying-RAID problem – although some people have – but an oncoming train wreck that the industry needs to think about averting. In the 1980s DEC and IBM were very happy to make huge margins selling disk mirroring software. Then RAID came along and moved those margins into RAID controllers. Now RAID is coming to a crossroads. I suspect a new group of innovators will take the lead in the new paradigm.
BTW, be sure to check outEqualLogic products. A bunch of my readers really like them. One big plus: all software comes bundled with the system, so they don’t nickel and dime you to death. That’s pretty revolutionary itself.
Comments welcome. I’m off to NAB and Storage Networking World this week, so stay tuned for posts on more cool stuff.
Robin, I think it’s a little simplistic to blame RAID on the way disk drive manufacturers have failed to keep the ratio of capacity to IOPS consistent over the years. We’re still using the same protocol to store and retrieve data on directly attached (SCSI/SATA) devices as we do when they’re behind a RAID controller. Perhaps what we need is an improved version of SATA/SCSI which doesn’t provide a “normal” view of a HDD but is optimised to read/write as it knows it is behind a RAID or other type of controller.
Oh, my – where to begin?
RAID costs too much? When you can get mature open-source software implementations for free? Especially as I suspect that you’re just itching to discuss an alternative like ZFS, which is itself, of course, a software implementation…
Or perhaps you meant that *good hardware RAID implementations* cost too much. I think that Marc did an adequate job of defending their pricing – sophisticated NVRAM optimizations, rigorous drive-level and system-level testing, etc.
In anything resembling a free market, things *never* ‘cost too much’: they cost what they’re worth, or occasionally even less during market-share wars. If buyers didn’t feel that current hardware RAID products were worth what they cost, their prices would come down (there’s clearly at least some room for them to, and indeed some activity on this front with the increased acceptance of ATA disk technology). One could perhaps more reasonably suggest that industry innovation (i.e., competitiveness) has been sluggish in providing lower-cost, at least equally effective alternatives – but that’s a rather different issue.
Management is based on a broken concept? I think not: RAIDs require very minimal managment – just give them another disk once in a while if they need one (or if you need more space), and they’ll take care of the rest, just as virtually as virtual-memory systems handle mapping from one view of memory to a different one.
Don’t confuse support for the LUN abstraction with anything RAID-specific: it’s just a way of continuing support at the RAID level for a pre-existing SCSI single-disk abstraction. Or, if you were referring to using LUNs in a complex multi-RAID array to segregate access and/or allow software striping across multiple internal RAIDs instead of having a single large, somewhat more failure-prone RAID, then (again) this is not a characteristic of RAID itself.
RAID per se simply virtualizes a single-disk abstraction to encompass multiple cooperating disks for increased capacity and/or performance and/or availability. As such, it does so easily and effectively. So perhaps you really meant to suggest that the interface between higher-level software like file systems or databases and the underlying storage (*any* underlying storage – this is hardly specific to RAID) is awkward. If so, don’t blame RAID for supplying what its consumers demand (and demanded long before RAID as such even appeared): blame the consumers for not having adjusted such that they could support something better (because until that happens, anyone developing a better RAID interface won’t have any market at all – which is usually a significant impediment to such innovation from below)..
Parity RAID is architecturally doomed? Rubbish – especially as (again) you confuse more general limitations with something specific to parity RAID.
Disks have indeed increased significantly in size to the point where the likelihood of encountering an unreadable sector during reconstruction is far greater than it once was. But that’s just as true for mirroring as for parity RAID, save for a small constant (1 for a mirror copy vs., say, 4 for a 5-disk RAID-5 array): if we’ve been able to tolerate the 2 to 3 orders of magnitude increase in the probability of encountering such rebuild errors as disk sizes grew over the past couple of decades, a mere factor of 4 seems pretty well down in the noise.
Or, to look at it another way, if the risk of this is excessive for RAID-5, it’s really not all that much less for RAID-1, and *neither* will remain adequate for long. So you’ll be left with the choice between using 3 copies of each datum (for 3x the cost of the underlying storage) or using double-parity RAID (for 1.1x – 1.2x the cost of the underlying storage: double-parity-protected stripes can be quite a bit wider and still retain significant MTTDL advantages): some workloads may be sufficiently performance-critical to justify the expense of the former for all their storage, but most installations will find the latter attractive for at least some of their storage.
There is, of course, also evidence that background disk-scrubbing decreases the probability of encountering bad sectors when you least want to by an order of magnitude of more – but this is applicable to all uses, of course, not just to parity arrays.
Your objection to I/O rates is equally flawed, since IOPS/GB have decreased just as much for single-disk or mirrored configurations as they have for parity RAID. Again, if we’ve been able to tolerate a 2-3 order of magnitude decrease in this value over the past couple of decades, tolerating a mere factor of 2 *only* during degraded read operation seems laughably minor by comparison (especially since in large configurations striped across many small RAID-5 arrays only a small percentage of the data will see *any* such degradation when a disk in one small array fails).
Your suggestion that “those precious IOPS get squandered on read/modify/write cycles” is of course true only for small-write-intensive workloads (full-stripe writes incur no such penalty). Even for small writes, the impact is hardly major (again, compared to the several-orders-of-magnitude decrease in IOPS/GB we’ve tolerated over the last couple of decades): at worst, a small write must first read the preexisting data and parity (two parallel reads, including seek and rotational latency, on separate disks) and then update both (two parallel writes on the same two disks, but without seek latency, though a full rotation is required). One of the reads is often unnecessary, since the original data is still in the disk’s cache. In any event, while using RAID-5 for small-write-intensive workloads may be inadvisable if performance is absolutely critical, there are plenty of workloads where either performance is just *not* that super-critical or small writes just aren’t that frequent, in which case RAID-5 clearly remains a good choice (with the more expensive RAID-1 capacity reserved for workloads that actually *need* it).
One reason why parity RAID may become at least somewhat relatively less attractive in the future is increasing use of off-site replication, where mirroring between sites is usually the only practical approach. Even then, though, if your data is sufficiently important to be mirrored off-site, it may well be sufficiently important that mirroring alone is insufficient protection (or may have sufficiently high availability requirements that the probability of finding the remote copy unavailable after a local disk or sector has failed is unacceptable) – so using local parity RAID as well as remote mirroring may wind up the most cost-effective choice (use RAID-5 at both sites and you can tolerate the failure of any 3 disks without data loss at little more storage cost than just having the two sites mirrored entails, and without the added performance impact of RAID-6). Or you could use 3-way mirroring when performance is more important and you only feel the need to tolerate the loss of any 2 disks.
That pretty much covers my objections, I think. To sum them up, the ‘oncoming train wreck’ that you describe has nothing whatsoever to do with RAID per se, but with the interface between storage (RAIDed or not) and its higher-level clients. ZFS has indeed provided a glimpse of A Better Way to manage this interaction – though I question some of their specific design details (the RAID-Z approach that you describe glowingly elsewhere squanders disk utilization – *every* small write writes an entire stripe, tying up N disks for an I/O apiece, and, worse yet *so does every subsequent read of that data*) and will note that they still use mechanisms similar to RAID-1, RAID-5, and now RAID-6 internally to achieve availability.
By the way, recent Newegg pricing for reputable desktop SATA drives has nearly reached $0.25/GB, with little sign that it’s yet hit rock-bottom. I got my last PATA drive – a 160 GB Seagate – for $0.125/GB, but that was after rebate, whereas Newegg’s prices are out-the-door ones, and last time I checked their prices for ‘near-line’ allegedly more enterprise-quality 7200 rpm SATA drives weren’t a great deal higher.
– bill
And for anyone who feels that I have not already run on at more than sufficient length:
In writing the above I was, I admit, thinking mostly about *file system* use of storage, where the existing lower-level storage interface actively impedes significant potential optimization (as ZFS helps illustrate) and where I see the most potential for dramatic improvement (given that the entire storage stack that lies beneath a file or object abstraction can be rethought).
However, even within the confines of that existing traditional block-level interface there are also possibilities for improvement, and products like Xiotech’s ‘Magnitude’ and perhaps also Compaq/HP’s ‘EVA’ have been exploring them for quite a few years now – primarily by replacing the traditional RAID static, algorithmic mapping between externally-visible logical array addresses and internal disk addresses with an explicit internal mapping that facilitates more flexible variation in numbers and sizes of disks and more flexible use of the space thereon – with considerably less management pain as well. For that matter, throw in HP’s innovative AutoRAID product from the ’90s as well, though AFAIK it never quite achieved the recognition that it might have (did its implementation fail to live up to its potential?).
I still think of such products as ‘RAID’ (well, perhaps ‘RAID V2.0’), but though they share a common abstraction (use of redundancy in the form of mirroring or parity across disks to achieve increased availability) they certainly don’t fit the 1988 definition at the detail level. Perhaps this is more what you had in mind (and I just stumbled upon an article from last month that may as well – http://searchstorage.techtarget.com/columnItem/0,294698,sid5_gci1246543,00.html).
– bill
As end user I don’t like RAID and I don’t get the price tag either.
Server:
I think clustered file system makes more sense, even if you keep 2 or 3 copies of the same data. Database can be kept on cluster also.
Google proved that this will work even with a lot of data and it can be done fast enough.
Now hopefully the software will get cheap enough to be used. Or Google might release their GFS as open source…
Desktop:
You should not keep your important data there anyways. So a RAID0 on SAS is pretty much all you can have there performance wise. (RAID 0 to see 3 SAS drives as 1 TB…)