The virtual machine I/O blender

by Robin Harris on Wednesday, 23 July, 2008

I’m at the SNIA Symposium this morning. Hence the short post.

What is the impact of virtual machines on I/O?
Engineers have spent decades optimizing the OS, drivers, caching, controllers and disks for specific workloads.

Observed behavior such as locality of reference have informed many strategies. Like read-ahead.

A smear
But when you put 25 virtual machines on a single server, what happens to all this hard-won empiricism? It’s gone.

Each of the 25 machines may have predictable I/O behavior. But together all those I/O patterns smear together. One I/O may have nothing to do with the next 10.

Fast and stupid
That puts a premium on stupid, but fast storage. Storage that doesn’t think about what you may be trying to do because you aren’t trying to do anything. A jumble of VMs is doing it.

The StorageMojo take
The “stupid vs smart” network debate has been around for decades. In storage we’ve always taken it for granted that smart is better. But now?

Not so much.

Comments welcome, of course. If you are running a lot of VMs what I/O issues have you noticed?

{ 26 comments… read them below or add one }

Jeff Darcy July 23, 2008 at 6:47 am

Do all the I/O patterns smear together? Are all 25 VMs really accessing the same piece of the same LUN simultaneously? In most cases, I’d doubt it. In the cases where they are, the situation’s no worse than if those 25 were on physically separate hosts but still banging on the same array. There are also likely to be things that the VM executive can do to identify separate I/O streams more accurately. As I’ve written elsewhere, adaptive readahead has yielded some impressive results that seem to make the effort worthwhile. Many of those approaches are more difficult to implement in a highly virtualized environment, but difficult is not the same as impossible. “Fast and stupid” does work in some cases, but I wouldn’t be predicting a seismic shift in how customers evaluate storage just yet.

Nik Simpson July 23, 2008 at 7:07 am

For FC, this is why a well designed NPIV implementation is so important because it gives each VM it’s own unique WWPN so that array caching strategies work again.

For iSCSI it’s less of a problem, particularly if the VM is using it’s iSCSI initiator over it’s own virtual connection (with distinct MAC and IP addresses)

To be honest, I don’t think is a problem that array designers should be trying solve, the solution lies higher up the I/O stack, and it’s much easier to do it there than to redesign arrays or throw useful technologies like read ahead caching.

Erik July 23, 2008 at 7:11 am

Let’s put it this way. I had no I/O problems until I got it in my head to virtualize. Then I learned waaaay more than I wanted to about latency, i/o elevators, IOPS, etc etc.

Typically what I’d see happen is that the host would run out of i/o bandwidth and kernel panic. If that didn’t happen (seperate the vm’s from the storage controller via nfs) the vm’s would get read starved for too long and start experiencing scsi timeouts. Very unpleasant.

Interestingly the VM platform seems to play a significant role in the I/O bottlenecking. Moving from vmware server to xen and vmware esx opened things up for me a lot.

Wes Felter July 23, 2008 at 8:41 am

SAN and NAS controllers have dealt with multiple simultaneous clients from the beginning, so I don’t see how this is different.

Bill Todd July 23, 2008 at 10:00 am

Where sharing the same storage among 25 virtual servers is a problem, any competent hypervisor should allow them to carve up the physical server storage capacity into disjoint sets of disks and use them just as they would have running on independent hardware (modulo the continued necessity of sharing physical server bus and memory bandwidth).

That said, when the servers are not completely saturating their private storage with requests there are efficiencies to be obtained by having them share the storage since they all get to spread out their requests across a far larger set of disks than they would have had independently and the aggregate load tends to become far more uniform (i.e., each request gets to benefit from the fact that not all other virtual servers are requesting data at the same time).

Contrary to your assertion, this puts a premium on *smart* storage. Whereas an independent server box can coordinate efficient storage use within its single operating system, no such coordination exists among multiple independent OS instances (unless it’s provided at the hypervisor level, which might be reasonable) and thus if the storage doesn’t do it, nobody does. (This, incidentally, is hardly a new development: exactly the same problem applies when multiple independent servers share a large external array – and such arrays use the same kinds of intelligence to optimize such use.) The storage array can (and definitely should) still identify sequential access patterns and perform read-ahead: even if the individual OSs are attempting to issue explicit read-ahead requests they don’t have control over the individual drives at that level to reorder their queues appropriately; the array cache (if sized somewhere nearly that of the total cache that the servers would have had if operating separately) can be considerably more efficient in handling temporal locality (in a manner at least somewhat analogous to the smoothing out of requests mentioned earlier: combining independent caches and workloads increases overall throughput as long as the caches are sufficiently intelligent – this may be even more important if physical server RAM was not increased to allow each virtual server to have the same system cache available that it would have had independently); and even some of the availability mechanisms can be realized more efficiently in an aggregate setup than they can be in multiple independent ones (larger RAID groups, fewer total spares, etc. without sacrificing robustness).

Hell, if you don’t include sufficient smarts in the storage to prioritize requests from different servers (when this is desirable), unless the hypervisor handles this (which still doesn’t address the case where multiple independent servers share the same external array) your high-priority operations can get swamped by low-priority requests regardless of how fast your ‘dumb’ storage may be.

All these kinds of issues apply similarly to a single server with a heterogeneous workload, and how to handle them within the OS is well understood. When the storage ceases to be under the direct control of such a single OS, the answer is not to throw up your hands in despair and simply shove fast hardware at the problem in the hope that it will be adequate, it’s to push those already well-understood facilities down into the shared storage where they can continue to do their job effectively. So, as I already observed, you’ve got it backward: when the storage is under the control of a single OS it can afford to be dumb (because all the smarts can be in the OS – and all other things being equal that’s the best place for them, because they’re closer to the point of use), but the more the storage is shared between independent entities the smarter it needs to be to achieve optimal performance.

– bill

Anders Gregersen July 23, 2008 at 12:15 pm

It’s a topic that is often discussed in the danish vmug. Most IO is seen as random IO and the number of SAN “clients” is a magnitude higher than environments before virtualization. Some optimization is possible but it will increase complexity (one vm for one or more LUNs). It’s also important to buy SANs that serves the purpuse. Netapp have some features that are great for VDI (Virtual Desktop Infrastructure) and I’m not a Netapp customer, where other SANs lack the features and will require a lot more capacity to accomplish the same task at the same speed. Some SANs like Equallogic apply their “smarts” at the block level where it optimizes for RAID level (deciding where to place the data based on reads and writes on the available RAID types). Perhaps the storage companies will move the “smarts” from the storage system to the virtualization layer instead, selling add-on components for the available virtualization platforms.

Ryan Malayter July 23, 2008 at 2:52 pm

We’ve seen high-IO tasks on one VMware ESX VM cripple IO performance on other VMs hosted on the same VMFS volume. This isn’t surprising, really, as there is no IOPS prioritization possible in ESX 3.5 at the VM level (except for changing the queue depth on a virtual disk, whcih isn’t QoS so much as rate limiting). You can do some reservation of IO bandwidth, but that is not really the bottleneck for most environments.

So we have most VMs in their own VMFS volume, which makes provisioning a hassle, but still much easier than physical provisioning.

Ultimately, with virtualization, everything becomes a random IO. So lots of cheap dumb spindles is probably the right architecture. Let the Hypervisor take care of IO prioritization (which most cannot do yet, especially in a clustered scenario).

Brainy July 24, 2008 at 1:04 am

@Erik: Can you elaborate the difference between ESX and Xen?

Steve Jones July 24, 2008 at 3:55 am

Shared storage arrays always have a problem with disentangling I/O patterns from multiple servers in order to optimise low level device access (like read-aheads). The simple fact is that the larger the number of active hosts you have, then the more difficult it is to distinguish the pattern and it looks closer to random behaviour. However, some things still work fine – if there’s a large number of small, sequential writes performed (such as is typical of an Oracle redo log) then the array is still able to cache the write, join up all the bits, and do the stage-out as a much more efficient single, large write (or an efficient RAID-5 full stripe write). Roll-up write optimisation like this, or constant block rew-writes for still works fine given sufficient non-volatile cache. Predictive read-ahead is the most difficult.

to the array, it doesn’t really matter if it’s virtual or physical hosts (although, arguably, if the array sees the multiple hosts then it could, in theory, use more sophisticate logic to detect sequential read patterns – that’s more difficult if the VM abstraction layer hides the VM from the array). If the hyperviser is “hiding” the virtual hosts from the array, then it’s possible for it to do some of this work – like pre-fetch, and IOP ordering, although I’ve no idea if hypervisers do this in practice.

On a more general point, VMs have a fundamental problem in that they can, in effect, see elongated service times if they are scheduled out by the hyperviser when the I/O terminates. For example, if a redo write (to a cached array) takes 1ms yet the hyperviser doesn’t let the guest have any CPU for 5ms then the I/O will appear to take 6ms. Unfortunately the one thing that cannot be virtualised is realtime events. This effect was well-known when mainframe domains and vritual machines were introduced in the 1970s and 80s. It was called “I/O elongation” and could have a severe effect on throughput and response times.

The view we have, is that virtual machines are not suitable for really high I/O activity systems, as contention within the hyperviser has a considerable impact on throughput. Due to the non-deterministic behaviour of the hyperviser, then it is inherently unsuited to applications which must respond to real-time events in a timely manner.

For consolidating high activity systems, such as databases, then it is much more efficient to develop a proper service based model using shared physical machines. Then the operating system and database system is much better able to schedule I/O than a hyperviser as it can be far more work-aware.

Amy July 24, 2008 at 10:53 am

I can’t say VM’s are my area of expertise and maybe I’m totally missing something, but why not just use very fast storage in the physical servers that are hosting your VM’s? From everything I’ve been reading lately there are a whole lot more choices coming from server vendors like HP and Sun.


Bill Todd July 24, 2008 at 10:59 am

Good point about the storage being able to improve its pattern detection capability if it’s able to resolve individual servers (even better would be the ability to resolve individual request streams on each server, but this would require that the server provide a unique stream ID with each request – not something which I think the current standard interfaces support). The virtual server (which *is* in a position to evaluate individual streams for patterns) can also potentially help out by performing explicit read-ahead requests in groups (or as a single extended request for sequential patterns when the data should be contiguous on disk, at least up to the point that it’s willing to tolerate single lengthy transfers rather than break them up such that the portion not immediately required can be deferred or canceled if a higher-priority request comes in), causing them to appear in close proximity at the storage array (save at breaks between groups) – just as it might choose to do when accessing individual local drives.

But I’m not sure that the problem you describe with ‘elongated service times’ is really I/O-specific: it sounds (especially given a hypervisor that allows priority-ordering of its virtual servers) more like the result of overloading the common processing/memory bandwidth (because the only reason that a virtual server wouldn’t be able to execute as soon as its I/O request completed should be because some other virtual server was using the processor: even among equal-priority virtual servers time-slice quanta should only apply when someone else is ready to run), which mostly suggests an insufficiently robust set of processing capabilities for the aggregate load placed on them (unless the decision to trade performance for lower hardware cost was an intentional one).

That said, I agree that there’s still a potential trade-off between the ability to schedule as close to the point of use as possible and placing a hypervisor between that point and the underlying (not just storage) hardware, though for things like batch-style (as contrasted with, say, interactive) processing it may often work just fine.

– bill

InsaneGeek July 24, 2008 at 12:29 pm

Is the problem really about cache algorithms and is it a full blender, does the locality of reference truely change that much?

At least with VMWare ESX server that I’ve played with people normally create “thick” virtual disks which pre-allocates the virtual disk (thin provisioning/grow on-demand changes things). So if an array proactively cache’s some data (the drive head is here why not grab the next couple of bytes) I’d venture there is just a good chance that a virtual machine will come back and request that same data as if it were on a different lun. If you are preallocating the virtual disks then locality of reference should stay close to the same as within a short period of time (next time that virtual machine is scheduled on CPU) it would ask for that cached data. I would deffinetly agree that it will be harder for the algorithms but I believe that in general it wouldn’t cause massive headache.

I wonder if some of the performance problems is less that cache algorithms are having a much more difficult issue, or that people just aren’t spreading the load across as many spindles. Before it was very easy for any admin to spread things out (dedicated storage guy or not) put one host on raidgroup 1, next on 2, next on 3, and so on very obvious. Now people are putting multiple guests onto the same size raidgroup and expecting it to perform the same as it did when it was spread across 5x. If the storage admin has spread the one lun out across lots of disks it shouldn’t be a big deal, but if the storage admin just carved up a larger lun from a single raidgroup… well you are going to run into problems. That’s where I see the most issues inside the array coming from (config of the host itself is another ball of wax). While some people may think that this makes things really difficult for the storage admin I don’t really consider it that difficult, if it’s a buch of virtual machine’s making up a total 1000 iops or a single database making 1000 iops, end of the day what you care about is that it needs 1000 iops so provision accordingly.

Jason July 24, 2008 at 3:22 pm

VMware is great at managing CPU and memory but not disk. I hope that one day VMware will manage disk in the same way it manages CPU and memory, that is, VMotion a VM when disk IO hits a certain threshold.

Bill Todd July 24, 2008 at 10:43 pm

Amy –

The old racing adage “Speed costs money: how fast can you afford to go?” applies to storage as well: the fastest non-volatile storage is battery-backed RAM, but it’s just not within the budget of most users – nor, for that matter, are the fastest disk drives, even though they cost orders of magnitude less (and it would not be wise to count on currently over-hyped technologies like flash and ZFS to address the issue, though both may eventually help with some aspects of it).

But there are far less costly ways to achieve performance for many workloads than brute-force storage speed, and that’s what we’ve been discussing here (‘smart’ storage really means ‘smart use of the underlying dumb storage by taking advantage of specific characteristics of the workload to achieve far better performance than would otherwise be the case’) – specifically, whether such optimizations remain effective as the storage gets increasingly removed (in terms of software layering) from the application.

– bill

John Lane July 25, 2008 at 7:57 am

The biggest difference we see (from a storage perspective) between a virtualized vs. non-virtualized environment is the amount of data and the number of systems that sit on shared storage. In a non-virtualized environment, we see the OS sit on the physical server while important apps and data site on the shared storage. In virtualized environments, we see ESX/Xen (or other hypervisor software) installed on the physical server. All of the VMs (OS, apps, and data) sit on the shared storage.

So 100 Virtual Machines does not equal 100 Physical Machines. And the storage requirements go up significantly, but most users do not realize this by themselves. Sometimes customers run into disk thrashing when they implement virtualization without considering this difference.

I find that “smart” storage is the best approach. Personally, I like Pillar’s approach to this problem with their QOS features, but there are solutions from other vendors that we recommend too.

Joe Kraska July 26, 2008 at 4:00 pm

I can’t say VM’s are my area of expertise and maybe I’m totally missing something, but why not just use very fast storage in the physical servers that are hosting your VM’s…

VMWare environments typically used a shared storage environment so that they can effectively do automated load balancing: move the vm from server to server in response to point loads on particular servers.

Problem is, basically, that shared storage loads with large VMWare systems are… intractable unless you have a pretty good budget.

The last I looked our own VMWare environment had 475 virtual machines in it. I cringe every time I look, actually. We have plenty of memory and compute. However, our SAN/enterprise NAS is lacking.

This costs a lot, mon.

BTW, a strategy we have been taking to lately is moving some of the heavier IO hogs to direct attached storage, as you say. This has some disadvantages.

BTW, at least one vendor combines direct attached storage with a clustered iSCSI SAN, pretty much with the specific purpose in mind of solving the VMWare disk deployment problem: LeftHand networks.

(Not an endorsement, we don’t have).

Joe Kraska
San Diego CA

Rick White July 26, 2008 at 4:06 pm

First off I’ll admit I’m biased…

Why add a lot of expensive RAM just to avoid going to disk? Worse yet why add a bunch of disks to make up for the fact that they’re so slow compared to the processors they’re supposed to serve? And does it really make sense to add another “smart” software layer to your system isn’t that just more complexity to manage? Is it because this is how its always been done?

I understand the industry had to kluge a lot of these pieces together over the years but it boggles my mind that we want to keep doing it. I suppose some folks aren’t prepared for the impact Enterprise Flash Drives are about to have on the way we architect our storage systems for performance. In my opinion a system with an Enterprise Flash solution from one of several server vendors will run virtualized environments just fine. I appreciate “smart” software, new network layers and the old tried and true methods but sometimes you just can’t beat the simplicity of massive brute force from a single drive.

Bill Todd July 26, 2008 at 9:02 pm

The answers to your questions are pretty straight-forward:

It makes sense to add RAM up to the point of (economically) diminishing returns when performance is important (and if it’s not important, this entire discussion is not applicable). For read-dominated workloads, the RAM needn’t be redundant (as long as it has good error detection such that the very rare instances of bad data can be re-fetched from disk) and thus may be very price-competitive with the kind of flash drives that you’re assuming will shortly become available, and for all workloads if the RAM can be placed in the host access to it will be far faster than RAM (let alone flash) out on a storage bus (not to mention being able to be used more flexibly for other purposes when that’s more desirable than dedicating it to cache).

The reason to use more disks (rather than the kind of flash that you assume will shortly become available) is because it’s far cheaper when you’ve got a great deal of data to store: as I suggested earlier, few have the luxury of infinite budgets, and for most this trade-off will continue to make sense. You don’t then have to go through the exercise of segregating data with high performance requirements (and reshuffling when the portion of your data with such requirements varies over time) – and may even not require any more disks than are necessary to house your dataset (because if you disperse the hot data fairly uniformly across the spindles the number of spindles required to hold the entire dataset may also be sufficient to provide acceptable access performance to the hot portion).

Of course it makes sense to add intelligence to increase performance: every single portion of the system does this in one way or another. Yes, it’s complex, but it has also become pretty well understood over the past decades: if you’re afraid of the complexity it’s because you’re not a specialist in the area (but the people who actually design and implement it are, and while they’re still only human suggesting that complexity should therefore be avoided in this area because of the possible risk while leaving the rest of the system rife with it is just silly).

No, it’s not “because this is how its always been done” – it’s because this is what’s been demonstrated empirically over the years to be effective. And when someone improves upon it, those improvements get incorporated (after they’ve proved themselves: data is too important to be experimented with cavalierly).

We’re not talking about kludges here: we’re talking about economically extracting the most from what we’ve got to work with (not that bad a definition for engineering in general). If your mind is boggled, it’s because it’s untutored in the subject.

The reason that “some folks aren’t prepared for the impact Enterprise Flash Drives are about to have on the way we architect our storage systems for performance” is because a lot of them (some of whom probably are far better acquainted with the subject than you are) just don’t agree with your assessment. Even those (one might suspect rare) cases that beat heavily upon a relatively small dataset may well find their access patterns sufficiently cacheable that a modicum of cache will provide comparable performance at a lower cost than placing the entire dataset on ‘Enterprise Flash’ (which if experience is any guide will not be priced cheaply), and even workloads with intense random-update activity can be handled by ‘write-anywhere’ approaches that convert the accesses to far more efficient bulk-sequential writes with reasonable latency.

Fast non-volatile storage with no moving parts is nothing new, and while flash *may* be poised to reduce its cost significantly it’s not at all clear that this reduction (coupled with flash’s limitations) will be sufficient to change the storage landscape dramatically (my own guess is that flash may earliest prove useful for relatively small sequential workloads with critical response requirements like database logs, though even here fronting a small dedicated array of conventional disks with a small NVRAM cache can work just as well, making such a flash approach primarily suitable for installations too small to be using such external storage). We’ll just have to stay tuned and see.

– bill

Rick White July 27, 2008 at 1:08 am

Bill said, “We’re not talking about kludges here: we’re talking about economically extracting the most from what we’ve got to work with (not that bad a definition for engineering in general). If your mind is boggled, it’s because it’s untutored in the subject.”

Bill I apologize if I frustrated you when I used the term, “kludge” and I hope I won’t offend you if my untutored mind has some fun with math…

Let’s start by creating a hypothetical SAN and we’ll call it the Model-40 for kicks. I would suspect it could do about 24,998 SPC-1 IOPS, sustain 204 MB/s bandwidth, with a 24,180-microsecond response time, taking up 38 rack units of space and a total ASU capacity of roughly 8,466 GB. Now I wonder what a SAN like this would cost…

Total TSC Price: $448,435
SPC-1 Price-Performance: $18/SPC-1 IOPS
Price per GB: $53

Now let’s create a hypothetical blade server and let’s say this blade server could hold…I don’t know, let’s say it could hold 32 Enterprise Flash Drives in a single blade chassis. I would suspect this system could do well over 1,000,000 IOPS, sustain 3,990 MB/s bandwidth, with a 357-microsecond response time, taking up 10 rack units of space and a total ASU capacity of roughly 10,208 GB. I wonder what a storage system like this would cost…

Total TSC Price: $340,230
SPC-1 Price-Performance: $0.35/SPC-1 IOPS
Price per GB: $33
*Including a fully configured sixteen-blade server.

If you’re like me your probably thinking, “Hmm? Disks are expensive when you aggregate them for performance.” Of course when it’s your only option you make the best of it but the market is changing and it won’t be our only option for long. I would suspect the server and OS vendors are working on some amazing alternatives right now.

However I do agree that disk arrays make sense for archival storage. But in my untutored and biased opinion I believe in the next few years it will be considered fiscally irresponsible to continue aggregating disks for performance (both CAPEX and OPEX). Of course this is just my untutored and probably ignorant opinion but it should be fun to watch as the market evolves no matter what ends up happening.

Anonymous July 27, 2008 at 6:33 am

Rick White, VP of Marketing for Fusion io, is misrepresenting the facts. Flash has not yet reached to capacity or cost per gigabyte of disk, nor has it reached the speed of RAM. Maybe it will develop to the point where it can displace one or the other some day, but not this day. Flash will continue to be yet another layer in the memory/storage hierarchy, perhaps a revolutionary one but still one adding and not subtracting complexity – both internally (with non-trivial flash translation layers etc.) and externally (with more staging between hierarchy levels). Until latency reaches *zero* read-ahead and write-behind have the potential to be useful with any possible technology. The uninformed can make all the guesses they want about how this or that technology *might* work with virtual machines, but those opinions mean nothing without data and less than nothing when the people stating them fail to disclose their interest in a particular conclusion.

Your company makes a great product with a great value proposition, Rick, and I wish you well. Using FUD and astroturf, though, will only hurt your brand instead of helping it.

Jeff Darcy July 27, 2008 at 8:11 am

Apples to oranges, Rick. First, using the EMC Clariion CX-40 for the comparison is hardly representative because everyone knows they sell the most overpriced storage in the industry. Second, your (unaudited) numbers only work if all of the computation is colocated with the storage – unlikely in the real world. If some separate set of servers are doing the computation, then you have to account for the cost and performance impact of going from those servers to the ones with the Fusion cards. Internal storage is not directly comparable to shared storage; if it were, you’d have to compare not to a Clariion but to much cheaper drives inside each server.

Server and OS vendors are indeed working on some amazing alternatives. Some are based on flash. Some are not.

Joe Kraska July 27, 2008 at 8:23 am

$53/GB for a SAN?

Gentlemen, you seriously need to get better procurement teams, or better yet, stop using price lists… no one, and I mean no one, buys at those prices.

Anyway, I have no doubt that high speed flash will improve storage… I can envision FusionIO’s device as an ideal dedicated journal device, and no doubt technologies like Compellent will leverage flash soon through dynamic automated block-level ILM.

Joe Kraska
San Diego CA

Rick White July 27, 2008 at 12:37 pm

I always find it interesting when someone is willing to post their whole name and not hide who they are yet they get accused of not disclosing who they are by someone using the name “Anonymous”

Anonymous said, “…those opinions mean nothing without data and less than nothing when the people stating them fail to disclose their interest in a particular conclusion.” — I thought I was perfectly clear when I started my response out by saying “First off I’ll admit I’m biased…” I didn’t think it was appropriate to plug my companies specific products so I spoke vaguely about very REAL solutions entering the market today, which is why I used a SAN model number Bill would be familiar with while not calling it out specifically.

Anonymous said, “Flash has not yet reached to capacity or cost per gigabyte of disk…” — I apologize for my direct response but you’re just wrong on this point and I don’t know how else to say it. If you’re talking about the raw media, a single NAND chip versus a single enterprise hard disk-drive then you’re right. When you aggregate NAND for performance versus disk an interesting thing happens, disks become more expensive per gigabyte and I mean a LOT more expensive.

Anonymous said, “…nor has it reached the speed of RAM.” — I never meant to imply that it has. If 357-microsend response time is close to your RAM’s latency I would suggest buying better RAM because that’s 100-times slower than what I would expect from my systems RAM.

Anonymous said, “Maybe it will develop to the point where it can displace one or the other some day, but not this day…” — I don’t recall claiming that Enterprise Flash Drives will displace system RAM. Although, I do believe you’ll need less of it, which means lower density chips at a lower price per GB and that’s a good thing for customers.

Also, Rick White said, “However I do agree that disk arrays make sense for archival storage…” — So as I said before I don’t believe Enterprise Flash Drives will displace hard disk-drives altogether, but again, I believe we will use smaller (2.5″), lower power, larger capacity, slower SATA like drives for archival storage versus faster, higher power, lower capacity FC like drives for high performance storage.

Anonymous said, “Flash will continue to be yet another layer in the memory/storage hierarchy, perhaps a revolutionary one but still one adding and not subtracting complexity – both internally (with non-trivial flash translation layers etc.) and externally (with more staging between hierarchy levels)…” — So if my Enterprise Flash Drive, or anyone else’s for that matter, looks like a single drive and functions like a block device I should consider this complex? I would assume just about any level system administrator could manage this type of storage or even a RAID of disks like this with very little effort and zero training but with some amazing performance implications (let’s say more than 50,000+ IOPS).

Anonymous said, “Until latency reaches *zero*…” — I’m not sure how we would ever reach *zero* latency but if you say so.

Anonymous said, “…read-ahead and write-behind have the potential to be useful with any possible technology.” — Would you mind elaborating on this?

Anonymous said, “The uninformed can make all the guesses they want about how this or that technology *might* work with virtual machines, but those opinions mean nothing without data…” — I’m not sure how you want me to reply to this because I think we both know where this rabbit hole would take us and I don’t think it’s appropriate for companies to directly plug their specific products in forums like this but that’s just my opinion. Needless to say I wasn’t nearly as hypothetical as I was implying but I was trying to play nice.

Anonymous said, “…and less than nothing when the people stating them fail to disclose their interest in a particular conclusion.” — Which I assume you’re referring to someone else considering I did state who I was (not Anonymous) and that I’m biased.

Anonymous said, “Your company makes a great product with a great value proposition…” — Thank you.

Anonymous said, “…I wish you well.” — If I had any idea who you were or what company you represented I would return the niceties.

Anonymous said, “Using FUD and astroturf, though, will only hurt your brand instead of helping it.” — I wasn’t trying to affect my brand in any way shape or form, I was merely discussing a new media and the impact it could have on the way we architect storage in the future. It would appear you are trying to affect my brand by making unfounded and what I consider unfair accusations without even saying who you are. I can only assume you’re a very frightened competitor who sells lots of disks for performance and your world is getting turned upside down by the possibilities about to unfold in the high performance storage market.

In closing, I’m not trying to pick a fight I was just sharing a different opinion but if this makes you uncomfortable I apologize. I do feel bad for the potential impact this may or may not have on your company but please don’t take it out on me just because Enterprise Flash Drives are going to have an impact on the high performance storage Market…in my opinion.

Bill Todd July 27, 2008 at 4:50 pm

Rick, I can assure you that I’m not at all ‘like you’, but it does help to know that I’m talking with a marketeer with an agenda rather than just with a confused, enthusiastic amateur.

So let’s get right down to the specifics about just how disingenuous you’re really being here:

1. Plugging your performance estimates above into the ‘performance calculator’ on your own Web site indicates that you’re positioning 32 of your io-drive 320 drives in an *unmirrored* configuration against a *mirrored* Clariion configuration – but reporting only the net (not raw) storage capacity of the latter. Hence right off the bat your relative $/GB comparison is off by a factor of over 2.5 (given that you weren’t giving the Clariion any credit for its hot spares either) – unless you’re seriously suggesting that there’s no need to mirror and provide spares for your own storage to guard against failure (in which case we can just stop paying any attention to you right now, since you’re not only completely clueless in general but aren’t even aware of the material on your own Web site which provides both mirrored and parity-based configurations in its performance calculator).

2. While you do quote the actual max measured IOPS value for the Clariion, that’s hardly representative of typical *response times* – because in *typical* use the latter are well under 10 ms. rather than well over 20 ms. (as becomes the case as one approaches complete saturation).

3. As Jeff pointed out you’re comparing local storage of your own (which appears to be the only kind that you offer) against SAN-located storage. Not only does the latter include the cost of separate enclosures to terminate the required external busses and house the controller and the storage to which it connects, but in the case of the Clariion it includes a controller capable of supporting 60% more drives (and over 200% more overall capacity if using 300 GB drives) than were configured in the test you cite (i.e., a lot of the controller value was left unused but still contributed to the overall price that you cited).

4. Jeff also mentioned the relative priciness of EMC storage, but didn’t quite do that subject justice: a quick look around the Web (e.g., at NewEgg and ServersDirect) indicates that comparable internal Seagate and Fujitsu 15krpm drives in the 147 – 450 GB range are available at $1.40 – $1.55/GB – so that’s the *real* apples-to-apples price competition with your own $33/GB internal flash storage (if you don’t wish to include SATA drives at $0.15 – $0.20/GB, that is – though they’re eminently suitable for anything but very demanding IOPS-intensive applications).

5. Such as streaming data, for example – where Seagate’s new 450 GB SAS 15krpm drives sport a maximum data rate of 164 MB/sec *apiece* and bottom out at over 100 MB/sec. Even the older drives in the Clariion you cited should support average rates in the 100 MB/sec area – so it would take just *2* of them to sustain the 204 MB/sec of bandwidth that you so casually claim for the Clariion box (perhaps the result of multiplying its measured IOPS by a small, random request size rather than assessing its streaming bandwidth), and it would take only about 40 of them (at an aggregate cost of under $10,000) to match the claimed 4 GB/sec bandwidth of your $340,000 flash solution (though in this case using around 60 SATA drives at a total cost of well under $5,000 might make more sense – and would offer considerably greater capacity as well).

6. Oh, yeah – to be complete, I should address your product’s single real claim to fame: its supposedly fabulous IOPS/$. If we remove the mirroring and hot-spare overhead from the Clariion box and remove the box itself (since we want to perform an apples-to-apples internal-storage comparision), we find that the actual cost of the 60 or so disks when purchased directly at retail rather than through EMC should be under $14,000 – which works out to under $0.56/SPC-1 IOP (lower than the $18/SPC-1 IOP figure you’d like people to believe by a factor of more than 30, and entirely consistent with measured IOPS for this calibre of disk with the significant queue lengths that you assumed). Thus the claimed $0.35/SPC-1 IOP for your own product, while superior, is not at all *dramatically* superior – especially considering the fact that one has to pay at least 20x the price per GB to obtain it (though to be fair the rare application that is so response-time sensitive that it can’t tolerate sub-10 ms. access latencies might still be tempted to consider it as an option).

In the end your ‘fun with math’ reminds me of the old saying that figures may not lie, but liars sure can figure. Henceforth your company might be better served if you confined your misrepresentations to non-technical venues where they’re less likely to get called out for what they are.

– bill

Rick White July 27, 2008 at 6:56 pm

So, as much as I would like to address the great questions and the jab or two, I would rather just end it here with one point of clarification. I wasn’t prepared to discuss Fusion’s products in detail but to be fair since there is some confusion, I wasn’t talking about our ioDrives. What I was referring to when I said “hypothetical” was definitely mirrored and networked but not yet announced. So Bill’s assumptions don’t work but he had no way of knowing this so I’ll just leave it at that.

No matter what happens it should be interesting to watch and if Bill or anyone else wants to get together to share ideas about the storage industry then dinners on me at SNW this fall.

Jeff Darcy July 27, 2008 at 7:53 pm

“if my Enterprise Flash Drive, or anyone else’s for that matter, looks like a single drive and functions like a block device I should consider this complex? I would assume just about any level system administrator could manage this type of storage”

Yes, and I’m sure they could manage any of a dozen alternatives too, but that didn’t stop you from presenting their complexity as a problem or a barrier. Why make an exception for your own kind of complexity?

“I wasn’t trying to affect my brand in any way shape or form”

Touting the key differentiating technology in your brand is practically the same as pushing the brand itself. When EMC was touting the advantages of cache-centric RAID, or NetApp was preaching the NAS gospel, everyone knew it was their way of enhancing their own brands. When you try to follow in their footsteps, you are most definiely trying to affect your brand.

As for your “fun with math” Bill and I are hardly ever on the same page but he’s pretty much right this time. Just adding up your IOPS numbers and comparing the total to a competitors’ measured whole-system result is fishy. It’s like HPC vendors adding up per-CPU GFLOPS, acting as though load can be distributed perfectly and communication won’t matter. Savvy customers in either market know their workloads don’t distribute and scale the way you’d like them to, and if they decide you don’t Get It then you’ve lost.

Also, your product provides lots of IOPS per dollar, but what happens when the customer wants something that they get from enterprise storage besides raw IOPS? Want RAID or multipathing? Set up MD/DM, which is kludgier than any of the dozen array interfaces I’ve used, to get RAID; forget about multipathing. What about disaster recovery or backup? Install more host software (hoping its supported on your particular platform) and expect to burn more host cycles running it. Even if your speeds and feeds look good, even if your prices look good, any comparison to enterprise storage is still suspect if you don’t have enterprise-storage features. Does say “enterprise” to anyone here?

“The Linux driver for this
piece of hardware is pretty dodgy. Sub-alpha quality actually. But
they seem to be working on it. Also there’s no driver for
OpenSolaris, Mac OS X, or Windows right now. In fact there’s not even
anything available for Debian or other respectable Linux distros, only
Red Hat and its clones.”

Anecdotal, admittedly, but are anecdotes any less reliable than marketing claims? Like Anonymous, I think you guys have a lot to be proud of. Unlike Anonymous, I’m willing to sign my name to what I say. Where you and I differ, I think, is that I see your product as a building block to be combined with existing technologies, whereas you seem to see it as a complete solution that displaces them. Only time will tell which of us is right, but my experience building enterprise storage systems and watching them get sold (or not) by both incumbents and upstarts makes me pretty confident about my guess. I know what “science project” means to those kinds of customers. That’s why I sell my science projects to scientists now. 😉

Leave a Comment

{ 3 trackbacks }

Previous post:

Next post: