Flash isn’t tier zero

by Robin Harris | Wednesday, August 13, 2008 | Architecture, Disk, SSD/Flash/NVRAM | 19 comments

A panel discussion on enterprise SSDs at the Flash Memory Summit came to an almost unanimous conclusion: NAND flash is best seen as an extension to DRAM and a layer between DRAM and disk – not as the guts of a disk drive replacement.

I don’t think the guy from Seagate agreed.

Since I was on the panel, my recollections have to be taken with grain of salt. But I was trying to resist the group think that too many panels fall prey to. Yet I agreed with the result.

Price changes everything
StorageMojo has reported at length on the problems of making a big, quirky EEPROM look like a disk. Flash doesn’t look much like DRAM either, but the two are cousins.

In the last few years price has altered the landscape. On today’s spot market a Gbit of DRAM is 7-10x of a Gbit of MLC NAND.

That wasn’t the case 3 years ago, so substituting flash for DRAM made no sense.

The market resistance to flash drives is because flash costs more than disk. Not a problem when augmenting DRAM.

The performance fit
Disks are millisecond devices; DRAM DIMMs are nanosecond devices; and NAND chips are microsecond devices.

More than once it was suggested that maybe it is time to bring back the 3600 RPM drive. Optimized for capacity, power and long life, it would be a good complement to servers with several hundred GB of flash.

The StorageMojo take
Flash as a new storage layer between DRAM and disk just sounds more logical than flash-as-a-disk-like product. Let disks be disks!

And flash be flash.

Courteous comments welcome, of course. More on this topic later. Stay tuned.

19 Comments

Jason on Wednesday, 13 August, 2008 at 9:19 am

Would this tier between DRAM and Disk be persistent storage like disk, or temporary cache like DRAM? Obviously, it’s not an either/or question, particularly when you’re talking about tens to hundreds of gigs. But with Flash still having a far lower ‘write count’ than either DRAM or Disk, you risk burning out the flash faster in servers which have to live for 3-5 or more years, if you use it for transient/cached storage.

So using it for your swap/pagefile probably is a bad idea, but using it for ‘stuff read a LOT but written infrequently’, you’d probably have a good case.

–Jason
Jeff Darcy on Wednesday, 13 August, 2008 at 11:50 am

I think the question is not which extreme we should run to, but where we should be in the middle. From a kernel-coder’s standpoint: what parts of the current code that deals with block storage still make sense, and what parts don’t? Modern systems are already capable of using block devices that aren’t physical disks, including network block devices and loopback devices. If you don’t feel like running mkfs on your flash device, in what way is running mkswap instead insufficient? I would say that on most such systems this approach would satisfy the “use flash more like memory” use case quite well, except that there doesn’t seem any open-source driver for using high-performance flash devices other than through a disk interface. That will change some day, though, probably some day soon. There’s no need to create a whole new abstraction for “flash treated kind of like RAM” when an instance of an existing abstraction is likely to work just as well.

Let disks be disks, and flash be flash. Neither is RAM.
Wes Felter on Wednesday, 13 August, 2008 at 12:25 pm

I’m glad we got that sorted out. Now how do we attach that flash to the system?
Robert Pearson on Wednesday, 13 August, 2008 at 1:43 pm

Yeah!!!
Score a big one for the home team.
I agree with both the consensus and your reserved opinion about flash.
Some Unit of Technology is still desired and required to replace “rotating rust”.
Flash presents really challenging design and implementation efforts to provide the “glueware” features and functions where its use adds the most value.
In Paul Clifford’s comments flash would be an excellent Unit of Technology addition. The flash feature/functions are, or can be, done in software but a flash layer at the “Lower Metric” level will be a performance enhancer.
Just imagine the Service Level Agreement (SLA) and a few operationally determined, site specific “rules of the Content road” residing in flash and you are off to the “Content Managed” environment. Hint: you can do this by Line of Business (LOB) down to the application. Maybe even into the “Lower Metrics”. Wouldn’t that be interesting? Might make implementing Service-Oriented Architecture (SOA) real easy. Just turn it on and let it learn.
the storage anarchist on Wednesday, 13 August, 2008 at 2:20 pm

So, this “Enterprise” panel wasn’t the one that EMC participated in, right?

This was the panel moderated by Ryan Floyd (Storm Ventures) and made up of Robin, Mike Cornwell (Sun – server side), Jim Porter (Disk/Trend), Steffan Hellmold (Seagate) and Joel Hagberg (Fujitsu).

Given that none of the participants (save Seagate) were in any way connected to “enterprise storage,” I guess I’m not surprised at your results – from everyone else’s perspective, flash-as-the-disruptor is probably be more interesting than the boring old flash-as-the-next-generation-of-storage vision that EMC (and soon IBM and Hitachi) are delivering.

From what I’ve been told by attendees, the “unanimous” allies were Sun (flash-belongs-in-the-server), the VC/moderator, the Disk/Trend guy and yourself.

Seagate, on the other hand, positioned a more evolutionary approach, leveraging that which is available today (disk drive form factor), while Fujitsu’s VP of Business Development claimed that Fujitsu would “be there” once Flash was ready for prime time (which it must not yet be, since Fujitsu doesn’t make any flash drives today).

Hardly a balanced view of “enterprise storage” in the opinions of several people in the audience that I’ve heard from…

But what I don’t understand is why this has to be an either-or discussion in the first place?

The more probable outcome is that we’ll have persistent solid-state storage appearing in lots of places up and down the I/O stack – just as we do DRAM today. The original model was all RAM in the server and none in the peripherals; today even cheap disk drives have more RAM cache than did most computers back in the 1980’s.

I mean, why can’t we have our flash as both cache and permanent storage?
Joe Kraska on Wednesday, 13 August, 2008 at 8:56 pm

Anarchist,

I don’t think it’s correct to call the “Server Side” of Sun’s business as not in the Enterprise business. I know it may seem that way from the outside, but frankly that’s just not how Sun works internally at the moment.

Anyway, yes. This is not an either-or. You will certainly start seeing improved flash in the nonvolatile parts of highly available storage (i.e., the journals, write commits, intent logs, whatever you want to call them) that make failover “safe” in large numbers very, very soon.

As for permanent storage, there are kinks, like higher incident rates of silent corruption and the like than with spinning media. But talk to the Sun guys about this and they will say “ZFS can fix that, and is an ideal file system for flash because of that.” They’re probably right.

Joe
the storage anarchist on Thursday, 14 August, 2008 at 4:08 am

Joe – sorry, I only meant that Sun’s server-side didn’t necessarily represent the perspective of “enterprise storage.”

As to the so-called “silent corruption,” one could argue that this is nothing but FUD> If it can be detected and corrected by ZFS or the operating system of a server, then too it can be detected and corrected by the storage device itself – as does the ZeusIOPS drive already today. In addition, it takes but a few addition guard bits to verify that what was written to a disk is in fact returned – not necessary to implement complex journals and commit logs. Today most enterprise-class storage arrays (including both Symmetrix and CLARiiON) already incorporate such Data Integrity Bit protection today (you can’t trust disks to return good data either, as Robin reported CERN discovered this last year). And with T10-DIF, we may soon have an end-to-end protection against undetected data corruption. Add simple RAID across multiple flash drives, and it is easy to correct from drive-detected errors through a simple block rebuild and without host operating system, database or file system involvement. I note that today FusionIO has added additional flash to their flash PCI card specifically to provide RAID for the main capacity.

So it’s not the issues of volatility or silent corruption that’s driving this “better in the server” mentality – or at least, it shouldn’t be.
Andrey Kuzmin on Thursday, 14 August, 2008 at 6:45 am

To Wes:

take a look at zfs Level 2 ARC (L2ARC) cache. It has already happened.

And Robin, thanks for insightful post.
Wes Felter on Thursday, 14 August, 2008 at 12:35 pm

I was asking about how we physically attach the flash if it doesn’t look like a disk. Unless you like sucking your L2ARC data through a 300MB/s straw…
Joe Kraska on Thursday, 14 August, 2008 at 6:22 pm

Wes, there are many ways:

http://www.fusionio.com

This way favors PCIe 8x.

Vendors are all over the place, but many of them are starting to do this sort of thing, we’re individual cards in their systems act, more or less, like a second level of memory or, if you will, high speed swap.

Joe.
Robert Pearson on Thursday, 14 August, 2008 at 8:45 pm

“…how we physically attach the flash if it doesnâ€™t look like a disk”

I am wondering about this myself. In my dreams I am hoping for “blade flash” that has modular, removable Storage units, multiple choice physical interfaces and, most importantly, a user configurable API. This would be a Unit of Technology that provides the most benefits for the $$$. The reality is still short of that.

Some interesting references are:
Adam Leventhal’s Weblog
“Adam and Brendan refer to each other in their Weblog articles.”

Brendan Gregg’s Weblog
“A previous ZFS feature (the ZIL) allowed you to add SSD disks as log devices to improve write performance. This means ZFS provides two dimensions for adding flash memory to the file system stack: the L2ARC for random reads, and the ZIL for writes.”

Adam Leventhal’s ACM “Flash Storage Memory” article
“Can flash memory become the foundation for a new tier in the storage hierarchy?”
Joe Kraska on Friday, 15 August, 2008 at 1:38 pm

â€œA previous ZFS feature (the ZIL) allowed you to add SSD disks as log devices to improve write performance.
—-
Yes. And with ext3 you can dedicate your journal device to this also.

For read cacheing, taking the risk of an hypothesis: you might set your swap to be very very large and put it on a fusion IO device or other some such.

Joe.
Jerry Leichter on Saturday, 16 August, 2008 at 4:30 am

ComputerWorld had an article this week on one particular use case that is seeing very rapid uptake of flash drives: Replacement of disks used read-mostly in many-way mirrored sets to give you sufficient IOPS. By replacing 4, 8, and even more mirrors with one flash drive, your agggregate space, power, and heating saving become very large; just the total power saving pays back the cost of the flash quickly.

Those of us who’ve been around a long time might remember an earlier “extra tier” of memory: The CDC 6600 and its successors back in the late ’60’s/early ’70’s supported “extended core storage”. This, like the main memory, was magnetic core. It was organized in small blocks (the CDC’s had 60-bit words, and if I remember right, ECS blocks were 8 words) and accessed by doing moves back and forth to main memory. So, of course, ECS was slower, not as good at random access, but cheaper. The software didn’t see it as an I/O device, but had direct access; the hardware let you make ranges of ECS accessible to a particular program. (Actually, these weren’t virtual memory machines even in main memory – it was one contiguous range of real memory per program, I think.) The cycle of idea reincarnation takes another turn….

One can imagine a path to a new storage layer as follows: First, use the memory-mapping interface to directly map files on the flash to memory. Now, replace the back end of the memory mapping for flash with something other than existing disk drivers. This gives general programmers the simplest interface – everything is just memory, if that’s how they want to use it – while hiding the changes needed to treat flash as flash, not as a disk. Of course, then you can make some of that memory transactional – it’s backed by flash, after all – and things start to get really interesting.

— Jerry
Steve Jones on Tuesday, 19 August, 2008 at 12:02 pm

Flash as an intermediate layer between the server and disk just makes it a non-volatile cache (and top end arrays already have very large, non-volatile caches measured in the 10s or 100s of GB. The problem is the law of diminishing returns – you have to add more and more of the stuff to get less and less real return in improved throughput. On a large OLTP system you eventually end up with a certain number of random access requests which punish the back end disks unless you spend disproportionate money on this intermediate cache layer (so what you get approaches a s0lid state disk).

Another approach, of using Hierarchical Storage is poss9ible, if you have the applications which can make use of it. In this case, the flash isn’t a “layer” – it’s a separate storage pool and is used to service (say) requirements for low latency OLTP systems. In this case, then it doesn’t make any sense to have a flash layer – it makes sense to have a higher performance storage pool and storage systems and databases which can automatically manage data placement accosrind to service requirements.

In the case of difference between disk and flash, applications talk to file systems, not devices (in general). I’m sure that current file systems are optimised for the performance limitations and characteristics of disk (minimising seeks, avoiding fragmentation etc.) . A file system optimised for flash would be a good idea and I’ve no doubt that many of the known weaknesses of flash on random writes could be addressed (e.g. WAFL-type files systems and roll-up optimisation of writes with the use of a small amount of non-volatile DRAM).

As for bringing back 3600 RPM drives, heaven help us. There is already a huge problem that the serial I/O speed is failing to keep up with increased capacity for unavoidable geometric issues. Disk sequential read rates go up in line with linear bit density whilst capacity goes up to the square. The result is it now takes 3 or 4 hours to read the whole of a 1TB drive. Rebuild a RAID set (especially a large RAID-5 set) and see how long that takes. Go to 3600 from 7200 and you’ll double that again. The number of random IOPs possible will reduce markedly (perhaps by 30-40% depending on access patterns). This is against a backdrop of increasing capacity so the number of IOPs per TB stored will continue to go down and down (as it has been doing since disks were invented) . I’d also be interested to know if there really would be any substantive difference in reliability, power consumption (or costs) between 3600 & 7200 RPM disks.

Possibly, just possibly, 3600 RPM drives would have a place for really low access rate, low throughput archive data, but It’s difficult enough to make use of 1TB 7,200 RPM SATA drives in the enterprise without hitting performance issues . That will get more and more difficult. The 2TB disk is here, and such a monster at 3600 RPM could take 10 hours to read.

Nb. I don’t think this is an issue for Enterprise storage only – performance on my quad-core 2GB three-disk desktop is often crippled by disk contention.
Joe Kraska on Tuesday, 19 August, 2008 at 8:56 pm

Those 3600RPM drives would be fine in, say, some kind of persistent archive. Supposing that we got additional density for it, and I mean 2X. Or some such.

Anyway, on the subject of flash and caches, consider architectures like Compellent’s or Equalogic’s. These systems will, when given a mix of storage with heterogeneous performance properties, do various levels of automated block-based ILM internally. I.e., discarding the babble speak, they put your most frequently accessed blocks on the best class of storage in the system, when there is a mix of storage types. You, the user, are required to do nothing.

I find that to be pretty interesting, particularly if you’re looking forward to a day where you have a tray of storage in your enterprise.

I’m thinking in particular for the virtual machine provisioning use case, where if you are “really load’em down,” you are starving for IOPS. Well, I may be *able* to put my 40G vmdk files on flash, but man I really donwanna. That would be a waste. If, however, the frequently busy parts were there: win.

Your “cache” discussion made me think of this, because it’s sort of cache-like in its behavior, but… it’s not.

Joe.
Steve Jones on Thursday, 21 August, 2008 at 1:25 am

There are certainly vendors out there that do have storage solutions which will automatically migrate your data for you, but like all such systems they are predicated on historical access patterns predicting the future. With complex environments sunject to ad-hoc and changing workload patterns, that can be very difficult. If your end of year batch run finds that critical data sets have been moved to slow archival storage as they’ve not been looked at for a few weeks, then it can cripple performance. Of course that one is predictable (at least by humans), but there are lots of others that arean’t – a sudden spate of problem calls due to weather conditions, a sudden rush of orders from unexp[ected quarters. An unexpected ad-hoc report that has to be run.

I think that maybe what I’m after is the impossible and amounts to a complete storage revolution (I suppose by getting rid of revolutions in the form of disk drives). Personally I feel that, ultimately, disk storage is doomed as a truyly high-performance storage medium as it is doomed by the fundamentals of geometry. Essentially as capacity increases through data density increases, the throughput and IOP per GB inevitably worsens (given there are physical limits to the speed at which disks can be spun). Lots of clever things have been done to try and overcome this – file system organisations, caches at various levels and so on, they just optimise things and can’t overcome that fundamental issue.

Now flash may not be the answer, but it at least adresses one of the issues (namely random read access). I think that disks, like tapes before them, will gradually get relegated to the role of relatively low performance, semi-archival type functions.
Robin Harris on Sunday, 24 August, 2008 at 1:24 pm

While I didn’t reiterate it in the post, I continue to believe that the near-to-medium term low-hanging fruit in the SSD business is high-performance flash SSDs inside storage arrays – EMC’s strategy.

Why? A couple of reasons. First, the array vendor has complete control of the environment that the SSD lives in and can – if they are smart – ensure that their software uses the SSD to good advantage. By replacing a few dozen short-stroked FC drives the array vendor provides clear power-footprint-performance advantages to the customer without asking them to change anything in their infrastructure.

Second, the array vendor shoulders much of the risk of the new technology for risk averse – aren’t they all? – enterprise customers. Persuading them isn’t a slam dunk, but the inside-the-array strategy supplies the one-throat-to-choke beloved of CIOs.

For the longer term the key question is: if flash had shipped in 1957 instead of the first disk drive, would we be trying to package flash into things that look like SSDs today? I don’t think so.

I welcome everyone – system, array, disk and startup vendors – to the scrum. There will be winners and losers among vendors, but in the end buyers will win. And that will keep them buying even more.

Robin
JPh Papillon on Saturday, 11 October, 2008 at 7:01 am

Good evening Robin,

Do you think it is time to reinvestigate the SSD place in the storage hierarchy when Linus Torvald is impressed by new SSD drives coming to consumer market ?

http://torvalds-family.blogspot.com/2008/10/so-i-got-one-of-new-intel-ssds.html

JPh. Papillon
Edgar on Sunday, 24 May, 2009 at 5:04 pm

Does anybody use flash SSD on SAP system (production environment) or for over a year? Over 1TB database size or above.

Please let us know how it goes.

DRAM-based SSD on SAP is proven. So skip that.

Edgar