ZFS: Threat or Menace? Pt. I

Update: since I wrote this article I’ve written much more about ZFS. Some of the best are:

Sadly, Apple has dropped ZFS. But with Oracle’s acquisition of Sun completed there is a chance it will come back. Stay tuned.

Now back to the original article about ZFS:

IMHO, both. In a storage industry where the hardware cost to protect data keeps rising, ZFS represents a software solution to the problem of wobbly disks and data corruption. Thus it is a threat to hardened disk array model of very expensive engineering on the outside to protect the soft underbelly of ever-cheaper disks on the inside.

It’s Software Version of the Initiation Rite in A Man Called Horse
Before I jump into the review of ZFS, let me share what I like best about it, from a slide in the modestly titled “ZFS, The Last Word In Filesystems” presentation:

ZFS Test Methodology

  • A Product is only as good as its test suite [amen, brother!]
    • ZFS designed to run in either user or kernel context
    • Nightly “ztest” program does all of the following in parallel:
      • Read, write, create and delete files and directories
      • Create and destroy entire filesystem and storage pools
      • Turn compression on and off (while FS is active)
      • Change checksum algorithm (while FS is active)
      • Add and remove devices (while pool is active)
      • Change I/O caching and scheduling policies (while pool is active)
      • Scribble random garbage on one side of live mirror to test self-healing data
      • Force violent crashes to simulate power loss, then verify pool integrity
    • Probably more abuse in 20 seconds than you’d see in a lifetime
    • ZFS has been subjected to over a million forced, violent crashes without losing data integrity or leaking a single block

Is RAID Hard or Soft?
I start here because, perhaps like you, I’ve always felt safer with hardware (HW) RAID — even though some pretty cruddy HW RAID has shipped. In my case I trace that back to the technologists at Veritas who, IMHO, never really “got” the enterprise — although the OpenVision guys certainly did — and whose software allowed average sysadmins to dig very deep holes that buried more than one of them. And, of course, the HW guys have kvetched about software performance for so long that most people have forgotten that HW RAID is simply software running on a dedicated processor. Processors that are usually two to five years out of date.

The real advantage of HW RAID is that the software sits in a controlled environment: the processor, the OS, the interprocessor links, the interface to the drives, the RAM, everything is specified and tested.

It needs to be: in general, storage systems, including disk drives, are steaming piles of spaghetti code whose authors are long gone. So there is a lot of regression testing to make sure that new features haven’t broken old features. An advantage array makers have over you is that they specify the firmware rev level of disk drives, so they know exactly what they are getting. Since their spaghetti code is no better at recovering from errors than, say, Windows 98, they work hard to make sure no errors happen. You pay through the nose for this, but they do a pretty good job.

It’s Always Something
Which is why I love Google’s GFS model. They assume everything will crash underneath them at the worst possible time and they’ve built the software to handle it. Endlessly patched 20 year old disk drive firmware? Exploding power supplies? Network outage? Asteroid hit? OK, maybe not the last one, but they are ready for everything else and more using cheap commodity products.

Yet GFS has some major problems: it isn’t, by a long shot, suitable for most enterprise applications. It isn’t open source. Worst of all, it isn’t for sale. GFS is a major competitive advantage for Google and nobody gets it but them.

Which brings us to ZFS, which at one point stood for Zettabyte File System, and now stands for ZFS. It isn’t just a file system, any more than GFS is. It is a complete software environment for protecting, storing and accessing data, designed for the most demanding enterprise environments. Using standard storage components: disk drives, enclosures, adapters, cables. No RAID arrays. No volume managers. No CDP. No fsck. No partitions. No volumes. Almost makes you nostalgic for the good old days, doesn’t it? Like before Novocaine.

I can show you the door, Neo, but you have to walk through it.
ZFS is a total rethink of how to manage data and storage. Its design principles include:

  • Pooled storage
    • No volumes
    • Virtualizes all disks
  • End-to-end data integrity
    • Everything Copy-On-Write
      • No overwrite of live data
      • On-disk state always valid
    • Everything is checksummed
      • No silent data corruption possible
      • No panics due to corrupted metadata
  • Everything transactional
    • All changes occur together
    • No need for journaling
    • High-performance full stripe writes

Many details fall out of these overall design ideas. I’ll deal with some of them today and more of them in Part II.

Performance Anxiety
The biggest single knock against software-based RAID is performance. Mirroring is as fast as a disk write, but parity RAID has to deal with the dreaded “write-hole” problem, which is too geeky to bore you with here, that really kills write performance.

Since storage arrays are running software RAID, how do they solve this problem? Money. Specifically, your money, plowed into a large and expensive non-volatile memory cache, usually redundant, with battery back up. There is nothing magic about this cache: it simply tells the system that the write is completed as soon as it is in the cache, which takes microseconds, instead of on the drive, which can take many thousands of times longer. It doesn’t even need to be in the array. Several vendors have sold NVRAM caches on I/O cards that improve performance just as much as a storage array does. But they are more of a hassle to manage.

With ZFS RAID-Z there is no RAID write hole problem. All writes are full stripe — high performance — writes. How can this be? ZFS has variable stripe width. Every ZFS block is its own stripe. No one else does this, because reconstructing the data is impossible when all the storage array knows about is blocks and all the file system knows about is files. In ZFS, though, the array and the file system are integrated, so the metadata has all the information needed to recreate the data on a lost disk. This is a very cool answer to a very old problem.

The truth? You can’t handle the truth!
Actually, in the storage world, we insist upon it. Data integrity is the sine qua non of data storage. Fast is good, accessible is good, but if it isn’t right, nothing else matters.

To ensure data integrity, all systems use some form of checksum to ensure some level of integrity. Yet that integrity may not be nearly as good as your friendly SE has led you to believe.

Most filesystems rely upon the hardware to detect and report errors. Even if disks were perfect, there are still many ways to damage data en route. In flight data corruption is a real problem.

In a well-done paper from Dell and EMC the problem is described this way:

System administrators may feel that because they store their data on a redundant disk array and maintain a well-designed tape-backup regimen, their data is adequately protected. However, undetected data corruption can occur between backup periods-backing up corrupted data yields corrupted data when restored. Scenarios that can put data at risk include:

  • Controller failure while data is in cache
  • Power outage of extended duration with data in cache
  • Power outage or controller failure during a write operation
  • Errors reading data from disk
  • Latent disk errors

In Dell | EMC systems, the data and the checksum are stored as a unit and compared inside the array. This effectively ensures that the array is as reliable as a disk, but it has no way of knowing if, for example, stale data is returned to the file system.

In fact, any checksum stored with the data it is supporting can only tell you that this data is uncorrupted. It could be the wrong data and neither it or the file system could know.

In contrast, a ZFS storage pool is a tree of blocks. ZFS employs a 256-bit checksum for every block. Instead of storing the checksum with the block itself, it stores the checksum in its parent block. Every block contains the checksums for all its children blocks, so the entire pool can validate that the data is both accurate and correct. If the data and the checksum disagree, the checksum can be trusted because it is part of an already validated, higher level block.

And it does all this in software. No co-processors, no arrays, no fancy disk formatting. It’s the architecture that is smart, not the storage.

Read Part II of ZFS: Threat or Menace?

Note: I’ve borrowed heavily from the publications of the ZFS team to write this post. Specifically, here and here and here.

{ 3 trackbacks }

OsMoSiS » Un NAS in casa
Tuesday, 12 September, 2006 at 2:33 am
Part II: Rebuilding ZEUS – The Operating System, FileSystem & Virtualisation | Thushan Fernando Uncut
Saturday, 17 October, 2009 at 6:44 pm
Open Source Storage [Imagine...] | Salageanu's Blog
Sunday, 22 August, 2010 at 10:56 am

{ 6 comments… read them below or add one }

Mike Manh Tuesday, 20 February, 2007 at 2:14 pm

So riddle me this: What if I had used a bunch of computers to make a local software Z Raid. Can I use a master computer to connect to those computers over a network, pool the storage, and present a storage pool to other computers looking to use a shared self healing storage device?

let’s say i use a different network interface for that so that the master talking to the other storage computers was doing it over a separate network than everyone else. That way i wouldn’t be bogging down the network interface with twice the traffic. This would be my poor man’s SAN. I don’t understand ZFS well enough to understand if you can pool networked other pools, so I guess that’s my question. Is it possible?

Val Friday, 31 August, 2007 at 4:26 am

MIke

I guess it’s possible… question of network routing…. but giganet is still quit slow I find maybe combining 2 or more gigabit ethernet cards together…. or bether fiberchannel ?? [$$$]

Oisín Friday, 20 June, 2008 at 8:44 am

If the checksum of a block is stored in its parent block, doesn’t this mean changing the contents of a node deep in the tree forces its parent to update its checksum, which in turn forces ITS parent to update its checksum, all the way back up the tree? Wouldn’t this take a long time?

Ivan R. Saturday, 21 June, 2008 at 1:50 am

Oisin:

Yes, the new chechsums all the way up to the top of the pool will be written. But ZFS is all copy-on-write, so new disk blocks will be used to store them, so the head doesn’t have to seek all over. Also ZFS will batches up I/O operations into transaction groups, so the whole batch will be flushed out to disk in one pass, rather than several smaller writes.

Ken Ow-Wing Friday, 20 November, 2009 at 7:25 pm

Hi Robin:
Can cite your source? Is it Bonwick? or somebody else. I searched the links to ZFS eng. blogs but did not see source statement

ZFS has been subjected to over a million forced, violent crashes without losing data integrity or leaking a single block

Thanks, Ken Ow-Wing

Robin Harris Saturday, 21 November, 2009 at 8:35 pm

Ken, that post is a couple of years old and IIRC – big if – there was a document that Bonwick wrote for a conference – FAST? – that contained the statement. I would not have made that up.

Robin

Leave a Comment