ZFS: Threat or Menace? Pt. II

Update: since I wrote this article I’ve written much more about ZFS. Some of the best are:

Now back to the original article about ZFS:

Part I discussed performance and and some data integrity features of ZFS. Now for some more cool features and the StorageMojo.com conclusion.

Physician, Heal Thyself
On-disk bit rot is a real and continuing problem. For a lot of reasons, data can go wrong. The important thing is getting it fixed. As mentioned in Pt I, ZFS separates checksums from their blocks, so it can detect both corrupted and/or incorrect blocks. It also fixes them when found in the course of an I/O.

Even better, ZFS maintains a background process that traverses the metadata and verifies the validity of each block. This process is analogous to the ECC memory scrubbing that EMC used in the original Symmetrix line with its large, single point of failure cache.

Whether fixing a single block or replacing a failed disk, ZFS uses the tree-based checksum structure to ensure valid data.

Snapshot Copy, Cheap. Real Cheap.
As part of its data integrity strategy ZFS doesn’t overwrite live data, so the NOW data and the NOW-less-1 data are always on disk – there is no incident that can leave the data in an indeterminate state. A happy consequence of this strategy is that it is easy to make snapshot copies. In fact, it is cheaper, in CPU cycles and I/Os, to never overwrite the old data. So it is easier and faster, at the cost of cheap JBOD space, to keep snapshots of everything than it is to go back and overwrite. Since it is copy-on-write even the additional disk space use in minimal. CDP for the rest of us. For free.

User managable as well, so users can recover their own file with a command as simple as $ cat ~maybee/.zfs/snapshot/wednesday/foo.c.

Scalability: Billions and Billions – Keep Going, Carl – and Billions and Billions and . . .
ZFS is a 128-bit file system, which is a lot of bits. Billions of Yottabytes, and a Yottabyte is a trllion Terabytes. All of mankind’s information is on the order of 20 million TB. So 128 bits is the last address space we’ll ever need.

The bigger issue is address bits. Thirty years ago when the first 32-bit virtual memory processors were new, a virtual address space of two Gigabytes (31 address bits) seemed impossibly huge. Large programs and datasets had bumped into the extended limits of 16 bit addresses, just as they did with 32 bit addresses several years ago. So at roughly 14 months per bit, you’ll need the 65th bit in 30 years. That seems like an impossibly long time, just as 2000 did in 1970, but it arrived soon enough. ZFS is ready.

Likewise there is no limit on the number of files or directories.

Volumes? We Don’t Need No Stinking Volumes
And thank goodness for that. Volumes were a helpful virtualization tool for physical disks early on. Managing a few volumes instead of dozens of physical disks made sense. Yet as the metoric rise of virtualization up the hype cycle showed, everyone is ready for something different. The industry just couldn’t figure out how to deliver it.

Just a simple matter of programming in the switch. Or the controller. Or the adapter. Assymetric out-of-band. Or in-band. Or, well, how about the file system? You’ll miss buying a lot of gear to make it work, but what the heck.

Virtualization Without Tears
Instead, ZFS uses a layer called the Data Management Unit to pool the physical disks into a single storage pool. There are no volumes. Just the pool. Add a disk to the pool with a simple # zpool add tank mirror c2t0d0 c3t0d0 command. Virtualization that works without drama.

Filesystems become a sysadmin tool. Trivial to set up, you can create templates for them that include quotas, backup, snapshot, compression and privileges. It is easy to list filesystems by size, to understand disk usage. Groups of filesystems may be managed as a group, further simplifying admin. And no path names to forget or mistype.

Some sample command lines to give you a flavor of just how simple and powerful this is:

  • Create a home directory: # zfs create tank/home/bonwick
  • Give it a 10GB quota: # zfs set quota=10g tank/home/bonwick
  • Take a snapshot: # zfs snapshot tank/home/bonwick@tuesday
  • Rollback to a snapshot: # zfs rollback tank/home/bonwick@tuesday
  • Do a full backup:# zfs backup tank/fs@A >/backup/A
  • Do an incremental backup: # zfs backup -i tank/fs@A tank/fs@B >/backup/B-A
  • Do a remote replication once per minute: # zfs backup -i tank/fs@11:31 tank/ ssh host zfs restore -d /tank/fs
  • Export an entire pool from an old server to a new one: old# zpool export tank

Maybe I’m easily impressed, but this seems wicked great.

StorageMojo.com Conclusion
Is v.1.0 of ZFS everything we could wish for? No V.1.0 product is, but the Sun engineers are off to a great start. Specifically, some things that I hope to see soon:

  • Cluster support so ZFS can be used on a highly available infrastructure
  • Double parity RAID-Z for the truly paranoid among us
  • Ported to Linux, FreeBSD, Mac OS X

Unlike Google’s GFS, ZFS’s design center is the real world of enterprise applications and management. ZFS gets rid of the all the expensive hardware gunk that is 90% of the cost of today’s enterprise storage at the same time it simplifies management and improves data integrity. It is a major win for users and the data-intensive applications of the future.

Will big-iron storage arrays go away? No, no more than mainframes have. There will likely always be a place for high-performance storage for exceptionally high-value data. Yet the secular trend is clear: corporate data is growing colder and the economic argument for high-performance storage for *everything* is growing weaker. ZFS provides a powerful, open-source alternative to traditional storage that will hasten rationalization and cost-reduction in the world of data storage.

{ 1 comment… read it below or add one }

Donald Ragbirsingh October 17, 2009 at 7:08 pm

ZFS is being adopted in preference to VXFS at my company. We’ve found some issues that demonstrate how a better understanding of ZFS is critical. A particular application was creating and deleting files so fast that the HDS USP had terrible i/o problems on a few 400GB filesystems – the transfer was a paltry 300MB/s when it should have been at least 700MB/s. When we disabled the ZIL – we found the culprit and the i/o went up. This M8000 domain had four Emulex LP11002 cards and four fibre connections interleaving the PCIe bridges within a single IOU. Admittedly – yes – ZFS is an excellent choice, but further testing showed that locating the ZFS Intent Log to a SSD would have bypassed this issue. HDS (and other frames) don’t like ZFS to dictate when they should flush their NVRAM caches – in fact – ZFS takes a sequential write and transforms it into a pure random write. On RAID5 and RAID6 storage which is typically assigned – these luns within the internal RAID group showed high latencies. Careful planning is required, in my case RAID10 was actually better.

Leave a Comment