Am @ Interop today – a nice, relaxing 250 mile drive from home – so this isn’t a standard StorageMojo post. Think of it as an expanded tweet.
Part of what Oracle gets with Sun is ZFS. And part of what Chris Mason of Oracle is working on is Btrfs – B-Tree or “butter” FS – seen as a Linux answer to ZFS. With a GPL license.
With many of the same features – such as parent-stored checksums and snapshots – Btrfs provides important new functionality to Linux. But if ZFS is an Oracle property, how hard could it be to change the licensing and open it up to the Linux community?
The StorageMojo take
I’m asking the question, not answering it. License T&C’s are important, but if the bottom line is that CDDL is incompatible with GPL, will Oracle be able to fix that? Will they want to?
Or does Linux really need AZFS – Almost ZFS?
Courteous comments welcome, of course.
Chris Mason says yes:
http://article.gmane.org/gmane.comp.file-systems.btrfs/2880/match=sun+oracle
🙂
NIHFS (not-invented-here FS) would be a better characterization. Somehow OS X and FreeBSD managed to use ZFS.
Although the two filesystems’ feature set overlaps quite a bit. I think that even if the licensing wasn’t a problem integrating ZFS into the Linux kernel would still prove to be a challenge. Both technically and politically.
Btrfs got to where it is so quickly b/c it is building on mature kernel features. For instance it’s using the same device mapping code for RAID and block abstraction that Linux’s software RAID and LVM are based on. Which is why it was merged into the mainline kernel so quickly.
So while the end goal may be similar at the end of the day Btrfs is a better fit for Linux. That’s not to say that ZFS on Linux will never happen.
yes — as http://thread.gmane.org/gmane.comp.file-systems.btrfs/2880 notes —
‘the core kernel developers have already stated that ZFS is a “rampant layering violation” and otherwise indicated they do not want ZFS in the Linux kernel, whereas BtrFS has gotten a much more positive response. It may well be that on the /Oracle/ side, the political and technical problems with porting ZFS are smaller than those with finishing BtrFS, but if the kernel developers wouldn’t accept it, _any_ money and effort spent on it would be wasted money and effort.’
“ZFS is a “rampant layering violation—
Funny how this is always brought up, when btrfs itself is a “rampant layering violation” by the same definition.
What comes to AZFS aka as btrfs: It will take many years for btrfs to be trusted in. Some people even nowadays don’t fully trust ZFS, and it has been out for several years.
It will take Linux many many years to get there where ZFS is today, and ZFS is not standing still.
There are many projects at Sun/Opensolaris.org that build on top of ZFS. Just look at the COMSTAR project, or the ASM project. There is absolutely nothing comparable over at the Linux camp.
Anyway, I wish btrfs good luck. There are cases, where I’m not able to use a professional OS (namly Solaris) and have to use a toy OS. So a better Filesystem than the current mess available is welcomed.
ZFS does great things because it violates layers. The ZFS dev team realized the need to violate layers, and discusses this issue in several interviews and documents. Any attempt to clone ZFS will need to violate layers, too.
If layers are sacred, Linux kernel developers must remove TCP/IP — the protocol violates layers in several places, regardless of implementation. I’ll hold my breath while waiting for that announcement.
I have asked my colleagues about it and prevalent opinion is that port of ZFS to Linux is not easier than finishing and fixing BTRFS to be production ready. And the result will be probably better (layered design instead of a monolithic monster ZFS is; have you heard about Do One Thing And Do It Well philosophy?)
@Matěj Cepl
Of course ZFS is not monolithic. The layers are just in a different place, where they make more sense.
The importance of “layer violations” or cross-layer optimizations as it is better known has been well known as early as when WAFL integrated RAID knowledge into the filesystem layer.
RDMA, is another example from the networking world.
In any case, if someone’s experienced storage performance issues long enough, the benefits/performance gains of cross-layer optimizations dont take long to get one’s head around.
I am actually surprised that btrfs uses the software raid layer already existing in the kernel, with its inefficiencies to handle larger loads. If I can get some spare time this week, I’ll post some figures from a study I did last year on the md module performance in linux.
ZFS (and its clones) are interesting in that they are the realization of everything we all wanted in a single node Unix file system ten years ago.
Every feature, pools, tiering, performance for databases, etc.
What’s suprising to me (or am I missing something) is that there’s all this excitement and buzz about a file system that does not have any form of multinode data access (CFS or DFS model) or a global namespace that spans nodes. Many of the advanced features that are attractive about ZFS are for large data sets, and having a file system tied to a single node negates a lot of those benefits. 128 bit namespace, great. Are you going to put 1,000 Petabytes of storage on a single OpenSolaris (or Linux) server? This would all be a lot more interesting if it supported even something as limited as 8 or 16 node concurrent and cache-coherent operations across a set of nodes, with transparent failover. There are lots of file systems for the massive distributed internet data center model (GoogleFS, Hadoop, Mogile, Haystack, etc) but there has been a notable slow down in interesting clustered file systems for the use cases where ZFS seems most likely to be deployed – databases and scale-out nearline (tier 2, Petabyte scale) NAS.
Carter, Sun is porting Lustre to use ZFS. It’s an open question whether Oracle will continue this development, of course.
@MatÄ›j Cepl and others: The “layering violation” objection was commented on by one of ZFS developers back in May 2007, and as ‘Brainy’ mentions, it’s just that the layers are in a different place then most systems:
http://blogs.sun.com/bonwick/entry/rampant_layering_violation
Personally the two will continue on even if the merger goes through and Oracle GPLs ZFS. Given the number of file systems that are already in the Linux kernel, no one’s going to notice the addition one or five or twelve more. Some are legacy (ext1, ext2), some serve useful niches (JFFS), but many others are a duplication of effort that seems wasteful.
It would be a big deal if both Btrfs and ZFS end up there.
I have conclusively proven that btrfs is actually a blatant repackaging of reiser4 in a cover up to avoid the political disaster of supporting the code of a convicted murderer. btrfs is 81.56% similar to reiser4. Here are the steps to reproduce. Please spread the word. http://pastebin.com/ff42272d http://pastebin.com/f27912488
Epic trolling penix, I took it seriously for a second there ! 🙂
Linux isn´t Unix, Linux isn´t BSD and of course not Mac OS X.
Linux is set under the GPL, BSD under BDS-License and Mac-Products i.e. Darwin.
That´s the Reason why ZFS is well supported by OS´s that are not under the GPL-License, respectable OS´es compatible with the CDDL-License.
The only way right now, to get REAL zfs-support is with Products from Sun.
Storage Solutions or the open source solution OpenSolaris.
ZFS for BSD-Derivates i.e. are still in deployment and not 100% stable, which is “tricky” in terms of storage.
If Oracle will acquire Sun, and thats how it really looks like, since the US-Government gave green lights, the chances that they release ZFS under GPL are pretty good, since Oracle is working on the GPL-ZFS-Clone BTRFS, which would take years to grow to the reliability of todays ZFS.
Let´s be honest, BTRFS can´t reach the quality of ZFS.
ZFS is a KILLER-FEATURE in UNIX-V-based-Systems.
And that´s why Linux really need´s “AZFS”.
Eazydor,
I love it how you mix in half truths with the truths, really quite inspiring, i wish you the best in politics.
Quality isn’t measurable like distance, but how the rest of the system integrates with the file system. I don’t think you (or I) are qualified to make a decision on if BRTFS will “reach” the quality of another file systems. Most people are armchair engineering.
Journaled filesystems used to be a killer feature.
Volume snapshotting used to be a killer feature.
Don’t worry, next year there will be another “KILLER-FEATURE” that will come out and this whole story will repeat. With someone else screaming that their stuff is cool.
I don´t believe that these are politics, much are just facts. Indeed, on concern of really big projects, with runtime of several years of deployment & realization, believe me, there is technology a really really small factor and you´re right in that way, there are much heavier decisions to take, than that, i.e whole infra, relatio, imlement and, or strategical descision like partners and featured tech. specs, like SLA´s and commitments and on and on.
But, here comes it, as a technology, at least in my opinion, ZFS is for 95 % of users (the armchair-engineers) the best free availible choice in storage-technology.
Journaling and Snapshots are today big parts / standards in deployment of new techs. They are “killer features” today, or call them like you want, good technologies.
And that there will be another one with better results is good fur the consumer, good for the market and is called “evolution”.
Sure, i´m just a young project assistant and i´ve just seen through 3-5 years of expierience, but i´m passionate about all kind´s of information technology but f you know about other FS´s currently availible for vast majority of users that are technological better than ZFS, than please let me know, even i know that won´t last forever 😉
I was doing all kinds of testing on ZFS, and found that while it is quite slow under linux using FUSE (not unexpected) it is quite powerful on OpenSolaris. Very nice! Too bad I can barely run OpenSolaris on any kind of modern hardware – support for storage and LAN controllers is quite spotty.
After finally getting something together that would actually run OpenSolaris I was starting to put it and ZFS through some serious paces and building confidence to begin using it as my storage platform of choice.
Then I decided I would check out the forums, just to make sure I had a good support community available if I had any problems or questions.
Well! I stumbled into all kinds of support threads about people experiencing severe unrecoverable data loss on ZFS, and really very lame responses from Sun to their problems. Basically their response amounted to “if you don’t run professional Sun server hardware, you shouldn’t be surprised that you lose data on ZFS. ZFS makes certain assumptions about hardware quality that you only get with Sun systems.” This seems quite odd, considering Sun actually tries to get people to run OpenSolaris on their “unprofessional” non-Sun systems, and a key feature of OpenSolaris is ZFS. There is a big disconnect here.
So I immediately stopped using ZFS for primary storage, and continue to test it as replicated, secondary storage. There’s a sizeable trust issue that needs to be overcome.
Dave,
I am one of those people. After a clean shutdown and reboot,
ZFS declared my JBOD (12 disks, 2 as hot spares, 2 for checksums)
utterly unreadable. I thought maybe I’d made a dumb mistake,
like powering off before the discs were fully synced.
Not so. A month later, the same scenario. This time I was ultra
careful, but it didn’t help; the JBOD was again unreadable. Sun
support were completely unable to help or even suggest how
it had happened (unimpressive for a pricy support contract).
End result? Lost data. A week of grief retrieving what I could
from the backup tapes (the ones that ZFS literature said I didn’t
need).
And no ZFS *ever again* at my employer’s organisation, or
anywhere else I get to have a say in it.
What a pity.
Crazy huh? Sure demonstrates the importance of NOT making one’s IT procurement decisions based on Press Releases and Tech Journalism. (No Offense intended Robin)
“This time I was ultra careful, but it didn’t help; the JBOD was again unreadable”
Oh please, troll much…
Dave
It turns out that cheap hardware does not always adhere to standards. For instance, some discs reports to ZFS that it has written data to the disc when it fact it has not (it is in the cache which make it look like performance is good. Just like some Linux NFS solutions – they cheat and therefore gives good benchmarks – but it is not safe. You can mimic this cheat behavior in OpenSolaris NFS now, too. But know it is not safe, but fast). ZFS is fooled by cheap hardware and that can cause problems. If you use ok hardware, which obeys standards – then ZFS is safe.
Sun say ZFS is for cheap hardware, but then Sun refers to non SAS and non SCSI – which can be very expensive. Instead, Sun talks about ordinary SATA discs (which obey standards). And surely, a SATA solution must be considered as cheap in comparison to a SAS solution?
Hi Kebabbert,
I understand what you’re saying, but that doesn’t absolve ZFS. They just need to design it with a bit of paranoia built in. Don’t just blindly trust that the cheap hardware reports back accurately that the data is written. Build in some sensible protection.
I was once stuck in a horrible “updated raid controller firmware doesn’t match not-yet-upgraded raid controller driver” crash loop on a critical Windows 2003 server, after told me there was a critical update that needed to be applied or data loss was a real risk. I couldn’t get the system running long enough to tell it to stop trying to restart SQL server. So it would crash. Over and over. (this all started happening remotely, and it had crashed well over 10 times, hard down, before I got in my car and drove to the data center to fix it in person)
After going through about 20 crashes and finally getting the Windows driver updated to match the RAID controller firmware, I held my breath and started up SQL, knowing it had been subjected to 10 consecutive nasty OS crashes while it was starting up. And, to my surprise, up it came.
I realized then that Microsoft had been forced to build some half-decent paranoia into SQL because it was accustomed to being “left in the lurch” a lot by an often-crummy underlying OS.
That’s what Sun needs to build into ZFS. Stop pointing fingers.
If ZFS is not really stable with “cheap” real hardware then better never use it profesionally – as the day might come this machine will be turned into a virtualized…. because then this might fail again totally depending on how the virtualization software handels this….
@Carter For a distributed system have a look at Ceph http://ceph.newdream.net/about/ it’s very much like Lustre but a much newer effort, which means it gets some stuff right this time I think.
BTW I think they solved the problem with corruption on “cheap hardware” in newer versions (although I can’t find a linke right now).
Hey, ZFS is “never-update-in-place” system and also “disk write-cache safe” and even automatically turning disk write-cache on by default!! when the whole physical disk is allocated exclusively for ZFS because it can deal with write cheating later. Moreover all data is hash/crc protected, so errors are detected and don’t propagate further and journal kept for automatic recovery. There is built in redundancy too and pitched years of production deployment.
How come these lame “cheap hardware” excuses pops up???
It’s amazing to me how much has been stolen from IBM over the years. IBM invented and published the ideas for the copy-on-write b-tree FS that Chris stole before working for Oracle. At least Wikipedia states this fact. It’s fitting that he works for Oracle, the company that stole the relational database, also invented by IBM.
What’s more amazing to me is people who like Solaris and think it is a professional OS. I’ll give you three good reasons why AIX is superior to Solaris from my ten years of enterprise financial services experience as a senior infrastucure engineer: 1) I defy any Solaris administrator to show me a command to get the server’s Serial Number. In AIX you have three different ways, one of which is simply ‘uname -u’. If you manage hundreds of servers remotely, this matters to you. 2) Automatic MPIO and seemless external storage integration – if you manage large storage arrays on POWER systems, you know what I’m talking about. 3) Solaris had a kernel patch, 137137-9, that failed to update the boot image in Run-Level 1, requiring a restore from boot -s. Sun support at the time had little to say about it.
Even Fedora Linux is more reliable than that. Solaris is supposedly mission-critical! Just like the funny kernel memory leak in SunOS 5.9 or the fact that SPARC reboots itself if you look at it funny. It is a design flaw that Solaris is programmed to reboot when it detects an ECC error. I have never seen an IBM POWER system reboot itself. I had Sun UltraSPARCIII and IV that would reboot themselves like little girls fainting in the hot sun and the core dump would be inconclusive: the SPARC processor would swear it was the memory and the memory would swear it was an error on the CPU. Crap!!! IBM is the original and always superior!
As for this discussion on Btrfs and ZFS – hopefully Linux developers will either add snapshot features to ReiserFS which is already a balanced-tree design, or evolve EXT4 into a complete native b-tree wih snapshots FS solution and dump ReiserFS (it has bad vibes from the past). I would NEVER use Btrfs simply because of who is working on it. ZFS could have the fate of OpenSolaris (Oracle will probably move forward with their RHEL rebuild and let Solaris fade away, if their months long silence to Sun developers at acquisition are any indication).
The only reason people used Solaris is because Sun gave their hardware away back in the day. And what ugly hardware it was. It could have been Irix or OSF/1 instead, and in academia it was DEC who was giving hardware away. Solaris and ZFS will be gone soon, unless the awesome Ken Smith keeps it alive. If you work for a company who uses Sun hardware, ask your IBM rep for a “POWER-on-wheels” on-site loaner POWER7 server, especially if you think VMWare software-based virtualization is cool – IBM PowerVM hardware-based virtualization is the real deal.
Support Linux development, let’s improve EXT4 and ReiserFS – don’t trust Oracle.
@Kenneth Salerno,
1. On Solaris x86 and Solaris SPARC, try the “sneep” command. On Solaris x86 you can also use “smbios” which is similar to Linux’s “dmidecode”. (Yes, I also manage hunderds of servers remotely)
2. Have you heard about STMS (Storage Traffic Management System)? You’ll love its previous name, MPxIO. I guess Sun stole that from IBM as well, just like Oracle stole the idea of a RDBMS, rofl.
3. Wow, I’m glad my Sun servers are not as easily scared into a reboot as yours. Maybe we could arrange some knowlege transfer on how not to look funny at your sensitive Sun servers.
Enjoy your super stable IBM work hourses. (Btw, I was also wondering if and when Oracle is going to change Solaris and ZFS’s license. Now that I’ve seen they are out to do the awfull thing to make a profit from things, I doubt that they will).
@Daniel
The point of my post was to point out Chris has repeated history for Oracle and that the Linux community should only support open projects. In the case of filesystems, since Linux is a monolithic kernel design where the filesystem exists in the kernel code heap, it is also imperative that the filesystem design meets certain criteria that, as stated by others here, according to core kernel developers ZFS does not cut it and Btrfs is a long way from being ready for general use.
1) You proved my point – sneep is not included with the base Operating System. dmidecode also doesn’t count for the reason you stated: we’re talking about Enterprise hardware, not overgrown PCs, so using Desktop Management Interface (/dev/mem hack) wouldn’t be possible on SPARC servers as you said.
2) You missed the part where I said its automatic. As in you boot or just type “cfgdev” and you’re done. Literally. Try using AIX or Virtual I/O Server before you’re so sure you like Solaris better. As for Oracle, they really did take IBM’s idea for the RDBMS: http://en.wikipedia.org/wiki/E._F._Codd which is exactly what happened in the case of Chris Mason and Btrfs.
3) You obviously didn’t own a E280R or Sun Fire V490 – lemons!
Keep using Solaris until Oracle cuts support and you’re forced to switch to Red Hat Enterprise Linux anyway or Oracle Linux. You really don’t know what you’re missing. I even suggested a way you could try a POWER7 server for free, at your work site for a month, then you can tell me if you still think Solaris is better than AIX. I should also mention I was a Solaris 10, 9, 8, 2.6 (SunOS 5.10, 5.9, 5.8, 5.6) Systems Administrator previously so I can say from extensive and painful experience that it really is not the best UNIX Operating System out there and Sun hardware is really disappointing and poorly designed, but, to each his own I guess.
@Dave @John et al
Well that’s just great. I thought I had found what I was looking for with ZFS. Now what? What do you use?
I’d like to not give up on ZFS just yet, any further discussion on this?
My plan was to put together commodity stuff – new sandy bridge – keep it low power but enough for dedup, put in OpenIndiana, 2 boxes geo-seperate, rsync or some backup scheme btw them.
The sandy bridge embedded cpu’s have ECC ram + ~45watts. Hopefully.
POWER7? Sorry, Intel has unfortunately but decisively won the architecture war. CMT SPARC processors may sound cool on paper, but the workloads that suit them are few.
AIX — it’s been roughly 23 years since I used it, but back then it was nasty. Came with something like 1 pty out of the box, and one had to run through no fewer than seven menus to add each additional. Even when we move forward to today, even Solaris is becoming increasingly difficult to get stuff to build on (GCC 4.6.0, for example, requires something called MPC, the current version of which won’t build on Solaris 10). I can only imagine how much more of a hassle AIX would be.
Want a serial number? Do the above. Or
ESC ( show /SYS/MB
Sun x64 systems are easily the best I’ve found. No other maker that I’ve come across delivers x64 systems with usable serial consoles. On an x4270m2, eg., I can configure the service processor and jumpstart the OS over the network. Can’t do that with systems from HP or Dell AFAICT.
I know this is an older post, but some of the comments in here are too significant to not address…
Are the posts by Dave, John, etc the product of paid FUD shills? Or are these real sysadmins that experienced spontaneously corrupted ZFS arrays without causal hardware failure?
My money is on them being paid shills… the three post string of questioning / damning /confirming is just too convenient otherwise.
Can anyone with ZFS filesystem experience comment? I’m interested to know whether these comments are plausible.
@JPorter
You asked for experience…
http://discuss.joyent.com/viewtopic.php?id=19430
I am in no way associated with Joyent, Sun or any other storage vendor.
I have used Linux from the very beginning. I have also used OpenBSD since around 2.4. The problem is that my beloved Linux has become too fat, too bloated, and too crappy. Here is my advice for what its worth. Discontinue the use of any commercial or commercially influenced operating system. Use OpenBSD. Leave the bloat behind. Gain the security and loose some of the hardware and feature support, but at least it would make hardware vendors reconsider.
ZFS info
http://blogs.oracle.com/video/entry/becoming_a_zfs_ninja
Watch the part 2 then part 1 if you like it. Part 2 will show you what ZFS can do and how to make it happen.
I do hope they incorporate ZFS into Linux. But either way a nice NFS share to a BSD system to gain these features for your linux storage needs will work fine until it is ready.
btrfs and ZFS have some similar features, but radically different theoretical underpinnings and implementations. There are good uses for both of them.
We are turning systems from Linux(Ubuntu) to FreeBSD. ZFS is a heavy reason for this change.
Take a look at kFreeBSD before you do.