Free Newsletters

   All InfoWorld Newsletters
Ahead of the Curve | Tom Yager » Sun's ZFS is close to perfect, but widely misunderstood

October 24, 2007 | Comments: (0)

Sun's ZFS is close to perfect, but widely misunderstood

You've read about ZFS, the advanced storage management facility baked into Sun Microsystems' Solaris Unix operating system. It is Sun's invention, yet Sun has opened it to the world, including ZFS with the mass of Solaris code that Sun has open-sourced.

Sun's ZFS is close to perfectYou've read about ZFS, but you may not know as much about it as you think. ZFS has been reported to be much faster than other file systems. That basic tidbit creates a pair of misconceptions about ZFS: First, that it is all about speed, and second, that ZFS is a file system. Neither speaks to the core purpose or advantage of ZFS. Where speed is concerned, ZFS is as fast as your disks, controllers, and device drivers. It is subject to the same hardware bottlenecks as all other means of managing storage. ZFS is also not what the majority of people, even in IT, understand as a file system. In most instances, a file system is a fixed structure laid down on a blank disk that has been split into partitions or slices. One formats a partition to create a file system, and that file system is a prerequisite for the storage of files. Without the file system, a computer wouldn't know where its files live.

ZFS tosses common understanding about file systems out the window. It begins with pools of storage. The definition of a pool is one of ZFS's most confounding aspects, but it is key to ZFS's unlimited flexibility. A ZFS pool is made up of any combination of devices, real or logical, that provide persistent storage. ZFS can take a bunch of raw disks and string them together into a striped RAID pool, with the protection of mirroring or single or double parity. A ZFS pool can be an arbitrary collection of partitions within disks. A pool can be made from an assortment of ordinary files. And a single pool can contain any mix of the device types I've described. If it presents to the system as a disk or a file, ZFS can stuff it in a pool.

A ZFS pool is handled as its name implies. It's a big vat of persistent storage that you can tap to create a hierarchical structure of directories and files, which brings us to the file system part of ZFS. But it isn't a file system as you know it. One doesn't fill buckets of finite size from a ZFS pool the way you would with any other variety of software or hardware RAID. Instead, a ZFS pool is structured like a municipal water system. The reservoir is shared by all consumers, any one of whom can drain the whole pool or use just a bit of it. What you'd associate with a file system is really just a name tag, like a street address, used for accounting and administrative convenience, but you can rename, remove, and add file systems to a ZFS pool at will -- and nondestructively while the storage is in active use. You can add storage to a pool at any time, and it is immediately available to all file systems in the pool. There is no delay for formatting because ZFS doesn't format.

It is possible to set up ZFS file systems to work as allocated storage with hard or soft limits imposed on file system nodes and on users that store data on those nodes. If you're particularly fond of limits, ZFS can create file systems that behave exactly like old-fashioned mountable volumes built on formatted partitions, but this is only a mirage. ZFS remains limitlessly flexible even when its capabilities are masked to fit the preferences of the administrator. But there's rarely a need to do this because ZFS is easy enough for a child to manage. Seriously, two commands with blissfully simple syntax run the whole show. See Paul Venezia's screencast of ZFS in action for an example.

More than anything, ZFS is unlimited. Pools can span practically infinite numbers of devices and quantities of storage. Each pool can have a practically limitless amount of file systems on it. In this context, the word "practically" refers to the bounds of practicality, such as the number of 3.5-inch hard drives that one can pack into a football stadium.

Explaining ZFS in one article is difficult only because it is so capable. The best way to sum up ZFS is that it makes it possible to do anything, no matter how insane, with persistent storage. I find ZFS to be so remarkable that you can count on it being a frequent subject of discussion here and in my blogs. If you haven't checked out ZFS yet, do, because it will eventually become ubiquitously implemented in IT. It is too brilliant not to be.

Posted by Tom Yager on October 24, 2007 03:00 AM


RATE THIS ARTICLE:





 

  •  
  • COMMENTS




Hmmm. ZFS, while indeed very interesting, seems to be accompanied by a significant amount of hype. Let's cut through some of it:

ZFS is not particularly fast (nor particularly slow). It does offer above-average small-to-medium-sized update speed (a characteristic of copy-on-write/batch-write-back implementations), but at the cost of streaming-read performance for material thus written (because the small updates are distributed around the storage rather than organized in physical sequence as is the case in extent-based, update-in-place systems with policies that minimize fragmentation). And it has a 'RAID-Z' implementation that can suffer dramatically by comparison to conventional RAID-5 for workloads characterized by lots of parallel small-to-medium-sized accesses (where RAID-Z performs more like RAID-3).

On the other hand, while more conventional file systems may be somewhat faster for many workloads, ZFS does leverage its copy-on-write approach to provide two atypical features:

1. Its updates are intrinsically 'atomic' - even if the system crashes at an arbitrary moment, they either complete, or seem never to have occurred at all. Conventional file systems don't normally guarantee this by default (some files just don't need it, and others require some form of recovery procedure after a crash anyway which can handle any inconsistencies), and those that offer it as an option must write updated material twice: first to a journal that will protect the update, then to the final location - though the journal can be in NVRAM to make the additional overhead negligible, or NVRAM write-back cache in the storage hardware may effectively substitute for the journal, and when material is being written for the first time rather than updated only the structural pointers to it rather than the data itself need be thus protected to guarantee atomicity.

2. ZFS can detect errors that conventional file systems cannot, by virtue of special checksums that it embeds in its metadata. While the frequency of these errors is too low to be important to the average desktop user, in environments with unusually high reliability requirements it provides an edge.

It's worth noting that the 'WAFL' file system in NetApp's line of products provides similar features to those two described above, leading to some discussion about just how much of ZFS actually *is* "Sun's invention".

ZFS's extremely flexible use of a variety of storage devices may be more unprecedented, and for environments that find conventional RAID management challenging (especially when storage must grow) this may be ZFS's most important feature (though it is a feature that some virtualized arrays can come close to equaling when paired with more conventional file systems). Unfortunately, it doesn't help when storage needs grow to exceed the capacity of a single server, but at least makes growth fairly painless up to that point. And the ability to use space flexibly across multiple file systems within a common space pool (similar to the ability to use underlying space flexibly across multiple LUNs in a virtualized array) is also a manageability benefit.

So while ZFS really isn't all that 'close to perfect', nor as entirely novel as Sun might have one believe, on the whole it does constitute a measurable stride forward in storage and while not an ideal fit for everyone should be an excellent fit for some.

Posted by: Bill Todd at November 3, 2007 02:32 AM

Enjoyed the article and comments on ZFS.
However, little mention was made about file naming conventions and compatibility with and between Unix, DOS, Apple and Windows.

Are file names in a pool automatically restricted to the capabilities - or lack thereof - specific to each of these file systems, or does it provide a flexible universal interface?

To wit, Microsoft has all but given up on short file name compatibility with their 'LFN'.

In our home grown fast file management system,
(O'SESAME ffm) we restrict and manage the first four levels within our CRM system in order to provide and maintain an open but controlled 'folder structure' compatible with the above fab' four. The user is still free to name the final file whatever they wish, but we suggest users name their documents using our compatibility reference.

Thanks again for the update.
DKWagner

Posted by: DKWagner at November 3, 2007 07:10 AM

wow, Paul Venezia's screencast of ZFS, i think that's the worst ZFS screencast ever seen.... i think he spent 10 minutes with ZFS before he started that... embarrassing...

Posted by: pressy at November 5, 2007 09:34 AM

If ZFS is a Sun's invention, then why does it resemble Polycenter Advanced File System so pretty close?
It looks like more an incomplete evolution of ADVFS.
Which one appeared first, after all?

Posted by: brasilio castilho at November 29, 2007 04:32 AM

ZFS's unlimited flexibility? Ahem, you can't even remove a device from a pool yet.

From the ZFS FAQ (http://opensolaris.org/os/community/zfs/faq/#deviceremoval):

3. Can devices be removed from a ZFS pool?

Removal of a top-level vdev, such as an entire RAID-Z group or a disk in an unmirrored configuration, is not currently supported. This feature is planned for a future release.

Posted by: srevel at March 19, 2008 02:02 AM

I think srevel should have included the last part of the FAQ article as well:

Can devices be removed from a ZFS pool?

Removal of a top-level vdev, such as an entire RAID-Z group or a disk in an unmirrored configuration, is not currently supported. This feature is planned for a future release.

You can remove a device from a mirrored ZFS configuration by using the zpool detach command.

You can replace a device with a device of equivalent size in both a mirrored or RAID-Z configuration by using the zpool replace command.

So, in essence you CAN remove or replace a device, you just cannot do it unless the data is mirrored somewhere else.

How does this differ from LVM on top of a striped RAID? If I remove a disk from the stripe, the entire volume goes off into the weeds immediately.

Posted by: ACMadsen at March 26, 2008 03:15 AM

Technology White Papers

 

InfoWorld Technology Marketplace

» Technology White Papers Library

Technology White Papers by Topic

Technology White Papers E-mail Alert

Find out when the latest white paper is available:
 
 
» BUY A LINK NOW

Sponsored Technology Links