ZFS on Linux


ZFS is a "storage platform that encompasses the functionality of traditional filesystems, volume managers, and more, with consistent reliability, functionality and performance."

For years ZFS was only available to Linux as a userspace filesystem via FUSE. That all changed back in March when Brian Behlendorf of LLNL made the announcement that a Linux native ZFS port had been deemed stable. With ZFS in the kernel rather than in userland, the filesystem is a viable option for production use.

License incompatibility

Not all is rosy, though. While ZFS is a great piece of tech with numerous appealing features (more below), politics surround this otherwise wonderful union. Sun open sourced ZFS but intentionally made it incompatible with GPL/Linux. At the time, open sourcing Solaris may have been the company's Hail Mary attempt at saving, now extinct, Sun. One can reason why they would make it hard for their biggest competitor to inherit a crown jewel.

Further reading on the license issue can be found in a FAQ entry on the ZFS on Linux site.

License fix: The Solaris Porting Layer

Even though the respective projects are license-incompatible, Linux uses a shim, the Solaris Porting Layer, or SPL. It is released under GPL and serves as a license compatibility layer as well as a kernel compatibility layer by translating Solaris kernel calls into Linux kernel calls. All hacks to make ZFS work on Linux are contained in the SPL, keeping upstream ZFS code clean and unchanged.

Some feel this type of hack undermines the spirit of the GPL and the Linux kernel's attempt to keep all source code open and free (as in speech). This is certainly a matter of opinion, though in this case both projects have freely available source code open for all to see ...

Why bother

Given the political / licensing issues, why give ZFS so much love when Linux hackers are working hard on btrfs, a comparable filesystem bundled in the Kernel? As of September 2012, btrfs is considered experimental. ZFS has seen real-world production use since 2005, though keep in mind the SPL may not have the same exposure.

Additionally, while ZFS on Linux is relatively new, the LLNL is using it in the real world with a boat load of data, 55 petabytes to be exact, albeit through low level system calls. Considering these aspects, ZFS is worth at least a try; and why not, the features are amazing.

ZFS features

ZFS brings a lot to the table, including:

On-the-fly compression

Filesystems can be configured to automatically compress data as it is written to disk and decompress as it is read back. This has the advantage of increasing storage capacity and increasing IO with a relatively small CPU hit.

Self-healing

When setup with redundant storage, i.e. RAID-1 and above, data is checksummed on write and verified on read. When the checksum does not match, ZFS discards the bad block and retrieves it from another disk in the array, all the while fixing the bad block on the first disk and being completely transparent to the application.

Copy-on-write, snapshots and deduplication

ZFS efficiently stores multiple versions of the same file by only storing the differences between them rather than making full copies of the file. Snapshots, near-instant, point-in-time versions of a filesystem provide fantastic recovery options with very low cost. The zfs-auto-snapshot package provides a great setup out of the box.

ZFS also has built-in deduplication which will detect and re-use identical blocks, though it requires ample amounts of RAM. See "The Cost of Deduplication" in Aaron Toponce's ZFS Administration pages.

Like rsync, only better

ZFS can send entire filesystems to remote servers with the zfs send command. Like rsync, it copies files to another system. Unlike rsync, zfs send operates on entire filesystems. Incremental changes are sent using snapshots. Snapshots inherently have a list of the changed blocks, which makes ZFS extremely efficient. Anyone that has used rsync over a high-latency network, slow disk access (EBS ...), or a very large filesystem with relatively small changes will immediately feel the benefits.

Time to play.

Getting started

This post walks through the installation on Ubuntu. If Ubuntu is not your OS of choice, there is installation documentation for a number of distributions over at the zfs on linux website.

Add the PPA

Add the zfs-native PPA, install the packages and ZFS is ready to use. The command below also installs zfs-auto-snapshot:

# apt-add-repository --yes ppa:zfs-native/stable
# apt-get update
# apt-get install -y ubuntu-zfs zfs-auto-snapshot

ZFS Pools

There are two main components to ZFS, a "pool" and a "filesystem". Pools are the abstraction between the physical disks and filesystems. Adding a disk to a pool makes the space available to the filesystems within the pool. For testing, let's make some block devices.

Create some block devices

The following command will create 4 loopback devices to make a test pool:

baremetal@baremetal:~/zfstest$ for i in {0..3}; do
    dd if=/dev/zero of=disk${i} bs=1 count=0 seek=100M;
    sudo losetup /dev/loop${i} $(pwd)/disk${i};
done;

When done, the devices can be deleted with the command:

baremetal@baremetal:~/zfstest$ for i in {0..3}; do
    sudo losetup -d /dev/loop${i};
    rm disk${i};
done;

Redundancy types

Disks can be added to a pool as simple stripes with no redundancy (RAID-0), mirrored sets (RAID-1), one-drive parity named RAIDZ1 (similar to RAID-5), two-drive parity named RAIDZ2 (similar to RAID-6), and three-drive parity named RAIDZ3 (no simliar RAID).

With ZFS, adding disks and the redundancy selection, is a permanent operation. A pool cannot shrink once it's grown and the type of redundancy cannot be changed.

Additionally, growing a pool needs to be done in increments matching the initial redundancy selection. For example, growing a pool with a single mirrored set requires another mirrored set or ZFS will complain. Let's check out an example.

Creating a ZFS pool

The zpool create command creates a pool; the command below creates a pool with a two-drive mirror:

baremetal@baremetal:~/zfstest$ sudo zpool create ttank mirror /dev/loop0 /dev/loop1

The pool's configuration can be seen using the zpool status command:

baremetal@baremetal:~/zfstest$ sudo zpool status ttank
  pool: ttank
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    ttank       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop0   ONLINE       0     0     0
        loop1   ONLINE       0     0     0

errors: No known data errors

Growing the pool by adding a single, non-mirrored disk results in an error:

baremetal@baremetal:~/zfstest$ sudo zpool add ttank /dev/loop2
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses mirror and new vdev is disk

Adding another mirror is the way to grow:

baremetal@baremetal:~/zfstest$ sudo zpool add ttank mirror /dev/loop2 /dev/loop3
baremetal@baremetal:~/zfstest$ sudo zpool status ttank
  pool: ttank
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    ttank       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop0   ONLINE       0     0     0
        loop1   ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        loop2   ONLINE       0     0     0
        loop3   ONLINE       0     0     0

errors: No known data errors

Destroying a pool is done by running the zpool destroy command:

baremetal@baremetal:~/zfstest$ sudo zpool destroy ttank

ZFS filesystems

Userspace interacts with ZFS at the filesystem level (at least that's how we will use it). The pool gets a default filesystem mounted by ZFS to the name of the pool. In the example above, the pool would be mounted at /ttank on the system.

There are many options that can be set on filesystems; a full list can be seen by running the command sudo zfs get all <filesystem name>. Some notable options are:

  • compression
  • quota
  • NFS & Samba sharing
  • atime toggle
  • dedup
  • mountpoint

Creating filesystems

Filesystem creation is easy:

baremetal@baremetal:~$ sudo zfs create ttank/foo

By default, this filesystem is mounted at /ttank/foo. This can be changed by setting the mountpoint option (example below). Make sure that the mountpoint directory is empty or does not exist.

Filesystem deletion is just as easy:

baremetal@baremetal:~$ sudo zfs destroy ttank/foo

Options can be set at filesystem creation. For example, to set compression:

baremetal@baremetal:~$ sudo zfs create -o compression=lzjb ttank/foo

Options can also be changed after the fact using zfs set:

baremetal@baremetal:~$ sudo zfs set compression=lzjb ttank/foo

Note that setting compression on a filesystem that already has files will not compress the existing files. New files will be compressed as they are written.

Here's an example that sets compresssion, specifies a custom mountpoint and for performance disables the atime attribute:

baremetal@baremetal:~$ sudo zfs create -o compression=lzjb -o mountpoint=/tmp/hi -o atime=off ttank/hi

No fstab fiddling

ZFS is intended to be a one-stop shop. It handles disk pooling. It handles filesystem creation. It even handles mounting and unmounting filesystems as they are created and destroyed. On boot, ZFS will mount the filesystems for you, all behind the scenes without having to mess with /etc/fstab entries! If for some reason mounting a filesystem is not desirable, the canmount option can be set to off:

baremetal@baremetal:~/zfstest$ sudo zfs umount ttank/foo
baremetal@baremetal:~/zfstest$ sudo zfs set canmount=off ttank/foo
baremetal@baremetal:~/zfstest$ sudo zfs mount ttank/foo
cannot mount 'ttank/foo': 'canmount' property is set to 'off'

Snapshots

The zfs list -t snapshot command will show all the snapshots on the system. Using grep the output can be pared down. Here's an example of snapshots on a filesystem for Cassandra data:

baremetal@baremetal:~$ sudo zfs list -t snapshot | grep cassandra
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_hourly-2013-10-07-2317      20K      -    33K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_hourly-2013-10-08-0017     344K      -   356K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_hourly-2013-10-08-0117        0      -   105K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_hourly-2013-10-08-0217        0      -   105K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_hourly-2013-10-08-0317     264K      -   320K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_hourly-2013-10-08-0417        0      -   386K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_hourly-2013-10-08-0517        0      -   386K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_frequent-2013-10-08-0545      0      -   386K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_frequent-2013-10-08-0600      0      -   386K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_frequent-2013-10-08-0615      0      -   386K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_hourly-2013-10-08-0617        0      -   386K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_daily-2013-10-08-0625         0      -   386K  -
tank/srv/baremetal/agent_mounts/cassandra@zfs-auto-snap_frequent-2013-10-08-0630      0      -   386K  -

The snapshots above are being created by zfs-auto-snapshot, which maintains 4 sub-hourly, 24 hourly, 31 daily, and 12 monthly snapshots.

Self-healing in action

Self-healing sounds great, but seeing is believing. Below we'll walk through a slight modification of the demo on the Oracle site, which simulates data corruption by writing some data, then using zpool export to write all data on disk into a consistent state. This operation is similar to migrating drives from one system to another. Once the data is exported, we'll intentionally corrupt data on one of the devices to see ZFS work its magic.

Pool creation is the same as above with the exception of the loop devices being 300M instead of 100M.

Prepping and verifying data

Here we just copy a tarball and verify its checksum:

baremetal@baremetal:~/zfstest$ sudo cp /var/cache/server-jre-7u25-linux-x64.tar.gz /ttank/
baremetal@baremetal:~/zfstest$ sudo find /ttank -type f | xargs md5sum
7164bd8619d731a2e8c01d0c60110f80  /ttank/server-jre-7u25-linux-x64.tar.gz

Now export the pool and corrupt!

baremetal@baremetal:~/zfstest$ sudo zpool export -f ttank
baremetal@baremetal:~/zfstest$ sudo dd bs=1024k count=32 conv=notrunc if=/dev/zero of=/dev/loop0
32+0 records in
32+0 records out
33554432 bytes (34 MB) copied, 0.074169 s, 452 MB/s

When the pool is imported, ZFS will tell you something is not right. Because we did this intentionally, we can simply tell ZFS the pool is okay:

baremetal@baremetal:~/zfstest$ sudo zpool import -d /dev ttank
baremetal@baremetal:~/zfstest$ sudo zpool status ttank
  pool: ttank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    ttank       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        loop0   ONLINE       0     0     6
        loop1   ONLINE       0     0     0

errors: No known data errors
baremetal@baremetal:~/zfstest$ sudo zpool online ttank loop0

And, the moment of truth:

baremetal@baremetal:~/zfstest$ sudo find /ttank -type f |xargs md5sum
7164bd8619d731a2e8c01d0c60110f80  /ttank/server-jre-7u25-linux-x64.tar.gz
baremetal@baremetal:~/zfstest$

Awesome.

Get in touch

Questions, comments, typos, or just random musings; please let us know. Feel free to email me personally at roberto@baremetal.io.