ZFS

From Hack Sphere Labs Wiki
Revision as of 23:02, 26 April 2013 by Webdawg (talk | contribs) (Replication)

Jump to: navigation, search

ZFS is a combined file system and logical volume manager created by Sun Microsystems. For more official information see: Wikipedia ZFS Entry This means that not only is ZFS a file system but it also functions as a software raid. While ZFS has many features and is a solution for things such as the RAID 5 write hole it is also extremely simplistic. ZFS is easy, fast, flexible, and under development.

ZFS Usable OS's:

  • Solaris (Suns main OS no longer free but just a trial version)
  • OpenSolaris (Suns opensource OS)
  • FreeBSD (ZFS is being ported to this OS and while it is stable it lags behind the Solaris release for obvious reasons)
  • Nexenta (This OS claims to be the Solaris kernel with the Ubuntu (Linux) userland which seems nice but it has no window manager by default)
  • StormOS (This OS is the same as Nexenta but has GNOME installed by default)

OS Support

  • I have been using ZFS and OpenIndiana for over a year now! Works great!!!
  • I have been using ZFS and FreeBSD for a while now! Works great!!!
  • I have not tested on Linux but here is a full port of it: http://zfsonlinux.org/

ZFS Background

During my research I found some of the information on ZFS confusing and have decided that it was for two reasons. The first was that it is still relatively new and people have many different questions about it. The second is that from its development it seems to have undergone a lot of changes. Even when I went on Freenode and was in the #ZFS IRC room and asking questions about it I was getting opinionated answers and no solid fact from some of the people using ZFS.

From what I have managed to gather ZFS likes and is used most with whole disks. It can be used with slices (the term for partitions in BSD and Solaris) and even files. Files to me was the most interesting because it allows one to experiment and understand ZFS without destroying anything and also allows one interesting opportunities if they see fit. Like completely ignoring performance and utilising the full sizes of disks. Like RAID ZFS can't use different size disks but takes the size of the smallest and applies that max size to all your drives. I think it is a limitation that ZFS should overcome but one step at a time I suppose.

As far as commands go try 'man zpool' etc and look at the links at the bottom of the page.

ZFS Possibilities + A Poor Man's Raid

Like I said though ZFS can use files and slices. That is I could divide drives into whatever size I want and use them how I please. With files and slices, though, one loses performance because of write caching issues which are enabled by default with whole disks but not with this method. You could have two 1 terrabyte drives and 1 500GB Drive and have 1.5 Terrabytes of usable space. That is you would use the Raid 2 level of ZFS and divide all the drives into 500GB slices or files.

Raid 5 or raidz (as ZFS calls it) Equation from 'man zpool':

A raidz group with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised. The minimum number of devices in a raidz group is one more than the number of parity disks. The recommended number is between 3 and 9 to help increase performance.

Ex:

If you have 5 500GB drives or 2x 1TB drives and 1X 500GB which are divided into slices or files you would be fine if any one drive failed.

(5-2)*500GB = 1.5 Terrabytes

This has the redundancy of 2x 500gb drives. So if one terrabyte drive failed you would be fine but have no redundancy.

You could do this but currently ZFS would not perform optimal (3-5x supposedly) and it may be simpler just to by more drives. Still. From reading this one can see what is possible with ZFS. One could even build a Poor Man's Raid with this and still be safe. The only considerations are the performance of this raid. Slices seem to be easy to use if they are not part of a root ZFS filesystem and with files one has to worry about outside data corruption of the files and possible outside config issues.

The way ZFS was 'ment' to be used is with full disks. It enables HD write caching by default and is extremely simple in most cases. It is as simple as finding out your device names and doing 'zpool create mirrorname mirror devicename1 devicename2' or 'zpool create mirrorname mirror filepathname1 filepathname2' or 'zpool create raidmirror raidz file/device/slicename file/device/slicename' on any system with ZFS installed. The 'device' that ZFS creates will be located at /mirrororraidname/ on the root filesystem.

ZFS Misc Notes

In OpenSolaris format -e or just format will give you your device names. Also when using entire drives with ZFS and you are not root it will not find the devices. su and then issue the command. You should be fine.

ZFS also has its own share maker. That is the zfs command creates your NFS or CIFS shares...look it up.

Create ZFS Raidz

  • First find the names of your disks:
format
AVAILABLE DISK SELECTIONS:
      0. c5t0d0 <ATA-ST920217AS-3.01 cyl 2429 alt 2 hd 255 sec 63>
         /pci@0,0/pci8086,202d@1f,2/disk@0,0
      1. c7t50014EE2B1448CFEd0 <ATA-WDCWD20EARX-00P-AB51 cyl 60798 alt 2 hd 255 sec 252>
         /scsi_vhci/disk@g50014ee2b1448cfe
      2. c7t50014EE25C1A7300d0 <ATA-WDCWD20EARX-00P-AB51 cyl 60798 alt 2 hd 255 sec 252>
         /scsi_vhci/disk@g50014ee25c1a7300
      3. c7t50014EE25CC0D3EEd0 <ATA-WDCWD20EARX-32P-AB51 cyl 60798 alt 2 hd 255 sec 252>
         /scsi_vhci/disk@g50014ee25cc0d3ee
      4. c7t50014EE20699B771d0 <ATA-WDCWD20EARX-00P-AB51 cyl 60798 alt 2 hd 255 sec 252>
         /scsi_vhci/disk@g50014ee20699b771
  • No redundancy
zpool create tank raidz c7t50014EE2B1448CFEd0 c7t50014EE25C1A7300d0 c7t50014EE25CC0D3EEd0 c7t50014EE20699B771d0
  • 1 disk
zpool create tank raidz1 c7t50014EE2B1448CFEd0 c7t50014EE25C1A7300d0 c7t50014EE25CC0D3EEd0 c7t50014EE20699B771d0
  • 2 disk
zpool create tank raidz2 c7t50014EE2B1448CFEd0 c7t50014EE25C1A7300d0 c7t50014EE25CC0D3EEd0 c7t50014EE20699B771d0
  • etc

ZFS Snapshots

You can make snapshots to backup information. You can destroy them and hold them and send them across the network to a zfs system.

Rolling Back

To discard all changes made since a snapshot was taken and revert the filesystem back to its state at the time the snapshot was taken:

# zfs rollback <snapshot_to_roll_back_to>
# zfs rollback test_pool/fs1@monday


Note: if the filesystem you want to rollback is currently mounted, you will need to unmount it and remount it. Use -f to force unmount.

You can only rollback to the most recent snapshot. If you want to rollback to an earlier snapshot, either delete the snapshots in between or use the -r option.

# zfs rollback test_pool/fs1@monday
cannot rollback to ’test_pool/fs1@monday’: more recent snapshots exist
use ’-r’ to force deletion of the following snapshots:
test_pool/fs1@tuesday
test_pool/fs1@wednesday
# zfs rollback -r test_pool/fs1@monday

Basics

Below text copied from: http://docs.oracle.com/cd/E19253-01/819-5461/gbcya/index.html

Creating and Destroying ZFS Snapshots

Snapshots are created by using the zfs snapshot command, which takes as its only argument the name of the snapshot to create. The snapshot name is specified as follows:

filesystem@snapname volume@snapname

The snapshot name must satisfy the naming requirements in ZFS Component Naming Requirements.

In the following example, a snapshot of tank/home/ahrens that is named friday is created.

# zfs snapshot tank/home/ahrens@friday

You can create snapshots for all descendent file systems by using the -r option. For example:

# zfs snapshot -r tank/home@now
# zfs list -t snapshot
NAME                       USED  AVAIL  REFER  MOUNTPOINT
rpool/ROOT/zfs2BE@zfs2BE  78.3M      -  4.53G  -
tank/home@now                 0      -    26K  -
tank/home/ahrens@now          0      -   259M  -
tank/home/anne@now            0      -   156M  -
tank/home/bob@now             0      -   156M  -
tank/home/cindys@now          0      -   104M  -

Snapshots have no modifiable properties. Nor can dataset properties be applied to a snapshot. For example:

# zfs set compression=on tank/home/ahrens@now
cannot set compression property for 'tank/home/ahrens@now': snapshot
properties cannot be modified

Snapshots are destroyed by using the zfs destroy command. For example:

# zfs destroy tank/home/ahrens@now

A dataset cannot be destroyed if snapshots of the dataset exist. For example:

# zfs destroy tank/home/ahrens
cannot destroy 'tank/home/ahrens': filesystem has children
use '-r' to destroy the following datasets:
tank/home/ahrens@tuesday
tank/home/ahrens@wednesday
tank/home/ahrens@thursday

In addition, if clones have been created from a snapshot, then they must be destroyed before the snapshot can be destroyed.

For more information about the destroy subcommand, see Destroying a ZFS File System.


Replication

Below text copied from: http://www.markround.com/archives/38-ZFS-Replication.html

ZFS Replication Sysadmin

As I've been investigating ZFS for use on production systems, I've been making a great deal of notes, and jotting down little "cookbook recipies" for various tasks. One of the coolest systems I've created recently utilised the zfs send & receive commands, along with incremental snapshots to create a replicated ZFS environment across two different systems. True, all this is present in the zfs manual page, but sometimes a quick demonstration makes things easier to understand and follow.

While this isn't true filesystem replication (you'd have to look at something like StorageTek AVS for that) it does provide periodic snapshots and incremental updates; these can be run every minute if you're driving this from cron - or, at even more granular intervals if you write your own daemon. Nonetheless, this suffices for disaster recovery and redundancy if you don't need up-to-the second replication between systems.

I've typed up my notes in blog format so you can follow along with this example yourself, all you'll need is a Solaris system running ZFS. Read more for the full demonstration...

First, as with my last walkthrough, I'll create a couple of files to use for testing purposes. In a real-life scenario, these would most likely be pools of disks in a RAIDZ configuration, and the two pools would also be on physically separate systems. I'm only using 100Mb files for each, as that's all I need for this proof of concept.

   [root@solaris]$ mkfile 100m master
   [root@solaris]$ mkfile 100m slave
   [root@solaris]$ zpool create master $PWD/master
   [root@solaris]$ zpool create slave $PWD/slave
   [root@solaris]$ zpool list
   NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
   master                 95.5M   84.5K   95.4M     0%  ONLINE     -
   slave                  95.5M   52.5K   95.4M     0%  ONLINE     -
   [root@solaris]$ zfs list
   NAME                   USED  AVAIL  REFER  MOUNTPOINT
   master                  77K  63.4M  24.5K  /master
   slave                 52.5K  63.4M  1.50K  /slave

There we go. The naming should be pretty self-explanatory : The "master" is the primary storage pool, which will replicate and push data through to the backup "slave" pool.

Now, I'll create a ZFS filesystem and add something to it. I had a few source tarballs knocking around, so I just unpacked one (GNU grep) to give me a set of files to use as a test :

   [root@solaris]$ zpool create master/data
   [root@solaris]$ cd /master/data/
   [root@solaris]$ gtar xzf ~/grep-2.5.1.tar.gz
   [root@solaris]$ ls
   grep-2.5.1

We can also see from "zfs list" we've now taken up some space :

   [root@solaris]$ zfs list
   NAME                   USED  AVAIL  REFER  MOUNTPOINT
   master                3.24M  60.3M  25.5K  /master
   master/data           3.15M  60.3M  3.15M  /master/data
   slave                 75.5K  63.4M  24.5K  /slave

Now, we'll transfer all this over to the "slave", and start the replication going. We first need to take an initial snapshot of the filesystem, as that's what "zfs send" works on. It's also worth noting here that in order to transfer the data to the slave, I simply piped it to "zfs receive". If you're doing this between two physically separate systems, you'd most likely just pipe this through SSH between the systems and set up keys to avoid the need for passwords. Anyway, enough talk :

   [root@solaris]$ zfs snapshot master/data@1
   [root@solaris]$ zfs send master/data@1 | zfs receive slave/data

This now sent it through to the slave. It's also worth pointing out that I didn't have to recreate the exact same pool or zfs structure on the slave (which may be useful if you are replicating between dissimilar systems), but I chose to keep the filesystem layout the same for the sake of legibility in this example. I also simply used a numeric identifier for each snapshot; in a production system, timestamps may be more appropriate.

Anyway, let's take a quick look at "zfs list", where we'll see the slave has now gained a snapshot utilising exactly the same amount of space as the master :

   [root@solaris]$ zfs list
   NAME                   USED  AVAIL  REFER  MOUNTPOINT
   master                3.25M  60.3M  25.5K  /master
   master/data           3.15M  60.3M  3.15M  /master/data
   master/data@1             0      -  3.15M  -
   slave                 3.25M  60.3M  24.5K  /slave
   slave/data            3.15M  60.3M  3.15M  /slave/data
   slave/data@1              0      -  3.15M  -

Now, here comes a big "gotcha". You now have to set the "readonly" attribute on the slave. I discovered that if this was not set, even just cd-ing into the slave's mountpoints would cause things to break in subsequent replication operations; presumably down to metadata (access times and the like) being altered.

   [root@solaris]$ zfs set readonly=on slave/data
zfs allow username send,receive,clone,create,destroy,hold,mount,promote,rename,rollback,share,snapshot tank

So, let's look in the slave to see if our files are there :

   [root@solaris]$ ls /slave/data
   grep-2.5.1

Excellent stuff! However, the real coolness starts with the incremental transfers - instead of transferring the whole lot again, we can just send only the bits of data that actually changed - this will drastically reduce bandwidth and the time taken to replicate data, making a "cron" based system of periodic snapshots and transfers feasable. To demonstrate this, I'll unpack another tarball (this time, GNU bison) on the master so I have some more data to send :

   [root@solaris]$ cd /master/data
   [root@solaris]$ gtar xzf ~/bison-2.3.tar.gz

And we'll now make a second snapshot, and transfer differences between this one and the last :

   [root@solaris]$ zfs snapshot master/data@2
   [root@solaris]$ zfs send -i master/data@1 master/data@2 | zfs receive slave/data

Checking to see what's happened, we see the slave has gained another snapshot:

   [root@solaris]$ zfs list
   NAME                   USED  AVAIL  REFER  MOUNTPOINT
   master                10.2M  53.3M  25.5K  /master
   master/data           10.1M  53.3M  10.1M  /master/data
   master/data@1         32.5K      -  3.15M  -
   master/data@2             0      -  10.1M  -
   slave                 10.2M  53.3M  25.5K  /slave
   slave/data            10.1M  53.3M  10.1M  /slave/data
   slave/data@1          32.5K      -  3.15M  -
   slave/data@2              0      -  10.1M  -

And our new data is now there as well :

   [root@solaris]$ ls /slave/data/
   bison-2.3   grep-2.5.1

And that's it. All that remains to turn this into a production system between two hosts is for a periodic cron job to be written that runs at the appropriate intervals (daily, or even every minute if need be) and snapshots the filesystem before transferring it. You'll also likely want to have another job that clears out old snapshots, or maybe archives them off somewhere.

Quick Reference

zfs list -t snapshot
zfs snapshot bla/bla@??????
zfs send -v -i bla/bla@?????? bla/bla@?????? | ssh 10.0.0.6 zfs recv bla/bla/blackhole0
( zfs send -v -i bla/bla@older bla/bla@newer | ssh 10.0.0.6 zfs recv bla/bla/blackhole0 )

ZFS Resources

[1] - Great for beginners. Lets one understand how to use ZFS without having any spare hard drives.

[2] - Turning a ZFS mirror into a raidz array.

[3] - Official ZFS Admin Guide Link

[4] - ZFS Command Quick Reference

[5] - ZFS Best Practices Guide

[6] - Open Solaris Mailing Lists

[7] - Understanding the zpool status Output

File:ZFS Command Quick Reference.odt - Sun ZFS Command Quick Reference

File:819-5461.pdf - Solaris ZFS Administration Guide