ZFS
ZFS is a combined file system and logical volume manager created by Sun Microsystems. For more official information see: Wikipedia ZFS Entry This means that not only is ZFS a file system but it also functions as a software raid. While ZFS has many features and is a solution for things such as the RAID 5 write hole it is also extremely simplistic. ZFS is easy, fast, flexible, and under development.
ZFS Usable OS's:
- Solaris (Suns main OS no longer free but just a trial version)
- OpenSolaris (Suns opensource OS)
- FreeBSD (ZFS is being ported to this OS and while it is stable it lags behind the Solaris release for obvious reasons)
- Nexenta (This OS claims to be the Solaris kernel with the Ubuntu (Linux) userland which seems nice but it has no window manager by default)
- StormOS (This OS is the same as Nexenta but has GNOME installed by default)
Contents
zfs sharenfs=rw,anon=uidofuser share/name
ZFS Performance/Slow ZFS Pool
iostat -mX zpool iostat eh 2 zpool iostat -v
svc_t - average response time of transactions, in milliseconds
Hotplug
- http://docs.oracle.com/cd/E23824_01/html/821-1459/devconfig2-25.html
- https://blogs.oracle.com/sa/entry/hotplugging_sata_drives
cfgadm -a cfgadm -c unconfigure sata0/1 cfgadm -c configure sata0/1
Mirror rpool
- http://docs.oracle.com/cd/E19253-01/819-5461/gkdep/index.html
- http://www.nickebo.net/making-your-zfs-root-pool-a-mirror-post-installation/
- List disks
format -e c0d1 (or whatever you new disk is called)
- you probably do not need fdisk
>fdisk >y
- k?
>label >0 (SMI) >y >quit prtvtoc /dev/rdsk/c5t0d0s2 | fmthard -s - /dev/rdsk/c5t1d0s2
- Attach drive
zpool attach -f rpool c5t0d0s0 c5t1d0s0 Make sure to wait until resilver is done before rebooting.
- Check
zpool status
- Install grub bootloader on 2nd disk:
installgrub /boot/grub/stage1 /boot/grub/stage2 /dev/rdsk/c5t1d0s0
OS Support
- I have been using ZFS and OpenIndiana for over a year now! Works great!!!
- I have been using ZFS and FreeBSD for a while now! Works great!!!
- I have not tested on Linux but here is a full port of it: http://zfsonlinux.org/
ZFS Background
During my research I found some of the information on ZFS confusing and have decided that it was for two reasons. The first was that it is still relatively new and people have many different questions about it. The second is that from its development it seems to have undergone a lot of changes. Even when I went on Freenode and was in the #ZFS IRC room and asking questions about it I was getting opinionated answers and no solid fact from some of the people using ZFS.
From what I have managed to gather ZFS likes and is used most with whole disks. It can be used with slices (the term for partitions in BSD and Solaris) and even files. Files to me was the most interesting because it allows one to experiment and understand ZFS without destroying anything and also allows one interesting opportunities if they see fit. Like completely ignoring performance and utilising the full sizes of disks. Like RAID ZFS can't use different size disks but takes the size of the smallest and applies that max size to all your drives. I think it is a limitation that ZFS should overcome but one step at a time I suppose.
As far as commands go try 'man zpool' etc and look at the links at the bottom of the page.
ZFS Possibilities + A Poor Man's Raid
Like I said though ZFS can use files and slices. That is I could divide drives into whatever size I want and use them how I please. With files and slices, though, one loses performance because of write caching issues which are enabled by default with whole disks but not with this method. You could have two 1 terrabyte drives and 1 500GB Drive and have 1.5 Terrabytes of usable space. That is you would use the Raid 2 level of ZFS and divide all the drives into 500GB slices or files.
Raid 5 or raidz (as ZFS calls it) Equation from 'man zpool':
A raidz group with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised. The minimum number of devices in a raidz group is one more than the number of parity disks. The recommended number is between 3 and 9 to help increase performance.
Ex:
If you have 5 500GB drives or 2x 1TB drives and 1X 500GB which are divided into slices or files you would be fine if any one drive failed.
(5-2)*500GB = 1.5 Terrabytes
This has the redundancy of 2x 500gb drives. So if one terrabyte drive failed you would be fine but have no redundancy.
You could do this but currently ZFS would not perform optimal (3-5x supposedly) and it may be simpler just to by more drives. Still. From reading this one can see what is possible with ZFS. One could even build a Poor Man's Raid with this and still be safe. The only considerations are the performance of this raid. Slices seem to be easy to use if they are not part of a root ZFS filesystem and with files one has to worry about outside data corruption of the files and possible outside config issues.
The way ZFS was 'ment' to be used is with full disks. It enables HD write caching by default and is extremely simple in most cases. It is as simple as finding out your device names and doing 'zpool create mirrorname mirror devicename1 devicename2' or 'zpool create mirrorname mirror filepathname1 filepathname2' or 'zpool create raidmirror raidz file/device/slicename file/device/slicename' on any system with ZFS installed. The 'device' that ZFS creates will be located at /mirrororraidname/ on the root filesystem.
ZFS Misc Notes
In OpenSolaris format -e or just format will give you your device names. Also when using entire drives with ZFS and you are not root it will not find the devices. su and then issue the command. You should be fine.
ZFS also has its own share maker. That is the zfs command creates your NFS or CIFS shares...look it up.
Create ZFS Raidz
- First find the names of your disks:
format AVAILABLE DISK SELECTIONS: 0. c5t0d0 <ATA-ST920217AS-3.01 cyl 2429 alt 2 hd 255 sec 63> /pci@0,0/pci8086,202d@1f,2/disk@0,0 1. c7t50014EE2B1448CFEd0 <ATA-WDCWD20EARX-00P-AB51 cyl 60798 alt 2 hd 255 sec 252> /scsi_vhci/disk@g50014ee2b1448cfe 2. c7t50014EE25C1A7300d0 <ATA-WDCWD20EARX-00P-AB51 cyl 60798 alt 2 hd 255 sec 252> /scsi_vhci/disk@g50014ee25c1a7300 3. c7t50014EE25CC0D3EEd0 <ATA-WDCWD20EARX-32P-AB51 cyl 60798 alt 2 hd 255 sec 252> /scsi_vhci/disk@g50014ee25cc0d3ee 4. c7t50014EE20699B771d0 <ATA-WDCWD20EARX-00P-AB51 cyl 60798 alt 2 hd 255 sec 252> /scsi_vhci/disk@g50014ee20699b771
- No redundancy
zpool create tank raidz c7t50014EE2B1448CFEd0 c7t50014EE25C1A7300d0 c7t50014EE25CC0D3EEd0 c7t50014EE20699B771d0
- 1 disk
zpool create tank raidz1 c7t50014EE2B1448CFEd0 c7t50014EE25C1A7300d0 c7t50014EE25CC0D3EEd0 c7t50014EE20699B771d0
- 2 disk
zpool create tank raidz2 c7t50014EE2B1448CFEd0 c7t50014EE25C1A7300d0 c7t50014EE25CC0D3EEd0 c7t50014EE20699B771d0
- etc
ZFS Snapshots
You can make snapshots to backup information. You can destroy them and hold them and send them across the network to a zfs system.
Rolling Back
To discard all changes made since a snapshot was taken and revert the filesystem back to its state at the time the snapshot was taken:
# zfs rollback <snapshot_to_roll_back_to>
# zfs rollback test_pool/fs1@monday
Note: if the filesystem you want to rollback is currently mounted, you will need to unmount it and remount it. Use -f to force unmount.
You can only rollback to the most recent snapshot. If you want to rollback to an earlier snapshot, either delete the snapshots in between or use the -r option.
# zfs rollback test_pool/fs1@monday cannot rollback to ’test_pool/fs1@monday’: more recent snapshots exist use ’-r’ to force deletion of the following snapshots: test_pool/fs1@tuesday test_pool/fs1@wednesday
# zfs rollback -r test_pool/fs1@monday
Basics
Below text copied from: http://docs.oracle.com/cd/E19253-01/819-5461/gbcya/index.html
Creating and Destroying ZFS Snapshots
Snapshots are created by using the zfs snapshot command, which takes as its only argument the name of the snapshot to create. The snapshot name is specified as follows:
filesystem@snapname volume@snapname
The snapshot name must satisfy the naming requirements in ZFS Component Naming Requirements.
In the following example, a snapshot of tank/home/ahrens that is named friday is created.
# zfs snapshot tank/home/ahrens@friday
You can create snapshots for all descendent file systems by using the -r option. For example:
# zfs snapshot -r tank/home@now # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT rpool/ROOT/zfs2BE@zfs2BE 78.3M - 4.53G - tank/home@now 0 - 26K - tank/home/ahrens@now 0 - 259M - tank/home/anne@now 0 - 156M - tank/home/bob@now 0 - 156M - tank/home/cindys@now 0 - 104M -
Snapshots have no modifiable properties. Nor can dataset properties be applied to a snapshot. For example:
# zfs set compression=on tank/home/ahrens@now cannot set compression property for 'tank/home/ahrens@now': snapshot properties cannot be modified
Snapshots are destroyed by using the zfs destroy command. For example:
# zfs destroy tank/home/ahrens@now
A dataset cannot be destroyed if snapshots of the dataset exist. For example:
# zfs destroy tank/home/ahrens cannot destroy 'tank/home/ahrens': filesystem has children use '-r' to destroy the following datasets: tank/home/ahrens@tuesday tank/home/ahrens@wednesday tank/home/ahrens@thursday
In addition, if clones have been created from a snapshot, then they must be destroyed before the snapshot can be destroyed.
For more information about the destroy subcommand, see Destroying a ZFS File System.
Replication
Problem?
I was having a problem with receive through ssh. If you receive error such as zfs command not found, the solution is that you write full path of zfs in the remote size.
zfs send DV/Sn03 | ssh reski@zfsserver /usr/sbin/zfs receive DV/Sn03
- I always run the send and receive commands in screen.
How
Below text copied from: http://www.markround.com/archives/38-ZFS-Replication.html
ZFS Replication Sysadmin
As I've been investigating ZFS for use on production systems, I've been making a great deal of notes, and jotting down little "cookbook recipies" for various tasks. One of the coolest systems I've created recently utilised the zfs send & receive commands, along with incremental snapshots to create a replicated ZFS environment across two different systems. True, all this is present in the zfs manual page, but sometimes a quick demonstration makes things easier to understand and follow.
While this isn't true filesystem replication (you'd have to look at something like StorageTek AVS for that) it does provide periodic snapshots and incremental updates; these can be run every minute if you're driving this from cron - or, at even more granular intervals if you write your own daemon. Nonetheless, this suffices for disaster recovery and redundancy if you don't need up-to-the second replication between systems.
I've typed up my notes in blog format so you can follow along with this example yourself, all you'll need is a Solaris system running ZFS. Read more for the full demonstration...
First, as with my last walkthrough, I'll create a couple of files to use for testing purposes. In a real-life scenario, these would most likely be pools of disks in a RAIDZ configuration, and the two pools would also be on physically separate systems. I'm only using 100Mb files for each, as that's all I need for this proof of concept.
[root@solaris]$ mkfile 100m master [root@solaris]$ mkfile 100m slave [root@solaris]$ zpool create master $PWD/master [root@solaris]$ zpool create slave $PWD/slave [root@solaris]$ zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT master 95.5M 84.5K 95.4M 0% ONLINE - slave 95.5M 52.5K 95.4M 0% ONLINE - [root@solaris]$ zfs list NAME USED AVAIL REFER MOUNTPOINT master 77K 63.4M 24.5K /master slave 52.5K 63.4M 1.50K /slave
There we go. The naming should be pretty self-explanatory : The "master" is the primary storage pool, which will replicate and push data through to the backup "slave" pool.
Now, I'll create a ZFS filesystem and add something to it. I had a few source tarballs knocking around, so I just unpacked one (GNU grep) to give me a set of files to use as a test :
[root@solaris]$ zpool create master/data [root@solaris]$ cd /master/data/ [root@solaris]$ gtar xzf ~/grep-2.5.1.tar.gz [root@solaris]$ ls grep-2.5.1
We can also see from "zfs list" we've now taken up some space :
[root@solaris]$ zfs list NAME USED AVAIL REFER MOUNTPOINT master 3.24M 60.3M 25.5K /master master/data 3.15M 60.3M 3.15M /master/data slave 75.5K 63.4M 24.5K /slave
Now, we'll transfer all this over to the "slave", and start the replication going. We first need to take an initial snapshot of the filesystem, as that's what "zfs send" works on. It's also worth noting here that in order to transfer the data to the slave, I simply piped it to "zfs receive". If you're doing this between two physically separate systems, you'd most likely just pipe this through SSH between the systems and set up keys to avoid the need for passwords. Anyway, enough talk :
[root@solaris]$ zfs snapshot master/data@1 [root@solaris]$ zfs send master/data@1 | zfs receive slave/data
This now sent it through to the slave. It's also worth pointing out that I didn't have to recreate the exact same pool or zfs structure on the slave (which may be useful if you are replicating between dissimilar systems), but I chose to keep the filesystem layout the same for the sake of legibility in this example. I also simply used a numeric identifier for each snapshot; in a production system, timestamps may be more appropriate.
Anyway, let's take a quick look at "zfs list", where we'll see the slave has now gained a snapshot utilising exactly the same amount of space as the master :
[root@solaris]$ zfs list NAME USED AVAIL REFER MOUNTPOINT master 3.25M 60.3M 25.5K /master master/data 3.15M 60.3M 3.15M /master/data master/data@1 0 - 3.15M - slave 3.25M 60.3M 24.5K /slave slave/data 3.15M 60.3M 3.15M /slave/data slave/data@1 0 - 3.15M -
Now, here comes a big "gotcha". You now have to set the "readonly" attribute on the slave. I discovered that if this was not set, even just cd-ing into the slave's mountpoints would cause things to break in subsequent replication operations; presumably down to metadata (access times and the like) being altered.
[root@solaris]$ zfs set readonly=on slave/data
- You may want to grant full permissions to username with this command (http://docs.oracle.com/cd/E19082-01/817-2271/gbchv/index.html)
zfs allow username send,receive,clone,create,destroy,hold,mount,promote,rename,rollback,share,snapshot tank
So, let's look in the slave to see if our files are there :
[root@solaris]$ ls /slave/data grep-2.5.1
Excellent stuff! However, the real coolness starts with the incremental transfers - instead of transferring the whole lot again, we can just send only the bits of data that actually changed - this will drastically reduce bandwidth and the time taken to replicate data, making a "cron" based system of periodic snapshots and transfers feasable. To demonstrate this, I'll unpack another tarball (this time, GNU bison) on the master so I have some more data to send :
[root@solaris]$ cd /master/data [root@solaris]$ gtar xzf ~/bison-2.3.tar.gz
And we'll now make a second snapshot, and transfer differences between this one and the last :
[root@solaris]$ zfs snapshot master/data@2 [root@solaris]$ zfs send -i master/data@1 master/data@2 | zfs receive slave/data
Checking to see what's happened, we see the slave has gained another snapshot:
[root@solaris]$ zfs list NAME USED AVAIL REFER MOUNTPOINT master 10.2M 53.3M 25.5K /master master/data 10.1M 53.3M 10.1M /master/data master/data@1 32.5K - 3.15M - master/data@2 0 - 10.1M - slave 10.2M 53.3M 25.5K /slave slave/data 10.1M 53.3M 10.1M /slave/data slave/data@1 32.5K - 3.15M - slave/data@2 0 - 10.1M -
And our new data is now there as well :
[root@solaris]$ ls /slave/data/ bison-2.3 grep-2.5.1
And that's it. All that remains to turn this into a production system between two hosts is for a periodic cron job to be written that runs at the appropriate intervals (daily, or even every minute if need be) and snapshots the filesystem before transferring it. You'll also likely want to have another job that clears out old snapshots, or maybe archives them off somewhere.
- Enable Sharing and set sharename
zfs set sharesmb=name=myshare yourpool/shares/bob
- Share Filesystem
zfs set sharesmb=on fsname
- Check if shared
sharemgr show -vp
- Enable "pam_smb_passwd" to make regular OpenIndiana users have smb passwords. To do so, add the following line to the end of the file "/etc/pam.conf":
other password required pam_smb_passwd.so.1 nowarn
Quick Reference
- Initial Send
zfs send tank/data@blabla | ssh remoteserver /usr/sbin/zfs recv tank/data
zfs list -t snapshot zfs snapshot bla/bla@?????? zfs send -v -i bla/bla@?????? bla/bla@?????? | ssh 10.0.0.6 zfs recv bla/bla/blackhole0 ( zfs send -v -i bla/bla@older bla/bla@newer | ssh 10.0.0.6 zfs recv bla/bla/blackhole0 )
freebsd zfs creation
- to list drives
atacontrol list
or
camcontrol devlist
- identify sector size
camcontrol identify ada2 etc (the rest of your drives)
- If it is 4096 create nop devices, to make zfs use 4096 sized sectors
gnop create -S 4096 /dev/ada2
etc (the rest of your drives)
- create raidz pool
zfs create tank raidz ada2.nop ada3.nop
- create raid0 pool
zfs create tank ada2.nop ada3.nop
- zfs vs zpool
- export zpool
zpool export data gnop destroy /dev/ada2.nop /dev/ada3.nop zpool import data
You can check the configuration of the pool by using the "zdb" command on the pool:
zdb -C data | grep ashift
The ashift should be "12" for 4K alignment. This works because ZFS writes the ashift value in its metadata
URL Reference
- http://www.mailpile.is/
- https://hakshop.myshopify.com/products/wifi-pineapple
- http://www.i-programmer.info/news/105-artificial-intelligence/6197-anonymouth-hides-identity.html
- https://lavabit.com/
- http://ivoras.sharanet.org/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html (Section 3)
ZFS Resources
[1] - Great for beginners. Lets one understand how to use ZFS without having any spare hard drives.
[2] - Turning a ZFS mirror into a raidz array.
[3] - Official ZFS Admin Guide Link
[4] - ZFS Command Quick Reference
[5] - ZFS Best Practices Guide
[6] - Open Solaris Mailing Lists
[7] - Understanding the zpool status Output
File:ZFS Command Quick Reference.odt - Sun ZFS Command Quick Reference
File:819-5461.pdf - Solaris ZFS Administration Guide
[8] - Shrink rpool
[9] - Fun with ZFS send and receive
[10] - ZFS Send/Rec
[11] - ZFS Rollback Forensics