Tech Notes And Miscellaneous Thoughts
 

Converting to a ZFS rootfs

My main desktop/server machine (running Debian sid) at home has been running XFS on mdadm raid-1 on a pair of SSDs for the last few years. A few days ago, one of the SSDs died.

I’ve been planning to switch to ZFS as the root filesystem for a while now, so instead of just replacing the failed drive, I took the opportunity to convert it.

NOTE: at this point in time, ZFS On Linux does NOT support TRIM for either datasets or zvols on SSD. There’s a patch almost ready (TRIM/Discard support from Nexenta #3656), so I’m betting on that getting merged before it becomes an issue for me.

Here’s the procedure I came up with:

1. Buy new disks, shutdown machine, install new disks, reboot.

The details of this stage are unimportant, and the only thing to note is that I’m switching from mdadm RAID-1 with two SSDs to ZFS with two mirrored pairs (RAID-10) on four SSDs (Crucial MX300 275G – at around $100 AUD each, they’re hard to resist). Buying four 275G SSDs is slightly more expensive than buying two of the 525G models, but will perform a lot better.

When installed in the machine, they ended up as /dev/sdp, /dev/sdq, /dev/sdr, and /dev/sds. I’ll be using the symlinks in /dev/disk/by-id/ for the zpool, but for partition and setup, it’s easiest to use the /dev/sd? device nodes.

2. Partition the disks identically with gpt partition tables, using gdisk and sgdisk.

I need:

  • A small partition (type EF02, 1MB) for grub to install itself in. Needed on gpt.
  • A small partition (type EF00, 1GB) for EFI System. I’m not currently booting with UEFI but I want the option to move to it later.
  • A small partition (type 8300, 2GB) for /boot.I want /boot on a separate partition to make it easier to recover from problems that might occur with future upgrades. 2GB might seem excessive, but as this is my tftp & dhcp server I can’t rely on network boot for rescues, so I want to be able to put rescue ISO images in there and boot them with grub and memdisk.This will be mdadm RAID-1, with 4 copies.
  • A larger partition (type 8200, 4GB) for swap. With 4 identically partitioned SSDs, I’ll end up with 16GB swap (using zswap for block-device backed compressed RAM swap)
  • A large partition (type bf07, 210 GB) for my rootfs
  • A small partition (type bf08, 2 GB) to provide ZIL for my HDD zpools
  • A larger partition (type bf09, 32 GB) to provide L2ARC for my HDD zpools

ZFS On Linux uses partition type bf08 (“Solaris Reserved 1”) natively, but doesn’t seem to care what the partition types are for ZIL and L2ARC. I arbitrarily used bf08 (“Solaris Reserved 2”) and bf09 (“Solaris Reserved 3”) for easy identification. I’ll set these up later, once I’ve got the system booted – I don’t want to risk breaking my existing zpools by taking away their ZIL and L2ARC (and forgetting to zpool remove them, which I might possibly have done once) if I have to repartition.

I used gdisk to interactively set up the partitions:

# gdisk -l /dev/sdp
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdp: 537234768 sectors, 256.2 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 4234FE49-FCF0-48AE-828B-3C52448E8CBD
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 537234734
Partitions will be aligned on 8-sector boundaries
Total free space is 6 sectors (3.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              40            2047   1004.0 KiB  EF02  BIOS boot partition
   2            2048         2099199   1024.0 MiB  EF00  EFI System
   3         2099200         6293503   2.0 GiB     8300  Linux filesystem
   4         6293504        14682111   4.0 GiB     8200  Linux swap
   5        14682112       455084031   210.0 GiB   BF07  Solaris Reserved 1
   6       455084032       459278335   2.0 GiB     BF08  Solaris Reserved 2
   7       459278336       537234734   37.2 GiB    BF09  Solaris Reserved 3

I then cloned the partition table to the other three SSDs with this little script:

clone-partitions.sh

#! /bin/bash

src='sdp'

targets=( 'sdq' 'sdr' 'sds' )

for tgt in "${targets[@]}"; do
  sgdisk --replicate="/dev/$tgt" /dev/"$src"
  sgdisk --randomize-guids "/dev/$tgt"
done

3. Create the mdadm for /boot, the zpool, and and the root filesystem.

Most rootfs on ZFS guides that I’ve seen say to call the pool rpool, then create a dataset called "$(hostname)-1" and then create a ROOT dataset under that. so on my machine, that would be rpool/ganesh-1/ROOT. Some reverse the order of hostname and the rootfs dataset, for rpool/ROOT/ganesh-1.

There might be uses for this naming scheme in other environments but not in mine. And, to me, it looks ugly. So I’ll use just $(hostname)/root for the rootfs. i.e. ganesh/root

I wrote a script to automate it, figuring I’d probably have to do it several times in order to optimise performance. Also, I wanted to document the procedure for future reference, and have scripts that would be trivial to modify for other machines.

create.sh

#! /bin/bash

exec &> ./create.log

hn="$(hostname -s)"
base='ata-Crucial_CT275MX300SSD1_'

md='/dev/md0'
md_part=3
md_parts=( $(/bin/ls -1 /dev/disk/by-id/${base}*-part${md_part}) )

zfs_part=5

# 4 disks, so use the top half and bottom half for the two mirrors.
zmirror1=( $(/bin/ls -1 /dev/disk/by-id/${base}*-part${zfs_part} | head -n 2) )
zmirror2=( $(/bin/ls -1 /dev/disk/by-id/${base}*-part${zfs_part} | tail -n 2) )

# create /boot raid array
mdadm "$md" --create \
    --bitmap=internal \
    --raid-devices=4 \
    --level 1 \
    --metadata=0.90 \
    "${md_parts[@]}"

mkfs.ext4 "$md"

# create zpool
zpool create -o ashift=12 "$hn" \
    mirror "${zmirror1[@]}" \
    mirror "${zmirror2[@]}"

# create zfs rootfs
zfs set compression=on "$hn"
zfs set atime=off "$hn"
zfs create "$hn/root"
zpool set bootfs="$hn/root"

# mount the new /boot under the zfs root
mount "$md" "/$hn/root/boot"

If you want or need other ZFS datasets (e.g. for /home, /var etc) then create them here in this script. Or you can do that later after you’ve got the system up and running on ZFS.

If you run mysql or postgresql, read the various tuning guides for how to get best performance for databases on ZFS (they both need their own datasets with particular recordsize and other settings). If you download Linux ISOs or anything with bit-torrent, avoid COW fragmentation by setting up a dataset to download into with recordsize=16K and configure your BT client to move the downloads to another directory on completion.

I did this after I got my system booted on ZFS. For my db, I stoppped the postgres service, renamed /var/lib/postgresql to /var/lib/p, created the new datasets with:

zfs create -o recordsize=8K -o logbias=throughput -o mountpoint=/var/lib/postgresql \
  -o primarycache=metadata ganesh/postgres

zfs create -o recordsize=128k -o logbias=latency -o mountpoint=/var/lib/postgresql/9.6/main/pg_xlog \
  -o primarycache=metadata ganesh/pg-xlog

followed by rsync and then started postgres again.

4. rsync my current system to it.

Logout all user sessions, shut down all services that write to the disk (postfix, postgresql, mysql, apache, asterisk, docker, etc). If you haven’t booted into recovery/rescue/single-user mode, then you should be as close to it as possible – everything non-esssential should be stopped. I chose not to boot to single-user in case I needed access to the web to look things up while I did all this (this machine is my internet gateway).

Then:

hn="$(hostname -s)"
time rsync -avxHAXS -h -h --progress --stats --delete / /boot/ "/$hn/root/"

After the rsync, my 130GB of data from XFS was compressed to 91GB on ZFS with transparent lz4 compression.

Run the rsync again if (as I did), you realise you forgot to shut down postfix (causing newly arrived mail to not be on the new setup) or something.

You can do a (very quick & dirty) performance test now, by running zpool scrub "$hn". Then run watch zpool status "$hn". As there should be no errorss to correct, you should get scrub speeds approximating the combined sequential read speed of all vdevs in the pool. In my case, I got around 500-600M/s – I was kind of expecting closer to 800M/s but that’s good enough….the Crucial MX300s aren’t the fastest drive available (but they’re great for the price), and ZFS is optimised for reliability more than speed. The scrub took about 3 minutes to scan all 91GB. My HDD zpools get around 150 to 250M/s, depending on whether they have mirror or RAID-Z vdevs and on what kind of drives they have.

For real benchmarking, use bonnie++ or fio.

5. Prepare the new rootfs for chroot, chroot into it, edit /etc/fstab and /etc/default/grub.

This script bind mounts /proc, /sys, /dev, and /dev/pts before chrooting:

chroot.sh

#! /bin/sh

hn="$(hostname -s)"

for i in proc sys dev dev/pts ; do
  mount -o bind "/$i" "/${hn}/root/$i"
done

chroot "/${hn}/root"

Change /etc/fstab (on the new zfs root to) have the zfs root and ext4 on raid-1 /boot:

/ganesh/root    /         zfs     defaults                                         0  0
/dev/md0        /boot     ext4    defaults,relatime,nodiratime,errors=remount-ro   0  2

I haven’t bothered with setting up the swap at this point. That’s trivial and I can do it after I’ve got the system rebooted with its new ZFS rootfs (which reminds me, I still haven’t done that :).

add boot=zfs to the GRUB_CMDLINE_LINUX variable in /etc/default/grub. On my system, that’s:

GRUB_CMDLINE_LINUX="iommu=noagp usbhid.quirks=0x1B1C:0x1B20:0x408 boot=zfs"

NOTE: If you end up needing to run rsync again as in step 4. above copy /etc/fstab and /etc/default/grub to the old root filesystem first. I suggest to /etc/fstab.zfs and /etc/default/grub.zfs

6. Install grub

Here’s where things get a little complicated. Running install-grub on /dev/sd[pqrs] is fine, we created the type ef02 partition for it to install itself into.

But running update-grub to generate the new /boot/grub/grub.cfg will fail with an error like this:

/usr/sbin/grub-probe: error: failed to get canonical path of `/dev/ata-Crucial_CT275MX300SSD1_163313AADD8A-part5'.

IMO, that’s a bug in grub-probe – it should look in /dev/disk/by-id/ if it can’t find what it’s looking for in /dev/

I fixed that problem with this script:

fix-ata-links.sh

#! /bin/sh

cd /dev
ln -s /dev/disk/by-id/ata-Crucial* .

After that, update-grub works fine.

NOTE: you will have to add udev rules to create these symlinks, or run this script on every boot otherwise you’ll get that error every time you run update-grub in future.

7. Prepare to reboot

Unmount proc, sys, dev/pts, dev, the new raid /boot, and the new zfs filesystems. Set the mount point for the new rootfs to /

umount-zfs-root.sh

#! /bin/sh

hn="$(hostname -s)"
md="/dev/md0"

for i in dev/pts dev sys proc ; do
  umount "/${hn}/root/$i"
done

umount "$md"

zfs umount "${hn}/root"
zfs umount "${hn}"
zfs set mountpoint=/ "${hn}/root"
zfs set canmount=off "${hn}"

8. Reboot

Remember to configure the BIOS to boot from your new disks.

The system should boot up with the new rootfs, no rescue disk required as in some other guides – the rsync and chroot stuff has already been done.

9. Other notes

  • If you’re adding partition(s) to a zpool for ZIL, remember that ashift is per vdev, not per zpool. So remember to specify ashift=12 when adding them. e.g.
    zpool add -o ashift=12 export log \
      mirror ata-Crucial_CT275MX300SSD1_163313AAEE5F-part6 \
             ata-Crucial_CT275MX300SSD1_163313AB002C-part6
    

    Check that all vdevs in all pools have the correct ashift value with:

    zdb | grep -E 'ashift|vdev|type' | grep -v disk
    

10. Useful references

Reading these made it much easier to come up with my own method. Highly recommended.

4 Comments

  1. Troy

    You state you’ll have 4GB of swap on each disk for 16GB total, which means you’re doing raid-0 for swap. If you care about your system reliability I’d suggest you also use raid-10 for a total of only 8GB of swap.

  2. claudex

    Hello,

    I see that you don’t use raid for the swap, that means if one disk fail with data in the swap, the process with the data in the swap will crash. It seems strange to me to put so much energy on a raid system and still have a spof.

  3. cas

    @Troy, @claudex:

    I haven’t even set up the swap yet, it’s not a high priority for me. This machine rarely swaps, so 4x4GB is overkill.

    I’m still in the process of doing more important things. e.g. I just used zfs send to move /var/lib/docker from my HDD “export” pool to this new SSD pool. I expect I’ll be moving most of my KVM VM ZVOLs too, the ones I run frequently, anyway.

    I’m not even sure that I’m going to use any or all of those 4 partitions for swap anyway, I may end up swapping to a ZVOL and keep the partitions spare in case I need a non-ZFS filesystem for any reason.

    I’ve got plenty of disk space. This ZFS root pool, a 4TB RAIDZ-1 “export” pool for general use, and an 8TB mirror “backup” pool for backing up all the machines on my network (also contains other stuff “backed up on teh interwebs” like my local Debian mirror). So “losing” 16GB is no big deal.

    # zpool list
    NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
    backup  7.25T  3.36T  3.89T         -    21%    46%  1.00x  ONLINE  -
    export  3.62T  1.77T  1.85T     16.0E    15%    48%  1.00x  ONLINE  -
    ganesh   416G  95.8G   320G         -    12%    23%  1.00x  ONLINE  -
    

    Now that I’ve got more space on /, I can move some of the stuff back from HDD /export/home/cas to SSD /home/cas where it’ll be faster even without L2ARC.

    As for SPOF and energy invested – it’s impossible to entirely eliminate single points of failure. Whatever you do, you’ll still end up with at least one, so the point is to minimise the probability of them causing significant damage. Effort is relative, this wasn’t terribly difficult. I documented it because I’d never done it before (although I have been using ZFS On Linux since around 2010), and I still have two more machines here to convert when I get some new SSDs for them….that won’t be immediately, so I don’t want to forget and then have to re-invent this process.

  4. cg

    Hello cas,

    thanks a lot for this interesting instruction.
    I’m up to do nearly the same thing (with less disks) and I like to change to ZFS(oL).
    It will be interesting and fun to enter the zfs world with your kind of concept.

    I’ll try this first with my existing good old crunchbang to shift it over into a virtual machine – so this process will be kind of double fun (or stress) ;-D …but I think I will learn a lot again.
    After that I will change my workstation to Arch with zfs on the bottom. That will be a bit easier to install because everything will be fresh.
    And I can also use my old OS as VM for daily use if I run into some time expensive trouble with my new OS.

    So thanks again.

Comments are closed.