My main desktop/server machine (running Debian sid) at home has been running XFS on mdadm raid-1 on a pair of SSDs for the last few years. A few days ago, one of the SSDs died.
I’ve been planning to switch to ZFS as the root filesystem for a while now, so instead of just replacing the failed drive, I took the opportunity to convert it.
NOTE: at this point in time, ZFS On Linux does NOT support TRIM for either datasets or zvols on SSD. There’s a patch almost ready (TRIM/Discard support from Nexenta #3656), so I’m betting on that getting merged before it becomes an issue for me.
Here’s the procedure I came up with:
1. Buy new disks, shutdown machine, install new disks, reboot.
The details of this stage are unimportant, and the only thing to note is that I’m switching from mdadm RAID-1 with two SSDs to ZFS with two mirrored pairs (RAID-10) on four SSDs (Crucial MX300 275G – at around $100 AUD each, they’re hard to resist). Buying four 275G SSDs is slightly more expensive than buying two of the 525G models, but will perform a lot better.
When installed in the machine, they ended up as /dev/sdp
, /dev/sdq
, /dev/sdr
, and /dev/sds
. I’ll be using the symlinks in /dev/disk/by-id/ for the zpool, but for partition and setup, it’s easiest to use the /dev/sd? device nodes.
2. Partition the disks identically with gpt partition tables, using gdisk
and sgdisk
.
I need:
- A small partition (type EF02, 1MB) for grub to install itself in. Needed on gpt.
- A small partition (type EF00, 1GB) for EFI System. I’m not currently booting with UEFI but I want the option to move to it later.
- A small partition (type 8300, 2GB) for /boot.I want /boot on a separate partition to make it easier to recover from problems that might occur with future upgrades. 2GB might seem excessive, but as this is my
tftp
&dhcp
server I can’t rely on network boot for rescues, so I want to be able to put rescue ISO images in there and boot them withgrub
andmemdisk
.This will be mdadm RAID-1, with 4 copies. - A larger partition (type 8200, 4GB) for swap. With 4 identically partitioned SSDs, I’ll end up with 16GB swap (using
zswap
for block-device backed compressed RAM swap) - A large partition (type bf07, 210 GB) for my rootfs
- A small partition (type bf08, 2 GB) to provide ZIL for my HDD zpools
- A larger partition (type bf09, 32 GB) to provide L2ARC for my HDD zpools
ZFS On Linux uses partition type bf08 (“Solaris Reserved 1”) natively, but doesn’t seem to care what the partition types are for ZIL and L2ARC. I arbitrarily used bf08 (“Solaris Reserved 2”) and bf09 (“Solaris Reserved 3”) for easy identification. I’ll set these up later, once I’ve got the system booted – I don’t want to risk breaking my existing zpools by taking away their ZIL and L2ARC (and forgetting to zpool remove
them, which I might possibly have done once) if I have to repartition.
I used gdisk
to interactively set up the partitions:
# gdisk -l /dev/sdp
GPT fdisk (gdisk) version 1.0.1
Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present
Found valid GPT with protective MBR; using GPT.
Disk /dev/sdp: 537234768 sectors, 256.2 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 4234FE49-FCF0-48AE-828B-3C52448E8CBD
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 537234734
Partitions will be aligned on 8-sector boundaries
Total free space is 6 sectors (3.0 KiB)
Number Start (sector) End (sector) Size Code Name
1 40 2047 1004.0 KiB EF02 BIOS boot partition
2 2048 2099199 1024.0 MiB EF00 EFI System
3 2099200 6293503 2.0 GiB 8300 Linux filesystem
4 6293504 14682111 4.0 GiB 8200 Linux swap
5 14682112 455084031 210.0 GiB BF07 Solaris Reserved 1
6 455084032 459278335 2.0 GiB BF08 Solaris Reserved 2
7 459278336 537234734 37.2 GiB BF09 Solaris Reserved 3
I then cloned the partition table to the other three SSDs with this little script:
clone-partitions.sh
#! /bin/bash
src='sdp'
targets=( 'sdq' 'sdr' 'sds' )
for tgt in "${targets[@]}"; do
sgdisk --replicate="/dev/$tgt" /dev/"$src"
sgdisk --randomize-guids "/dev/$tgt"
done
3. Create the mdadm for /boot, the zpool, and and the root filesystem.
Most rootfs on ZFS guides that I’ve seen say to call the pool rpool
, then create a dataset called "$(hostname)-1"
and then create a ROOT
dataset under that. so on my machine, that would be rpool/ganesh-1/ROOT
. Some reverse the order of hostname and the rootfs dataset, for rpool/ROOT/ganesh-1
.
There might be uses for this naming scheme in other environments but not in mine. And, to me, it looks ugly. So I’ll use just $(hostname)/root
for the rootfs. i.e. ganesh/root
I wrote a script to automate it, figuring I’d probably have to do it several times in order to optimise performance. Also, I wanted to document the procedure for future reference, and have scripts that would be trivial to modify for other machines.
create.sh
#! /bin/bash
exec &> ./create.log
hn="$(hostname -s)"
base='ata-Crucial_CT275MX300SSD1_'
md='/dev/md0'
md_part=3
md_parts=( $(/bin/ls -1 /dev/disk/by-id/${base}*-part${md_part}) )
zfs_part=5
# 4 disks, so use the top half and bottom half for the two mirrors.
zmirror1=( $(/bin/ls -1 /dev/disk/by-id/${base}*-part${zfs_part} | head -n 2) )
zmirror2=( $(/bin/ls -1 /dev/disk/by-id/${base}*-part${zfs_part} | tail -n 2) )
# create /boot raid array
mdadm "$md" --create \
--bitmap=internal \
--raid-devices=4 \
--level 1 \
--metadata=0.90 \
"${md_parts[@]}"
mkfs.ext4 "$md"
# create zpool
zpool create -o ashift=12 "$hn" \
mirror "${zmirror1[@]}" \
mirror "${zmirror2[@]}"
# create zfs rootfs
zfs set compression=on "$hn"
zfs set atime=off "$hn"
zfs create "$hn/root"
zpool set bootfs="$hn/root"
# mount the new /boot under the zfs root
mount "$md" "/$hn/root/boot"
If you want or need other ZFS datasets (e.g. for /home, /var etc) then create them here in this script. Or you can do that later after you’ve got the system up and running on ZFS.
If you run mysql or postgresql, read the various tuning guides for how to get best performance for databases on ZFS (they both need their own datasets with particular recordsize
and other settings). If you download Linux ISOs or anything with bit-torrent, avoid COW fragmentation by setting up a dataset to download into with recordsize=16K
and configure your BT client to move the downloads to another directory on completion.
I did this after I got my system booted on ZFS. For my db, I stoppped the postgres service, renamed /var/lib/postgresql
to /var/lib/p
, created the new datasets with:
zfs create -o recordsize=8K -o logbias=throughput -o mountpoint=/var/lib/postgresql \
-o primarycache=metadata ganesh/postgres
zfs create -o recordsize=128k -o logbias=latency -o mountpoint=/var/lib/postgresql/9.6/main/pg_xlog \
-o primarycache=metadata ganesh/pg-xlog
followed by rsync
and then started postgres again.
4. rsync my current system to it.
Logout all user sessions, shut down all services that write to the disk (postfix, postgresql, mysql, apache, asterisk, docker, etc). If you haven’t booted into recovery/rescue/single-user mode, then you should be as close to it as possible – everything non-esssential should be stopped. I chose not to boot to single-user in case I needed access to the web to look things up while I did all this (this machine is my internet gateway).
Then:
hn="$(hostname -s)"
time rsync -avxHAXS -h -h --progress --stats --delete / /boot/ "/$hn/root/"
After the rsync, my 130GB of data from XFS was compressed to 91GB on ZFS with transparent lz4 compression.
Run the rsync again if (as I did), you realise you forgot to shut down postfix (causing newly arrived mail to not be on the new setup) or something.
You can do a (very quick & dirty) performance test now, by running zpool scrub "$hn"
. Then run watch zpool status "$hn"
. As there should be no errorss to correct, you should get scrub speeds approximating the combined sequential read speed of all vdevs in the pool. In my case, I got around 500-600M/s – I was kind of expecting closer to 800M/s but that’s good enough….the Crucial MX300s aren’t the fastest drive available (but they’re great for the price), and ZFS is optimised for reliability more than speed. The scrub took about 3 minutes to scan all 91GB. My HDD zpools get around 150 to 250M/s, depending on whether they have mirror or RAID-Z vdevs and on what kind of drives they have.
For real benchmarking, use bonnie++
or fio
.
5. Prepare the new rootfs for chroot, chroot into it, edit /etc/fstab
and /etc/default/grub
.
This script bind mounts /proc, /sys, /dev, and /dev/pts before chrooting:
chroot.sh
#! /bin/sh
hn="$(hostname -s)"
for i in proc sys dev dev/pts ; do
mount -o bind "/$i" "/${hn}/root/$i"
done
chroot "/${hn}/root"
Change /etc/fstab
(on the new zfs root to) have the zfs root and ext4 on raid-1 /boot:
/ganesh/root / zfs defaults 0 0
/dev/md0 /boot ext4 defaults,relatime,nodiratime,errors=remount-ro 0 2
I haven’t bothered with setting up the swap at this point. That’s trivial and I can do it after I’ve got the system rebooted with its new ZFS rootfs (which reminds me, I still haven’t done that :).
add boot=zfs
to the GRUB_CMDLINE_LINUX
variable in /etc/default/grub
. On my system, that’s:
GRUB_CMDLINE_LINUX="iommu=noagp usbhid.quirks=0x1B1C:0x1B20:0x408 boot=zfs"
NOTE: If you end up needing to run rsync again as in step 4. above copy /etc/fstab
and /etc/default/grub
to the old root filesystem first. I suggest to /etc/fstab.zfs
and /etc/default/grub.zfs
6. Install grub
Here’s where things get a little complicated. Running install-grub
on /dev/sd[pqrs] is fine, we created the type ef02 partition for it to install itself into.
But running update-grub
to generate the new /boot/grub/grub.cfg
will fail with an error like this:
/usr/sbin/grub-probe: error: failed to get canonical path of `/dev/ata-Crucial_CT275MX300SSD1_163313AADD8A-part5'.
IMO, that’s a bug in grub-probe
– it should look in /dev/disk/by-id/
if it can’t find what it’s looking for in /dev/
I fixed that problem with this script:
fix-ata-links.sh
#! /bin/sh
cd /dev
ln -s /dev/disk/by-id/ata-Crucial* .
After that, update-grub
works fine.
NOTE: you will have to add udev
rules to create these symlinks, or run this script on every boot otherwise you’ll get that error every time you run update-grub
in future.
7. Prepare to reboot
Unmount proc, sys, dev/pts, dev, the new raid /boot, and the new zfs filesystems. Set the mount point for the new rootfs to /
umount-zfs-root.sh
#! /bin/sh
hn="$(hostname -s)"
md="/dev/md0"
for i in dev/pts dev sys proc ; do
umount "/${hn}/root/$i"
done
umount "$md"
zfs umount "${hn}/root"
zfs umount "${hn}"
zfs set mountpoint=/ "${hn}/root"
zfs set canmount=off "${hn}"
8. Reboot
Remember to configure the BIOS to boot from your new disks.
The system should boot up with the new rootfs, no rescue disk required as in some other guides – the rsync and chroot stuff has already been done.
9. Other notes
- If you’re adding partition(s) to a zpool for ZIL, remember that
ashift
is per vdev, not per zpool. So remember to specifyashift=12
when adding them. e.g.zpool add -o ashift=12 export log \ mirror ata-Crucial_CT275MX300SSD1_163313AAEE5F-part6 \ ata-Crucial_CT275MX300SSD1_163313AB002C-part6
Check that all vdevs in all pools have the correct
ashift
value with:zdb | grep -E 'ashift|vdev|type' | grep -v disk
10. Useful references
Reading these made it much easier to come up with my own method. Highly recommended.
You state you’ll have 4GB of swap on each disk for 16GB total, which means you’re doing raid-0 for swap. If you care about your system reliability I’d suggest you also use raid-10 for a total of only 8GB of swap.
Hello,
I see that you don’t use raid for the swap, that means if one disk fail with data in the swap, the process with the data in the swap will crash. It seems strange to me to put so much energy on a raid system and still have a spof.
@Troy, @claudex:
I haven’t even set up the swap yet, it’s not a high priority for me. This machine rarely swaps, so 4x4GB is overkill.
I’m still in the process of doing more important things. e.g. I just used
zfs send
to move/var/lib/docker
from my HDD “export” pool to this new SSD pool. I expect I’ll be moving most of my KVM VM ZVOLs too, the ones I run frequently, anyway.I’m not even sure that I’m going to use any or all of those 4 partitions for swap anyway, I may end up swapping to a ZVOL and keep the partitions spare in case I need a non-ZFS filesystem for any reason.
I’ve got plenty of disk space. This ZFS root pool, a 4TB RAIDZ-1 “export” pool for general use, and an 8TB mirror “backup” pool for backing up all the machines on my network (also contains other stuff “backed up on teh interwebs” like my local Debian mirror). So “losing” 16GB is no big deal.
Now that I’ve got more space on /, I can move some of the stuff back from HDD /export/home/cas to SSD /home/cas where it’ll be faster even without L2ARC.
As for SPOF and energy invested – it’s impossible to entirely eliminate single points of failure. Whatever you do, you’ll still end up with at least one, so the point is to minimise the probability of them causing significant damage. Effort is relative, this wasn’t terribly difficult. I documented it because I’d never done it before (although I have been using ZFS On Linux since around 2010), and I still have two more machines here to convert when I get some new SSDs for them….that won’t be immediately, so I don’t want to forget and then have to re-invent this process.
Hello cas,
thanks a lot for this interesting instruction.
I’m up to do nearly the same thing (with less disks) and I like to change to ZFS(oL).
It will be interesting and fun to enter the zfs world with your kind of concept.
I’ll try this first with my existing good old crunchbang to shift it over into a virtual machine – so this process will be kind of double fun (or stress) ;-D …but I think I will learn a lot again.
After that I will change my workstation to Arch with zfs on the bottom. That will be a bit easier to install because everything will be fresh.
And I can also use my old OS as VM for daily use if I run into some time expensive trouble with my new OS.
So thanks again.