zfs support in makefs

From: Mark Johnston <markj_at_freebsd.org>
Date: Wed, 18 May 2022 19:03:17 UTC
Hi,

For the past little while I've been working on ZFS support in makefs(8).
At this point I'm able to create a bootable FreeBSD VM image, using the
standard FreeBSD ZFS layout, and run through the regression test suite
in bhyve.  I've also been able to create and boot an EC2 AMI.

Some background is below for anyone interested, and I would greatly
appreciate feedback on the interface, described further below.

The initial diff is here: https://reviews.freebsd.org/D35248

Comments here or in the review are welcome.

=== Background ===

The goal is to enable creation of ZFS-based VM images, in particular by
release(7).  Currently one can implement this by creating a pool on a
file-backed memory disk and populating it with "make installworld", but
this has a few drawbacks:

1. The resulting images are not reproducible.  That is, if one creates
   two ZFS images with identical contents, the images themselves will not
   be byte-identical.  For instance, each pool gets a randomly generated
   GUID, as does each vdev, and there are other sources of
   non-determinism besides.
2. Creating a zpool requires root privileges by default and can't be done
   at all in a jail.
3. Populating the image is a resource-intensive operation, the kernel
   will cache the output files until the pool is exported, etc.

For UFS images we use makefs to solve these problems, so I wanted to try
and take the same approach for ZFS.  I assume that the appeal of using
ZFS as the root filesystem for VMs is obvious.

I initially implemented ZFS support in makefs using libzpool.so, which
is effectively a copy of the OpenZFS kernel code compiled for userspace.
It is mostly used for testing and debugging.  This worked and was
relatively simple to implement, but it only solved problem 2.  Bending
libzpool to satisfy my requirements seemed difficult, and the result
would require continuous maintenance as OpenZFS evolves and its internal
interfaces change.  I spent some time hacking libzpool to limit its
memory and CPU usage and gave up; while it was functional, the result
was painfully slow.

I then looked at the bits used by the loader to load files off of a boot
volume, and implemented the creation of ZFS images from scratch, i.e.,
without reusing OpenZFS code.  This required more effort but I believe
it'll be easier to maintain in the long run, and it solves all three
problems above.

The implementation is mostly derived from an old ZFS on-disk format
specification (http://www.giis.co.in/Zfs_ondiskformat.pdf), various blog
posts, and lots of time spent staring at zdb output.  I reused some code
from the boot loader: the nvlist implementation, since the one in
sys/contrib doesn't have some required features, and zfsimpl.h, which
contains C structs describing various on-disk data structures.

ZFS in general is pretty complex so this effort required some
specialization to the problem at hand.  In particular, makefs
- always creates a pool with a single disk vdev with all data written in
  a single transaction group; there's no snapshots, no RAID-Z/dRAID, no
  redundant block copies, no ZIL, no encryption, no gang blocks, no
  zvol, etc.
- does not implement compression,
- doesn't preserve holes in files,
- always creates pools at version 5000, i.e., all feature flags are off
  and have to be enabled separately,
- does not try to do any clever metaslab placement or sizing, on the
  basis that the pool will likely be expanded upon first boot anyway,
- doesn't use spill blocks and is not particularly clever when it comes
  to choosing block sizes, creating some avoidable internal
  fragmentation (though it doesn't seem too bad relative to OpenZFS
  without compression, maybe 10% overhead in some unscientific tests)

Some of these can be addressed (especially compression and sparse file
support), but I wanted to get some feedback before spending more time on
this.  Really this thing is just intended to do the minimum necessary to
provide ZFS-based VM images.

=== Interface ===

Creating a pool with a single dataset is easy:

$ makefs -t zfs -s 10g -o poolname=test ./zfs.img /path/to/input

Upon importing such a pool, you'll get a dataset named "test" mounted at
/test containing everything under /path/to/input.

It's possible to set properties on the root dataset:

$ makefs -t zfs -s 10g -o poolname=test -o fs=test:setuid=off:atime=on ./zfs.img /path/to/input

It's also possible to create additional datasets:

$ makefs -t zfs -s 10g -o poolname=test -o fs=test/ds1:mountpoint=/test/dir1 ./zfs.img /path/to/input

The parameter syntax is
"-o fs=<dataset name>[:<prop1>=<val1>[:<prop2>=<val2>[:...]]]".  Only a
few properties are supported, at least for now.

Dataset mountpoints behave the same as they would if created with the
standard ZFS tools.  So by default the root dataset's mountpoint is
/test, test/ds1's mountpoint is /test/ds1, etc..  If a dataset overrides
its default mountpoint, its children inherit that mountpoint.

makefs builds the output filesystem using a single input directory tree.
Thus, makefs -t zfs requires that at least one of the dataset's
mountpoints map to /path/to/input; that is, there is a "root" mount
point.

The -o rootpath parameter defines this root mount point.  By default it's
"/<poolname>".  All datasets in the pool must have their mountpoints
under this path, and one dataset's mountpoint must be equal to this
path.  To build bootable images, one sets -o rootpath=/.

Putting it all together, one can build a image using the standard layout
with an invocation like this:

makefs -t zfs -o poolname=zroot -s 20g -o rootpath=/ -o bootfs=zroot/ROOT/default \
    -o fs=zroot:canmount=off:mountpoint=none \
    -o fs=zroot/ROOT:mountpoint=none \
    -o fs=zroot/ROOT/default:mountpoint=/ \
    -o fs=zroot/tmp:mountpoint=/tmp:exec=on:setuid=off \
    -o fs=zroot/usr:mountpoint=/usr:canmount=off \
    -o fs=zroot/usr/home \
    -o fs=zroot/usr/ports:setuid=off \
    -o fs=zroot/usr/src \
    -o fs=zroot/usr/obj \
    -o fs=zroot/var:mountpoint=/var:canmount=off \
    -o fs=zroot/var/audit:setuid=off:exec=off \
    -o fs=zroot/var/crash:setuid=off:exec=off \
    -o fs=zroot/var/log:setuid=off:exec=off \
    -o fs=zroot/var/mail:atime=on \
    -o fs=zroot/var/tmp:setuid=off \
    ${HOME}/tmp/zfs.img ${HOME}/tmp/world

I'll admit this is somewhat clunky, but it doesn't seem worse than what
we have to do otherwise, see poudriere-image for example:
https://github.com/freebsd/poudriere/blob/master/src/share/poudriere/image_zfs.sh#L79

What do folks think of this interface?  Is there anything missing, or
anything that doesn't make sense?