[zfs] recordsize: unexpected increase of disk usage when increasing it

From: Florent Rivoire <florent_at_rivoire.fr>
Date: Tue, 18 Jan 2022 13:56:46 UTC
TLDR: I rsync-ed the same data twice: once with 128K recordsize and
once with 1M, and the allocated size on disk is ~3% bigger with 1M.
Why not smaller ?


Hello,

I would like some help to understand how the disk usage evolves when I
change the recordsize.

I've read several articles/presentations/forums about recordsize in
ZFS, and if I try to summarize, I mainly understood that:
- recordsize is the "maximum" size of "objects" (so "logical blocks")
that zfs will create for both  -data & metadata, then each object is
compressed and allocated to one vdev, splitted into smaller (ashift
size) "physical" blocks and written on disks
- increasing recordsize is usually good when storing large files that
are not modified, because it limits the nb of metadata objects
(block-pointers), which has a positive effect on performance
- decreasing recordsize is useful for "databases-like" workloads (ie:
small random writes inside existing objects), because it avoids write
amplification (read-modify-write a large object for a small update)

Today, I'm trying to observe the effect of increasing recordsize for
*my* data (because I'm also considering defining special_small_blocks
& using SSDs as "special", but not tested nor discussed here, just
recordsize).
So, I'm doing some benchmarks on my "documents" dataset (details in
"notes" below), but the results are really strange to me.

When I rsync the same data to a freshly-recreated zpool:
A) with recordsize=128K : 226G allocated on disk
B) with recordsize=1M : 232G allocated on disk => bigger than 128K ?!?

I would clearly expect the other way around, because bigger recordsize
generates less metadata so smaller disk usage, and there shouldn't be
any overhead because 1M is just a maximum and not a forced size to
allocate for every object.
I don't mind the increased usage (I can live with a few GB more), but
I would like to understand why it happens.

I tried to give all the details of my tests below.
Did I do something wrong ? Can you explain the increase ?

Thanks !



===============================================
A) 128K
==========

# zpool destroy bench
# zpool create -o ashift=12 bench
/dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4

# rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
[...]
sent 241,042,476,154 bytes  received 353,838 bytes  81,806,492.45 bytes/sec
total size is 240,982,439,038  speedup is 1.00

# zfs get recordsize bench
NAME   PROPERTY    VALUE    SOURCE
bench  recordsize  128K     default

# zpool list -v bench
NAME                                           SIZE  ALLOC   FREE
CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bench                                         2.72T   226G  2.50T
  -         -     0%     8%  1.00x    ONLINE  -
  gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   226G  2.50T
  -         -     0%  8.10%      -    ONLINE

# zfs list bench
NAME    USED  AVAIL     REFER  MOUNTPOINT
bench   226G  2.41T      226G  /bench

# zfs get all bench |egrep "(used|referenced|written)"
bench  used                  226G                   -
bench  referenced            226G                   -
bench  usedbysnapshots       0B                     -
bench  usedbydataset         226G                   -
bench  usedbychildren        1.80M                  -
bench  usedbyrefreservation  0B                     -
bench  written               226G                   -
bench  logicalused           226G                   -
bench  logicalreferenced     226G                   -

# zdb -Lbbbs bench > zpool-bench-rcd128K.zdb



===============================================
B) 1M
==========

# zpool destroy bench
# zpool create -o ashift=12 bench
/dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
# zfs set recordsize=1M bench

# rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
[...]
sent 241,042,476,154 bytes  received 353,830 bytes  80,173,899.88 bytes/sec
total size is 240,982,439,038  speedup is 1.00

# zfs get recordsize bench
NAME   PROPERTY    VALUE    SOURCE
bench  recordsize  1M       local

# zpool list -v bench
NAME                                           SIZE  ALLOC   FREE
CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bench                                         2.72T   232G  2.49T
  -         -     0%     8%  1.00x    ONLINE  -
  gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   232G  2.49T
  -         -     0%  8.32%      -    ONLINE

# zfs list bench
NAME    USED  AVAIL     REFER  MOUNTPOINT
bench   232G  2.41T      232G  /bench

# zfs get all bench |egrep "(used|referenced|written)"
bench  used                  232G                   -
bench  referenced            232G                   -
bench  usedbysnapshots       0B                     -
bench  usedbydataset         232G                   -
bench  usedbychildren        1.96M                  -
bench  usedbyrefreservation  0B                     -
bench  written               232G                   -
bench  logicalused           232G                   -
bench  logicalreferenced     232G                   -

# zdb -Lbbbs bench > zpool-bench-rcd1M.zdb



===============================================
Notes:
==========

- the source dataset contains ~50% of pictures (raw files and jpg),
and also some music, various archived documents, zip, videos
- no change on the source dataset while testing (cf size logged by resync)
- I repeated the tests twice (128K, then 1M, then 128K, then 1M), and
same results
- probably not important here, but:
/dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR
(WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset
on another zpool that I never tweaked except ashit=12 (because using
the same model of Red 3TB)

# zfs --version
zfs-2.0.6-1
zfs-kmod-v2021120100-zfs_a8c7652

# uname -a
FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11
75566f060d4(HEAD) TRUENAS  amd64