Re: [zfs] recordsize: unexpected increase of disk usage when increasing it

From: Rich <rincebrain_at_gmail.com>
Date: Tue, 18 Jan 2022 14:12:50 UTC
Compression would have made your life better here, and possibly also made
it clearer what's going on.

All records in a file are going to be the same size pre-compression - so if
you set the recordsize to 1M and save a 131.1M file, it's going to take up
132M on disk before compression/raidz overhead/whatnot.

Usually compression saves you from the tail padding actually requiring
allocation on disk, which is one reason I encourage everyone to at least
use lz4 (or, if you absolutely cannot for some reason, I guess zle should
also work for this one case...)

But I would say it's probably the sum of last record padding across the
whole dataset, if you don't have compression on.

- Rich

On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr> wrote:

> TLDR: I rsync-ed the same data twice: once with 128K recordsize and
> once with 1M, and the allocated size on disk is ~3% bigger with 1M.
> Why not smaller ?
>
>
> Hello,
>
> I would like some help to understand how the disk usage evolves when I
> change the recordsize.
>
> I've read several articles/presentations/forums about recordsize in
> ZFS, and if I try to summarize, I mainly understood that:
> - recordsize is the "maximum" size of "objects" (so "logical blocks")
> that zfs will create for both  -data & metadata, then each object is
> compressed and allocated to one vdev, splitted into smaller (ashift
> size) "physical" blocks and written on disks
> - increasing recordsize is usually good when storing large files that
> are not modified, because it limits the nb of metadata objects
> (block-pointers), which has a positive effect on performance
> - decreasing recordsize is useful for "databases-like" workloads (ie:
> small random writes inside existing objects), because it avoids write
> amplification (read-modify-write a large object for a small update)
>
> Today, I'm trying to observe the effect of increasing recordsize for
> *my* data (because I'm also considering defining special_small_blocks
> & using SSDs as "special", but not tested nor discussed here, just
> recordsize).
> So, I'm doing some benchmarks on my "documents" dataset (details in
> "notes" below), but the results are really strange to me.
>
> When I rsync the same data to a freshly-recreated zpool:
> A) with recordsize=128K : 226G allocated on disk
> B) with recordsize=1M : 232G allocated on disk => bigger than 128K ?!?
>
> I would clearly expect the other way around, because bigger recordsize
> generates less metadata so smaller disk usage, and there shouldn't be
> any overhead because 1M is just a maximum and not a forced size to
> allocate for every object.
> I don't mind the increased usage (I can live with a few GB more), but
> I would like to understand why it happens.
>
> I tried to give all the details of my tests below.
> Did I do something wrong ? Can you explain the increase ?
>
> Thanks !
>
>
>
> ===============================================
> A) 128K
> ==========
>
> # zpool destroy bench
> # zpool create -o ashift=12 bench
> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
>
> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
> [...]
> sent 241,042,476,154 bytes  received 353,838 bytes  81,806,492.45 bytes/sec
> total size is 240,982,439,038  speedup is 1.00
>
> # zfs get recordsize bench
> NAME   PROPERTY    VALUE    SOURCE
> bench  recordsize  128K     default
>
> # zpool list -v bench
> NAME                                           SIZE  ALLOC   FREE
> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> bench                                         2.72T   226G  2.50T
>   -         -     0%     8%  1.00x    ONLINE  -
>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   226G  2.50T
>   -         -     0%  8.10%      -    ONLINE
>
> # zfs list bench
> NAME    USED  AVAIL     REFER  MOUNTPOINT
> bench   226G  2.41T      226G  /bench
>
> # zfs get all bench |egrep "(used|referenced|written)"
> bench  used                  226G                   -
> bench  referenced            226G                   -
> bench  usedbysnapshots       0B                     -
> bench  usedbydataset         226G                   -
> bench  usedbychildren        1.80M                  -
> bench  usedbyrefreservation  0B                     -
> bench  written               226G                   -
> bench  logicalused           226G                   -
> bench  logicalreferenced     226G                   -
>
> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb
>
>
>
> ===============================================
> B) 1M
> ==========
>
> # zpool destroy bench
> # zpool create -o ashift=12 bench
> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
> # zfs set recordsize=1M bench
>
> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
> [...]
> sent 241,042,476,154 bytes  received 353,830 bytes  80,173,899.88 bytes/sec
> total size is 240,982,439,038  speedup is 1.00
>
> # zfs get recordsize bench
> NAME   PROPERTY    VALUE    SOURCE
> bench  recordsize  1M       local
>
> # zpool list -v bench
> NAME                                           SIZE  ALLOC   FREE
> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> bench                                         2.72T   232G  2.49T
>   -         -     0%     8%  1.00x    ONLINE  -
>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   232G  2.49T
>   -         -     0%  8.32%      -    ONLINE
>
> # zfs list bench
> NAME    USED  AVAIL     REFER  MOUNTPOINT
> bench   232G  2.41T      232G  /bench
>
> # zfs get all bench |egrep "(used|referenced|written)"
> bench  used                  232G                   -
> bench  referenced            232G                   -
> bench  usedbysnapshots       0B                     -
> bench  usedbydataset         232G                   -
> bench  usedbychildren        1.96M                  -
> bench  usedbyrefreservation  0B                     -
> bench  written               232G                   -
> bench  logicalused           232G                   -
> bench  logicalreferenced     232G                   -
>
> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb
>
>
>
> ===============================================
> Notes:
> ==========
>
> - the source dataset contains ~50% of pictures (raw files and jpg),
> and also some music, various archived documents, zip, videos
> - no change on the source dataset while testing (cf size logged by resync)
> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), and
> same results
> - probably not important here, but:
> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR
> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset
> on another zpool that I never tweaked except ashit=12 (because using
> the same model of Red 3TB)
>
> # zfs --version
> zfs-2.0.6-1
> zfs-kmod-v2021120100-zfs_a8c7652
>
> # uname -a
> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11
> 75566f060d4(HEAD) TRUENAS  amd64
>