Re: [zfs] recordsize: unexpected increase of disk usage when increasing it

From: Rich <rincebrain_at_gmail.com>
Date: Tue, 18 Jan 2022 14:29:54 UTC
Really? I didn't know it would still trim the tails on files with
compression off.

...

        size    1179648
        parent  34
        links   1
        pflags  40800000004
Indirect blocks:
               0 L1  DVA[0]=<3:c02b96c000:1000> DVA[1]=<3:c810733000:1000>
[L1 ZFS plain file] skein lz4 unencrypted LE contiguous unique double
size=20000L/1000P birth=35675472L/35675472P fill=2
cksum=5cfba24b351a09aa:8bd9dfef87c5b625:906ed5c3252943db:bed77ce51ad540d4
               0  L0 DVA[0]=<2:a0827db4000:100000> [L0 ZFS plain file]
skein uncompressed unencrypted LE contiguous unique single
size=100000L/100000P birth=35675472L/35675472P fill=1
cksum=95b06edf60e5f54c:af6f6950775d0863:8fc28b0783fcd9d3:2e44676e48a59360
          100000  L0 DVA[0]=<2:a0827eb4000:100000> [L0 ZFS plain file]
skein uncompressed unencrypted LE contiguous unique single
size=100000L/100000P birth=35675472L/35675472P fill=1
cksum=62a1f05769528648:8197c8a05ca9f1fb:a750c690124dd2e0:390bddc4314cd4c3

It seems not?

- Rich


On Tue, Jan 18, 2022 at 9:23 AM Alan Somers <asomers@freebsd.org> wrote:

> On Tue, Jan 18, 2022 at 7:13 AM Rich <rincebrain@gmail.com> wrote:
> >
> > Compression would have made your life better here, and possibly also
> made it clearer what's going on.
> >
> > All records in a file are going to be the same size pre-compression - so
> if you set the recordsize to 1M and save a 131.1M file, it's going to take
> up 132M on disk before compression/raidz overhead/whatnot.
>
> Not true.  ZFS will trim the file's tails even without compression enabled.
>
> >
> > Usually compression saves you from the tail padding actually requiring
> allocation on disk, which is one reason I encourage everyone to at least
> use lz4 (or, if you absolutely cannot for some reason, I guess zle should
> also work for this one case...)
> >
> > But I would say it's probably the sum of last record padding across the
> whole dataset, if you don't have compression on.
> >
> > - Rich
> >
> > On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr>
> wrote:
> >>
> >> TLDR: I rsync-ed the same data twice: once with 128K recordsize and
> >> once with 1M, and the allocated size on disk is ~3% bigger with 1M.
> >> Why not smaller ?
> >>
> >>
> >> Hello,
> >>
> >> I would like some help to understand how the disk usage evolves when I
> >> change the recordsize.
> >>
> >> I've read several articles/presentations/forums about recordsize in
> >> ZFS, and if I try to summarize, I mainly understood that:
> >> - recordsize is the "maximum" size of "objects" (so "logical blocks")
> >> that zfs will create for both  -data & metadata, then each object is
> >> compressed and allocated to one vdev, splitted into smaller (ashift
> >> size) "physical" blocks and written on disks
> >> - increasing recordsize is usually good when storing large files that
> >> are not modified, because it limits the nb of metadata objects
> >> (block-pointers), which has a positive effect on performance
> >> - decreasing recordsize is useful for "databases-like" workloads (ie:
> >> small random writes inside existing objects), because it avoids write
> >> amplification (read-modify-write a large object for a small update)
> >>
> >> Today, I'm trying to observe the effect of increasing recordsize for
> >> *my* data (because I'm also considering defining special_small_blocks
> >> & using SSDs as "special", but not tested nor discussed here, just
> >> recordsize).
> >> So, I'm doing some benchmarks on my "documents" dataset (details in
> >> "notes" below), but the results are really strange to me.
> >>
> >> When I rsync the same data to a freshly-recreated zpool:
> >> A) with recordsize=128K : 226G allocated on disk
> >> B) with recordsize=1M : 232G allocated on disk => bigger than 128K ?!?
> >>
> >> I would clearly expect the other way around, because bigger recordsize
> >> generates less metadata so smaller disk usage, and there shouldn't be
> >> any overhead because 1M is just a maximum and not a forced size to
> >> allocate for every object.
>
> A common misconception.  The 1M recordsize applies to every newly
> created object, and every object must use the same size for all of its
> records (except possibly the last one).  But objects created before
> you changed the recsize will retain their old recsize, file tails have
> a flexible recsize.
>
> >> I don't mind the increased usage (I can live with a few GB more), but
> >> I would like to understand why it happens.
>
> You might be seeing the effects of sparsity.  ZFS is smart enough not
> to store file holes (and if any kind of compression is enabled, it
> will find long runs of zeroes and turn them into holes).  If your data
> contains any holes that are >= 128 kB but < 1MB, then they can be
> stored as holes with a 128 kB recsize but must be stored as long runs
> of zeros with a 1MB recsize.
>
> However, I would suggest that you don't bother.  With a 128kB recsize,
> ZFS has something like a 1000:1 ratio of data:metadata.  In other
> words, increasing your recsize can save you at most 0.1% of disk
> space.  Basically, it doesn't matter.  What it _does_ matter for is
> the tradeoff between write amplification and RAM usage.  1000:1 is
> comparable to the disk:ram of many computers.  And performance is more
> sensitive to metadata access times than data access times.  So
> increasing your recsize can help you keep a greater fraction of your
> metadata in ARC.  OTOH, as you remarked increasing your recsize will
> also increase write amplification.
>
> So to summarize:
> * Adjust compression settings to save disk space.
> * Adjust recsize to save RAM.
>
> -Alan
>
> >>
> >> I tried to give all the details of my tests below.
> >> Did I do something wrong ? Can you explain the increase ?
> >>
> >> Thanks !
> >>
> >>
> >>
> >> ===============================================
> >> A) 128K
> >> ==========
> >>
> >> # zpool destroy bench
> >> # zpool create -o ashift=12 bench
> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
> >>
> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
> >> [...]
> >> sent 241,042,476,154 bytes  received 353,838 bytes  81,806,492.45
> bytes/sec
> >> total size is 240,982,439,038  speedup is 1.00
> >>
> >> # zfs get recordsize bench
> >> NAME   PROPERTY    VALUE    SOURCE
> >> bench  recordsize  128K     default
> >>
> >> # zpool list -v bench
> >> NAME                                           SIZE  ALLOC   FREE
> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> >> bench                                         2.72T   226G  2.50T
> >>   -         -     0%     8%  1.00x    ONLINE  -
> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   226G  2.50T
> >>   -         -     0%  8.10%      -    ONLINE
> >>
> >> # zfs list bench
> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
> >> bench   226G  2.41T      226G  /bench
> >>
> >> # zfs get all bench |egrep "(used|referenced|written)"
> >> bench  used                  226G                   -
> >> bench  referenced            226G                   -
> >> bench  usedbysnapshots       0B                     -
> >> bench  usedbydataset         226G                   -
> >> bench  usedbychildren        1.80M                  -
> >> bench  usedbyrefreservation  0B                     -
> >> bench  written               226G                   -
> >> bench  logicalused           226G                   -
> >> bench  logicalreferenced     226G                   -
> >>
> >> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb
> >>
> >>
> >>
> >> ===============================================
> >> B) 1M
> >> ==========
> >>
> >> # zpool destroy bench
> >> # zpool create -o ashift=12 bench
> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
> >> # zfs set recordsize=1M bench
> >>
> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
> >> [...]
> >> sent 241,042,476,154 bytes  received 353,830 bytes  80,173,899.88
> bytes/sec
> >> total size is 240,982,439,038  speedup is 1.00
> >>
> >> # zfs get recordsize bench
> >> NAME   PROPERTY    VALUE    SOURCE
> >> bench  recordsize  1M       local
> >>
> >> # zpool list -v bench
> >> NAME                                           SIZE  ALLOC   FREE
> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> >> bench                                         2.72T   232G  2.49T
> >>   -         -     0%     8%  1.00x    ONLINE  -
> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   232G  2.49T
> >>   -         -     0%  8.32%      -    ONLINE
> >>
> >> # zfs list bench
> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
> >> bench   232G  2.41T      232G  /bench
> >>
> >> # zfs get all bench |egrep "(used|referenced|written)"
> >> bench  used                  232G                   -
> >> bench  referenced            232G                   -
> >> bench  usedbysnapshots       0B                     -
> >> bench  usedbydataset         232G                   -
> >> bench  usedbychildren        1.96M                  -
> >> bench  usedbyrefreservation  0B                     -
> >> bench  written               232G                   -
> >> bench  logicalused           232G                   -
> >> bench  logicalreferenced     232G                   -
> >>
> >> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb
> >>
> >>
> >>
> >> ===============================================
> >> Notes:
> >> ==========
> >>
> >> - the source dataset contains ~50% of pictures (raw files and jpg),
> >> and also some music, various archived documents, zip, videos
> >> - no change on the source dataset while testing (cf size logged by
> resync)
> >> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), and
> >> same results
> >> - probably not important here, but:
> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR
> >> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset
> >> on another zpool that I never tweaked except ashit=12 (because using
> >> the same model of Red 3TB)
> >>
> >> # zfs --version
> >> zfs-2.0.6-1
> >> zfs-kmod-v2021120100-zfs_a8c7652
> >>
> >> # uname -a
> >> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11
> >> 75566f060d4(HEAD) TRUENAS  amd64
>