Re: [zfs] recordsize: unexpected increase of disk usage when increasing it

From: Rich <rincebrain_at_gmail.com>
Date: Tue, 18 Jan 2022 15:33:03 UTC
Yeah, that's consistent with my understanding of the behavior - one record
gets packed, as soon as you hit recordsize all subsequent records are
(logically, at least) recordsize, and then compression saves you, or
doesn't.

- Rich

On Tue, Jan 18, 2022 at 10:29 AM alan somers <asomers@gmail.com> wrote:

> I think the difference is in whether the file is < 1 record or >= 1
> record.  It looks like the first record is variably-sized but after
> that it's like you say, with compression off it rounds up.
>
> On Tue, Jan 18, 2022 at 8:07 AM Rich <rincebrain@gmail.com> wrote:
> >
> > Nope. I just retried it on my FBSD 13-RELEASE VM, too:
> > # uname -a
> > FreeBSD fbsd13rel 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24
> 07:33:27 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC
> amd64
> > # zpool version
> > zfs-2.1.99-683_ga967e54c2
> > zfs-kmod-2.1.99-683_ga967e54c2
> > # zpool get all | grep 'feature@' | grep disabled
> > buildpool  feature@edonr                  disabled
>  local
> > # dd if=/dev/urandom of=/buildpool/testme/2 bs=1179648 count=1
> > 1+0 records in
> > 1+0 records out
> > 1179648 bytes transferred in 0.009827 secs (120041885 bytes/sec)
> > # du -sh /buildpool/testme/2
> > 2.0M    /buildpool/testme/2
> > # zfs get all buildpool/testme | grep -v default
> > NAME              PROPERTY              VALUE                  SOURCE
> > buildpool/testme  type                  filesystem             -
> > buildpool/testme  creation              Tue Jan 18  4:46 2022  -
> > buildpool/testme  used                  4.03M                  -
> > buildpool/testme  available             277G                   -
> > buildpool/testme  referenced            4.03M                  -
> > buildpool/testme  compressratio         1.00x                  -
> > buildpool/testme  mounted               yes                    -
> > buildpool/testme  recordsize            1M                     local
> > buildpool/testme  compression           off                    local
> > buildpool/testme  atime                 off                    inherited
> from buildpool
> > buildpool/testme  createtxg             15030                  -
> > buildpool/testme  version               5                      -
> > buildpool/testme  utf8only              off                    -
> > buildpool/testme  normalization         none                   -
> > buildpool/testme  casesensitivity       sensitive              -
> > buildpool/testme  guid                  11057815587819738755   -
> > buildpool/testme  usedbysnapshots       0B                     -
> > buildpool/testme  usedbydataset         4.03M                  -
> > buildpool/testme  usedbychildren        0B                     -
> > buildpool/testme  usedbyrefreservation  0B                     -
> > buildpool/testme  objsetid              280                    -
> > buildpool/testme  refcompressratio      1.00x                  -
> > buildpool/testme  written               4.03M                  -
> > buildpool/testme  logicalused           4.01M                  -
> > buildpool/testme  logicalreferenced     4.01M                  -
> >
> > What version are you running?
> >
> > - Rich
> >
> > On Tue, Jan 18, 2022 at 10:00 AM Alan Somers <asomers@freebsd.org>
> wrote:
> >>
> >> That's not what I get.  Is your pool formatted using a very old
> >> version or something?
> >>
> >> somers@fbsd-head /u/h/somers [1]>
> >> dd if=/dev/random bs=1179648 of=/testpool/food/t/richfile count=1
> >> 1+0 records in
> >> 1+0 records out
> >> 1179648 bytes transferred in 0.003782 secs (311906705 bytes/sec)
> >> somers@fbsd-head /u/h/somers> du -sh  /testpool/food/t/richfile
> >> 1.1M    /testpool/food/t/richfile
> >>
> >> On Tue, Jan 18, 2022 at 7:51 AM Rich <rincebrain@gmail.com> wrote:
> >> >
> >> > 2.1M    /workspace/test1M/1
> >> >
> >> > - Rich
> >> >
> >> > On Tue, Jan 18, 2022 at 9:47 AM Alan Somers <asomers@freebsd.org>
> wrote:
> >> >>
> >> >> Yeah, it does.  Just check "du -sh <FILENAME>".  zdb there is showing
> >> >> you the logical size of the record, but it isn't showing how many
> disk
> >> >> blocks are actually allocated.
> >> >>
> >> >> On Tue, Jan 18, 2022 at 7:30 AM Rich <rincebrain@gmail.com> wrote:
> >> >> >
> >> >> > Really? I didn't know it would still trim the tails on files with
> compression off.
> >> >> >
> >> >> > ...
> >> >> >
> >> >> >         size    1179648
> >> >> >         parent  34
> >> >> >         links   1
> >> >> >         pflags  40800000004
> >> >> > Indirect blocks:
> >> >> >                0 L1  DVA[0]=<3:c02b96c000:1000>
> DVA[1]=<3:c810733000:1000> [L1 ZFS plain file] skein lz4 unencrypted LE
> contiguous unique double size=20000L/1000P birth=35675472L/35675472P fill=2
> cksum=5cfba24b351a09aa:8bd9dfef87c5b625:906ed5c3252943db:bed77ce51ad540d4
> >> >> >                0  L0 DVA[0]=<2:a0827db4000:100000> [L0 ZFS plain
> file] skein uncompressed unencrypted LE contiguous unique single
> size=100000L/100000P birth=35675472L/35675472P fill=1
> cksum=95b06edf60e5f54c:af6f6950775d0863:8fc28b0783fcd9d3:2e44676e48a59360
> >> >> >           100000  L0 DVA[0]=<2:a0827eb4000:100000> [L0 ZFS plain
> file] skein uncompressed unencrypted LE contiguous unique single
> size=100000L/100000P birth=35675472L/35675472P fill=1
> cksum=62a1f05769528648:8197c8a05ca9f1fb:a750c690124dd2e0:390bddc4314cd4c3
> >> >> >
> >> >> > It seems not?
> >> >> >
> >> >> > - Rich
> >> >> >
> >> >> >
> >> >> > On Tue, Jan 18, 2022 at 9:23 AM Alan Somers <asomers@freebsd.org>
> wrote:
> >> >> >>
> >> >> >> On Tue, Jan 18, 2022 at 7:13 AM Rich <rincebrain@gmail.com>
> wrote:
> >> >> >> >
> >> >> >> > Compression would have made your life better here, and possibly
> also made it clearer what's going on.
> >> >> >> >
> >> >> >> > All records in a file are going to be the same size
> pre-compression - so if you set the recordsize to 1M and save a 131.1M
> file, it's going to take up 132M on disk before compression/raidz
> overhead/whatnot.
> >> >> >>
> >> >> >> Not true.  ZFS will trim the file's tails even without
> compression enabled.
> >> >> >>
> >> >> >> >
> >> >> >> > Usually compression saves you from the tail padding actually
> requiring allocation on disk, which is one reason I encourage everyone to
> at least use lz4 (or, if you absolutely cannot for some reason, I guess zle
> should also work for this one case...)
> >> >> >> >
> >> >> >> > But I would say it's probably the sum of last record padding
> across the whole dataset, if you don't have compression on.
> >> >> >> >
> >> >> >> > - Rich
> >> >> >> >
> >> >> >> > On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <
> florent@rivoire.fr> wrote:
> >> >> >> >>
> >> >> >> >> TLDR: I rsync-ed the same data twice: once with 128K
> recordsize and
> >> >> >> >> once with 1M, and the allocated size on disk is ~3% bigger
> with 1M.
> >> >> >> >> Why not smaller ?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Hello,
> >> >> >> >>
> >> >> >> >> I would like some help to understand how the disk usage
> evolves when I
> >> >> >> >> change the recordsize.
> >> >> >> >>
> >> >> >> >> I've read several articles/presentations/forums about
> recordsize in
> >> >> >> >> ZFS, and if I try to summarize, I mainly understood that:
> >> >> >> >> - recordsize is the "maximum" size of "objects" (so "logical
> blocks")
> >> >> >> >> that zfs will create for both  -data & metadata, then each
> object is
> >> >> >> >> compressed and allocated to one vdev, splitted into smaller
> (ashift
> >> >> >> >> size) "physical" blocks and written on disks
> >> >> >> >> - increasing recordsize is usually good when storing large
> files that
> >> >> >> >> are not modified, because it limits the nb of metadata objects
> >> >> >> >> (block-pointers), which has a positive effect on performance
> >> >> >> >> - decreasing recordsize is useful for "databases-like"
> workloads (ie:
> >> >> >> >> small random writes inside existing objects), because it
> avoids write
> >> >> >> >> amplification (read-modify-write a large object for a small
> update)
> >> >> >> >>
> >> >> >> >> Today, I'm trying to observe the effect of increasing
> recordsize for
> >> >> >> >> *my* data (because I'm also considering defining
> special_small_blocks
> >> >> >> >> & using SSDs as "special", but not tested nor discussed here,
> just
> >> >> >> >> recordsize).
> >> >> >> >> So, I'm doing some benchmarks on my "documents" dataset
> (details in
> >> >> >> >> "notes" below), but the results are really strange to me.
> >> >> >> >>
> >> >> >> >> When I rsync the same data to a freshly-recreated zpool:
> >> >> >> >> A) with recordsize=128K : 226G allocated on disk
> >> >> >> >> B) with recordsize=1M : 232G allocated on disk => bigger than
> 128K ?!?
> >> >> >> >>
> >> >> >> >> I would clearly expect the other way around, because bigger
> recordsize
> >> >> >> >> generates less metadata so smaller disk usage, and there
> shouldn't be
> >> >> >> >> any overhead because 1M is just a maximum and not a forced
> size to
> >> >> >> >> allocate for every object.
> >> >> >>
> >> >> >> A common misconception.  The 1M recordsize applies to every newly
> >> >> >> created object, and every object must use the same size for all
> of its
> >> >> >> records (except possibly the last one).  But objects created
> before
> >> >> >> you changed the recsize will retain their old recsize, file tails
> have
> >> >> >> a flexible recsize.
> >> >> >>
> >> >> >> >> I don't mind the increased usage (I can live with a few GB
> more), but
> >> >> >> >> I would like to understand why it happens.
> >> >> >>
> >> >> >> You might be seeing the effects of sparsity.  ZFS is smart enough
> not
> >> >> >> to store file holes (and if any kind of compression is enabled, it
> >> >> >> will find long runs of zeroes and turn them into holes).  If your
> data
> >> >> >> contains any holes that are >= 128 kB but < 1MB, then they can be
> >> >> >> stored as holes with a 128 kB recsize but must be stored as long
> runs
> >> >> >> of zeros with a 1MB recsize.
> >> >> >>
> >> >> >> However, I would suggest that you don't bother.  With a 128kB
> recsize,
> >> >> >> ZFS has something like a 1000:1 ratio of data:metadata.  In other
> >> >> >> words, increasing your recsize can save you at most 0.1% of disk
> >> >> >> space.  Basically, it doesn't matter.  What it _does_ matter for
> is
> >> >> >> the tradeoff between write amplification and RAM usage.  1000:1 is
> >> >> >> comparable to the disk:ram of many computers.  And performance is
> more
> >> >> >> sensitive to metadata access times than data access times.  So
> >> >> >> increasing your recsize can help you keep a greater fraction of
> your
> >> >> >> metadata in ARC.  OTOH, as you remarked increasing your recsize
> will
> >> >> >> also increase write amplification.
> >> >> >>
> >> >> >> So to summarize:
> >> >> >> * Adjust compression settings to save disk space.
> >> >> >> * Adjust recsize to save RAM.
> >> >> >>
> >> >> >> -Alan
> >> >> >>
> >> >> >> >>
> >> >> >> >> I tried to give all the details of my tests below.
> >> >> >> >> Did I do something wrong ? Can you explain the increase ?
> >> >> >> >>
> >> >> >> >> Thanks !
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> ===============================================
> >> >> >> >> A) 128K
> >> >> >> >> ==========
> >> >> >> >>
> >> >> >> >> # zpool destroy bench
> >> >> >> >> # zpool create -o ashift=12 bench
> >> >> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
> >> >> >> >>
> >> >> >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
> >> >> >> >> [...]
> >> >> >> >> sent 241,042,476,154 bytes  received 353,838 bytes
> 81,806,492.45 bytes/sec
> >> >> >> >> total size is 240,982,439,038  speedup is 1.00
> >> >> >> >>
> >> >> >> >> # zfs get recordsize bench
> >> >> >> >> NAME   PROPERTY    VALUE    SOURCE
> >> >> >> >> bench  recordsize  128K     default
> >> >> >> >>
> >> >> >> >> # zpool list -v bench
> >> >> >> >> NAME                                           SIZE  ALLOC
>  FREE
> >> >> >> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> >> >> >> >> bench                                         2.72T   226G
> 2.50T
> >> >> >> >>   -         -     0%     8%  1.00x    ONLINE  -
> >> >> >> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   226G
> 2.50T
> >> >> >> >>   -         -     0%  8.10%      -    ONLINE
> >> >> >> >>
> >> >> >> >> # zfs list bench
> >> >> >> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
> >> >> >> >> bench   226G  2.41T      226G  /bench
> >> >> >> >>
> >> >> >> >> # zfs get all bench |egrep "(used|referenced|written)"
> >> >> >> >> bench  used                  226G                   -
> >> >> >> >> bench  referenced            226G                   -
> >> >> >> >> bench  usedbysnapshots       0B                     -
> >> >> >> >> bench  usedbydataset         226G                   -
> >> >> >> >> bench  usedbychildren        1.80M                  -
> >> >> >> >> bench  usedbyrefreservation  0B                     -
> >> >> >> >> bench  written               226G                   -
> >> >> >> >> bench  logicalused           226G                   -
> >> >> >> >> bench  logicalreferenced     226G                   -
> >> >> >> >>
> >> >> >> >> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> ===============================================
> >> >> >> >> B) 1M
> >> >> >> >> ==========
> >> >> >> >>
> >> >> >> >> # zpool destroy bench
> >> >> >> >> # zpool create -o ashift=12 bench
> >> >> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
> >> >> >> >> # zfs set recordsize=1M bench
> >> >> >> >>
> >> >> >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
> >> >> >> >> [...]
> >> >> >> >> sent 241,042,476,154 bytes  received 353,830 bytes
> 80,173,899.88 bytes/sec
> >> >> >> >> total size is 240,982,439,038  speedup is 1.00
> >> >> >> >>
> >> >> >> >> # zfs get recordsize bench
> >> >> >> >> NAME   PROPERTY    VALUE    SOURCE
> >> >> >> >> bench  recordsize  1M       local
> >> >> >> >>
> >> >> >> >> # zpool list -v bench
> >> >> >> >> NAME                                           SIZE  ALLOC
>  FREE
> >> >> >> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> >> >> >> >> bench                                         2.72T   232G
> 2.49T
> >> >> >> >>   -         -     0%     8%  1.00x    ONLINE  -
> >> >> >> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   232G
> 2.49T
> >> >> >> >>   -         -     0%  8.32%      -    ONLINE
> >> >> >> >>
> >> >> >> >> # zfs list bench
> >> >> >> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
> >> >> >> >> bench   232G  2.41T      232G  /bench
> >> >> >> >>
> >> >> >> >> # zfs get all bench |egrep "(used|referenced|written)"
> >> >> >> >> bench  used                  232G                   -
> >> >> >> >> bench  referenced            232G                   -
> >> >> >> >> bench  usedbysnapshots       0B                     -
> >> >> >> >> bench  usedbydataset         232G                   -
> >> >> >> >> bench  usedbychildren        1.96M                  -
> >> >> >> >> bench  usedbyrefreservation  0B                     -
> >> >> >> >> bench  written               232G                   -
> >> >> >> >> bench  logicalused           232G                   -
> >> >> >> >> bench  logicalreferenced     232G                   -
> >> >> >> >>
> >> >> >> >> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> ===============================================
> >> >> >> >> Notes:
> >> >> >> >> ==========
> >> >> >> >>
> >> >> >> >> - the source dataset contains ~50% of pictures (raw files and
> jpg),
> >> >> >> >> and also some music, various archived documents, zip, videos
> >> >> >> >> - no change on the source dataset while testing (cf size
> logged by resync)
> >> >> >> >> - I repeated the tests twice (128K, then 1M, then 128K, then
> 1M), and
> >> >> >> >> same results
> >> >> >> >> - probably not important here, but:
> >> >> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB
> CMR
> >> >> >> >> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize
> dataset
> >> >> >> >> on another zpool that I never tweaked except ashit=12 (because
> using
> >> >> >> >> the same model of Red 3TB)
> >> >> >> >>
> >> >> >> >> # zfs --version
> >> >> >> >> zfs-2.0.6-1
> >> >> >> >> zfs-kmod-v2021120100-zfs_a8c7652
> >> >> >> >>
> >> >> >> >> # uname -a
> >> >> >> >> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11
> >> >> >> >> 75566f060d4(HEAD) TRUENAS  amd64
>