From nobody Tue Jan 18 14:29:54 2022
X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id B9FC8195A784
	for <freebsd-fs@mlmmj.nyi.freebsd.org>; Tue, 18 Jan 2022 14:30:05 +0000 (UTC)
	(envelope-from rincebrain@gmail.com)
Received: from mail-ua1-x935.google.com (mail-ua1-x935.google.com [IPv6:2607:f8b0:4864:20::935])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4JdWQ46b5Fz4h1F;
	Tue, 18 Jan 2022 14:30:04 +0000 (UTC)
	(envelope-from rincebrain@gmail.com)
Received: by mail-ua1-x935.google.com with SMTP id x33so36971836uad.12;
        Tue, 18 Jan 2022 06:30:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=zfTxVRw40nL/03eq9LoGY1BACyMnyaqH2uzIf9qwluI=;
        b=bvmwSqSqs5q7jH9d8fJzPhi1ufMubr0CZaMGSwJEDtZ12+ifTixcxhdQXPCixeodCk
         1/SSuzJOITV2ni/KjObmA+aAbPUuti8kdP1B0fdfGwgnVR2Jrts9ZagyrdDkiEpKh+63
         tmMf9F1OiqD7cHy9E8p+XndYEOFfI0sBXyMQdB48GCIEbfUTXE8aenn18U4/O4NbeNLa
         RdkXPXC3Oiqgff7Oh99kiZcTWgFPZ18HuijoKCzbBVrfR2xG4HPBAQ3Jw1kX2rCTvA7V
         3P/SPC6nFCQk8nhg7+zka/64nBjP6XuR7NtFZgWxe570XKj27B2DUUXBqpd7OUVXbbY4
         oI1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=zfTxVRw40nL/03eq9LoGY1BACyMnyaqH2uzIf9qwluI=;
        b=y/8dxYYw+5jmLoVX14zlgt1/OVW15/qwpgHLi7zEssmmsAQcpsXlO4QESgsjNaD7gU
         RPaFHCCBqUtvdaHHRP7tuCsGRbZGkTzBsqUc8EGn/rwWzXmr1RbrF79FCisJaj7TZkGi
         8QLpFGgKV0QnvSY2kzwjS3ZkuYGQ+m5GfGOAiV8Endzk/AukYwr/7ujHabpz9X08Lxzv
         JpdRXKpZlzYkYhyNqVxdF7n6rnpZLxl1V/Tle5R46KsUM3Yi+L7cicrtDGVB8jpHttSC
         mWc1gMFC7ljhrvY1GM1s+989VOhD/rQCPv124NxQ34pSaJrExRdYJ4pw47GK5P1nARNi
         LYuQ==
X-Gm-Message-State: AOAM530CKAMx3ZMROSUZlZMk9tnl3DK1R+xmbLenj4lbjy8Ma+mpObU1
	IbLEWWhSe4GDhOBJSRyO2HI6O6LWNTR08Ad2GvNjE+OwnE0=
X-Google-Smtp-Source: ABdhPJxN6v4G/NjiVfQuvr2Xuw+NBzQCSQO3e7JMors3LhLio+fpgOpG6gfMICpZtE8eM/OdoprrEcV8CtXU2D/EnNk=
X-Received: by 2002:a05:6102:3a74:: with SMTP id bf20mr5272479vsb.31.1642516202763;
 Tue, 18 Jan 2022 06:30:02 -0800 (PST)
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-fs
List-Help: <mailto:freebsd-fs+help@freebsd.org>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Subscribe: <mailto:freebsd-fs+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-fs+unsubscribe@freebsd.org>
Sender: owner-freebsd-fs@freebsd.org
MIME-Version: 1.0
References: <CADzRhsEsZMGE-SoeWLMG9NTtkwhhy6OGQQ046m9AxGFbp5h_kQ@mail.gmail.com>
 <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com> <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG+aTiudvK_jp2sQKJQ@mail.gmail.com>
In-Reply-To: <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG+aTiudvK_jp2sQKJQ@mail.gmail.com>
From: Rich <rincebrain@gmail.com>
Date: Tue, 18 Jan 2022 09:29:54 -0500
Message-ID: <CAOeNLuoQLgKn673FVotxdoDC3HBr1_j+zY0t9-uVj7N+Fkoe1Q@mail.gmail.com>
Subject: Re: [zfs] recordsize: unexpected increase of disk usage when
 increasing it
To: Alan Somers <asomers@freebsd.org>
Cc: Florent Rivoire <florent@rivoire.fr>, freebsd-fs <freebsd-fs@freebsd.org>
Content-Type: multipart/alternative; boundary="000000000000cdb85605d5dc1a76"
X-Rspamd-Queue-Id: 4JdWQ46b5Fz4h1F
X-Spamd-Bar: ---
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=bvmwSqSq;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (mx1.freebsd.org: domain of rincebrain@gmail.com designates 2607:f8b0:4864:20::935 as permitted sender) smtp.mailfrom=rincebrain@gmail.com
X-Spamd-Result: default: False [-4.00 / 15.00];
	 ARC_NA(0.00)[];
	 NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	 R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112];
	 FROM_HAS_DN(0.00)[];
	 RCPT_COUNT_THREE(0.00)[3];
	 FREEMAIL_FROM(0.00)[gmail.com];
	 R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36];
	 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
	 NEURAL_HAM_LONG(-1.00)[-1.000];
	 TO_MATCH_ENVRCPT_SOME(0.00)[];
	 TO_DN_ALL(0.00)[];
	 DKIM_TRACE(0.00)[gmail.com:+];
	 MID_RHS_MATCH_FROMTLD(0.00)[];
	 RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::935:from];
	 NEURAL_HAM_SHORT(-1.00)[-1.000];
	 DMARC_POLICY_ALLOW(-0.50)[gmail.com,none];
	 MLMMJ_DEST(0.00)[freebsd-fs];
	 FROM_EQ_ENVFROM(0.00)[];
	 MIME_TRACE(0.00)[0:+,1:+,2:~];
	 FREEMAIL_ENVFROM(0.00)[gmail.com];
	 ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US];
	 RCVD_COUNT_TWO(0.00)[2];
	 RCVD_TLS_ALL(0.00)[];
	 DWL_DNSWL_NONE(0.00)[gmail.com:dkim]
X-ThisMailContainsUnwantedMimeParts: N

--000000000000cdb85605d5dc1a76
Content-Type: text/plain; charset="UTF-8"

Really? I didn't know it would still trim the tails on files with
compression off.

...

        size    1179648
        parent  34
        links   1
        pflags  40800000004
Indirect blocks:
               0 L1  DVA[0]=<3:c02b96c000:1000> DVA[1]=<3:c810733000:1000>
[L1 ZFS plain file] skein lz4 unencrypted LE contiguous unique double
size=20000L/1000P birth=35675472L/35675472P fill=2
cksum=5cfba24b351a09aa:8bd9dfef87c5b625:906ed5c3252943db:bed77ce51ad540d4
               0  L0 DVA[0]=<2:a0827db4000:100000> [L0 ZFS plain file]
skein uncompressed unencrypted LE contiguous unique single
size=100000L/100000P birth=35675472L/35675472P fill=1
cksum=95b06edf60e5f54c:af6f6950775d0863:8fc28b0783fcd9d3:2e44676e48a59360
          100000  L0 DVA[0]=<2:a0827eb4000:100000> [L0 ZFS plain file]
skein uncompressed unencrypted LE contiguous unique single
size=100000L/100000P birth=35675472L/35675472P fill=1
cksum=62a1f05769528648:8197c8a05ca9f1fb:a750c690124dd2e0:390bddc4314cd4c3

It seems not?

- Rich


On Tue, Jan 18, 2022 at 9:23 AM Alan Somers <asomers@freebsd.org> wrote:

> On Tue, Jan 18, 2022 at 7:13 AM Rich <rincebrain@gmail.com> wrote:
> >
> > Compression would have made your life better here, and possibly also
> made it clearer what's going on.
> >
> > All records in a file are going to be the same size pre-compression - so
> if you set the recordsize to 1M and save a 131.1M file, it's going to take
> up 132M on disk before compression/raidz overhead/whatnot.
>
> Not true.  ZFS will trim the file's tails even without compression enabled.
>
> >
> > Usually compression saves you from the tail padding actually requiring
> allocation on disk, which is one reason I encourage everyone to at least
> use lz4 (or, if you absolutely cannot for some reason, I guess zle should
> also work for this one case...)
> >
> > But I would say it's probably the sum of last record padding across the
> whole dataset, if you don't have compression on.
> >
> > - Rich
> >
> > On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr>
> wrote:
> >>
> >> TLDR: I rsync-ed the same data twice: once with 128K recordsize and
> >> once with 1M, and the allocated size on disk is ~3% bigger with 1M.
> >> Why not smaller ?
> >>
> >>
> >> Hello,
> >>
> >> I would like some help to understand how the disk usage evolves when I
> >> change the recordsize.
> >>
> >> I've read several articles/presentations/forums about recordsize in
> >> ZFS, and if I try to summarize, I mainly understood that:
> >> - recordsize is the "maximum" size of "objects" (so "logical blocks")
> >> that zfs will create for both  -data & metadata, then each object is
> >> compressed and allocated to one vdev, splitted into smaller (ashift
> >> size) "physical" blocks and written on disks
> >> - increasing recordsize is usually good when storing large files that
> >> are not modified, because it limits the nb of metadata objects
> >> (block-pointers), which has a positive effect on performance
> >> - decreasing recordsize is useful for "databases-like" workloads (ie:
> >> small random writes inside existing objects), because it avoids write
> >> amplification (read-modify-write a large object for a small update)
> >>
> >> Today, I'm trying to observe the effect of increasing recordsize for
> >> *my* data (because I'm also considering defining special_small_blocks
> >> & using SSDs as "special", but not tested nor discussed here, just
> >> recordsize).
> >> So, I'm doing some benchmarks on my "documents" dataset (details in
> >> "notes" below), but the results are really strange to me.
> >>
> >> When I rsync the same data to a freshly-recreated zpool:
> >> A) with recordsize=128K : 226G allocated on disk
> >> B) with recordsize=1M : 232G allocated on disk => bigger than 128K ?!?
> >>
> >> I would clearly expect the other way around, because bigger recordsize
> >> generates less metadata so smaller disk usage, and there shouldn't be
> >> any overhead because 1M is just a maximum and not a forced size to
> >> allocate for every object.
>
> A common misconception.  The 1M recordsize applies to every newly
> created object, and every object must use the same size for all of its
> records (except possibly the last one).  But objects created before
> you changed the recsize will retain their old recsize, file tails have
> a flexible recsize.
>
> >> I don't mind the increased usage (I can live with a few GB more), but
> >> I would like to understand why it happens.
>
> You might be seeing the effects of sparsity.  ZFS is smart enough not
> to store file holes (and if any kind of compression is enabled, it
> will find long runs of zeroes and turn them into holes).  If your data
> contains any holes that are >= 128 kB but < 1MB, then they can be
> stored as holes with a 128 kB recsize but must be stored as long runs
> of zeros with a 1MB recsize.
>
> However, I would suggest that you don't bother.  With a 128kB recsize,
> ZFS has something like a 1000:1 ratio of data:metadata.  In other
> words, increasing your recsize can save you at most 0.1% of disk
> space.  Basically, it doesn't matter.  What it _does_ matter for is
> the tradeoff between write amplification and RAM usage.  1000:1 is
> comparable to the disk:ram of many computers.  And performance is more
> sensitive to metadata access times than data access times.  So
> increasing your recsize can help you keep a greater fraction of your
> metadata in ARC.  OTOH, as you remarked increasing your recsize will
> also increase write amplification.
>
> So to summarize:
> * Adjust compression settings to save disk space.
> * Adjust recsize to save RAM.
>
> -Alan
>
> >>
> >> I tried to give all the details of my tests below.
> >> Did I do something wrong ? Can you explain the increase ?
> >>
> >> Thanks !
> >>
> >>
> >>
> >> ===============================================
> >> A) 128K
> >> ==========
> >>
> >> # zpool destroy bench
> >> # zpool create -o ashift=12 bench
> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
> >>
> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
> >> [...]
> >> sent 241,042,476,154 bytes  received 353,838 bytes  81,806,492.45
> bytes/sec
> >> total size is 240,982,439,038  speedup is 1.00
> >>
> >> # zfs get recordsize bench
> >> NAME   PROPERTY    VALUE    SOURCE
> >> bench  recordsize  128K     default
> >>
> >> # zpool list -v bench
> >> NAME                                           SIZE  ALLOC   FREE
> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> >> bench                                         2.72T   226G  2.50T
> >>   -         -     0%     8%  1.00x    ONLINE  -
> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   226G  2.50T
> >>   -         -     0%  8.10%      -    ONLINE
> >>
> >> # zfs list bench
> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
> >> bench   226G  2.41T      226G  /bench
> >>
> >> # zfs get all bench |egrep "(used|referenced|written)"
> >> bench  used                  226G                   -
> >> bench  referenced            226G                   -
> >> bench  usedbysnapshots       0B                     -
> >> bench  usedbydataset         226G                   -
> >> bench  usedbychildren        1.80M                  -
> >> bench  usedbyrefreservation  0B                     -
> >> bench  written               226G                   -
> >> bench  logicalused           226G                   -
> >> bench  logicalreferenced     226G                   -
> >>
> >> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb
> >>
> >>
> >>
> >> ===============================================
> >> B) 1M
> >> ==========
> >>
> >> # zpool destroy bench
> >> # zpool create -o ashift=12 bench
> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
> >> # zfs set recordsize=1M bench
> >>
> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
> >> [...]
> >> sent 241,042,476,154 bytes  received 353,830 bytes  80,173,899.88
> bytes/sec
> >> total size is 240,982,439,038  speedup is 1.00
> >>
> >> # zfs get recordsize bench
> >> NAME   PROPERTY    VALUE    SOURCE
> >> bench  recordsize  1M       local
> >>
> >> # zpool list -v bench
> >> NAME                                           SIZE  ALLOC   FREE
> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
> >> bench                                         2.72T   232G  2.49T
> >>   -         -     0%     8%  1.00x    ONLINE  -
> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   232G  2.49T
> >>   -         -     0%  8.32%      -    ONLINE
> >>
> >> # zfs list bench
> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
> >> bench   232G  2.41T      232G  /bench
> >>
> >> # zfs get all bench |egrep "(used|referenced|written)"
> >> bench  used                  232G                   -
> >> bench  referenced            232G                   -
> >> bench  usedbysnapshots       0B                     -
> >> bench  usedbydataset         232G                   -
> >> bench  usedbychildren        1.96M                  -
> >> bench  usedbyrefreservation  0B                     -
> >> bench  written               232G                   -
> >> bench  logicalused           232G                   -
> >> bench  logicalreferenced     232G                   -
> >>
> >> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb
> >>
> >>
> >>
> >> ===============================================
> >> Notes:
> >> ==========
> >>
> >> - the source dataset contains ~50% of pictures (raw files and jpg),
> >> and also some music, various archived documents, zip, videos
> >> - no change on the source dataset while testing (cf size logged by
> resync)
> >> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), and
> >> same results
> >> - probably not important here, but:
> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR
> >> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset
> >> on another zpool that I never tweaked except ashit=12 (because using
> >> the same model of Red 3TB)
> >>
> >> # zfs --version
> >> zfs-2.0.6-1
> >> zfs-kmod-v2021120100-zfs_a8c7652
> >>
> >> # uname -a
> >> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11
> >> 75566f060d4(HEAD) TRUENAS  amd64
>

--000000000000cdb85605d5dc1a76
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Really? I didn&#39;t know it would still trim the tails on=
 files with compression off.<div><br></div><div>...</div><div><br></div><di=
v>=C2=A0 =C2=A0 =C2=A0 =C2=A0 size =C2=A0 =C2=A01179648<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 parent =C2=A034<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 links =C2=A0 1=
<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 pflags =C2=A040800000004<br>Indirect blocks=
:<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 L1 =C2=A0DVA[=
0]=3D&lt;3:c02b96c000:1000&gt; DVA[1]=3D&lt;3:c810733000:1000&gt; [L1 ZFS p=
lain file] skein lz4 unencrypted LE contiguous unique double size=3D20000L/=
1000P birth=3D35675472L/35675472P fill=3D2 cksum=3D5cfba24b351a09aa:8bd9dfe=
f87c5b625:906ed5c3252943db:bed77ce51ad540d4<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0L0 DVA[0]=3D&lt;2:a0827db4000:100000&gt;=
 [L0 ZFS plain file] skein uncompressed unencrypted LE contiguous unique si=
ngle size=3D100000L/100000P birth=3D35675472L/35675472P fill=3D1 cksum=3D95=
b06edf60e5f54c:af6f6950775d0863:8fc28b0783fcd9d3:2e44676e48a59360<br>=C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 100000 =C2=A0L0 DVA[0]=3D&lt;2:a0827eb4000:100=
000&gt; [L0 ZFS plain file] skein uncompressed unencrypted LE contiguous un=
ique single size=3D100000L/100000P birth=3D35675472L/35675472P fill=3D1 cks=
um=3D62a1f05769528648:8197c8a05ca9f1fb:a750c690124dd2e0:390bddc4314cd4c3<br=
></div><div><br></div><div>It seems not?</div><div><br></div><div>- Rich</d=
iv><div><br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" cla=
ss=3D"gmail_attr">On Tue, Jan 18, 2022 at 9:23 AM Alan Somers &lt;<a href=
=3D"mailto:asomers@freebsd.org">asomers@freebsd.org</a>&gt; wrote:<br></div=
><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border=
-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, Jan 18, 2022 at =
7:13 AM Rich &lt;<a href=3D"mailto:rincebrain@gmail.com" target=3D"_blank">=
rincebrain@gmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; Compression would have made your life better here, and possibly also m=
ade it clearer what&#39;s going on.<br>
&gt;<br>
&gt; All records in a file are going to be the same size pre-compression - =
so if you set the recordsize to 1M and save a 131.1M file, it&#39;s going t=
o take up 132M on disk before compression/raidz overhead/whatnot.<br>
<br>
Not true.=C2=A0 ZFS will trim the file&#39;s tails even without compression=
 enabled.<br>
<br>
&gt;<br>
&gt; Usually compression saves you from the tail padding actually requiring=
 allocation on disk, which is one reason I encourage everyone to at least u=
se lz4 (or, if you absolutely cannot for some reason, I guess zle should al=
so work for this one case...)<br>
&gt;<br>
&gt; But I would say it&#39;s probably the sum of last record padding acros=
s the whole dataset, if you don&#39;t have compression on.<br>
&gt;<br>
&gt; - Rich<br>
&gt;<br>
&gt; On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire &lt;<a href=3D"mailto:=
florent@rivoire.fr" target=3D"_blank">florent@rivoire.fr</a>&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; TLDR: I rsync-ed the same data twice: once with 128K recordsize an=
d<br>
&gt;&gt; once with 1M, and the allocated size on disk is ~3% bigger with 1M=
.<br>
&gt;&gt; Why not smaller ?<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; Hello,<br>
&gt;&gt;<br>
&gt;&gt; I would like some help to understand how the disk usage evolves wh=
en I<br>
&gt;&gt; change the recordsize.<br>
&gt;&gt;<br>
&gt;&gt; I&#39;ve read several articles/presentations/forums about recordsi=
ze in<br>
&gt;&gt; ZFS, and if I try to summarize, I mainly understood that:<br>
&gt;&gt; - recordsize is the &quot;maximum&quot; size of &quot;objects&quot=
; (so &quot;logical blocks&quot;)<br>
&gt;&gt; that zfs will create for both=C2=A0 -data &amp; metadata, then eac=
h object is<br>
&gt;&gt; compressed and allocated to one vdev, splitted into smaller (ashif=
t<br>
&gt;&gt; size) &quot;physical&quot; blocks and written on disks<br>
&gt;&gt; - increasing recordsize is usually good when storing large files t=
hat<br>
&gt;&gt; are not modified, because it limits the nb of metadata objects<br>
&gt;&gt; (block-pointers), which has a positive effect on performance<br>
&gt;&gt; - decreasing recordsize is useful for &quot;databases-like&quot; w=
orkloads (ie:<br>
&gt;&gt; small random writes inside existing objects), because it avoids wr=
ite<br>
&gt;&gt; amplification (read-modify-write a large object for a small update=
)<br>
&gt;&gt;<br>
&gt;&gt; Today, I&#39;m trying to observe the effect of increasing recordsi=
ze for<br>
&gt;&gt; *my* data (because I&#39;m also considering defining special_small=
_blocks<br>
&gt;&gt; &amp; using SSDs as &quot;special&quot;, but not tested nor discus=
sed here, just<br>
&gt;&gt; recordsize).<br>
&gt;&gt; So, I&#39;m doing some benchmarks on my &quot;documents&quot; data=
set (details in<br>
&gt;&gt; &quot;notes&quot; below), but the results are really strange to me=
.<br>
&gt;&gt;<br>
&gt;&gt; When I rsync the same data to a freshly-recreated zpool:<br>
&gt;&gt; A) with recordsize=3D128K : 226G allocated on disk<br>
&gt;&gt; B) with recordsize=3D1M : 232G allocated on disk =3D&gt; bigger th=
an 128K ?!?<br>
&gt;&gt;<br>
&gt;&gt; I would clearly expect the other way around, because bigger record=
size<br>
&gt;&gt; generates less metadata so smaller disk usage, and there shouldn&#=
39;t be<br>
&gt;&gt; any overhead because 1M is just a maximum and not a forced size to=
<br>
&gt;&gt; allocate for every object.<br>
<br>
A common misconception.=C2=A0 The 1M recordsize applies to every newly<br>
created object, and every object must use the same size for all of its<br>
records (except possibly the last one).=C2=A0 But objects created before<br=
>
you changed the recsize will retain their old recsize, file tails have<br>
a flexible recsize.<br>
<br>
&gt;&gt; I don&#39;t mind the increased usage (I can live with a few GB mor=
e), but<br>
&gt;&gt; I would like to understand why it happens.<br>
<br>
You might be seeing the effects of sparsity.=C2=A0 ZFS is smart enough not<=
br>
to store file holes (and if any kind of compression is enabled, it<br>
will find long runs of zeroes and turn them into holes).=C2=A0 If your data=
<br>
contains any holes that are &gt;=3D 128 kB but &lt; 1MB, then they can be<b=
r>
stored as holes with a 128 kB recsize but must be stored as long runs<br>
of zeros with a 1MB recsize.<br>
<br>
However, I would suggest that you don&#39;t bother.=C2=A0 With a 128kB recs=
ize,<br>
ZFS has something like a 1000:1 ratio of data:metadata.=C2=A0 In other<br>
words, increasing your recsize can save you at most 0.1% of disk<br>
space.=C2=A0 Basically, it doesn&#39;t matter.=C2=A0 What it _does_ matter =
for is<br>
the tradeoff between write amplification and RAM usage.=C2=A0 1000:1 is<br>
comparable to the disk:ram of many computers.=C2=A0 And performance is more=
<br>
sensitive to metadata access times than data access times.=C2=A0 So<br>
increasing your recsize can help you keep a greater fraction of your<br>
metadata in ARC.=C2=A0 OTOH, as you remarked increasing your recsize will<b=
r>
also increase write amplification.<br>
<br>
So to summarize:<br>
* Adjust compression settings to save disk space.<br>
* Adjust recsize to save RAM.<br>
<br>
-Alan<br>
<br>
&gt;&gt;<br>
&gt;&gt; I tried to give all the details of my tests below.<br>
&gt;&gt; Did I do something wrong ? Can you explain the increase ?<br>
&gt;&gt;<br>
&gt;&gt; Thanks !<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
<br>
&gt;&gt; A) 128K<br>
&gt;&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt;&gt;<br>
&gt;&gt; # zpool destroy bench<br>
&gt;&gt; # zpool create -o ashift=3D12 bench<br>
&gt;&gt; /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4<br>
&gt;&gt;<br>
&gt;&gt; # rsync -av --exclude &#39;.zfs&#39; /mnt/tank/docs-florent/ /benc=
h<br>
&gt;&gt; [...]<br>
&gt;&gt; sent 241,042,476,154 bytes=C2=A0 received 353,838 bytes=C2=A0 81,8=
06,492.45 bytes/sec<br>
&gt;&gt; total size is 240,982,439,038=C2=A0 speedup is 1.00<br>
&gt;&gt;<br>
&gt;&gt; # zfs get recordsize bench<br>
&gt;&gt; NAME=C2=A0 =C2=A0PROPERTY=C2=A0 =C2=A0 VALUE=C2=A0 =C2=A0 SOURCE<b=
r>
&gt;&gt; bench=C2=A0 recordsize=C2=A0 128K=C2=A0 =C2=A0 =C2=A0default<br>
&gt;&gt;<br>
&gt;&gt; # zpool list -v bench<br>
&gt;&gt; NAME=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0SIZE=C2=A0 ALLOC=C2=A0 =C2=A0FREE<br>
&gt;&gt; CKPOINT=C2=A0 EXPANDSZ=C2=A0 =C2=A0FRAG=C2=A0 =C2=A0 CAP=C2=A0 DED=
UP=C2=A0 =C2=A0 HEALTH=C2=A0 ALTROOT<br>
&gt;&gt; bench=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A02.72T=C2=A0 =C2=A0226G=C2=A0 2.50T<br>
&gt;&gt;=C2=A0 =C2=A0-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-=C2=A0 =C2=A0 =C2=
=A00%=C2=A0 =C2=A0 =C2=A08%=C2=A0 1.00x=C2=A0 =C2=A0 ONLINE=C2=A0 -<br>
&gt;&gt;=C2=A0 =C2=A0gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4=C2=A0 2.72T=
=C2=A0 =C2=A0226G=C2=A0 2.50T<br>
&gt;&gt;=C2=A0 =C2=A0-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-=C2=A0 =C2=A0 =C2=
=A00%=C2=A0 8.10%=C2=A0 =C2=A0 =C2=A0 -=C2=A0 =C2=A0 ONLINE<br>
&gt;&gt;<br>
&gt;&gt; # zfs list bench<br>
&gt;&gt; NAME=C2=A0 =C2=A0 USED=C2=A0 AVAIL=C2=A0 =C2=A0 =C2=A0REFER=C2=A0 =
MOUNTPOINT<br>
&gt;&gt; bench=C2=A0 =C2=A0226G=C2=A0 2.41T=C2=A0 =C2=A0 =C2=A0 226G=C2=A0 =
/bench<br>
&gt;&gt;<br>
&gt;&gt; # zfs get all bench |egrep &quot;(used|referenced|written)&quot;<b=
r>
&gt;&gt; bench=C2=A0 used=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 226G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 referenced=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 22=
6G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br=
>
&gt;&gt; bench=C2=A0 usedbysnapshots=C2=A0 =C2=A0 =C2=A0 =C2=A00B=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 usedbydataset=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0226G=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 usedbychildren=C2=A0 =C2=A0 =C2=A0 =C2=A0 1.80M=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 -<br>
&gt;&gt; bench=C2=A0 usedbyrefreservation=C2=A0 0B=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 written=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0226G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 logicalused=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A022=
6G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br=
>
&gt;&gt; bench=C2=A0 logicalreferenced=C2=A0 =C2=A0 =C2=A0226G=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt;&gt;<br>
&gt;&gt; # zdb -Lbbbs bench &gt; zpool-bench-rcd128K.zdb<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
<br>
&gt;&gt; B) 1M<br>
&gt;&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt;&gt;<br>
&gt;&gt; # zpool destroy bench<br>
&gt;&gt; # zpool create -o ashift=3D12 bench<br>
&gt;&gt; /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4<br>
&gt;&gt; # zfs set recordsize=3D1M bench<br>
&gt;&gt;<br>
&gt;&gt; # rsync -av --exclude &#39;.zfs&#39; /mnt/tank/docs-florent/ /benc=
h<br>
&gt;&gt; [...]<br>
&gt;&gt; sent 241,042,476,154 bytes=C2=A0 received 353,830 bytes=C2=A0 80,1=
73,899.88 bytes/sec<br>
&gt;&gt; total size is 240,982,439,038=C2=A0 speedup is 1.00<br>
&gt;&gt;<br>
&gt;&gt; # zfs get recordsize bench<br>
&gt;&gt; NAME=C2=A0 =C2=A0PROPERTY=C2=A0 =C2=A0 VALUE=C2=A0 =C2=A0 SOURCE<b=
r>
&gt;&gt; bench=C2=A0 recordsize=C2=A0 1M=C2=A0 =C2=A0 =C2=A0 =C2=A0local<br=
>
&gt;&gt;<br>
&gt;&gt; # zpool list -v bench<br>
&gt;&gt; NAME=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0SIZE=C2=A0 ALLOC=C2=A0 =C2=A0FREE<br>
&gt;&gt; CKPOINT=C2=A0 EXPANDSZ=C2=A0 =C2=A0FRAG=C2=A0 =C2=A0 CAP=C2=A0 DED=
UP=C2=A0 =C2=A0 HEALTH=C2=A0 ALTROOT<br>
&gt;&gt; bench=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A02.72T=C2=A0 =C2=A0232G=C2=A0 2.49T<br>
&gt;&gt;=C2=A0 =C2=A0-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-=C2=A0 =C2=A0 =C2=
=A00%=C2=A0 =C2=A0 =C2=A08%=C2=A0 1.00x=C2=A0 =C2=A0 ONLINE=C2=A0 -<br>
&gt;&gt;=C2=A0 =C2=A0gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4=C2=A0 2.72T=
=C2=A0 =C2=A0232G=C2=A0 2.49T<br>
&gt;&gt;=C2=A0 =C2=A0-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-=C2=A0 =C2=A0 =C2=
=A00%=C2=A0 8.32%=C2=A0 =C2=A0 =C2=A0 -=C2=A0 =C2=A0 ONLINE<br>
&gt;&gt;<br>
&gt;&gt; # zfs list bench<br>
&gt;&gt; NAME=C2=A0 =C2=A0 USED=C2=A0 AVAIL=C2=A0 =C2=A0 =C2=A0REFER=C2=A0 =
MOUNTPOINT<br>
&gt;&gt; bench=C2=A0 =C2=A0232G=C2=A0 2.41T=C2=A0 =C2=A0 =C2=A0 232G=C2=A0 =
/bench<br>
&gt;&gt;<br>
&gt;&gt; # zfs get all bench |egrep &quot;(used|referenced|written)&quot;<b=
r>
&gt;&gt; bench=C2=A0 used=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 232G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 referenced=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 23=
2G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br=
>
&gt;&gt; bench=C2=A0 usedbysnapshots=C2=A0 =C2=A0 =C2=A0 =C2=A00B=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 usedbydataset=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0232G=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 usedbychildren=C2=A0 =C2=A0 =C2=A0 =C2=A0 1.96M=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 -<br>
&gt;&gt; bench=C2=A0 usedbyrefreservation=C2=A0 0B=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 written=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0232G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0-<br>
&gt;&gt; bench=C2=A0 logicalused=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A023=
2G=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br=
>
&gt;&gt; bench=C2=A0 logicalreferenced=C2=A0 =C2=A0 =C2=A0232G=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-<br>
&gt;&gt;<br>
&gt;&gt; # zdb -Lbbbs bench &gt; zpool-bench-rcd1M.zdb<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
<br>
&gt;&gt; Notes:<br>
&gt;&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
&gt;&gt;<br>
&gt;&gt; - the source dataset contains ~50% of pictures (raw files and jpg)=
,<br>
&gt;&gt; and also some music, various archived documents, zip, videos<br>
&gt;&gt; - no change on the source dataset while testing (cf size logged by=
 resync)<br>
&gt;&gt; - I repeated the tests twice (128K, then 1M, then 128K, then 1M), =
and<br>
&gt;&gt; same results<br>
&gt;&gt; - probably not important here, but:<br>
&gt;&gt; /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR<b=
r>
&gt;&gt; (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize datas=
et<br>
&gt;&gt; on another zpool that I never tweaked except ashit=3D12 (because u=
sing<br>
&gt;&gt; the same model of Red 3TB)<br>
&gt;&gt;<br>
&gt;&gt; # zfs --version<br>
&gt;&gt; zfs-2.0.6-1<br>
&gt;&gt; zfs-kmod-v2021120100-zfs_a8c7652<br>
&gt;&gt;<br>
&gt;&gt; # uname -a<br>
&gt;&gt; FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11<br>
&gt;&gt; 75566f060d4(HEAD) TRUENAS=C2=A0 amd64<br>
</blockquote></div>

--000000000000cdb85605d5dc1a76--