From nobody Tue Jan 18 15:00:08 2022 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 2DC06196E8C5 for ; Tue, 18 Jan 2022 15:00:26 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4JdX552znWz4t3t for ; Tue, 18 Jan 2022 15:00:25 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f46.google.com with SMTP id 60-20020a9d0142000000b0059103eb18d4so24505677otu.2 for ; Tue, 18 Jan 2022 07:00:25 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=qHFMk/fv80u00cyb89XeyyZxh6WMLGkGGKE5Rs0JIAw=; b=j37nXbCipoyTF+EnaIxoVGoLR7turuTL5xycDvSWfNNKay3RJpuaGC0jaLFZmCUQM0 tOVsxU83Sf9YZXkWPtc+IqhKFCpqoKioH2i4ZpPoT2BZvIwGa3e8Zn62KYmbM5KdPxez WltmX55eKlOrZKsYegrwPZRJ4BB/wpU8PbAOcalLmWZe2YwkzGZlHmGDa7prCQxz+771 z52GebeRE/TraMNS/Sr7hJuhmX3hFNZjlckISn98ufrsgBht8t89RnFvDFv/TnOBPtGi O1sS/hlLk/0tHFO6QaPURbi8QtnF1O4E8p6WvcnA30pvnXhGkhBXrS9B/vJVJ4ADZAEu C2Vg== X-Gm-Message-State: AOAM5322DaYPpnxpSFjUfggc4LVpcJShOSdwPYdQaT4c9aPG/oQW39X7 qmGnTZ4xmnew5X+Ou8fv63pmGK48YEOxoW5riPw= X-Google-Smtp-Source: ABdhPJzwtpHcWvCY8BWnMGQps2vpPI+JM/BMeetuMV8otIPRXmgmFu6EZF1LIONhar4vbiWZpcFb+YlF2gnHiCHU3Hs= X-Received: by 2002:a05:6830:1389:: with SMTP id d9mr15998638otq.114.1642518019128; Tue, 18 Jan 2022 07:00:19 -0800 (PST) List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Tue, 18 Jan 2022 08:00:08 -0700 Message-ID: Subject: Re: [zfs] recordsize: unexpected increase of disk usage when increasing it To: Rich Cc: Florent Rivoire , freebsd-fs Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4JdX552znWz4t3t X-Spamd-Bar: / Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.210.46 as permitted sender) smtp.mailfrom=asomers@gmail.com X-Spamd-Result: default: False [0.92 / 15.00]; ARC_NA(0.00)[]; FREEFALL_USER(0.00)[asomers]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17:c]; RCVD_TLS_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; DMARC_NA(0.00)[freebsd.org]; RWL_MAILSPIKE_GOOD(0.00)[209.85.210.46:from]; NEURAL_SPAM_MEDIUM(0.92)[0.922]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_SPAM_LONG(1.00)[0.998]; RCVD_IN_DNSWL_NONE(0.00)[209.85.210.46:from]; MLMMJ_DEST(0.00)[freebsd-fs]; FREEMAIL_TO(0.00)[gmail.com]; FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com]; FREEMAIL_ENVFROM(0.00)[gmail.com]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N That's not what I get. Is your pool formatted using a very old version or something? somers@fbsd-head /u/h/somers [1]> dd if=3D/dev/random bs=3D1179648 of=3D/testpool/food/t/richfile count=3D1 1+0 records in 1+0 records out 1179648 bytes transferred in 0.003782 secs (311906705 bytes/sec) somers@fbsd-head /u/h/somers> du -sh /testpool/food/t/richfile 1.1M /testpool/food/t/richfile On Tue, Jan 18, 2022 at 7:51 AM Rich wrote: > > 2.1M /workspace/test1M/1 > > - Rich > > On Tue, Jan 18, 2022 at 9:47 AM Alan Somers wrote: >> >> Yeah, it does. Just check "du -sh ". zdb there is showing >> you the logical size of the record, but it isn't showing how many disk >> blocks are actually allocated. >> >> On Tue, Jan 18, 2022 at 7:30 AM Rich wrote: >> > >> > Really? I didn't know it would still trim the tails on files with comp= ression off. >> > >> > ... >> > >> > size 1179648 >> > parent 34 >> > links 1 >> > pflags 40800000004 >> > Indirect blocks: >> > 0 L1 DVA[0]=3D<3:c02b96c000:1000> DVA[1]=3D<3:c8107330= 00:1000> [L1 ZFS plain file] skein lz4 unencrypted LE contiguous unique dou= ble size=3D20000L/1000P birth=3D35675472L/35675472P fill=3D2 cksum=3D5cfba2= 4b351a09aa:8bd9dfef87c5b625:906ed5c3252943db:bed77ce51ad540d4 >> > 0 L0 DVA[0]=3D<2:a0827db4000:100000> [L0 ZFS plain fil= e] skein uncompressed unencrypted LE contiguous unique single size=3D100000= L/100000P birth=3D35675472L/35675472P fill=3D1 cksum=3D95b06edf60e5f54c:af6= f6950775d0863:8fc28b0783fcd9d3:2e44676e48a59360 >> > 100000 L0 DVA[0]=3D<2:a0827eb4000:100000> [L0 ZFS plain fil= e] skein uncompressed unencrypted LE contiguous unique single size=3D100000= L/100000P birth=3D35675472L/35675472P fill=3D1 cksum=3D62a1f05769528648:819= 7c8a05ca9f1fb:a750c690124dd2e0:390bddc4314cd4c3 >> > >> > It seems not? >> > >> > - Rich >> > >> > >> > On Tue, Jan 18, 2022 at 9:23 AM Alan Somers wrot= e: >> >> >> >> On Tue, Jan 18, 2022 at 7:13 AM Rich wrote: >> >> > >> >> > Compression would have made your life better here, and possibly als= o made it clearer what's going on. >> >> > >> >> > All records in a file are going to be the same size pre-compression= - so if you set the recordsize to 1M and save a 131.1M file, it's going to= take up 132M on disk before compression/raidz overhead/whatnot. >> >> >> >> Not true. ZFS will trim the file's tails even without compression en= abled. >> >> >> >> > >> >> > Usually compression saves you from the tail padding actually requir= ing allocation on disk, which is one reason I encourage everyone to at leas= t use lz4 (or, if you absolutely cannot for some reason, I guess zle should= also work for this one case...) >> >> > >> >> > But I would say it's probably the sum of last record padding across= the whole dataset, if you don't have compression on. >> >> > >> >> > - Rich >> >> > >> >> > On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire wrote: >> >> >> >> >> >> TLDR: I rsync-ed the same data twice: once with 128K recordsize an= d >> >> >> once with 1M, and the allocated size on disk is ~3% bigger with 1M= . >> >> >> Why not smaller ? >> >> >> >> >> >> >> >> >> Hello, >> >> >> >> >> >> I would like some help to understand how the disk usage evolves wh= en I >> >> >> change the recordsize. >> >> >> >> >> >> I've read several articles/presentations/forums about recordsize i= n >> >> >> ZFS, and if I try to summarize, I mainly understood that: >> >> >> - recordsize is the "maximum" size of "objects" (so "logical block= s") >> >> >> that zfs will create for both -data & metadata, then each object = is >> >> >> compressed and allocated to one vdev, splitted into smaller (ashif= t >> >> >> size) "physical" blocks and written on disks >> >> >> - increasing recordsize is usually good when storing large files t= hat >> >> >> are not modified, because it limits the nb of metadata objects >> >> >> (block-pointers), which has a positive effect on performance >> >> >> - decreasing recordsize is useful for "databases-like" workloads (= ie: >> >> >> small random writes inside existing objects), because it avoids wr= ite >> >> >> amplification (read-modify-write a large object for a small update= ) >> >> >> >> >> >> Today, I'm trying to observe the effect of increasing recordsize f= or >> >> >> *my* data (because I'm also considering defining special_small_blo= cks >> >> >> & using SSDs as "special", but not tested nor discussed here, just >> >> >> recordsize). >> >> >> So, I'm doing some benchmarks on my "documents" dataset (details i= n >> >> >> "notes" below), but the results are really strange to me. >> >> >> >> >> >> When I rsync the same data to a freshly-recreated zpool: >> >> >> A) with recordsize=3D128K : 226G allocated on disk >> >> >> B) with recordsize=3D1M : 232G allocated on disk =3D> bigger than = 128K ?!? >> >> >> >> >> >> I would clearly expect the other way around, because bigger record= size >> >> >> generates less metadata so smaller disk usage, and there shouldn't= be >> >> >> any overhead because 1M is just a maximum and not a forced size to >> >> >> allocate for every object. >> >> >> >> A common misconception. The 1M recordsize applies to every newly >> >> created object, and every object must use the same size for all of it= s >> >> records (except possibly the last one). But objects created before >> >> you changed the recsize will retain their old recsize, file tails hav= e >> >> a flexible recsize. >> >> >> >> >> I don't mind the increased usage (I can live with a few GB more), = but >> >> >> I would like to understand why it happens. >> >> >> >> You might be seeing the effects of sparsity. ZFS is smart enough not >> >> to store file holes (and if any kind of compression is enabled, it >> >> will find long runs of zeroes and turn them into holes). If your dat= a >> >> contains any holes that are >=3D 128 kB but < 1MB, then they can be >> >> stored as holes with a 128 kB recsize but must be stored as long runs >> >> of zeros with a 1MB recsize. >> >> >> >> However, I would suggest that you don't bother. With a 128kB recsize= , >> >> ZFS has something like a 1000:1 ratio of data:metadata. In other >> >> words, increasing your recsize can save you at most 0.1% of disk >> >> space. Basically, it doesn't matter. What it _does_ matter for is >> >> the tradeoff between write amplification and RAM usage. 1000:1 is >> >> comparable to the disk:ram of many computers. And performance is mor= e >> >> sensitive to metadata access times than data access times. So >> >> increasing your recsize can help you keep a greater fraction of your >> >> metadata in ARC. OTOH, as you remarked increasing your recsize will >> >> also increase write amplification. >> >> >> >> So to summarize: >> >> * Adjust compression settings to save disk space. >> >> * Adjust recsize to save RAM. >> >> >> >> -Alan >> >> >> >> >> >> >> >> I tried to give all the details of my tests below. >> >> >> Did I do something wrong ? Can you explain the increase ? >> >> >> >> >> >> Thanks ! >> >> >> >> >> >> >> >> >> >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> A) 128K >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> >> >> # zpool destroy bench >> >> >> # zpool create -o ashift=3D12 bench >> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 >> >> >> >> >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench >> >> >> [...] >> >> >> sent 241,042,476,154 bytes received 353,838 bytes 81,806,492.45 = bytes/sec >> >> >> total size is 240,982,439,038 speedup is 1.00 >> >> >> >> >> >> # zfs get recordsize bench >> >> >> NAME PROPERTY VALUE SOURCE >> >> >> bench recordsize 128K default >> >> >> >> >> >> # zpool list -v bench >> >> >> NAME SIZE ALLOC FREE >> >> >> CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT >> >> >> bench 2.72T 226G 2.50T >> >> >> - - 0% 8% 1.00x ONLINE - >> >> >> gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 2.72T 226G 2.50T >> >> >> - - 0% 8.10% - ONLINE >> >> >> >> >> >> # zfs list bench >> >> >> NAME USED AVAIL REFER MOUNTPOINT >> >> >> bench 226G 2.41T 226G /bench >> >> >> >> >> >> # zfs get all bench |egrep "(used|referenced|written)" >> >> >> bench used 226G - >> >> >> bench referenced 226G - >> >> >> bench usedbysnapshots 0B - >> >> >> bench usedbydataset 226G - >> >> >> bench usedbychildren 1.80M - >> >> >> bench usedbyrefreservation 0B - >> >> >> bench written 226G - >> >> >> bench logicalused 226G - >> >> >> bench logicalreferenced 226G - >> >> >> >> >> >> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb >> >> >> >> >> >> >> >> >> >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> B) 1M >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> >> >> # zpool destroy bench >> >> >> # zpool create -o ashift=3D12 bench >> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 >> >> >> # zfs set recordsize=3D1M bench >> >> >> >> >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench >> >> >> [...] >> >> >> sent 241,042,476,154 bytes received 353,830 bytes 80,173,899.88 = bytes/sec >> >> >> total size is 240,982,439,038 speedup is 1.00 >> >> >> >> >> >> # zfs get recordsize bench >> >> >> NAME PROPERTY VALUE SOURCE >> >> >> bench recordsize 1M local >> >> >> >> >> >> # zpool list -v bench >> >> >> NAME SIZE ALLOC FREE >> >> >> CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT >> >> >> bench 2.72T 232G 2.49T >> >> >> - - 0% 8% 1.00x ONLINE - >> >> >> gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 2.72T 232G 2.49T >> >> >> - - 0% 8.32% - ONLINE >> >> >> >> >> >> # zfs list bench >> >> >> NAME USED AVAIL REFER MOUNTPOINT >> >> >> bench 232G 2.41T 232G /bench >> >> >> >> >> >> # zfs get all bench |egrep "(used|referenced|written)" >> >> >> bench used 232G - >> >> >> bench referenced 232G - >> >> >> bench usedbysnapshots 0B - >> >> >> bench usedbydataset 232G - >> >> >> bench usedbychildren 1.96M - >> >> >> bench usedbyrefreservation 0B - >> >> >> bench written 232G - >> >> >> bench logicalused 232G - >> >> >> bench logicalreferenced 232G - >> >> >> >> >> >> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb >> >> >> >> >> >> >> >> >> >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> Notes: >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> >> >> - the source dataset contains ~50% of pictures (raw files and jpg)= , >> >> >> and also some music, various archived documents, zip, videos >> >> >> - no change on the source dataset while testing (cf size logged by= resync) >> >> >> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), = and >> >> >> same results >> >> >> - probably not important here, but: >> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR >> >> >> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize datas= et >> >> >> on another zpool that I never tweaked except ashit=3D12 (because u= sing >> >> >> the same model of Red 3TB) >> >> >> >> >> >> # zfs --version >> >> >> zfs-2.0.6-1 >> >> >> zfs-kmod-v2021120100-zfs_a8c7652 >> >> >> >> >> >> # uname -a >> >> >> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11 >> >> >> 75566f060d4(HEAD) TRUENAS amd64