Re: [zfs] recordsize: unexpected increase of disk usage when increasing it
- Reply: Florent Rivoire : "Re: [zfs] recordsize: unexpected increase of disk usage when increasing it"
- Reply: Alan Somers : "Re: [zfs] recordsize: unexpected increase of disk usage when increasing it"
- In reply to: Florent Rivoire : "[zfs] recordsize: unexpected increase of disk usage when increasing it"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 18 Jan 2022 14:12:50 UTC
Compression would have made your life better here, and possibly also made it clearer what's going on. All records in a file are going to be the same size pre-compression - so if you set the recordsize to 1M and save a 131.1M file, it's going to take up 132M on disk before compression/raidz overhead/whatnot. Usually compression saves you from the tail padding actually requiring allocation on disk, which is one reason I encourage everyone to at least use lz4 (or, if you absolutely cannot for some reason, I guess zle should also work for this one case...) But I would say it's probably the sum of last record padding across the whole dataset, if you don't have compression on. - Rich On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr> wrote: > TLDR: I rsync-ed the same data twice: once with 128K recordsize and > once with 1M, and the allocated size on disk is ~3% bigger with 1M. > Why not smaller ? > > > Hello, > > I would like some help to understand how the disk usage evolves when I > change the recordsize. > > I've read several articles/presentations/forums about recordsize in > ZFS, and if I try to summarize, I mainly understood that: > - recordsize is the "maximum" size of "objects" (so "logical blocks") > that zfs will create for both -data & metadata, then each object is > compressed and allocated to one vdev, splitted into smaller (ashift > size) "physical" blocks and written on disks > - increasing recordsize is usually good when storing large files that > are not modified, because it limits the nb of metadata objects > (block-pointers), which has a positive effect on performance > - decreasing recordsize is useful for "databases-like" workloads (ie: > small random writes inside existing objects), because it avoids write > amplification (read-modify-write a large object for a small update) > > Today, I'm trying to observe the effect of increasing recordsize for > *my* data (because I'm also considering defining special_small_blocks > & using SSDs as "special", but not tested nor discussed here, just > recordsize). > So, I'm doing some benchmarks on my "documents" dataset (details in > "notes" below), but the results are really strange to me. > > When I rsync the same data to a freshly-recreated zpool: > A) with recordsize=128K : 226G allocated on disk > B) with recordsize=1M : 232G allocated on disk => bigger than 128K ?!? > > I would clearly expect the other way around, because bigger recordsize > generates less metadata so smaller disk usage, and there shouldn't be > any overhead because 1M is just a maximum and not a forced size to > allocate for every object. > I don't mind the increased usage (I can live with a few GB more), but > I would like to understand why it happens. > > I tried to give all the details of my tests below. > Did I do something wrong ? Can you explain the increase ? > > Thanks ! > > > > =============================================== > A) 128K > ========== > > # zpool destroy bench > # zpool create -o ashift=12 bench > /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 > > # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench > [...] > sent 241,042,476,154 bytes received 353,838 bytes 81,806,492.45 bytes/sec > total size is 240,982,439,038 speedup is 1.00 > > # zfs get recordsize bench > NAME PROPERTY VALUE SOURCE > bench recordsize 128K default > > # zpool list -v bench > NAME SIZE ALLOC FREE > CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT > bench 2.72T 226G 2.50T > - - 0% 8% 1.00x ONLINE - > gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 2.72T 226G 2.50T > - - 0% 8.10% - ONLINE > > # zfs list bench > NAME USED AVAIL REFER MOUNTPOINT > bench 226G 2.41T 226G /bench > > # zfs get all bench |egrep "(used|referenced|written)" > bench used 226G - > bench referenced 226G - > bench usedbysnapshots 0B - > bench usedbydataset 226G - > bench usedbychildren 1.80M - > bench usedbyrefreservation 0B - > bench written 226G - > bench logicalused 226G - > bench logicalreferenced 226G - > > # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb > > > > =============================================== > B) 1M > ========== > > # zpool destroy bench > # zpool create -o ashift=12 bench > /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 > # zfs set recordsize=1M bench > > # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench > [...] > sent 241,042,476,154 bytes received 353,830 bytes 80,173,899.88 bytes/sec > total size is 240,982,439,038 speedup is 1.00 > > # zfs get recordsize bench > NAME PROPERTY VALUE SOURCE > bench recordsize 1M local > > # zpool list -v bench > NAME SIZE ALLOC FREE > CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT > bench 2.72T 232G 2.49T > - - 0% 8% 1.00x ONLINE - > gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 2.72T 232G 2.49T > - - 0% 8.32% - ONLINE > > # zfs list bench > NAME USED AVAIL REFER MOUNTPOINT > bench 232G 2.41T 232G /bench > > # zfs get all bench |egrep "(used|referenced|written)" > bench used 232G - > bench referenced 232G - > bench usedbysnapshots 0B - > bench usedbydataset 232G - > bench usedbychildren 1.96M - > bench usedbyrefreservation 0B - > bench written 232G - > bench logicalused 232G - > bench logicalreferenced 232G - > > # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb > > > > =============================================== > Notes: > ========== > > - the source dataset contains ~50% of pictures (raw files and jpg), > and also some music, various archived documents, zip, videos > - no change on the source dataset while testing (cf size logged by resync) > - I repeated the tests twice (128K, then 1M, then 128K, then 1M), and > same results > - probably not important here, but: > /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR > (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset > on another zpool that I never tweaked except ashit=12 (because using > the same model of Red 3TB) > > # zfs --version > zfs-2.0.6-1 > zfs-kmod-v2021120100-zfs_a8c7652 > > # uname -a > FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11 > 75566f060d4(HEAD) TRUENAS amd64 >