Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

From: Mark Millard <marklmi_at_yahoo.com>
Date: Mon, 04 Sep 2023 09:00:43 UTC
On Sep 3, 2023, at 23:35, Mark Millard <marklmi@yahoo.com> wrote:

> On Sep 3, 2023, at 22:06, Alexander Motin <mav@FreeBSD.org> wrote:
> 
>> 
>> On 03.09.2023 22:54, Mark Millard wrote:
>>> After that ^t produced the likes of:
>>> load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 13004k
>> 
>> So the full state is not "tx->tx", but is actually a "tx->tx_quiesce_done_cv", which means the thread is waiting for new transaction to be opened, which means some previous to be quiesced and then synced.
>> 
>>> #0 0xffffffff80b6f103 at mi_switch+0x173
>>> #1 0xffffffff80bc0f24 at sleepq_switch+0x104
>>> #2 0xffffffff80aec4c5 at _cv_wait+0x165
>>> #3 0xffffffff82aba365 at txg_wait_open+0xf5
>>> #4 0xffffffff82a11b81 at dmu_free_long_range+0x151
>> 
>> Here it seems like transaction commit is waited due to large amount of delete operations, which ZFS tries to spread between separate TXGs.
> 
> That fit the context: cleaning out /usr/local/poudriere/data/.m/
> 
>> You should probably see some large and growing number in sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .
> 
> After the reboot I started a -J64 example. It has avoided the
> early "witness exhausted". Again I ^C'd after about an hours
> after the 2nd builder had started. So: again cleaning out
> /usr/local/poudriere/data/.m/ Only seconds between:
> 
> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042
> 
> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427
> 
> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323
> 
> # sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
> kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027
> 
> I have found a measure of progress: zfs list's USED
> for /usr/local/poudriere/data/.m is decreasing. So
> ztop's d/s was a good classification: deletes.
> 
>>> #5 0xffffffff829a87d2 at zfs_rmnode+0x72
>>> #6 0xffffffff829b658d at zfs_freebsd_reclaim+0x3d
>>> #7 0xffffffff8113a495 at VOP_RECLAIM_APV+0x35
>>> #8 0xffffffff80c5a7d9 at vgonel+0x3a9
>>> #9 0xffffffff80c5af7f at vrecycle+0x3f
>>> #10 0xffffffff829b643e at zfs_freebsd_inactive+0x4e
>>> #11 0xffffffff80c598cf at vinactivef+0xbf
>>> #12 0xffffffff80c590da at vput_final+0x2aa
>>> #13 0xffffffff80c68886 at kern_funlinkat+0x2f6
>>> #14 0xffffffff80c68588 at sys_unlink+0x28
>>> #15 0xffffffff8106323f at amd64_syscall+0x14f
>>> #16 0xffffffff8103512b at fast_syscall_common+0xf8
>> 
>> What we don't see here is what quiesce and sync threads of the pool are actually doing.  Sync thread has plenty of different jobs, including async write, async destroy, scrub and others, that all may delay each other.
>> 
>> Before you rebooted the system, depending how alive it is, could you save a number of outputs of `procstat -akk`, or at least specifically `procstat -akk | grep txg_thread_enter` if the full is hard?  Or somehow else observe what they are doing.
> 
> # procstat -akk > ~/mmjnk00.txt
> # procstat -akk > ~/mmjnk01.txt
> # procstat -akk > ~/mmjnk02.txt
> # procstat -akk > ~/mmjnk03.txt
> # procstat -akk > ~/mmjnk04.txt
> # procstat -akk > ~/mmjnk05.txt
> # grep txg_thread_enter ~/mmjnk0[0-5].txt
> /usr/home/root/mmjnk00.txt:    6 100881 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk00.txt:    6 100882 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk01.txt:    6 100881 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk01.txt:    6 100882 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk02.txt:    6 100881 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk02.txt:    6 100882 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk03.txt:    6 100881 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk03.txt:    6 100882 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk04.txt:    6 100881 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk04.txt:    6 100882 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk05.txt:    6 100881 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe 
> /usr/home/root/mmjnk05.txt:    6 100882 zfskern             txg_thread_enter    mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b _cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe 
> 
> (Hopefully that will be a sufficiently useful start.)
> 
>> `zpool status`, `zpool get all` and `sysctl -a` would also not harm.
> 
> It is a very simple zpool configuration: one partition.
> I only use it for bectl BE reasons, not the general
> range of reasons for using zfs. I created the media with
> my normal content, then checkpointed before doing the
> git fetch to start to set up the experiment.
> 
> # zpool status
>  pool: zamd64
> state: ONLINE
> status: Some supported and requested features are not enabled on the pool.
> The pool can still be used, but some features are unavailable.
> action: Enable all features using 'zpool upgrade'. Once this is done,
> the pool may no longer be accessible by software that does not support
> the features. See zpool-features(7) for details.
> checkpoint: created Sun Sep  3 11:46:54 2023, consumes 2.17M
> config:
> 
> NAME            STATE     READ WRITE CKSUM
> zamd64          ONLINE       0     0     0
>  gpt/amd64zfs  ONLINE       0     0     0
> 
> errors: No known data errors
> 
> There was also a snapshot in place before I did the
> checkpoint operation.
> 
> I deliberately did not use my typical openzfs-2.1-freebsd 
> for compatibility but used defaults when creating the pool:
> 
> # zpool get all
> NAME    PROPERTY                       VALUE                          SOURCE
> zamd64  size                           872G                           -
> zamd64  capacity                       21%                            -
> zamd64  altroot                        -                              default
> zamd64  health                         ONLINE                         -
> zamd64  guid                           4817074778276814820            -
> zamd64  version                        -                              default
> zamd64  bootfs                         zamd64/ROOT/main-amd64         local
> zamd64  delegation                     on                             default
> zamd64  autoreplace                    off                            default
> zamd64  cachefile                      -                              default
> zamd64  failmode                       wait                           default
> zamd64  listsnapshots                  off                            default
> zamd64  autoexpand                     off                            default
> zamd64  dedupratio                     1.00x                          -
> zamd64  free                           688G                           -
> zamd64  allocated                      184G                           -
> zamd64  readonly                       off                            -
> zamd64  ashift                         0                              default
> zamd64  comment                        -                              default
> zamd64  expandsize                     -                              -
> zamd64  freeing                        0                              -
> zamd64  fragmentation                  17%                            -
> zamd64  leaked                         0                              -
> zamd64  multihost                      off                            default
> zamd64  checkpoint                     2.17M                          -
> zamd64  load_guid                      17719601284614388220           -
> zamd64  autotrim                       off                            default
> zamd64  compatibility                  off                            default
> zamd64  bcloneused                     0                              -
> zamd64  bclonesaved                    0                              -
> zamd64  bcloneratio                    1.00x                          -
> zamd64  feature@async_destroy          enabled                        local
> zamd64  feature@empty_bpobj            active                         local
> zamd64  feature@lz4_compress           active                         local
> zamd64  feature@multi_vdev_crash_dump  enabled                        local
> zamd64  feature@spacemap_histogram     active                         local
> zamd64  feature@enabled_txg            active                         local
> zamd64  feature@hole_birth             active                         local
> zamd64  feature@extensible_dataset     active                         local
> zamd64  feature@embedded_data          active                         local
> zamd64  feature@bookmarks              enabled                        local
> zamd64  feature@filesystem_limits      enabled                        local
> zamd64  feature@large_blocks           enabled                        local
> zamd64  feature@large_dnode            enabled                        local
> zamd64  feature@sha512                 enabled                        local
> zamd64  feature@skein                  enabled                        local
> zamd64  feature@edonr                  enabled                        local
> zamd64  feature@userobj_accounting     active                         local
> zamd64  feature@encryption             enabled                        local
> zamd64  feature@project_quota          active                         local
> zamd64  feature@device_removal         enabled                        local
> zamd64  feature@obsolete_counts        enabled                        local
> zamd64  feature@zpool_checkpoint       active                         local
> zamd64  feature@spacemap_v2            active                         local
> zamd64  feature@allocation_classes     enabled                        local
> zamd64  feature@resilver_defer         enabled                        local
> zamd64  feature@bookmark_v2            enabled                        local
> zamd64  feature@redaction_bookmarks    enabled                        local
> zamd64  feature@redacted_datasets      enabled                        local
> zamd64  feature@bookmark_written       enabled                        local
> zamd64  feature@log_spacemap           active                         local
> zamd64  feature@livelist               enabled                        local
> zamd64  feature@device_rebuild         enabled                        local
> zamd64  feature@zstd_compress          enabled                        local
> zamd64  feature@draid                  enabled                        local
> zamd64  feature@zilsaxattr             active                         local
> zamd64  feature@head_errlog            active                         local
> zamd64  feature@blake3                 enabled                        local
> zamd64  feature@block_cloning          enabled                        local
> zamd64  feature@vdev_zaps_v2           active                         local
> zamd64  feature@redaction_list_spill   disabled                       local
> 
> /etc/sysctl.conf does have:
> 
> vfs.zfs.min_auto_ashift=12
> vfs.zfs.per_txg_dirty_frees_percent=5
> 
> The vfs.zfs.per_txg_dirty_frees_percent is from prior
> Mateusz Guzik help, where after testing the change I
> reported:
> 
> Result summary: Seems to have avoided the sustained periods
> of low load average activity. Much better for the context.
> 
> But it was for a different machine (aarch64, 8 cores). But
> it was for poudriere bulk use.
> 
> Turns out the default of 30 was causing sort of like
> what is seen here: I could have presented some of the
> information via the small load average figures here.
> 
> (Note: 5 is the old default, 30 is newer. Other contexts
> have other problems with 5: no single right setting and
> no automated configuration.)
> 
> Other than those 2 items, zfs is untuned (defaults).
> 
> sysctl -a is a lot more output (864930 Bytes) so I'll skip
> it for now.
> 
>> PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me shiver. Make sure there is no storage problems, like some huge delays, timeouts, etc, that can be seen, for example, as busy percents regularly spiking far above 100% in your `gstat -spod`.
>> 
> 
> The "gstat -spod" output showed (and shows): around 0.8ms/w to 3ms/w,
> mostly at the lower end of the range. < 98%busy, no spikes to > 100%.
> It is a previously unused Samsung PSSD T7 Touch.

A little more context here: that is for the "kB" figures seen
during the cleanup/delete activity. During port builds into
packages larger "kB" figures are seen and the ms/w figures
will tend to be larger as well. The larger sizes can also
lead to reaching somewhat above 100 %busy some of the time.

I'll also note that I've ended up doing a lot more write
activity exploring than I'd expected.

> I was not prepared to replace the content of a PCIe slot's media
> or M.2 connection's media for the temporary purpose. No spare
> supply for those so no simple swapping for those.




===
Mark Millard
marklmi at yahoo.com