Re: armv8.2-A+ tuned FreeBSD kernels vs. poudriere bulk and USB3 media: tx->tx_quiesce_done_cv related blocking of processes?
- Reply: Mark Millard : "Re: armv8.2-A+ tuned FreeBSD kernels vs. poudriere bulk and USB3 media: tx->tx_quiesce_done_cv related blocking of processes?"
- In reply to: Mark Millard : "Re: armv8.2-A+ tuned FreeBSD kernels vs. poudriere bulk and USB3 media: tx->tx_quiesce_done_cv related blocking of processes?"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 01 May 2023 00:44:10 UTC
can you redo zfs test with: sysctl vfs.zfs.per_txg_dirty_frees_percent=5 On 5/1/23, Mark Millard <marklmi@yahoo.com> wrote: > As the evidence this time is largely independent of the details > reported previousy, I'm top posting this. > > The previous ZFS on USB3 results were based on poudriere using > "USE_TMPFS=data", meaning that almost all file I/O was via ZFS > to the USB3 media. > > The UFS on U2 960GB Optane via USB3 adapter results did not not > suffer the reported problems, despite "USE_TMPFS=data". (I let > it run to completion.) But this had both a media and a file > system difference. > > This time the results are for just changing poudriere to use > "USE_TMPFS=all" instead but back on the original ZFS on > USB3 media. The implication is that the vast majority of the > file I/O is not via ZFS to the USB3 media. > > The context has 32 GiBytes of RAM and about 118 GiBytes of > swap/paging space. It would need to page if pet run to > completion. > > Observing, the load average is behaving normally: Things are > not stuck waiting. "gstat -spod" indicates the ZFS I/O is > not sustained (no paging in swap space use yet). > > First 1 hr: 262 ports built. > > But this had both a media and a file system difference again. > > I'm stopping after this, in order to set up the next just- > ZFS experiments. > > > Next experiments: > > I expect to move the ZFS context to U2 960GB Optane media > used with the USB3 adapter and to test both "USE_TMPFS=data" > and "USE_TMPSF=all", probably letting USE_TMPFS=all run to > completion. > > If Optane based USE_TMPFS=data context still has the problem, > even the more performance media would have been not enough to > avoid what would then appear to be a ZFS problem, two other > file systems not having the problem. > > The Optane based USE_TMPFS=all context should just handle the > paging and more rare ZFS I/O quicker. I do not expect problems > for this combination, given the UFS on Optane USB3 results > and the partial USE_TMPFS=all non-Optane USB3 results. Even > with ZFS working, this likely is the more performant type of > context for the Windows Dev Kit 2023, given that I'm leaving > Windows 11 Pro in place on the internal media. > > > Hypothesis for the original problem: > > I wonder if ZFS write activity to the USB3 media was largely > blocking the ZFS read activity to the same media, causing > lots of things to have to spend much time waiting for data > instead of making progress, leading to long periods of low > load averages. > > > Older material: > > On Apr 30, 2023, at 00:50, Mark Millard <marklmi@yahoo.com> wrote: > >> On Apr 29, 2023, at 19:44, Mark Millard <marklmi@yahoo.com> wrote: >> >>> This is based on: main-n262658-b347c2284603-dirty, b347c2284603 >>> being from late Apr 28, 2023 UTC. (The "-dirty" is from some >>> historical patches that I use.) The build is a non-debug build >>> (but with symbols not stripped). World or Kernel had been built >>> using: >>> >>> -mcpu=cortex-a78C+flagm+nofp16fml >>> >>> just for testing purposes. (Worked nicely for -j8 buildworld >>> buildkernel testing for the 4 cortex-a78c's plus 4 cortex-x1c's >>> present.) >>> >>> Monitoring poudriere bulk related activity via top and gstat -spod >>> I see a lot of the odd result of one process doing something >>> like: >>> >>> CPU4 4 1:39 99.12% /usr/local/sbin/pkg-static create -f tzst -r >>> /wrkdirs/usr/ports/devel/cmake-core/work/stage >>> >>> while other processes sit in the likes of: >>> >>> tx->tx >>> zcq->z >>> zcw->z >>> zilog- >>> select >>> wait >>> >>> But sometimes there is no CPU bound process and the top CPU process is >>> the likes of: >>> >>> 1.24% [usb{usbus0}] >>> >>> "gstat -spod" basically shows da0 dedicated to write activity most >>> of the time. >>> >>> After: sysctl kern.tty_info_kstacks=1 >>> Then using ^T at various times, I see a lot of: >>> >>> load: 0.48 cmd: sh 93914 [tx->tx_quiesce_done_cv] 7534.91r 11.06u 22.66s >>> 0% 3800k >>> #0 0xffff0000004fd564 at mi_switch+0x104 >>> #1 0xffff000000463f40 at _cv_wait+0x120 >>> #2 0xffff00000153fa34 at txg_wait_open+0xf4 >>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c >>> #4 0xffff000001448254 at zfs_rmnode+0x64 >>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48 >>> #10 0xffff0000005fc430 at vinactivef+0x180 >>> #11 0xffff0000005fba50 at vput_final+0x200 >>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320 >>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14 >>> #14 0xffff0000008f8514 at do_el0_sync+0x594 >>> #15 0xffff0000008d4904 at handle_el0_sync+0x40 >>> >>> load: 0.34 cmd: sh 93914 [tx->tx_quiesce_done_cv] 7566.69r 11.06u 22.66s >>> 0% 3800k >>> #0 0xffff0000004fd564 at mi_switch+0x104 >>> #1 0xffff000000463f40 at _cv_wait+0x120 >>> #2 0xffff00000153fa34 at txg_wait_open+0xf4 >>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c >>> #4 0xffff000001448254 at zfs_rmnode+0x64 >>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34 >>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >>> #7 0xffff0000005fd6c0 at vgonel+0x450 >>> #8 0xffff0000005fde7c at vrecycle+0x9c >>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48 >>> #10 0xffff0000005fc430 at vinactivef+0x180 >>> #11 0xffff0000005fba50 at vput_final+0x200 >>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320 >>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14 >>> #14 0xffff0000008f8514 at do_el0_sync+0x594 >>> #15 0xffff0000008d4904 at handle_el0_sync+0x40 >>> >>> load: 0.44 cmd: sh 93914 [tx->tx_quiesce_done_cv] 7693.52r 11.24u 23.08s >>> 0% 3800k >>> #0 0xffff0000004fd564 at mi_switch+0x104 >>> #1 0xffff000000463f40 at _cv_wait+0x120 >>> #2 0xffff00000153fa34 at txg_wait_open+0xf4 >>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c >>> #4 0xffff000001448254 at zfs_rmnode+0x64 >>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34 >>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >>> #7 0xffff0000005fd6c0 at vgonel+0x450 >>> #8 0xffff0000005fde7c at vrecycle+0x9c >>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48 >>> #10 0xffff0000005fc430 at vinactivef+0x180 >>> #11 0xffff0000005fba50 at vput_final+0x200 >>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320 >>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14 >>> #14 0xffff0000008f8514 at do_el0_sync+0x594 >>> #15 0xffff0000008d4904 at handle_el0_sync+0x40 >>> >>> >>> The system (Windows Dev Kit 2023) has 32 GiBytes of RAM. Example >>> output from a top modified to show some "Max[imum]Obs[erved]" >>> information: >>> >>> last pid: 17198; load averages: 0.33, 0.58, 1.06 MaxObs: 15.49, >>> 8.73, 5.75 >>> up 0+20:48:10 19:14:49 >>> 426 threads: 9 running, 394 sleeping, 1 stopped, 22 waiting, 50 >>> MaxObsRunning >>> CPU: 0.0% user, 0.0% nice, 0.2% system, 0.1% interrupt, 99.7% idle >>> Mem: 282760Ki Active, 7716Mi Inact, 23192Ki Laundry, 22444Mi Wired, >>> 2780Ki Buf, 848840Ki Free, 2278Mi MaxObsActive, 22444Mi MaxObsWired, >>> 22752Mi MaxObs(Act+Wir+Lndry) >>> ARC: 11359Mi Total, 3375Mi MFU, 5900Mi MRU, 993Mi Anon, 93076Ki Header, >>> 992Mi Other >>> 8276Mi Compressed, 19727Mi Uncompressed, 2.38:1 Ratio >>> Swap: 120832Mi Total, 120832Mi Free, 2301Mi MaxObs(Act+Lndry+SwapUsed), >>> 22752Mi MaxObs(Act+Wir+Lndry+SwapUsed) >>> >>> >>> The poudriere bulk has 8 builders but has ALLOW_MAKE_JOBS=yes >>> without any explicit settings for the likes of MAKE_JOBS_NUMBER . >>> So it is a configuration that allows a high load average compared >>> to the number of hardware threads (here: cores) in the system. >>> >>> >>> I've rebooted to do a test with filemon not loaded at the time >>> (here it was loaded from prior buildworld buildkernel activity). >>> We will see if it still ends up with such problems. >> >> It still ends up with things waiting, but the detailed STATE >> valiues generally lsited are somewhat different. >> >> I also tried a chroot into a world from use of -mcpu=cortex-a72 >> and got similar results. Suggesting that only the >> -mcpu=cortex-a78c+flagm+nofp16fml kernel is required to see the >> issues. This even got some examples like: >> >> load: 1.20 cmd: sh 95560 [tx->tx_quiesce_done_cv] 2022.55r 3.21u 9.24s 0% >> 3852k >> #0 0xffff0000004fd564 at mi_switch+0x104 >> #1 0xffff000000463f40 at _cv_wait+0x120 >> #2 0xffff000001518a34 at txg_wait_open+0xf4 >> #3 0xffff00000147d0bc at dmu_free_long_range+0x17c >> #4 0xffff000001421254 at zfs_rmnode+0x64 >> #5 0xffff00000142e7c4 at zfs_freebsd_reclaim+0x34 >> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >> #7 0xffff0000005fd6c0 at vgonel+0x450 >> #8 0xffff0000005fde7c at vrecycle+0x9c >> #9 0xffff00000142e678 at zfs_freebsd_inactive+0x48 >> #10 0xffff0000005fc430 at vinactivef+0x180 >> #11 0xffff0000005fba50 at vput_final+0x200 >> #12 0xffff00015d8ceab4 at null_reclaim+0x154 >> #13 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >> #14 0xffff0000005fd6c0 at vgonel+0x450 >> #15 0xffff0000005fde7c at vrecycle+0x9c >> #16 0xffff00015d8ce8e8 at null_inactive+0x38 >> #17 0xffff0000005fc430 at vinactivef+0x180 >> >> (The chroot use involves null mounts.) >> >> which is sort of analogous to the filemon related >> backtraces I showed earlier. The common part >> across the examples looks to be #0..#11: >> >> #0 0xffff0000004fd564 at mi_switch+0x104 >> #1 0xffff000000463f40 at _cv_wait+0x120 >> #2 0xffff00000153fa34 at txg_wait_open+0xf4 >> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c >> #4 0xffff000001448254 at zfs_rmnode+0x64 >> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34 >> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >> #7 0xffff0000005fd6c0 at vgonel+0x450 >> #8 0xffff0000005fde7c at vrecycle+0x9c >> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48 >> #10 0xffff0000005fc430 at vinactivef+0x180 >> #11 0xffff0000005fba50 at vput_final+0x200 >> >> There were a lot more nanslp examples in all >> the alter testing (i.e, those avoided filemon.ko >> being loaded). >> >> Starting from having pkgclean -A'd the ports, the >> experiments got about the same number of ports built >> as of the end of the 1st hour. >> >> >> >> UFS vs. ZFS? Different media types? . . . >> >> So I decided to create and try a UFS test context >> instead of a ZFS one. But the media that was best >> to update was a U2 960GB Optane in a USB3 >> adapter, something that would perform noticeably >> better than my normal USB3 NVMe drives, even with >> USB involved. >> >> This combination maintained reasonable load averages >> (instead of having long periods of <1) and finish >> building 172 ports in the 1st hour, far more than the >> around 83 each time I tried the other device/ZFS >> combination. NO evdience of the earlier reported >> oddities. >> >> I should also time a from-scratch buildworld >> buildkernel. >> >> I'll look into setting up another U2 960GB Optane >> for use in the USB3 adpater, but with ZFS. That >> should help isolate media vs. file system >> contributions to the varying behaviors. > > > > === > Mark Millard > marklmi at yahoo.com > > > -- Mateusz Guzik <mjguzik gmail.com>