Re: ZFS deadlock in 14 [USE_TMPFS=no poudriere messed up from the start, lots of "vlruwk"]
Date: Sun, 20 Aug 2023 00:33:47 UTC
On Aug 19, 2023, at 16:27, Mark Millard <marklmi@yahoo.com> wrote: > On Aug 19, 2023, at 15:41, Mark Millard <marklmi@yahoo.com> wrote: > >> On Aug 19, 2023, at 13:41, Mark Millard <marklmi@yahoo.com> wrote: >> >>> [I forgot to adjust USE_TMPFS for the purpose of the test. >>> So I'll later be starting over.] >>> >>> . . . >> >> I finally got around to starting a from-scratch bulk -a >> again (based on USE_TMPFS=no this time). This is with >> 15107.patch and 15122.patch applied. This is a non-debug >> kernel experiment. >> >> Interstingly it got: >> >> [00:01:34] [01] [00:00:00] Builder starting >> [00:01:57] [01] [00:00:23] Builder started >> [00:01:57] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.4 >> [00:03:09] [01] [00:01:12] Finished ports-mgmt/pkg | pkg-1.20.4: Success >> [00:03:21] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1 >> [00:03:21] [02] [00:00:00] Builder starting >> [00:03:21] [03] [00:00:00] Builder starting >> [00:03:21] [04] [00:00:00] Builder starting >> [00:03:21] [05] [00:00:00] Builder starting >> [00:03:21] [06] [00:00:00] Builder starting >> [00:03:21] [07] [00:00:00] Builder starting >> [00:03:22] [08] [00:00:00] Builder starting >> [00:03:22] [09] [00:00:00] Builder starting >> [00:03:22] [10] [00:00:00] Builder starting >> [00:03:22] [11] [00:00:00] Builder starting >> [00:03:22] [12] [00:00:00] Builder starting >> [00:03:22] [13] [00:00:00] Builder starting >> [00:03:22] [14] [00:00:00] Builder starting >> [00:03:22] [15] [00:00:00] Builder starting >> [00:03:22] [16] [00:00:00] Builder starting >> [00:03:22] [17] [00:00:00] Builder starting >> [00:03:22] [18] [00:00:00] Builder starting >> [00:03:22] [19] [00:00:00] Builder starting >> [00:03:22] [20] [00:00:00] Builder starting >> [00:03:22] [21] [00:00:00] Builder starting >> [00:03:22] [22] [00:00:00] Builder starting >> [00:03:22] [23] [00:00:00] Builder starting >> [00:03:22] [24] [00:00:00] Builder starting >> [00:03:22] [25] [00:00:00] Builder starting >> [00:03:22] [26] [00:00:00] Builder starting >> [00:03:22] [27] [00:00:00] Builder starting >> [00:03:22] [28] [00:00:00] Builder starting >> [00:03:22] [29] [00:00:00] Builder starting >> [00:03:22] [30] [00:00:00] Builder starting >> [00:03:22] [31] [00:00:00] Builder starting >> [00:03:22] [32] [00:00:00] Builder starting >> [00:03:30] [01] [00:00:09] Finished print/indexinfo | indexinfo-0.3.1: Success >> [00:03:31] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22 >> >> and is still that way minutes later. >> >> ^T shows: >> >> [00:03:31] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22 >> load: 13.02 cmd: sh 2187 [vlruwk] 570.19r 0.62u 38.60s 9% 3948k >> #0 0xffffffff80b7701b at mi_switch+0xbb >> #1 0xffffffff80bc976f at sleepq_timedwait+0x2f >> #2 0xffffffff80b76770 at _sleep+0x1d0 >> #3 0xffffffff80c5b435 at vn_alloc_hard+0x2a5 >> #4 0xffffffff80c50b72 at getnewvnode_reserve+0x92 >> #5 0xffffffff829b9b12 at zfs_zget+0x22 >> #6 0xffffffff829a6a8d at zfs_dirent_lookup+0x16d >> #7 0xffffffff829a6b5f at zfs_dirlook+0x7f >> #8 0xffffffff829b6410 at zfs_lookup+0x350 >> #9 0xffffffff829b182a at zfs_freebsd_cachedlookup+0x6a >> #10 0xffffffff80c36a0d at vfs_cache_lookup+0xad >> #11 0xffffffff80c45141 at vfs_lookup+0x581 >> #12 0xffffffff80c44238 at namei+0x238 >> #13 0xffffffff80c63b5e at kern_statat+0xee >> #14 0xffffffff80c64237 at sys_fstatat+0x27 >> #15 0xffffffff81049a79 at amd64_syscall+0x109 >> #16 0xffffffff8101f11b at fast_syscall_common+0xf8 >> [main-amd64-bulk_a-default] [2023-08-19_15h14m10s] [parallel_build:] Queued: 34435 Built: 2 Failed: 0 Skipped: 35 Ignored: 358 Fetched: 0 Tobuild: 34040 Time: 00:10:52 >> ID TOTAL ORIGIN PKGNAME PHASE PHASE TMPFS CPU% MEM% >> [01] 00:07:29 devel/gettext-runtime | gettext-runtime-0.22 build 00:06:32 25.4% 0% >> [00:11:25] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-08-19_15h14m10s >> >> Note the 3:31->11:25 . >> >> Top is showing lots of "vlruwk". For example: >> >> 362 0 root 40 0 27076Ki 13776Ki CPU19 19 4:23 0.00% cpdup -i0 -o ref 32 >> 349 0 root 53 0 27076Ki 13776Ki vlruwk 22 4:20 0.01% cpdup -i0 -o ref 31 >> 328 0 root 68 0 27076Ki 13804Ki vlruwk 8 4:30 0.01% cpdup -i0 -o ref 30 >> 304 0 root 37 0 27076Ki 13792Ki vlruwk 6 4:18 0.01% cpdup -i0 -o ref 29 >> 282 0 root 42 0 33220Ki 13956Ki vlruwk 8 4:33 0.01% cpdup -i0 -o ref 28 >> 242 0 root 56 0 27076Ki 13796Ki vlruwk 4 4:28 0.00% cpdup -i0 -o ref 27 >> >> In other words, it is messed up from the start, not >> just later. >> >> It does suggest that the dbg kernel should not end up with >> resource problems: not that much gets very far. So I'll >> probably stop it and substitute the debug kernel, reboot >> and try again. > > Still for nodbg kernel . . . > > The "vlruwk" processes do occasionally instead show a > CPU?? . Nothing seems stuck in only one STATE. (Live > lock?) > > As for using the dbg kernel instead . . . > > Most of the time that processes are showing CPU?? more > progress is made in building, but basically one builder. > vlruwk dooes show up, gradually showing a larger fracion > of the time. ref 02 .. ref 32 are still in cpdup -i0 -o . > *vnode is showing up some as well. N process looks to > be stuck in just one of those. (Live lock?) > > The debug kernel is not reporting anything during this > so far. > > (some time goes by) > > At this point vlruwk is fairly commonly what mostlt display > for the cpdup's that are not finishing --but none are stuck > in vlruuwk . > > Looks like I should try without the 2 patches (15107 and > 15122). > I restored to eliminate the 2 patches and rebuilt the kernels installed the nodbg one, rebooted, and tried again, still using USE_TMPFS=no . . . The cpdup's get the same sort of vlruwk/CPU??/*vnode sort of general activity. ref 02 .. ref 32 still do not complete, leaving 1 builder active. In other words, the patches did not make a difference that I've noticed for what I'm reporting. === Mark Millard marklmi at yahoo.com