Re: ZFS deadlock in 14 [USE_TMPFS=no poudriere messed up from the start, lots of "vlruwk"]

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 19 Aug 2023 22:41:04 UTC
On Aug 19, 2023, at 13:41, Mark Millard <marklmi@yahoo.com> wrote:

> [I forgot to adjust USE_TMPFS for the purpose of the test.
> So I'll later be starting over.]
> 
> . . .

I finally got around to starting a from-scratch bulk -a
again (based on USE_TMPFS=no this time). This is with
15107.patch and 15122.patch applied. This is a non-debug
kernel experiment.

Interstingly it got:

[00:01:34] [01] [00:00:00] Builder starting
[00:01:57] [01] [00:00:23] Builder started
[00:01:57] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.4
[00:03:09] [01] [00:01:12] Finished ports-mgmt/pkg | pkg-1.20.4: Success
[00:03:21] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1
[00:03:21] [02] [00:00:00] Builder starting
[00:03:21] [03] [00:00:00] Builder starting
[00:03:21] [04] [00:00:00] Builder starting
[00:03:21] [05] [00:00:00] Builder starting
[00:03:21] [06] [00:00:00] Builder starting
[00:03:21] [07] [00:00:00] Builder starting
[00:03:22] [08] [00:00:00] Builder starting
[00:03:22] [09] [00:00:00] Builder starting
[00:03:22] [10] [00:00:00] Builder starting
[00:03:22] [11] [00:00:00] Builder starting
[00:03:22] [12] [00:00:00] Builder starting
[00:03:22] [13] [00:00:00] Builder starting
[00:03:22] [14] [00:00:00] Builder starting
[00:03:22] [15] [00:00:00] Builder starting
[00:03:22] [16] [00:00:00] Builder starting
[00:03:22] [17] [00:00:00] Builder starting
[00:03:22] [18] [00:00:00] Builder starting
[00:03:22] [19] [00:00:00] Builder starting
[00:03:22] [20] [00:00:00] Builder starting
[00:03:22] [21] [00:00:00] Builder starting
[00:03:22] [22] [00:00:00] Builder starting
[00:03:22] [23] [00:00:00] Builder starting
[00:03:22] [24] [00:00:00] Builder starting
[00:03:22] [25] [00:00:00] Builder starting
[00:03:22] [26] [00:00:00] Builder starting
[00:03:22] [27] [00:00:00] Builder starting
[00:03:22] [28] [00:00:00] Builder starting
[00:03:22] [29] [00:00:00] Builder starting
[00:03:22] [30] [00:00:00] Builder starting
[00:03:22] [31] [00:00:00] Builder starting
[00:03:22] [32] [00:00:00] Builder starting
[00:03:30] [01] [00:00:09] Finished print/indexinfo | indexinfo-0.3.1: Success
[00:03:31] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22

and is still that way minutes later.

^T shows:

[00:03:31] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22
load: 13.02  cmd: sh 2187 [vlruwk] 570.19r 0.62u 38.60s 9% 3948k
#0 0xffffffff80b7701b at mi_switch+0xbb
#1 0xffffffff80bc976f at sleepq_timedwait+0x2f
#2 0xffffffff80b76770 at _sleep+0x1d0
#3 0xffffffff80c5b435 at vn_alloc_hard+0x2a5
#4 0xffffffff80c50b72 at getnewvnode_reserve+0x92
#5 0xffffffff829b9b12 at zfs_zget+0x22
#6 0xffffffff829a6a8d at zfs_dirent_lookup+0x16d
#7 0xffffffff829a6b5f at zfs_dirlook+0x7f
#8 0xffffffff829b6410 at zfs_lookup+0x350
#9 0xffffffff829b182a at zfs_freebsd_cachedlookup+0x6a
#10 0xffffffff80c36a0d at vfs_cache_lookup+0xad
#11 0xffffffff80c45141 at vfs_lookup+0x581
#12 0xffffffff80c44238 at namei+0x238
#13 0xffffffff80c63b5e at kern_statat+0xee
#14 0xffffffff80c64237 at sys_fstatat+0x27
#15 0xffffffff81049a79 at amd64_syscall+0x109
#16 0xffffffff8101f11b at fast_syscall_common+0xf8
[main-amd64-bulk_a-default] [2023-08-19_15h14m10s] [parallel_build:] Queued: 34435 Built: 2     Failed: 0     Skipped: 35    Ignored: 358   Fetched: 0     Tobuild: 34040  Time: 00:10:52
 ID  TOTAL                   ORIGIN   PKGNAME              PHASE PHASE    TMPFS  CPU% MEM%
[01] 00:07:29 devel/gettext-runtime | gettext-runtime-0.22 build 00:06:32       25.4%   0%
[00:11:25] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-08-19_15h14m10s

Note the 3:31->11:25 .

Top is showing lots of "vlruwk". For example:

  362     0 root         40    0  27076Ki   13776Ki CPU19   19   4:23   0.00% cpdup -i0 -o ref 32
  349     0 root         53    0  27076Ki   13776Ki vlruwk  22   4:20   0.01% cpdup -i0 -o ref 31
  328     0 root         68    0  27076Ki   13804Ki vlruwk   8   4:30   0.01% cpdup -i0 -o ref 30
  304     0 root         37    0  27076Ki   13792Ki vlruwk   6   4:18   0.01% cpdup -i0 -o ref 29
  282     0 root         42    0  33220Ki   13956Ki vlruwk   8   4:33   0.01% cpdup -i0 -o ref 28
  242     0 root         56    0  27076Ki   13796Ki vlruwk   4   4:28   0.00% cpdup -i0 -o ref 27
  
In other words, it is messed up from the start, not
just later.

It does suggest that the dbg kernel should not end up with
resource problems: not that much gets very far. So I'll
probably stop it and substitute the debug kernel, reboot
and try again.

===
Mark Millard
marklmi at yahoo.com