Re: ZFS deadlock in 14

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 22 Aug 2023 18:24:00 UTC
Alexander Motin <mav_at_FreeBSD.org> wrote on
Date: Tue, 22 Aug 2023 16:18:12 UTC :

> I am waiting for final test results from George Wilson and then will 
> request quick merge of both to zfs-2.2-release branch. Unfortunately 
> there are still not many reviewers for the PR, since the code is not 
> trivial, but at least with the test reports Brian Behlendorf and Mark 
> Maybee seem to be OK to merge the two PRs into 2.2. If somebody else 
> have tested and/or reviewed the PR, you may comment on it.

I had written to the list that when I tried to test the system
doing poudriere builds (initially with your patches) using
USE_TMPFS=no so that zfs had to deal with all the file I/O, I
instead got only one builder that ended up active, the others
never reaching "Builder started":

[00:01:34] [01] [00:00:00] Builder starting
[00:01:57] [01] [00:00:23] Builder started
[00:01:57] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.4
[00:03:09] [01] [00:01:12] Finished ports-mgmt/pkg | pkg-1.20.4: Success
[00:03:21] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1
[00:03:21] [02] [00:00:00] Builder starting
[00:03:21] [03] [00:00:00] Builder starting
[00:03:21] [04] [00:00:00] Builder starting
[00:03:21] [05] [00:00:00] Builder starting
[00:03:21] [06] [00:00:00] Builder starting
[00:03:21] [07] [00:00:00] Builder starting
[00:03:22] [08] [00:00:00] Builder starting
[00:03:22] [09] [00:00:00] Builder starting
[00:03:22] [10] [00:00:00] Builder starting
[00:03:22] [11] [00:00:00] Builder starting
[00:03:22] [12] [00:00:00] Builder starting
[00:03:22] [13] [00:00:00] Builder starting
[00:03:22] [14] [00:00:00] Builder starting
[00:03:22] [15] [00:00:00] Builder starting
[00:03:22] [16] [00:00:00] Builder starting
[00:03:22] [17] [00:00:00] Builder starting
[00:03:22] [18] [00:00:00] Builder starting
[00:03:22] [19] [00:00:00] Builder starting
[00:03:22] [20] [00:00:00] Builder starting
[00:03:22] [21] [00:00:00] Builder starting
[00:03:22] [22] [00:00:00] Builder starting
[00:03:22] [23] [00:00:00] Builder starting
[00:03:22] [24] [00:00:00] Builder starting
[00:03:22] [25] [00:00:00] Builder starting
[00:03:22] [26] [00:00:00] Builder starting
[00:03:22] [27] [00:00:00] Builder starting
[00:03:22] [28] [00:00:00] Builder starting
[00:03:22] [29] [00:00:00] Builder starting
[00:03:22] [30] [00:00:00] Builder starting
[00:03:22] [31] [00:00:00] Builder starting
[00:03:22] [32] [00:00:00] Builder starting
[00:03:30] [01] [00:00:09] Finished print/indexinfo | indexinfo-0.3.1: Success
[00:03:31] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22
. . .

Top was showing lots of "vlruwk" for the cpdup's. For example:

. . .
 362     0 root         40    0  27076Ki   13776Ki CPU19   19   4:23   0.00% cpdup -i0 -o ref 32
 349     0 root         53    0  27076Ki   13776Ki vlruwk  22   4:20   0.01% cpdup -i0 -o ref 31
 328     0 root         68    0  27076Ki   13804Ki vlruwk   8   4:30   0.01% cpdup -i0 -o ref 30
 304     0 root         37    0  27076Ki   13792Ki vlruwk   6   4:18   0.01% cpdup -i0 -o ref 29
 282     0 root         42    0  33220Ki   13956Ki vlruwk   8   4:33   0.01% cpdup -i0 -o ref 28
 242     0 root         56    0  27076Ki   13796Ki vlruwk   4   4:28   0.00% cpdup -i0 -o ref 27
. . .

But those processes did show CPU?? on occasion, as well as
*vnode less often. None of the cpdup's was stuck in

Removing your patches did not change the behavior.

So far I've not seen any similar reports to these
resuls that I got the ThreadRipper 1950X that I
have access to.

I normally use USE_TMPFS=all but that hides the
problem and is why I've no clue when the behavior
would have started if I'd been using USE_TMPFS=no
instead.

I never got so far as testing for the kinds of
reports I've seen about the deadlock issue.

No one has commented one what I reported or if
they have done any USE_TMPFS=no style of testing.
(I also use ALLOW_MAKE_JOBS=yes .)

The ZFS context is a simple single partition context.
I use ZFS in order to use bectl BE's, not other
reasons.

===
Mark Millard
marklmi at yahoo.com