Re: poudriere bulk with ZFS and USE_TMPFS=no on main [14-ALPHA2 based]: extensive vlruwk for cpdup's on new builders after pkg builds in first builder

From: Mark Millard <marklmi_at_yahoo.com>
Date: Thu, 24 Aug 2023 20:07:22 UTC
On Aug 24, 2023, at 00:22, Mark Millard <marklmi@yahoo.com> wrote:

> On Aug 23, 2023, at 22:54, Mateusz Guzik <mjguzik@gmail.com> wrote:
> 
>> On 8/24/23, Mark Millard <marklmi@yahoo.com> wrote:
>>> On Aug 23, 2023, at 15:10, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>> 
>>>> On 8/23/23, Mark Millard <marklmi@yahoo.com> wrote:
>>>>> [Forked off the ZFS deadlock 14 discussion, per feedback.]
>>>>> . . .
>>>> 
>>>> This is a known problem, but it is unclear if you should be running
>>>> into it in this setup.
>>> 
>>> The changed fixed the issue: so I do run into the the issue
>>> for this setup. See below.
>>> 
>>>> Can you try again but this time *revert*
>>>> 138a5dafba312ff39ce0eefdbe34de95519e600d, like so:
>>>> git revert 138a5dafba312ff39ce0eefdbe34de95519e600d
>>>> 
>>>> may want to switch to a different branch first, for example: git
>>>> checkout -b vfstesting
>>> 
>>> # git -C /usr/main-src/ diff sys/kern/vfs_subr.c
>>> diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
>>> index 0f3f00abfd4a..5dff556ac258 100644
>>> --- a/sys/kern/vfs_subr.c
>>> +++ b/sys/kern/vfs_subr.c
>>> @@ -3528,25 +3528,17 @@ vdbatch_process(struct vdbatch *vd)
>>>       MPASS(curthread->td_pinned > 0);
>>>       MPASS(vd->index == VDBATCH_SIZE);
>>> +       mtx_lock(&vnode_list_mtx);
>>>       critical_enter();
>>> -       if (mtx_trylock(&vnode_list_mtx)) {
>>> -               for (i = 0; i < VDBATCH_SIZE; i++) {
>>> -                       vp = vd->tab[i];
>>> -                       vd->tab[i] = NULL;
>>> -                       TAILQ_REMOVE(&vnode_list, vp, v_vnodelist);
>>> -                       TAILQ_INSERT_TAIL(&vnode_list, vp, v_vnodelist);
>>> -                       MPASS(vp->v_dbatchcpu != NOCPU);
>>> -                       vp->v_dbatchcpu = NOCPU;
>>> -               }
>>> -               mtx_unlock(&vnode_list_mtx);
>>> -       } else {
>>> -               for (i = 0; i < VDBATCH_SIZE; i++) {
>>> -                       vp = vd->tab[i];
>>> -                       vd->tab[i] = NULL;
>>> -                       MPASS(vp->v_dbatchcpu != NOCPU);
>>> -                       vp->v_dbatchcpu = NOCPU;
>>> -               }
>>> +       for (i = 0; i < VDBATCH_SIZE; i++) {
>>> +               vp = vd->tab[i];
>>> +               TAILQ_REMOVE(&vnode_list, vp, v_vnodelist);
>>> +               TAILQ_INSERT_TAIL(&vnode_list, vp, v_vnodelist);
>>> +               MPASS(vp->v_dbatchcpu != NOCPU);
>>> +               vp->v_dbatchcpu = NOCPU;
>>>       }
>>> +       mtx_unlock(&vnode_list_mtx);
>>> +       bzero(vd->tab, sizeof(vd->tab));
>>>       vd->index = 0;
>>>       critical_exit();
>>> }
>>> 
>>> Still with:
>>> 
>>> # grep USE_TMPFS= /usr/local/etc/poudriere.conf
>>> # EXAMPLE: USE_TMPFS="wrkdir data"
>>> #USE_TMPFS=all
>>> #USE_TMPFS="data"
>>> USE_TMPFS=no
>>> 
>>> 
>>> That allowed the other builders to eventually reach "Builder started"
>>> and later activity, [00:05:50] [27] [00:02:29] Builder started
>>> being the first non-[01] to do so, no vlruwk's observed in what
>>> I saw in top:
>>> 
>>> . . .
>>> 
>>> Now testing for the zfs deadlock issue should be possible for
>>> this setup.
>>> 
>> 
>> Thanks for testing, I wrote a fix:
>> 
>> https://people.freebsd.org/~mjg/vfs-recycle-fix.diff
>> 
>> Applies to *stock* kernel (as in without the revert).
> 
> I'm going to leave the deadlock test running for when
> I sleep tonight. So it is going to be a while before
> I get to testing this. $ work will likely happen first
> as well. (No deadlock observed yet, by the way. 6+ hrs
> and 3000+ ports built so far.)
> 
> I can easily restore the sys/kern/vfs_subr.c to then
> do normal 14.0-ALPHA2-ish based patching with: so not
> a problem. Thanks.
> 

I stopped the deadlock experiment, cleaned out the partial
bulk -a, put back the modern sys/kern/vfs_subr.c , applied
your patch, built, installed, rebooted, and started another
bulk -a run. It made progress on all the builders to and
past "Builder started":

. . .
[00:01:34] Building 34042 packages using up to 32 builders
[00:01:34] Hit CTRL+t at any time to see build progress and stats
[00:01:34] [01] [00:00:00] Builder starting
[00:01:57] [01] [00:00:23] Builder started
[00:01:57] [01] [00:00:00] Building ports-mgmt/pkg | pkg-1.20.4
[00:03:09] [01] [00:01:12] Finished ports-mgmt/pkg | pkg-1.20.4: Success
[00:03:22] [01] [00:00:00] Building print/indexinfo | indexinfo-0.3.1
[00:03:22] [02] [00:00:00] Builder starting
[00:03:22] [03] [00:00:00] Builder starting
[00:03:22] [04] [00:00:00] Builder starting
[00:03:22] [05] [00:00:00] Builder starting
[00:03:22] [06] [00:00:00] Builder starting
[00:03:22] [07] [00:00:00] Builder starting
[00:03:22] [08] [00:00:00] Builder starting
[00:03:22] [09] [00:00:00] Builder starting
[00:03:22] [10] [00:00:00] Builder starting
[00:03:22] [11] [00:00:00] Builder starting
[00:03:22] [12] [00:00:00] Builder starting
[00:03:22] [13] [00:00:00] Builder starting
[00:03:22] [14] [00:00:00] Builder starting
[00:03:22] [15] [00:00:00] Builder starting
[00:03:22] [16] [00:00:00] Builder starting
[00:03:22] [17] [00:00:00] Builder starting
[00:03:22] [18] [00:00:00] Builder starting
[00:03:22] [19] [00:00:00] Builder starting
[00:03:22] [20] [00:00:00] Builder starting
[00:03:22] [21] [00:00:00] Builder starting
[00:03:22] [22] [00:00:00] Builder starting
[00:03:22] [23] [00:00:00] Builder starting
[00:03:22] [24] [00:00:00] Builder starting
[00:03:22] [25] [00:00:00] Builder starting
[00:03:22] [26] [00:00:00] Builder starting
[00:03:22] [27] [00:00:00] Builder starting
[00:03:22] [28] [00:00:00] Builder starting
[00:03:22] [29] [00:00:00] Builder starting
[00:03:22] [30] [00:00:00] Builder starting
[00:03:22] [31] [00:00:00] Builder starting
[00:03:22] [32] [00:00:00] Builder starting
[00:03:30] [01] [00:00:08] Finished print/indexinfo | indexinfo-0.3.1: Success
[00:03:30] [01] [00:00:00] Building devel/gettext-runtime | gettext-runtime-0.22
[00:04:42] [01] [00:01:12] Finished devel/gettext-runtime | gettext-runtime-0.22: Success
[00:04:48] [01] [00:00:00] Building devel/libtextstyle | libtextstyle-0.22
[00:05:46] [19] [00:02:24] Builder started
[00:05:46] [15] [00:02:24] Builder started
[00:05:46] [19] [00:00:00] Building graphics/libpotrace | libpotrace-1.16
[00:05:46] [15] [00:00:00] Building devel/libdaemon | libdaemon-0.14_1
[00:05:46] [25] [00:02:24] Builder started
[00:05:46] [25] [00:00:00] Building audio/speexdsp | speexdsp-1.2.1
[00:05:46] [29] [00:02:24] Builder started
[00:05:46] [29] [00:00:00] Building devel/opencl | opencl-3.0.14
. . .

Thanks. I'll let it run as another deadlock test. The
prior run built over 9400 in about 18.5 hr before I
stopped it (no deadlocks observed).

===
Mark Millard
marklmi at yahoo.com