Re: ZFS deadlock in 14

From: Cy Schubert <Cy.Schubert_at_cschubert.com>
Date: Sat, 12 Aug 2023 04:55:40 UTC
The poudriere build machine building amd64 packages also panicked. But with:

Dumping 2577 out of 8122 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91
%

__curthread () at /opt/src/git-src/sys/amd64/include/pcpu_aux.h:59
59              __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct 
pcpu
,
(kgdb) #0  __curthread () at /opt/src/git-src/sys/amd64/include/pcpu_aux.h:5
9
#1  doadump (textdump=textdump@entry=1)
    at /opt/src/git-src/sys/kern/kern_shutdown.c:407
#2  0xffffffff806c10e0 in kern_reboot (howto=260)
    at /opt/src/git-src/sys/kern/kern_shutdown.c:528
#3  0xffffffff806c15df in vpanic (
    fmt=0xffffffff80b6c5f5 "%s: possible deadlock detected for %p (%s), 
blocked
for %d ticks\n", ap=ap@entry=0xfffffe008e698e90)
    at /opt/src/git-src/sys/kern/kern_shutdown.c:972
#4  0xffffffff806c1383 in panic (fmt=<unavailable>)
    at /opt/src/git-src/sys/kern/kern_shutdown.c:896
#5  0xffffffff8064a5ea in deadlkres ()
    at /opt/src/git-src/sys/kern/kern_clock.c:201
#6  0xffffffff80677632 in fork_exit (callout=0xffffffff8064a2c0 <deadlkres>,
    arg=0x0, frame=0xfffffe008e698f40)
    at /opt/src/git-src/sys/kern/kern_fork.c:1162
#7  <signal handler called>
(kgdb)

This is consistent with PR/271945. Reducing -J to 1 or 5:1 circumvents this 
panic.

This is certainly a different panic from the one experienced on the 
poudriere builder building i386 packages. Both machines run in amd64 mode.


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0


Cy Schubert writes:
> This is new. Instead of affecting the machine with poudriere building amd64 
> packages, it affected the other machine with poudriere building i386 
> packages. This is new since the two recent ZFS patches.
>
> Don't get me wrong, the two new patches have resulted in I believe better 
> availability of the poudriere machine building amd64 packages. I doubt the 
> two patches caused this but they may have exposed this problem, probably 
> fixed by another patch or two.
>
> Sorry, there was no dump produced by this panic. I'll need to check the 
> config of this machine, swap is a gmirror, which it doesn't like to dump 
> to. Below are serial console messages captured by conserver.
>
> panic: vm_page_dequeue_deferred: page 0xfffffe00028fb0d0 has unexpected 
> queue state^M
> cpuid = 3^M
> time = 1691807572^M
> KDB: stack backtrace:^M
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
> 0xfffffe00c50bc600^M
> vpanic() at vpanic+0x132/frame 0xfffffe00c50bc730^M
> panic() at panic+0x43/frame 0xfffffe00c50bc790^M
> vm_page_dequeue_deferred() at vm_page_dequeue_deferred+0xb2/frame 
> 0xfffffe00c50bc7a0^M
> vm_page_free_prep() at vm_page_free_prep+0x11b/frame 0xfffffe00c50bc7c0^M
> vm_page_free_toq() at vm_page_free_toq+0x12/frame 0xfffffe00c50bc7f0^M
> vm_object_page_remove() at vm_object_page_remove+0xb6/frame 
> 0xfffffe00c50bc850^M
> vn_pages_remove_valid() at vn_pages_remove_valid+0x48/frame 
> 0xfffffe00c50bc880^M
> zfs_rezget() at zfs_rezget+0x35/frame 0xfffffe00c50bca60^M
> zfs_resume_fs() at zfs_resume_fs+0x1c8/frame 0xfffffe00c50bcab0^M
> zfs_ioc_rollback() at zfs_ioc_rollback+0x157/frame 0xfffffe00c50bcb00^M
> zfsdev_ioctl_common() at zfsdev_ioctl_common+0x612/frame 
> 0xfffffe00c50bcbc0^M
> zfsdev_ioctl() at zfsdev_ioctl+0x12a/frame 0xfffffe00c50bcbf0^M
> devfs_ioctl() at devfs_ioctl+0xd2/frame 0xfffffe00c50bcc40^M
> vn_ioctl() at vn_ioctl+0xc2/frame 0xfffffe00c50bccb0^M
> devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00c50bccd0^M
> kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe00c50bcd30^M
> sys_ioctl() at sys_ioctl+0x152/frame 0xfffffe00c50bce00^M
> amd64_syscall() at amd64_syscall+0x138/frame 0xfffffe00c50bcf30^M
> fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00c50bcf30^M
> --- syscall (54, FreeBSD ELF64, ioctl), rip = 0x20938296107a, rsp = 
> 0x209379aeee18, rbp = 0x209379aeee90 ---^M
> Uptime: 42m33s^M
> Automatic reboot in 15 seconds - press a key on the console to abort^M
> Rebooting...^M
> cpu_reset: Restarting BSP^M
> cpu_reset_proxy: Stopped CPU 3^M
>
>
> -- 
> Cheers,
> Cy Schubert <Cy.Schubert@cschubert.com>
> FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
>
> 			e^(i*pi)+1=0
>
>
> Cy Schubert writes:
> > I haven't experienced any problems (yet) either.
> >
> >
> > -- 
> > Cheers,
> > Cy Schubert <Cy.Schubert@cschubert.com>
> > FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> > NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
> >
> > 			e^(i*pi)+1=0
> >
> >
> > In message <CAK7dMtDJQtaai3_6VjEkwVwW5JN6e8v=kKTOPffp371xb=ORUg@mail.gmail.
> c
> > om>
> > , Kevin Bowling writes:
> > > The two MFVs on head have improved/fixed stability with poudriere for
> > > me 48 core bare metal.
> > >
> > > On Thu, Aug 10, 2023 at 6:37=E2=80=AFAM Cy Schubert <Cy.Schubert@cschuber
> t.
> > =
> > > com> wrote:
> > > >
> > > > In message <CAK7dMtDJeuf8rjWbsNEZABUfeqpjUyCHzuOL9AAhKk93sy+PKg@mail.gm
> ai
> > =
> > > l.c
> > > > om>
> > > > , Kevin Bowling writes:
> > > > > Possibly https://github.com/openzfs/zfs/commit/2cb992a99ccadb78d97049
> b4
> > =
> > > 0bd4=3D
> > > > > 42eb4fdc549d
> > > > >
> > > > > On Tue, Aug 8, 2023 at 10:08=3DE2=3D80=3DAFAM Dag-Erling Sm=3DC3=3DB8
> rg
> > =
> > > rav <des@freeb=3D
> > > > > sd.org> wrote:
> > > > > >
> > > > > > At some point between 42d088299c (4 May) and f0c9703301 (26 June), 
> a
> > > > > > deadlock was introduced in ZFS.  It is still present as of 9c2823ba
> e9
> > =
> > >  (4
> > > > > > August) and is 100% reproducable just by starting poudriere bulk in
>  a
> > > > > > 16-core VM and waiting a few hours until deadlkres kicks in.  In th
> e
> > > > > > latest instance, deadlkres complained about a bash process:
> > > > > >
> > > > > >     #0  sched_switch (td=3D3Dtd@entry=3D3D0xfffffe02fb1d8000, flags
> =
> > > =3D3Dflags@e=3D
> > > > > ntry=3D3D259) at /usr/src/sys/kern/sched_ule.c:2299
> > > > > >     #1  0xffffffff80b5a0a3 in mi_switch (flags=3D3Dflags@entry=3D3D
> 25
> > =
> > > 9) at /u=3D
> > > > > sr/src/sys/kern/kern_synch.c:550
> > > > > >     #2  0xffffffff80babcb4 in sleepq_switch (wchan=3D3D0xfffff81854
> 3a
> > =
> > > 9e70, =3D
> > > > > pri=3D3D64) at /usr/src/sys/kern/subr_sleepqueue.c:609
> > > > > >     #3  0xffffffff80babb8c in sleepq_wait (wchan=3D3D<unavailable>,
>  p
> > =
> > > ri=3D3D<=3D
> > > > > unavailable>) at /usr/src/sys/kern/subr_sleepqueue.c:660
> > > > > >     #4  0xffffffff80b1c1b0 in sleeplk (lk=3D3Dlk@entry=3D3D0xfffff8
> 18
> > =
> > > 543a9e70=3D
> > > > > , flags=3D3Dflags@entry=3D3D2121728, ilk=3D3Dilk@entry=3D3D0x0, wmesg
> =
> > > =3D3Dwmesg@entry=3D
> > > > > =3D3D0xffffffff8222a054 "zfs", pri=3D3D<optimized out>, pri@entry=3D3
> D6
> > =
> > > 4, timo=3D3D=3D
> > > > > timo@entry=3D3D6, queue=3D3D1) at /usr/src/sys/kern/kern_lock.c:310
> > > > > >     #5  0xffffffff80b1a23f in lockmgr_slock_hard (lk=3D3D0xfffff818
> 54
> > =
> > > 3a9e70=3D
> > > > > , flags=3D3D2121728, ilk=3D3D<optimized out>, file=3D3D0xffffffff8125
> 44
> > =
> > > fb "/usr/s=3D
> > > > > rc/sys/kern/vfs_subr.c", line=3D3D3057, lwa=3D3D0x0) at /usr/src/sys/
> ke
> > =
> > > rn/kern_=3D
> > > > > lock.c:705
> > > > > >     #6  0xffffffff80c59ec3 in VOP_LOCK1 (vp=3D3D0xfffff818543a9e00,
>  f
> > =
> > > lags=3D
> > > > > =3D3D2105344, file=3D3D0xffffffff812544fb "/usr/src/sys/kern/vfs_subr
> .c
> > =
> > > ", line=3