Re: git: 315ee00fa961 - main - zfs: merge openzfs/zfs@804414aad

From: Cy Schubert <Cy.Schubert_at_cschubert.com>
Date: Thu, 31 Aug 2023 23:44:36 UTC
In message <20230831234023.917DFAB@slippy.cwsent.com>, Cy Schubert writes:
> In message <CANCZdfqLWoQnLkKcLYLa73WOKDOAEfXB2rQX869Qaaqv6z=gKA@mail.gmail.c
> om>
> , Warner Losh writes:
> > On Thu, Aug 31, 2023, 5:32 PM Cy Schubert <Cy.Schubert@cschubert.com> wrote
> =
> > :
> >
> > > In message <20230831223526.DCB701A1@slippy.cwsent.com>, Cy Schubert
> > > writes:
> > > > In message <a5c51f3f-8c7f-8bd5-f718-72bc33fe22ed@FreeBSD.org>,
> > > Alexander
> > > > Motin
> > > > writes:
> > > > > On 31.08.2023 13:53, Cy Schubert wrote:
> > > > > > One thing that circumvents my two problems is reducing poudriere
> > > bulk job
> > > > s
> > > > > > from 8 to 5 on my 4 core machines.
> > > > >
> > > > > Cy, I have no real evidences to think it is related, other than your
> > > > > panics look like some memory corruptions, but could you try is patch:
> > > > > https://github.com/openzfs/zfs/pull/15228 .  If it won't do the
> > > trick,
> > > > > then I am out of ideas without additional input.
> > > >
> > > > So far so good. Poudriere has been running with a decent -J jobs on bot
> =
> > h
> > > > machines for over an hour. I'll let you know if they survive the night.
> > > It
> > > > can take some time before the panics happen though.
> > > >
> > > > The problem is more likely to occur when there are a lot of small
> > > package
> > > > builds than large long running jobs, probably because of the parallel
> > > ZFS
> > > > dataset creations, deletions, and rollbacks.
> > > >
> > > > >
> > > > > Gleb, you may try to add this too, just as a choice between impossibl
> =
> > e
> > > > > and improbable.
> > > > >
> > > > > --
> > > > > Alexander Motin
> > > >
> > > >
> > > > --
> > > > Cheers,
> > > > Cy Schubert <Cy.Schubert@cschubert.com>
> > > > FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
> > > > NTP:           <cy@nwtime.org>    Web:  https://nwtime.org
> > > >
> > > >                       e^(i*pi)+1=3D0
> > > >
> > > >
> > >
> > > One of the two machines is hung.
> > >
> > > cwfw# ping bob
> > > PING bob (10.1.1.7): 56 data bytes
> > > ^C
> > > --- bob ping statistics ---
> > > 2 packets transmitted, 0 packets received, 100.0% packet loss
> > > cwfw# console bob
> > > [Enter `^Ec?' for help]
> > > [halt sent]
> > > KDB: enter: Break to debugger
> > > [ thread pid 31259 tid 100913 ]
> > > Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)
> > > db> bt
> > > Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> > > kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef2d0
> > > uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef310
> > > intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef380
> > > intr_execute_handlers() at intr_execute_handlers+0x63/frame
> > > 0xfffffe00c53ef3b0
> > > Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef3b0
> > > --- interrupt, rip =3D 0xffffffff806d5c70, rsp =3D 0xfffffe00c53ef480, rb
> =
> > p =3D
> > > 0xfffffe00c53ef480 ---
> > > getbinuptime() at getbinuptime+0x30/frame 0xfffffe00c53ef480
> > > arc_access() at arc_access+0x250/frame 0xfffffe00c53ef4d0
> > > arc_buf_access() at arc_buf_access+0xd0/frame 0xfffffe00c53ef4f0
> > > dbuf_hold_impl() at dbuf_hold_impl+0xf3/frame 0xfffffe00c53ef580
> > > dbuf_hold() at dbuf_hold+0x25/frame 0xfffffe00c53ef5b0
> > > dnode_hold_impl() at dnode_hold_impl+0x194/frame 0xfffffe00c53ef670
> > > dmu_bonus_hold() at dmu_bonus_hold+0x20/frame 0xfffffe00c53ef6a0
> > > zfs_zget() at zfs_zget+0x20d/frame 0xfffffe00c53ef750
> > > zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0
> > > zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> > > zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> > > zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame
> > > 0xfffffe00c53ef9e0
> > > vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> > > vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> > > namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> > > vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> > > kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> > > ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> > > int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc
> > > db>
> > >
> > > I'll let it continue. Hopefully the watchdog timer will pop and we get a
> > > dump.
> > >
> >
> >
> > Might also be interesting to see if this moves around or is really hung
> > getting the time. I suspect it's live lock given this traceback.
>
> It's moving around.
>
>
> db> c
> [halt sent]
> KDB: enter: Break to debugger
> [ thread pid 31259 tid 100913 ]
> Stopped at      kdb_break+0x48: movq    $0,0xa1069d(%rip)
> db> bt
> Tracing pid 31259 tid 100913 td 0xfffffe00c4eca000
> kdb_break() at kdb_break+0x48/frame 0xfffffe00c53ef410
> uart_intr() at uart_intr+0xf7/frame 0xfffffe00c53ef450
> intr_event_handle() at intr_event_handle+0x12b/frame 0xfffffe00c53ef4c0
> intr_execute_handlers() at intr_execute_handlers+0x63/frame 
> 0xfffffe00c53ef4f0
> Xapic_isr1() at Xapic_isr1+0xdc/frame 0xfffffe00c53ef4f0
> --- interrupt, rip = 0xffffffff807117c4, rsp = 0xfffffe00c53ef5c0, rbp = 
> 0xfffffe00c53ef5c0 ---
> lock_delay() at lock_delay+0x14/frame 0xfffffe00c53ef5c0
> _sx_xlock_hard() at _sx_xlock_hard+0x1a9/frame 0xfffffe00c53ef660
> _sx_xlock() at _sx_xlock+0xb0/frame 0xfffffe00c53ef6a0
> zfs_zget() at zfs_zget+0x1fa/frame 0xfffffe00c53ef750
> zfs_dirent_lookup() at zfs_dirent_lookup+0x16d/frame 0xfffffe00c53ef7a0
> zfs_dirlook() at zfs_dirlook+0x7f/frame 0xfffffe00c53ef7d0
> zfs_lookup() at zfs_lookup+0x3c0/frame 0xfffffe00c53ef8a0
> zfs_freebsd_cachedlookup() at zfs_freebsd_cachedlookup+0x67/frame 
> 0xfffffe00c53ef9e0
> vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe00c53efa30
> vfs_lookup() at vfs_lookup+0x457/frame 0xfffffe00c53efac0
> namei() at namei+0x2e1/frame 0xfffffe00c53efb20
> vn_open_cred() at vn_open_cred+0x505/frame 0xfffffe00c53efca0
> kern_openat() at kern_openat+0x287/frame 0xfffffe00c53efdf0
> ia32_syscall() at ia32_syscall+0x156/frame 0xfffffe00c53eff30
> int0x80_syscall_common() at int0x80_syscall_common+0x9c/frame 0xffff89dc
> db> c

It seems to have settled there for now.


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0