Re: ZFS + FreeBSD XEN dom0 panic

From: Roger_Pau_Monné <royger_at_gmail.com>
Date: Sat, 26 Mar 2022 15:56:21 UTC
El ds., 26 de març 2022, 15:39, Roger Pau Monné <roger.pau@citrix.com> va
escriure:

> On Sat, Mar 26, 2022 at 02:08:06PM +0200, Ze Dupsys wrote:
> > On 2022.03.26. 11:11, Roger Pau Monné wrote:
> > >
> > > Hm, do you think you could upload (or attach) your
> > > /usr/lib/debug/boot/kernel/kernel.debug and provide an updated panic
> > > trace using that same exact kernel?
> >
> > Yes, it is just too big for email attachment.
> > Uploaded at: https://files.fm/f/mp3v3qa22
> >
> > This time i starved Dom0 of RAM(2G) to speed panic up. Panic trace it the
> > same.
> >
> > Trace:
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 2; apic id = 04
> > fault virtual address = 0x22710028
> > fault code            = supervisor read data, page not present
> > instruction pointer   = 0x20:0xffffffff80c6a2b2
> > stack pointer         = 0x28:0xfffffe009e486b30
> > frame pointer         = 0x28:0xfffffe009e486b30
> > code segment          = base 0x0, limit 0xfffff, type 0x1b
> >                       = DPL 0, pres 1, long 1, def32 0, gran 1
> > processor eflags      = interrupt enabled, resume, IOPL = 0
> > current process               = 3995 (devmatch)
> > trap number           = 12
> > panic: page fault
> > cpuid = 2
> > time = 1648293768
> > KDB: stack backtrace:
> > #0 0xffffffff80c7c285 at kdb_backtrace+0x65
> > #1 0xffffffff80c2e2e1 at vpanic+0x181
> > #2 0xffffffff80c2e153 at panic+0x43
> > #3 0xffffffff810c8b97 at trap+0xba7
> > #4 0xffffffff810c8bef at trap+0xbff
> > #5 0xffffffff810c8243 at trap+0x253
> > #6 0xffffffff810a0848 at calltrap+0x8
> > #7 0xffffffff80c86ed1 at rman_is_region_manager+0x241
> > #8 0xffffffff80c3eb41 at sbuf_new_for_sysctl+0x101
> > #9 0xffffffff80c3df8c at kernel_sysctl+0x3ec
> > #10 0xffffffff80c3e603 at userland_sysctl+0x173
> > #11 0xffffffff80c3e44f at sys___sysctl+0x5f
> > #12 0xffffffff810c949c at amd64_syscall+0x10c
> > #13 0xffffffff810a115b at Xfast_syscall+0xfb
> > Uptime: 10m19s
>
> It's weird, because here you get a page fault, but there are also
> traces with:
>
> general protection fault while in kernel mode
> cpuid = 3; a(d8) Scan for VGA option rom
> pic id = 06
> instruction pointer     = 0x20:0xffffffff810c5d64
> stack pointer           = 0x28:0xfffffe00a20fe990
> frame pointer           = 0x28:0xfffffe00a20fe990
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 8998 (devmatch)
> trap number             = 9
> panic: general protection fault
> cpuid = 3
> time = 1647416577
> KDB: stack backtrace:
> #0 0xffffffff80c7ca05 at kdb_backtrace+0x65
> #1 0xffffffff80c2ea11 at vpanic+0x181
> #2 0xffffffff80c2e883 at panic+0x43
> #3 0xffffffff810c9b97 at trap+0xba7
> #4 0xffffffff810c907b at trap+0x8b
> #5 0xffffffff810a0dc8 at calltrap+0x8
> #6 0xffffffff80c83067 at kvprintf+0x1007
> #7 0xffffffff80c83df9 at snprintf+0x59
> #8 0xffffffff80c8768b at rman_is_region_manager+0x27b
> #9 0xffffffff80c3f271 at sbuf_new_for_sysctl+0x101
> #10 0xffffffff80c3e6bc at kernel_sysctl+0x3ec
> #11 0xffffffff80c3ed33 at userland_sysctl+0x173
> #12 0xffffffff80c3eb7f at sys___sysctl+0x5f
> #13 0xffffffff810ca49c at amd64_syscall+0x10c
> #14 0xffffffff810a16db at Xfast_syscall+0xfb
>
> That show a general protection fault instead of a page fault.
>
> I've built an hypervisor with debug enabled for you, it's at:
>
> https://people.freebsd.org/~royger/xen-debug
>
> This is the same as the one in ports, just build with debug=y. If you
> can place it in /boot/ and change your xen_kernel to:
>
> xen_kernel="/boot/xen-debug"
>
> It might provide some additional info.
>
> I've also noticed it seems to always be 'devmatch' the process that
> triggers the panic.
>
> >
> > cat /tmp/panic.log| sed -Ee 's/^#[0-9]* //' -e 's/ .*//' | xargs
> addr2line
> > -e /usr/lib/debug/boot/kernel/kernel.debug
> > /usr/src/sys/kern/subr_kdb.c:443
> > /usr/src/sys/kern/kern_shutdown.c:0
> > /usr/src/sys/kern/kern_shutdown.c:844
> > /usr/src/sys/amd64/amd64/trap.c:944
> > /usr/src/sys/amd64/amd64/trap.c:0
> > /usr/src/sys/amd64/amd64/trap.c:0
> > /usr/src/sys/amd64/amd64/exception.S:292
> > /usr/src/sys/kern/subr_rman.c:0
>
> I've been able to get a better trace with gdb and your debug symbols,
> and this is:
>
> (gdb) info line *0xffffffff80c6a2b2
> Line 1386 of "/usr/src/sys/kern/subr_bus.c" starts at address
> 0xffffffff80c6a2b2 <device_get_name+18>
>    and ends at 0xffffffff80c6a2b6 <device_get_name+22>.
> (gdb) info line *0xffffffff80c86ed1
> Line 1052 of "/usr/src/sys/kern/subr_rman.c" starts at address
> 0xffffffff80c86ecc <sysctl_rman+540>
>    and ends at 0xffffffff80c86ed5 <sysctl_rman+549>.
>
> The page fault happens exactly at:
>
> https://cgit.freebsd.org/src/tree/sys/kern/subr_bus.c?h=stable/13#n1386
>
> Which is called from
>
> https://cgit.freebsd.org/src/tree/sys/kern/subr_rman.c?h=stable/13#n1052
>
> I'm trying to figure out how the device could be removed or
> disconnected from the rman. I will try to create a patch to catch the
> device that leaves rman regions when destroyed/removed.


Replying from my phone so the format will likely be mangled.

I think I've found at least one issue with blkback leaking resources on
destroy if the ring was not connected. Could you give the following patch a
try? I've just build tested it, so can't guarantee it will work.

Thanks, Roger.