Re: ZFS + FreeBSD XEN dom0 panic
- Reply: Roger Pau Monné : "Re: ZFS + FreeBSD XEN dom0 panic"
- In reply to: Ze Dupsys : "Re: ZFS + FreeBSD XEN dom0 panic"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 11 Apr 2022 15:37:27 UTC
On Mon, Apr 11, 2022 at 11:47:50AM +0300, Ze Dupsys wrote: > On 2022.04.08. 18:02, Roger Pau Monné wrote: > > On Fri, Apr 08, 2022 at 10:45:12AM +0300, Ze Dupsys wrote: > > > On 2022.04.05. 18:22, Roger Pau Monné wrote: > > > > .. Thanks, sorry for the late reply, somehow the message slip. > > > > > > > > I've been able to get the file:line for those, and the trace is kind > > > > of weird, I'm not sure I know what's going on TBH. It seems to me the > > > > backend instance got freed while being in the process of connecting. > > > > > > > > I've made some changes, that might mitigate this, but having not a > > > > clear understanding of what's going on makes this harder. > > > > > > > > I've pushed the changes to: > > > > > > > > http://xenbits.xen.org/gitweb/?p=people/royger/freebsd.git;a=shortlog;h=refs/heads/for-leak > > > > > > > > (This is on top of main branch). > > > > > > > > I'm also attaching the two patches on this email. > > > > > > > > Let me know if those make a difference to stabilize the system. > > > > > > Hi, > > > > > > Yes, it stabilizes the system, but there is still a memleak somewhere, i > > > think. > > > > > > System could run tests for approximately 41 hour, did not panic, but started > > > to OOM kill everything. > > > > > > I did not know how to git clone given commit, thus i just applied patches to > > > 13.0-RELEASE sources. > > > > > > Serial logs have nothing unusual, just that at some point OOM kill starts. > > > > Well, I think that's good^W better than before. Thanks again for all > > the testing. > > > > It might be helpful now to start dumping `vmstat -m` periodically > > while running the stress tests. As there are (hopefully) no more > > panics now vmstat might report us what subsystem is hogging the > > memory. It's possible it's blkback (again). > > > > Thanks, Roger. > > > > Yes, it certainly is better. Applied patch on my pre-production server, have > not had any panic since then, still testing though. > > On my stressed lab server, it's a bit different story. On occasion i see a > panic with this trace on serial (can not reliably repeat, but sometimes upon > starting dom id 1 and 2, sometimes mid-stress-test, dom id > 95). > panic: pmap_growkernel: no memory to grow kernel > cpuid = 2 > time = 1649485133 > KDB: stack backtrace: > #0 0xffffffff80c57385 at kdb_backtrace+0x65 > #1 0xffffffff80c09d61 at vpanic+0x181 > #2 0xffffffff80c09bd3 at panic+0x43 > #3 0xffffffff81073eed at pmap_growkernel+0x27d > #4 0xffffffff80f2d918 at vm_map_insert+0x248 > #5 0xffffffff80f30079 at vm_map_find+0x549 > #6 0xffffffff80f2bda6 at kmem_init+0x226 > #7 0xffffffff80c731a1 at vmem_xalloc+0xcb1 > #8 0xffffffff80c72a9b at vmem_xalloc+0x5ab > #9 0xffffffff80c724a6 at vmem_alloc+0x46 > #10 0xffffffff80f2ac6b at kva_alloc+0x2b > #11 0xffffffff8107f0eb at pmap_mapdev_attr+0x27b > #12 0xffffffff810588ca at nexus_add_irq+0x65a > #13 0xffffffff81058710 at nexus_add_irq+0x4a0 > #14 0xffffffff810585b9 at nexus_add_irq+0x349 > #15 0xffffffff80c495c1 at bus_alloc_resource+0xa1 > #16 0xffffffff8105e940 at xenmem_free+0x1a0 > #17 0xffffffff80a7e0dd at xbd_instance_create+0x943d > > | sed -Ee 's/^#[0-9]* //' -e 's/ .*//' | xargs addr2line -e > /usr/lib/debug/boot/kernel/kernel.debug > /usr/src/sys/kern/subr_kdb.c:443 > /usr/src/sys/kern/kern_shutdown.c:0 > /usr/src/sys/kern/kern_shutdown.c:843 > /usr/src/sys/amd64/amd64/pmap.c:0 > /usr/src/sys/vm/vm_map.c:0 > /usr/src/sys/vm/vm_map.c:0 > /usr/src/sys/vm/vm_kern.c:712 > /usr/src/sys/kern/subr_vmem.c:928 > /usr/src/sys/kern/subr_vmem.c:0 > /usr/src/sys/kern/subr_vmem.c:1350 > /usr/src/sys/vm/vm_kern.c:150 > /usr/src/sys/amd64/amd64/pmap.c:0 > /usr/src/sys/x86/x86/nexus.c:0 > /usr/src/sys/x86/x86/nexus.c:449 > /usr/src/sys/x86/x86/nexus.c:412 > /usr/src/sys/kern/subr_bus.c:4620 > /usr/src/sys/x86/xen/xenpv.c:123 > /usr/src/sys/dev/xen/blkback/blkback.c:3010 > > With gdb backtrace i think i can get a better trace though: > #0 __curthread at /usr/src/sys/amd64/include/pcpu_aux.h:55 > #1 doadump at /usr/src/sys/kern/kern_shutdown.c:399 > #2 kern_reboot at /usr/src/sys/kern/kern_shutdown.c:486 > #3 vpanic at /usr/src/sys/kern/kern_shutdown.c:919 > #4 panic at /usr/src/sys/kern/kern_shutdown.c:843 > #5 pmap_growkernel at /usr/src/sys/amd64/amd64/pmap.c:208 > #6 vm_map_insert at /usr/src/sys/vm/vm_map.c:1752 > #7 vm_map_find at /usr/src/sys/vm/vm_map.c:2259 > #8 kva_import at /usr/src/sys/vm/vm_kern.c:712 > #9 vmem_import at /usr/src/sys/kern/subr_vmem.c:928 > #10 vmem_try_fetch at /usr/src/sys/kern/subr_vmem.c:1049 > #11 vmem_xalloc at /usr/src/sys/kern/subr_vmem.c:1449 > #12 vmem_alloc at /usr/src/sys/kern/subr_vmem.c:1350 > #13 kva_alloc at /usr/src/sys/vm/vm_kern.c:150 > #14 pmap_mapdev_internal at /usr/src/sys/amd64/amd64/pmap.c:8974 > #15 pmap_mapdev_attr at /usr/src/sys/amd64/amd64/pmap.c:8990 > #16 nexus_map_resource at /usr/src/sys/x86/x86/nexus.c:523 > #17 nexus_activate_resource at /usr/src/sys/x86/x86/nexus.c:448 > #18 nexus_alloc_resource at /usr/src/sys/x86/x86/nexus.c:412 > #19 BUS_ALLOC_RESOURCE at ./bus_if.h:321 > #20 bus_alloc_resource at /usr/src/sys/kern/subr_bus.c:4617 > #21 xenpv_alloc_physmem at /usr/src/sys/x86/xen/xenpv.c:121 > #22 xbb_alloc_communication_mem at > /usr/src/sys/dev/xen/blkback/blkback.c:3010 > #23 xbb_connect at /usr/src/sys/dev/xen/blkback/blkback.c:3336 > #24 xenbusb_back_otherend_changed at > /usr/src/sys/xen/xenbus/xenbusb_back.c:228 > #25 xenwatch_thread at /usr/src/sys/dev/xen/xenstore/xenstore.c:1003 > #26 in fork_exit at /usr/src/sys/kern/kern_fork.c:1069 > #27 <signal handler called> > > > There is some sort of mismatch in info, because panic message printed > "panic: pmap_growkernel: no memory to grow kernel", but gdb backtrace in > #5 0xffffffff81073eed in pmap_growkernel at > /usr/src/sys/amd64/amd64/pmap.c:208 > leads to lines: > switch (pmap->pm_type) { > .. > panic("pmap_valid_bit: invalid pm_type %d", pmap->pm_type) > > So either trace is off the mark or message in serial logs. If this was only > memleak related, then it should not happen when dom id 1 is started, i > suppose. That's weird, I would rather trust the printed panic message rather than the symbol resolution. Seems to be a kind of memory exhaustion, as the kernel is failing to allocate a page for use in the kernel page table. I will try to see what can be done here. Thanks, Roger.