Re: ZFS + FreeBSD XEN dom0 panic
- Reply: Roger Pau Monné : "Re: ZFS + FreeBSD XEN dom0 panic"
- In reply to: Roger Pau Monné : "Re: ZFS + FreeBSD XEN dom0 panic"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 11 Apr 2022 08:47:50 UTC
On 2022.04.08. 18:02, Roger Pau Monné wrote: > On Fri, Apr 08, 2022 at 10:45:12AM +0300, Ze Dupsys wrote: >> On 2022.04.05. 18:22, Roger Pau Monné wrote: >>> .. Thanks, sorry for the late reply, somehow the message slip. >>> >>> I've been able to get the file:line for those, and the trace is kind >>> of weird, I'm not sure I know what's going on TBH. It seems to me the >>> backend instance got freed while being in the process of connecting. >>> >>> I've made some changes, that might mitigate this, but having not a >>> clear understanding of what's going on makes this harder. >>> >>> I've pushed the changes to: >>> >>> http://xenbits.xen.org/gitweb/?p=people/royger/freebsd.git;a=shortlog;h=refs/heads/for-leak >>> >>> (This is on top of main branch). >>> >>> I'm also attaching the two patches on this email. >>> >>> Let me know if those make a difference to stabilize the system. >> >> Hi, >> >> Yes, it stabilizes the system, but there is still a memleak somewhere, i >> think. >> >> System could run tests for approximately 41 hour, did not panic, but started >> to OOM kill everything. >> >> I did not know how to git clone given commit, thus i just applied patches to >> 13.0-RELEASE sources. >> >> Serial logs have nothing unusual, just that at some point OOM kill starts. > > Well, I think that's good^W better than before. Thanks again for all > the testing. > > It might be helpful now to start dumping `vmstat -m` periodically > while running the stress tests. As there are (hopefully) no more > panics now vmstat might report us what subsystem is hogging the > memory. It's possible it's blkback (again). > > Thanks, Roger. > Yes, it certainly is better. Applied patch on my pre-production server, have not had any panic since then, still testing though. On my stressed lab server, it's a bit different story. On occasion i see a panic with this trace on serial (can not reliably repeat, but sometimes upon starting dom id 1 and 2, sometimes mid-stress-test, dom id > 95). panic: pmap_growkernel: no memory to grow kernel cpuid = 2 time = 1649485133 KDB: stack backtrace: #0 0xffffffff80c57385 at kdb_backtrace+0x65 #1 0xffffffff80c09d61 at vpanic+0x181 #2 0xffffffff80c09bd3 at panic+0x43 #3 0xffffffff81073eed at pmap_growkernel+0x27d #4 0xffffffff80f2d918 at vm_map_insert+0x248 #5 0xffffffff80f30079 at vm_map_find+0x549 #6 0xffffffff80f2bda6 at kmem_init+0x226 #7 0xffffffff80c731a1 at vmem_xalloc+0xcb1 #8 0xffffffff80c72a9b at vmem_xalloc+0x5ab #9 0xffffffff80c724a6 at vmem_alloc+0x46 #10 0xffffffff80f2ac6b at kva_alloc+0x2b #11 0xffffffff8107f0eb at pmap_mapdev_attr+0x27b #12 0xffffffff810588ca at nexus_add_irq+0x65a #13 0xffffffff81058710 at nexus_add_irq+0x4a0 #14 0xffffffff810585b9 at nexus_add_irq+0x349 #15 0xffffffff80c495c1 at bus_alloc_resource+0xa1 #16 0xffffffff8105e940 at xenmem_free+0x1a0 #17 0xffffffff80a7e0dd at xbd_instance_create+0x943d | sed -Ee 's/^#[0-9]* //' -e 's/ .*//' | xargs addr2line -e /usr/lib/debug/boot/kernel/kernel.debug /usr/src/sys/kern/subr_kdb.c:443 /usr/src/sys/kern/kern_shutdown.c:0 /usr/src/sys/kern/kern_shutdown.c:843 /usr/src/sys/amd64/amd64/pmap.c:0 /usr/src/sys/vm/vm_map.c:0 /usr/src/sys/vm/vm_map.c:0 /usr/src/sys/vm/vm_kern.c:712 /usr/src/sys/kern/subr_vmem.c:928 /usr/src/sys/kern/subr_vmem.c:0 /usr/src/sys/kern/subr_vmem.c:1350 /usr/src/sys/vm/vm_kern.c:150 /usr/src/sys/amd64/amd64/pmap.c:0 /usr/src/sys/x86/x86/nexus.c:0 /usr/src/sys/x86/x86/nexus.c:449 /usr/src/sys/x86/x86/nexus.c:412 /usr/src/sys/kern/subr_bus.c:4620 /usr/src/sys/x86/xen/xenpv.c:123 /usr/src/sys/dev/xen/blkback/blkback.c:3010 With gdb backtrace i think i can get a better trace though: #0 __curthread at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump at /usr/src/sys/kern/kern_shutdown.c:399 #2 kern_reboot at /usr/src/sys/kern/kern_shutdown.c:486 #3 vpanic at /usr/src/sys/kern/kern_shutdown.c:919 #4 panic at /usr/src/sys/kern/kern_shutdown.c:843 #5 pmap_growkernel at /usr/src/sys/amd64/amd64/pmap.c:208 #6 vm_map_insert at /usr/src/sys/vm/vm_map.c:1752 #7 vm_map_find at /usr/src/sys/vm/vm_map.c:2259 #8 kva_import at /usr/src/sys/vm/vm_kern.c:712 #9 vmem_import at /usr/src/sys/kern/subr_vmem.c:928 #10 vmem_try_fetch at /usr/src/sys/kern/subr_vmem.c:1049 #11 vmem_xalloc at /usr/src/sys/kern/subr_vmem.c:1449 #12 vmem_alloc at /usr/src/sys/kern/subr_vmem.c:1350 #13 kva_alloc at /usr/src/sys/vm/vm_kern.c:150 #14 pmap_mapdev_internal at /usr/src/sys/amd64/amd64/pmap.c:8974 #15 pmap_mapdev_attr at /usr/src/sys/amd64/amd64/pmap.c:8990 #16 nexus_map_resource at /usr/src/sys/x86/x86/nexus.c:523 #17 nexus_activate_resource at /usr/src/sys/x86/x86/nexus.c:448 #18 nexus_alloc_resource at /usr/src/sys/x86/x86/nexus.c:412 #19 BUS_ALLOC_RESOURCE at ./bus_if.h:321 #20 bus_alloc_resource at /usr/src/sys/kern/subr_bus.c:4617 #21 xenpv_alloc_physmem at /usr/src/sys/x86/xen/xenpv.c:121 #22 xbb_alloc_communication_mem at /usr/src/sys/dev/xen/blkback/blkback.c:3010 #23 xbb_connect at /usr/src/sys/dev/xen/blkback/blkback.c:3336 #24 xenbusb_back_otherend_changed at /usr/src/sys/xen/xenbus/xenbusb_back.c:228 #25 xenwatch_thread at /usr/src/sys/dev/xen/xenstore/xenstore.c:1003 #26 in fork_exit at /usr/src/sys/kern/kern_fork.c:1069 #27 <signal handler called> There is some sort of mismatch in info, because panic message printed "panic: pmap_growkernel: no memory to grow kernel", but gdb backtrace in #5 0xffffffff81073eed in pmap_growkernel at /usr/src/sys/amd64/amd64/pmap.c:208 leads to lines: switch (pmap->pm_type) { .. panic("pmap_valid_bit: invalid pm_type %d", pmap->pm_type) So either trace is off the mark or message in serial logs. If this was only memleak related, then it should not happen when dom id 1 is started, i suppose. I am still gathering more info regarding memleak case, will inform when available. Thanks.