Re: ZFS + FreeBSD XEN dom0 panic

From: Ze Dupsys <zedupsys_at_gmail.com>
Date: Wed, 02 Mar 2022 17:26:18 UTC
Today managed to crash lab Dom0 with:
xen_cmdline="dom0_mem=6144M dom0_max_vcpus=2 dom0=pvh,verbose=1
console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all sync_console=1
reboot=no"

I wrote ' vmstat -m | sort -k 2 -r' each 120 seconds, the latest one was as
in attachment, panic was with the same fingerprint as the one with
"rman_is_region_manager" line already reported.

The scripts i ran in parallel generally were the same as attached in bug
report, just a bit modified.
1) ./libexec.sh zfs_volstress_fast_4g (this just creates new ZVOLs and
instead of 2GB, it writes 4BG in each ZVOL created dd if=/dev/zero)
2)  ./test_vm1_zvol_3gb.sh (this loops commands: start first DomU, write
3GB in it's /tmp, restart DomU, removes /tmp, repeat)
3) ./test_vm2_zvol_5_on_off.sh (this loops: start second DomU, which has 5
disks attached, turn off DomU, repeat)
4) monitoring, sleep 120 seconds, print vmstat | sort in serial output.

Around dom id 108, system started to behave suspiciously, xl list showed
DomUs created, but they did not really start up, script timeout-ed for ssh
connection, no vnc. When i did xl destroy manually, and xl create, system
panic happened.

I have log files for all serial output, if there is anything useful, i can
provide. On disk log files seems to loose latest messages due to crash.


On Wed, Mar 2, 2022 at 3:57 PM Roger Pau Monné <roger.pau@citrix.com> wrote:

> On Wed, Mar 02, 2022 at 10:57:37AM +0200, Ze Dupsys wrote:
> > Hello,
> >
> > I started using XEN on one pre-production machine (with aim to use later
> in
> > production) with 12.2, but since it experienced random crashes i updated
> to
> > 13.0 in hope that errors might disappear.
> >
> > I do not know how detailed should i write, so that this email is not too
> > long, but gives enough info.
> >
> > FreeBSD Dom0 is installed on ZFS, somewhat basic install, IPFW and rules
> > for NATting are used. Zpool is composed of 2 mirrored disks. There is a
> > ZVOL volmode=dev for each VM and VM's jail that are attached as raw
> devices
> > to DomU. At the moment DomUs contain FreeBSD, some 12.0 to 13.0, UFS,
> with
> > VNET jails, epairs all bridged to DomU's xn0 interface. On Dom0 i have
> > bridge interfaces, where DomU's are connected depending on their
> > "zone/network", those that have allowed outgoing connections are NATted
> by
> > IPFW on specific physical NIC and IP.
>
> So from the traces on the ticket:
>
> panic: pmap_growkernel: no memory to grow kernel
> cpuid = 0
> time = 1646123072
> KDB: stack backtrace:
> #0 0xffffffff80c57525 at kdb_backtrace+0x65
> #1 0xffffffff80c09f01 at vpanic+0x181
> #2 0xffffffff80c09d73 at panic+0x43
> #3 0xffffffff81073eed at pmap_growkernel+0x27d
> #4 0xffffffff80f2dae8 at vm_map_insert+0x248
> #5 0xffffffff80f30249 at vm_map_find+0x549
> #6 0xffffffff80f2bf76 at kmem_init+0x226
> #7 0xffffffff80c73341 at vmem_xalloc+0xcb1
> #8 0xffffffff80c72c3b at vmem_xalloc+0x5ab
> #9 0xffffffff80f2bfce at kmem_init+0x27e
> #10 0xffffffff80c73341 at vmem_xalloc+0xcb1
> #11 0xffffffff80c72c3b at vmem_xalloc+0x5ab
> #12 0xffffffff80c72646 at vmem_alloc+0x46
> #13 0xffffffff80f2b616 at kmem_malloc_domainset+0x96
> #14 0xffffffff80f21a2a at uma_prealloc+0x23a
> #15 0xffffffff80f235de at sysctl_handle_uma_zone_cur+0xe2e
> #16 0xffffffff80f1f6af at uma_set_align+0x8f
> #17 0xffffffff82463362 at abd_borrow_buf_copy+0x22
> Uptime: 4m9s
>
>
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x22710028
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff80c45892
> stack pointer           = 0x28:0xfffffe0096600930
> frame pointer           = 0x28:0xfffffe0096600930
> code segment            = base rx0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 1496 (devmatch)
> trap number             = 12
> panic: page fault
> cpuid = 0
> time = 1646123791
> KDB: stack backtrace:
> #0 0xffffffff80c57525 at kdb_backtrace+0x65
> #1 0xffffffff80c09f01 at vpanic+0x181
> #2 0xffffffff80c09d73 at panic+0x43
> #3 0xffffffff8108b1a7 at trap+0xbc7
> #4 0xffffffff8108b1ff at trap+0xc1f
> #5 0xffffffff8108a85d at trap+0x27d
> #6 0xffffffff81061b18 at calltrap+0x8
> #7 0xffffffff80c62011 at rman_is_region_manager+0x241
> #8 0xffffffff80c1a051 at sbuf_new_for_sysctl+0x101
> #9 0xffffffff80c1949c at kernel_sysctl+0x43c
> #10 0xffffffff80c19b13 at userland_sysctl+0x173
> #11 0xffffffff80c1995f at sys___sysctl+0x5f
> #12 0xffffffff8108baac at amd64_syscall+0x10c
> #13 0xffffffff8106243e at Xfast_syscall+0xfe
>
>
> Fatal trap 12: page fault while in kernel mode
> cpuid = 1; apic id = 02
> fault virtual address   = 0x68
> fault code              = supervisor read data, page not present
> instruction pointer     = 0x20:0xffffffff824a599d
> stack pointer           = 0x28:0xfffffe00b1e27910
> frame pointer           = 0x28:0xfffffe00b1e279b0
> code segment            = base rx0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 0 (xbbd7 taskq)
> trap number             = 12
> panic: page fault
> cpuid = 1
> time = 1646122723
> KDB: stack backtrace:
> #0 0xffffffff80c57525 at kdb_backtrace+0x65
> #1 0xffffffff80c09f01 at vpanic+0x181
> #2 0xffffffff80c09d73 at panic+0x43
> #3 0xffffffff8108b1a7 at trap+0xbc7
> #4 0xffffffff8108b1ff at trap+0xc1f
> #5 0xffffffff8108a85d at trap+0x27d
> #6 0xffffffff81061b18 at calltrap+0x8
> #7 0xffffffff8248935a at dmu_read+0x2a
> #8 0xffffffff82456a3a at zvol_geom_bio_strategy+0x2aa
> #9 0xffffffff80a7f214 at xbd_instance_create+0xa394
> #10 0xffffffff80a7b1ea at xbd_instance_create+0x636a
> #11 0xffffffff80c6b1c1 at taskqueue_run+0x2a1
> #12 0xffffffff80c6c4dc at taskqueue_thread_loop+0xac
> #13 0xffffffff80bc7e3e at fork_exit+0x7e
> #14 0xffffffff81062b9e at fork_trampoline+0xe
> Uptime: 1h44m10s
>
> This all look to me like Out of Memory conditions, can you check with
> `top` whats going on with your memory?
>
> Might also be helpful to record periodic calls to `vmstat -m | sort -k
> 2 -r` to try to figure out what's using so much memory.
>
> Regards, Roger.
>