[Bug 261059] Kernel panic XEN + ZFS volume.
- In reply to: bugzilla-noreply_a_freebsd.org: "[Bug 261059] Kernel panic XEN + ZFS volume."
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 01 Mar 2022 09:46:26 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261059 --- Comment #1 from Janis <zedupsys@gmail.com> --- I've been digging further with this bug. Found one ZFS problem, which i can repeat 100%, reported as bug #262189. For me though it seems that these two bugs might not be related. What i have found though, is that less RAM for Dom0, helps to panic system sooner. Thus i assume, that ZFS stress script, just helped to fill memory sooner. Another thing i did, is install FreeBSD on UFS separate disk, and ZFS pool on the other disk. System still crashes, but it is easier to try out different combinations. My latest xen command line params are: xen_cmdline="dom0_mem=2048M cpufreq=dom0-kernel dom0_max_vcpus=2 dom0=pvh,verbose=1 console=vga,com1 com1=9600,8n1 guest_loglvl=all loglvl=all sync_console=1 reboot=no" So now it seems that i can see more verbose panic messages on serial output. While investigating, there are few things i have noticed; it has given me a suspicion that actually there is not just a single bug, but multiple, which trigger themselves at different times. Sometimes when system does not crash, it crashes when i destroy all DomU instances, not when i create them, sometimes after all DomU's have been destroyed, system crashes on init 0 call. 1. While stressing ZFS, at some point i get messages like these in console: xnb(xnb_frontend_changed:1391): frontend_state=Connected, xnb_state=InitWait xnb(xnb_connect_comms:787): rings connected! (XEN) d2v0: upcall vector 93 xbbd2: Error 12 Unable to allocate request bounce buffers xbbd2: Fatal error. Transitioning to Closing State xbbd5: Error 12 Unable to allocate request bounce buffers xbbd5: Fatal error. Transitioning to Closing State xnb(xnb_frontend_changed:1391): frontend_state=Connected, xnb_state=InitWait xnb(xnb_connect_comms:787): rings connected! Mar 1 10:31:55 lab-01 kernel: pid 1117 (qemu-system-i386), jid 0, uid 0, was killed: out of swap space Mar 1 10:32:59 lab-01 kernel: pid 1264 (qemu-system-i386), jid 0, uid 0, was killed: out of swap space Mar 1 10:33:06 lab-01 kernel: pid 1060 (zsh), jid 0, uid 0, was killed: out of swap space Mar 1 10:33:11 lab-01 kernel: pid 1053 (zsh), jid 0, uid 0, was killed: out of swap space For me this seems somehow weird, could it be a sign for memleak? That some resources are not cleaned up after DomU's destroy? Because all that the scripts are doing is start DomU, write some data in disk, stop DomU. 2. On domain creation part, sometimes i get an error like this: Parsing config from /service/crash/cfg/xen-vm2-zvol-5.conf libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable to add device with path /local/domain/0/backend/vbd/9/51712 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable to add device with path /local/domain/0/backend/vbd/9/51728 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable to add device with path /local/domain/0/backend/vbd/9/51744 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable to add device with path /local/domain/0/backend/vbd/9/51760 libxl: error: libxl_device.c:1111:device_backend_callback: Domain 9:unable to add device with path /local/domain/0/backend/vbd/9/51776 libxl: error: libxl_create.c:1613:domcreate_launch_dm: Domain 9:unable to add disk devices libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 9:Non-existant domain libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 9:Unable to destroy guest libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 9:Destruction of domain failed Is it possible to know more info for why Dom0 was "unable to add device with path"? More verbosity? Was it that ZFS held some locks or that previous DomU is still holding the same ZVOL? 3. Since i am following information at https://docs.freebsd.org/en/books/handbook/virtualization/#virtualization-host-xen, it seems that command: echo 'vm.max_wired=-1' >> /etc/sysctl.conf is obsolete, because in FreeBSD 13.0, there is no such sysctl knob, "sysctl: unknown oid 'vm.max_wired'". I do not know which is equivalent to this one. I found "vm.max_user_wired=-1", is it the same? Maybe manual should be updated. Even if i set this to -1, still quemy-system is killed with out of swap space error. Maybe there is a different sysctl for that purpose now? At one point i got an unseen error, but i do not remember what was the system state, what did i do. It is as follows: xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status xnb(xnb_ring2pkt:1526): Unknown extra info type 255. Discarding packet xnb(xnb_dump_txreq:299): netif_tx_request index =0 xnb(xnb_dump_txreq:300): netif_tx_request.gref =0 xnb(xnb_dump_txreq:301): netif_tx_request.offset=0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =8 xnb(xnb_dump_txreq:303): netif_tx_request.id =69 xnb(xnb_dump_txreq:304): netif_tx_request.size =1000 xnb(xnb_dump_txreq:299): netif_tx_request index =1 xnb(xnb_dump_txreq:300): netif_tx_request.gref =255 xnb(xnb_dump_txreq:301): netif_tx_request.offset=0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =0 xnb(xnb_dump_txreq:303): netif_tx_request.id =0 xnb(xnb_dump_txreq:304): netif_tx_request.size =0 xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status xnb(xnb_ring2pkt:1526): Unknown extra info type 255. Discarding packet xnb(xnb_dump_txreq:299): netif_tx_request index =0 xnb(xnb_dump_txreq:300): netif_tx_request.gref =0 xnb(xnb_dump_txreq:301): netif_tx_request.offset=0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =8 xnb(xnb_dump_txreq:303): netif_tx_request.id =69 xnb(xnb_dump_txreq:304): netif_tx_request.size =1000 xnb(xnb_dump_txreq:299): netif_tx_request index =1 xnb(xnb_dump_txreq:300): netif_tx_request.gref =255 xnb(xnb_dump_txreq:301): netif_tx_request.offset=0 xnb(xnb_dump_txreq:302): netif_tx_request.flags =0 xnb(xnb_dump_txreq:303): netif_tx_request.id =0 xnb(xnb_dump_txreq:304): netif_tx_request.size =0 xnb(xnb_rxpkt2rsp:2059): Got error -1 for hypervisor gnttab_copy status 4. Finally, due to better XEN flags, i get full output for panics: (XEN) d1v0: upcall vector 93 xnb(xnb_frontend_changed:1391): frontend_state=Connected, xnb_state=InitWait xnb(xnb_connect_comms:787): rings connected! (XEN) d2v0: upcall vector 93 xbbd2: Error 12 Unable to allocate request bounce buffers xbbd2: Fatal error. Transitioning to Closing State xbbd5: Error 12 Unable to allocate request bounce buffers xbbd5: Fatal error. Transitioning to Closing State xnb(xnb_frontend_changed:1391): frontend_state=Connected, xnb_state=InitWait xnb(xnb_connect_comms:787): rings connected! panic: pmap_growkernel: no memory to grow kernel cpuid = 0 time = 1646123072 KDB: stack backtrace: #0 0xffffffff80c57525 at kdb_backtrace+0x65 #1 0xffffffff80c09f01 at vpanic+0x181 #2 0xffffffff80c09d73 at panic+0x43 #3 0xffffffff81073eed at pmap_growkernel+0x27d #4 0xffffffff80f2dae8 at vm_map_insert+0x248 #5 0xffffffff80f30249 at vm_map_find+0x549 #6 0xffffffff80f2bf76 at kmem_init+0x226 #7 0xffffffff80c73341 at vmem_xalloc+0xcb1 #8 0xffffffff80c72c3b at vmem_xalloc+0x5ab #9 0xffffffff80f2bfce at kmem_init+0x27e #10 0xffffffff80c73341 at vmem_xalloc+0xcb1 #11 0xffffffff80c72c3b at vmem_xalloc+0x5ab #12 0xffffffff80c72646 at vmem_alloc+0x46 #13 0xffffffff80f2b616 at kmem_malloc_domainset+0x96 #14 0xffffffff80f21a2a at uma_prealloc+0x23a #15 0xffffffff80f235de at sysctl_handle_uma_zone_cur+0xe2e #16 0xffffffff80f1f6af at uma_set_align+0x8f #17 0xffffffff82463362 at abd_borrow_buf_copy+0x22 Uptime: 4m9s Here i somewhat do not understand why pmap_growkernel should panic if out of memory. I mean couldn't it just return DomU could not be created because out of memory? I do not know internals, so forgive me if this question seems foolish. Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x22710028 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80c45892 stack pointer = 0x28:0xfffffe0096600930 frame pointer = 0x28:0xfffffe0096600930 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 1496 (devmatch) trap number = 12 panic: page fault cpuid = 0 time = 1646123791 KDB: stack backtrace: #0 0xffffffff80c57525 at kdb_backtrace+0x65 #1 0xffffffff80c09f01 at vpanic+0x181 #2 0xffffffff80c09d73 at panic+0x43 #3 0xffffffff8108b1a7 at trap+0xbc7 #4 0xffffffff8108b1ff at trap+0xc1f #5 0xffffffff8108a85d at trap+0x27d #6 0xffffffff81061b18 at calltrap+0x8 #7 0xffffffff80c62011 at rman_is_region_manager+0x241 #8 0xffffffff80c1a051 at sbuf_new_for_sysctl+0x101 #9 0xffffffff80c1949c at kernel_sysctl+0x43c #10 0xffffffff80c19b13 at userland_sysctl+0x173 #11 0xffffffff80c1995f at sys___sysctl+0x5f #12 0xffffffff8108baac at amd64_syscall+0x10c #13 0xffffffff8106243e at Xfast_syscall+0xfe Fatal trap 12: page fault while in kernel mode cpuid = 1; apic id = 02 fault virtual address = 0x68 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff824a599d stack pointer = 0x28:0xfffffe00b1e27910 frame pointer = 0x28:0xfffffe00b1e279b0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (xbbd7 taskq) trap number = 12 panic: page fault cpuid = 1 time = 1646122723 KDB: stack backtrace: #0 0xffffffff80c57525 at kdb_backtrace+0x65 #1 0xffffffff80c09f01 at vpanic+0x181 #2 0xffffffff80c09d73 at panic+0x43 #3 0xffffffff8108b1a7 at trap+0xbc7 #4 0xffffffff8108b1ff at trap+0xc1f #5 0xffffffff8108a85d at trap+0x27d #6 0xffffffff81061b18 at calltrap+0x8 #7 0xffffffff8248935a at dmu_read+0x2a #8 0xffffffff82456a3a at zvol_geom_bio_strategy+0x2aa #9 0xffffffff80a7f214 at xbd_instance_create+0xa394 #10 0xffffffff80a7b1ea at xbd_instance_create+0x636a #11 0xffffffff80c6b1c1 at taskqueue_run+0x2a1 #12 0xffffffff80c6c4dc at taskqueue_thread_loop+0xac #13 0xffffffff80bc7e3e at fork_exit+0x7e #14 0xffffffff81062b9e at fork_trampoline+0xe Uptime: 1h44m10s One of those panics happened on init 0 at some point (all DomU's were destroyed) , unfortunately i did not note down which one. Version is still 13.0-RELEASE-p7. -- You are receiving this mail because: You are the assignee for the bug.