Kernel panics when given a high workload
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 13 Feb 2023 17:15:23 UTC
Hello! Good Monday to all the FreeBSD community. I am working on the FreeBSD kernel for my university thesis. The idea is to make changes to the FreeBSD short-time scheduler so that all its operations are based on the concept of Petri Nets. We already have the modeling of said scheduler and the first tests running. I am currently running into a problem that has left me out of ideas. Very randomly, the kernel throws page faults and reboots the OS. With my thesis partner, we tried to see when this problem happened but didn't find any pattern to reproduce it. We could see it mostly when the processor is heavily loaded, but as I said previously, only in some simulations. I leave some information about the logs found; any help is appreciated. Code: uname -a FreeBSD pielihueso 13.1-RELEASE FreeBSD 13.1-RELEASE DrudiGoldmanPI/update_petriNetScheduler-13.1.0-n250157-cb2e622cf22d PI_KERNELCONF amd64 Differences between PI_KERNELCONF and GENERIC are: 1. We are working on the 4BSD scheduler instead of the ULE: Code: # options SCHED_ULE # ULE scheduler options SCHED_4BSD # 4BSD scheduler 2. We added some debugger options: Code: options DDB options GDB options KDB_UNATTENDED ------- Code: *cd /usr/obj/usr/src/amd64.amd64/sys/PI_KERNELCONF/ kgdb kernel.debug /var/crash/vmcore.last* GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD] Copyright (C) 2022 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-portbld-freebsd13.1". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from /boot/kernel/kernel... Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug... Unread portion of the kernel message buffer: Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0xffffffffffffffa8 fault code = supervisor write data, page not present instruction pointer = 0x20:0xffffffff80ca0822 stack pointer = 0x28:0xfffffe00cd879b60 frame pointer = 0x28:0x0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 922 (sshd) trap number = 12 panic: page fault cpuid = 1 time = 1676286448 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00cd879920 vpanic() at vpanic+0x17f/frame 0xfffffe00cd879970 panic() at panic+0x43/frame 0xfffffe00cd8799d0 trap_fatal() at trap_fatal+0x385/frame 0xfffffe00cd879a30 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00cd879a90 calltrap() at calltrap+0x8/frame 0xfffffe00cd879a90 --- trap 0xc, rip = 0xffffffff80ca0822, rsp = 0xfffffe00cd879b60, rbp = 0 --- kern_select() at kern_select+0x942 Uptime: 34s Dumping 371 out of 8085 MB:..5%..13%..22%..31%..44%..52%..61%..74%..82%..91% __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 55 __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu, *(kgdb) where* #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55 #1 doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:399 #2 0xffffffff80c2f521 in kern_reboot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:487 #3 0xffffffff80c2f99e in vpanic (fmt=0xffffffff811dfeea "%s", ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920 #4 0xffffffff80c2f7a3 in panic (fmt=<unavailable>) at /usr/src/sys/kern/kern_shutdown.c:844 #5 0xffffffff810d7855 in trap_fatal (frame=0xfffffe00cd879aa0, eva=18446744073709551528) at /usr/src/sys/amd64/amd64/trap.c:944 #6 0xffffffff810d78af in trap_pfault (frame=0xfffffe00cd879aa0, usermode=false, signo=<optimized out>, ucode=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:763 #7 <signal handler called> #8 0xffffffff80ca0822 in selrescan (td=<error reading variable: Cannot access memory at address 0xffffffffffffffd0>, ibits=<optimized out>, obits=<optimized out>) at /usr/src/sys/kern/sys_generic.c:1325 #9 kern_select (td=<optimized out>, nd=<error reading variable: Cannot access memory at address 0xffffffffffffff90>, fd_in=<optimized out>, fd_ou=<optimized out>, fd_ex=<optimized out>, tvp=<optimized out>, abi_nfdbits=<error reading variable: Cannot access memory at address 0x10>) at /usr/src/sys/kern/sys_generic.c:1206 Backtrace stopped: Cannot access memory at address 0x8 Code: # nm -n /boot/kernel/kernel | grep 0xffffffff80ca0822 # nm -n /boot/kernel/kernel | grep 0xffffffff80ca0822 # nm -n /boot/kernel/kernel | grep 0xffffffff80c # nm -n /boot/kernel/kernel | grep 0xffffffff # nm -n /boot/kernel/kernel | grep 0xfffff # nm -n /boot/kernel/kernel | grep 0xff # nm -n /boot/kernel/kernel | grep 0x ffffffff80388c30 t cam_compat_handle_0x17 ffffffff803891e0 t cam_compat_handle_0x18 ffffffff803895f0 t cam_compat_handle_0x19 ffffffff80389730 t cam_compat_translate_dev_match_0x18 ffffffff80aba6a0 t xl_check_maddr_90xB ffffffff80aba6f0 t xl_check_maddr_90x ffffffff80abad90 t xl_txeof_90xB ffffffff80abb090 t xl_start_90xB_locked ffffffff80ebac80 t mlx5e_fec_mask_10x_25x_handler ffffffff80ebb050 t mlx5e_fec_avail_10x_25x_handler ffffffff80ebb0f0 t mlx5e_fec_mask_50x_handler ffffffff80ebb4e0 t mlx5e_fec_avail_50x_handler ffffffff810af6c0 T Xint0x80_syscall_pti ffffffff810af740 T Xint0x80_syscall ffffffff810af743 t int0x80_syscall_common ffffffff817fe180 r db_inst_0f0x ffffffff8180cc50 r mouse10x14_120 ffffffff8180cd40 r mouse10x16_50 ffffffff8180cd90 r mouse10x16_75 ffffffff8180cde0 r mouse10x16_90 ffffffff8180ce30 r mouse10x16_100 ffffffff8180ce80 r mouse10x16_120 ffffffff8180ced0 r mouse10x16_133 We also tried with dtrace but with no luck. Do you have other recommendations on how we can keep debugging this issue? We know it's something we broke on the scheduler because the generic kernel is working decently. P.S.: If someone is interested in how we implemented the Petri Net for the scheduler, contact me through the mail, and I can give you the paper we are working on.