Kernel panics when given a high workload

Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Nicolas Goldman <nicolasgoldman07_at_gmail.com>
Date: Mon, 13 Feb 2023 17:15:23 UTC
Hello! Good Monday to all the FreeBSD community.
I am working on the FreeBSD kernel for my university thesis. The idea is to
make changes to the FreeBSD short-time scheduler so that all its operations
are based on the concept of Petri Nets. We already have the modeling of
said scheduler and the first tests running.

I am currently running into a problem that has left me out of ideas. Very
randomly, the kernel throws page faults and reboots the OS. With my thesis
partner, we tried to see when this problem happened but didn't find any
pattern to reproduce it. We could see it mostly when the processor is
heavily loaded, but as I said previously, only in some simulations.

I leave some information about the logs found; any help is appreciated.

Code:

uname -a
FreeBSD pielihueso 13.1-RELEASE FreeBSD 13.1-RELEASE
DrudiGoldmanPI/update_petriNetScheduler-13.1.0-n250157-cb2e622cf22d
PI_KERNELCONF amd64


Differences between PI_KERNELCONF and GENERIC are:

1. We are working on the 4BSD scheduler instead of the ULE:
Code:

# options     SCHED_ULE        # ULE scheduler
options     SCHED_4BSD        # 4BSD scheduler


2. We added some debugger options:
Code:

options        DDB
options        GDB
options        KDB_UNATTENDED

-------
Code:

*cd /usr/obj/usr/src/amd64.amd64/sys/PI_KERNELCONF/
kgdb kernel.debug /var/crash/vmcore.last*

GNU gdb (GDB) 12.1 [GDB v12.1 for FreeBSD]
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-portbld-freebsd13.1".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /boot/kernel/kernel...
Reading symbols from /usr/lib/debug//boot/kernel/kernel.debug...

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address    = 0xffffffffffffffa8
fault code        = supervisor write data, page not present
instruction pointer    = 0x20:0xffffffff80ca0822
stack pointer            = 0x28:0xfffffe00cd879b60
frame pointer            = 0x28:0x0
code segment        = base 0x0, limit 0xfffff, type 0x1b
            = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags    = interrupt enabled, resume, IOPL = 0
current process        = 922 (sshd)
trap number        = 12
panic: page fault
cpuid = 1
time = 1676286448
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00cd879920
vpanic() at vpanic+0x17f/frame 0xfffffe00cd879970
panic() at panic+0x43/frame 0xfffffe00cd8799d0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe00cd879a30
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00cd879a90
calltrap() at calltrap+0x8/frame 0xfffffe00cd879a90
--- trap 0xc, rip = 0xffffffff80ca0822, rsp = 0xfffffe00cd879b60, rbp = 0 ---
kern_select() at kern_select+0x942
Uptime: 34s
Dumping 371 out of 8085 MB:..5%..13%..22%..31%..44%..52%..61%..74%..82%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55        __asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,

*(kgdb) where*


#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=1) at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80c2f521 in kern_reboot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:487
#3  0xffffffff80c2f99e in vpanic (fmt=0xffffffff811dfeea "%s",
ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920
#4  0xffffffff80c2f7a3 in panic (fmt=<unavailable>) at
/usr/src/sys/kern/kern_shutdown.c:844
#5  0xffffffff810d7855 in trap_fatal (frame=0xfffffe00cd879aa0,
eva=18446744073709551528) at /usr/src/sys/amd64/amd64/trap.c:944
#6  0xffffffff810d78af in trap_pfault (frame=0xfffffe00cd879aa0,
usermode=false, signo=<optimized out>, ucode=<optimized out>) at
/usr/src/sys/amd64/amd64/trap.c:763
#7  <signal handler called>
#8  0xffffffff80ca0822 in selrescan (td=<error reading variable:
Cannot access memory at address 0xffffffffffffffd0>, ibits=<optimized
out>, obits=<optimized out>) at /usr/src/sys/kern/sys_generic.c:1325
#9  kern_select (td=<optimized out>, nd=<error reading variable:
Cannot access memory at address 0xffffffffffffff90>, fd_in=<optimized
out>, fd_ou=<optimized out>, fd_ex=<optimized out>, tvp=<optimized
out>,
    abi_nfdbits=<error reading variable: Cannot access memory at
address 0x10>) at /usr/src/sys/kern/sys_generic.c:1206
Backtrace stopped: Cannot access memory at address 0x8


Code:

# nm -n /boot/kernel/kernel | grep  0xffffffff80ca0822
# nm -n /boot/kernel/kernel | grep  0xffffffff80ca0822
# nm -n /boot/kernel/kernel | grep  0xffffffff80c
# nm -n /boot/kernel/kernel | grep  0xffffffff
# nm -n /boot/kernel/kernel | grep  0xfffff
# nm -n /boot/kernel/kernel | grep  0xff
# nm -n /boot/kernel/kernel | grep  0x

ffffffff80388c30 t cam_compat_handle_0x17
ffffffff803891e0 t cam_compat_handle_0x18
ffffffff803895f0 t cam_compat_handle_0x19
ffffffff80389730 t cam_compat_translate_dev_match_0x18
ffffffff80aba6a0 t xl_check_maddr_90xB
ffffffff80aba6f0 t xl_check_maddr_90x
ffffffff80abad90 t xl_txeof_90xB
ffffffff80abb090 t xl_start_90xB_locked
ffffffff80ebac80 t mlx5e_fec_mask_10x_25x_handler
ffffffff80ebb050 t mlx5e_fec_avail_10x_25x_handler
ffffffff80ebb0f0 t mlx5e_fec_mask_50x_handler
ffffffff80ebb4e0 t mlx5e_fec_avail_50x_handler
ffffffff810af6c0 T Xint0x80_syscall_pti
ffffffff810af740 T Xint0x80_syscall
ffffffff810af743 t int0x80_syscall_common
ffffffff817fe180 r db_inst_0f0x
ffffffff8180cc50 r mouse10x14_120
ffffffff8180cd40 r mouse10x16_50
ffffffff8180cd90 r mouse10x16_75
ffffffff8180cde0 r mouse10x16_90
ffffffff8180ce30 r mouse10x16_100
ffffffff8180ce80 r mouse10x16_120
ffffffff8180ced0 r mouse10x16_133


We also tried with dtrace but with no luck. Do you have other
recommendations on how we can keep debugging this issue? We know it's
something we broke on the scheduler because the generic kernel is working
decently.

P.S.: If someone is interested in how we implemented the Petri Net for the
scheduler, contact me through the mail, and I can give you the paper we are
working on.