kernel: fatal trap 12 on CURRENT, when using WireGuard

From: Rainer Hurling <rhurlin_at_gwdg.de>
Date: Tue, 09 Jan 2024 20:23:54 UTC
I tried to update my 15.0-CURRENT box from n267335-499e84e16f56 to a 
very recent commit. The build and install went fine. After booting with 
new base, I got a page fault with the following error:


Kernel page fault with the following non-sleepable locks held:
shared rm netlink lock (netlink lock) r = 0 (0xfffff8005fc8ca20) locked 
@ /usr/src/sys/netlink/netlink_domain.c:241
exclusive rw lle (lle) r = 0 (0xfffff801951dce90) locked @ 
/usr/src/sys/netinet/in.c:1716
stack backtrace:
#0 0xffffffff80bc6c45 at witness_debugger+0x65
#1 0xffffffff80bc7d89 at witness_warn+0x3e9
#2 0xffffffff81056b18 at trap_pfault+0x88
#3 0xffffffff81028708 at calltrap+0x8
#4 0xffffffff80dbd6a2 at nl_send_group+0x1d2
#5 0xffffffff80dc0e27 at _nlmsg_flush+0x37
#6 0xffffffff80dc4fdc at rtnl_lle_event+0x10c
#7 0xffffffff80d15e32 at arp_mark_lle_reachable+0xd2
#8 0xffffffff80d15b43 at arp_check_update_lle+0x293
#9 0xffffffff80d151c5 at arpintr+0xa65
#10 0xffffffff80caaaed at netisr_dispatch_src+0xad
#11 0xffffffff80c8d57a at ether_demux+0x0x17a
#12 0xffffffff80c8ec53 at ether_nh_input+0x403
#13 0xffffffff80caaaed at netisr_dispatch_src+0xad
#14 0xffffffff80c8d9c9 at ether_input+0xd9
#15 0xffffffff80ca66ac at iflib_rxeof+0xe4c
#16 0xffffffff80ca0b5a at _task_fn_rx+0x7a
#17 0xffffffff80ba0118 at gtaskqueue_run_locked+0xa8

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x30000
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff80dc0a10
stack pointer           = 0x28:0xfffffe006a3a8760
frame pointer           = 0x28:0xfffffe006a3a8790
code segment            = base 0x0, limit 0xfffff, type 0x1b
                         = DPL 0, pres 1, long 1. def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 0 (if_io_tqg_0)
rdi: fffffe006a3a8850 rsi: fffffe006a3a86f0 rdx: fffffe006a3a87b0
rcx: fffff80001f88740  r8: ffffffff83210090  r9: 0000000000000000
rax: 0000000000000000 rbx: 0000000000030000 rbp: fffffe006a3a8790
r10: 0000000000000001 r11: 0000000000000000 r12: fffff8005fc8ca00
r13: fffff8005fc8ca20 r14: fffffe006a3a8850 r15: 0000000000000000
trap number             = 12
panic: page fault
cpuid = 0
time = 1704824328
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfffffe006a3a8430
vpanic() at vpanic+0x131/frame 0xfffffe006a3a8560
panic() at panic+0x43/frame 0xfffffe006a3a85c0
trap_fatal() at trap_fatal+0x40f/frame 0xfffffe006a3a8620
trap_pfault() at trap_pfault+0xae/frame 0xfffffe006a3a8690
calltrap() at calltrap+0x8/frame 0xfffffe006a3a8690
--- trap 0xc, rip = 0xffffffff80dc0a10, rsp = 0xfffffe006a3a8760, rbp = 
0xfffffe006a3a8790 ---
nl_send_one() at nl_send_one+0x20/frame 0xfffffe006a3a8790
nl_send_group() at nl_send_group+0x1d2/frame 0xfffffe006a3a8820
_nlmsg-flush() at _nlmsg_flush+0x37/frame 0xfffffe006a3a8840
rtnl_lle_event() at rtnl_lle_event+0x10c/frame 0xfffffe006a3a88e0
arp_mark_lle_reachable() at arp_mark_lle_reachable+0xd2/frame 
0xfffffe006a3a8930
arp_check_update_lle() at arp_check_update_lle+0x293/frame 
0xfffffe006a3a8a00
arpintr() at arpintr+0xa65/frame 0xfffffe006a3a8b60
netisr_dispatch_src() at netisr_dispatch_src+0xad/frame 0xfffffe006a3a8bc8
ether_demux() at ether_demux+0x17a/frame 0xfffffe006a4a8bf0
ether_nh_input() at ether_nh_input+0x403/frame 0xfffffe006a3a8c40
netisr_dispatch_src() at netisr_dispatch_src+0xad/frame 0xfffffe006a3a8ca0
ether_input() at ehter_input+0xd9/frame 0xfffffe006a3a8d00
iflib_rxeof() at iflib_rxeof+0xe4c/frame 0xfffffe006a3a8e00
_task_fn_rx() at _task_fn_rx+0x7a/frame 0xfffffe006a3a8e40
gtaskqueue_run_locked() at gtaskqueue_run_locked+0xa8/frame 
0xfffffe006a3a8ec0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xd3/frame 
0xfffffe006a3a8ef0
fork_exit() at fork_exit+0x82/frame 0xfffffe006a3a8f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe006a3a8f30
--- trap 0xf2b9f109, rip = 0x7afef8a176bef8a5, rsp = 0xddc963edd18963e9, 
rbp = 0x61f64fc36db64fc7
KDB: enter: panic
[ thread pid 0 tid 100067 ]
Stopped at      kdb_enter+0x33: movq    $0,0xe3a582(%rip)
db>


Since the current process 'if_io_tqg_0' and problems with netlink are 
mentioned, I searched in the area of my network connections. I 
discovered that this page fault only occurs when a connection is 
established with WireGuard (wg-quick up wg0). Without using WireGuard, 
this error does not occur.

I was able to find out at which commit this behavior occurs with my box:
- Up to commit main-n267347-660bd40a598a everything is fine.
- The two following commits n267348-67d9023f07a4 and 
n267349-0ad011ececb9 do not build on my box (module/netlink broken ...).
- From commit n267349-0ad011ececb9 (netlink) onwards this page fault 
occurs when WireGuard is started.

Any help is greatly appreciated.
CC'ed Gleb Smirnoff due to the affected commits.

Regards,
Rainer Hurling