[Bug 283747] kernel panic after telegraf service restart

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 19 Mar 2025 19:17:23 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283747

--- Comment #41 from Matthew L. Dailey <matthew.l.dailey@dartmouth.edu> ---
Thanks for the explanation, Gleb. It sounds like this is definitely worth
fixing, regardless of whether it's the cause of this specific bug. :-)

FWIW, neither of my test servers (uptime ~8 days with telegraf running
constantly) showed any zombies:
# ps aux | grep Z | grep -v grep
USER   PID  %CPU %MEM     VSZ    RSS TT  STAT STARTED        TIME COMMAND

I was just now able to panic my hardware system (took just over 8 days) and
will send a core the same way I did before. This is obviously from the
unpatched kernel. The last frame before the panic is in crunusebatch() so this
looks promising.

Unread portion of the kernel message buffer:
[691684] panic: crunusebatch: ref -4294967294 not >= 0 on cred
0xfffff8011b43cb00
[691684] cpuid = 17
[691684] time = 1742410811
[691684] KDB: stack backtrace:
[691684] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe00d3bfa870
[691684] vpanic() at vpanic+0x131/frame 0xfffffe00d3bfa9a0
[691684] panic() at panic+0x43/frame 0xfffffe00d3bfaa00
[691684] crunusebatch() at crunusebatch+0xfa/frame 0xfffffe00d3bfaa30
[691684] thread_reap_domain() at thread_reap_domain+0x28d/frame
0xfffffe00d3bfaae0
[691684] proc_reap() at proc_reap+0x660/frame 0xfffffe00d3bfab20
[691684] proc_to_reap() at proc_to_reap+0x3c4/frame 0xfffffe00d3bfab70
[691684] kern_wait6() at kern_wait6+0x1a6/frame 0xfffffe00d3bfac10
[691684] sys_wait4() at sys_wait4+0x6b/frame 0xfffffe00d3bfae00
[691684] amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d3bfaf30
[691684] fast_syscall_common() at fast_syscall_common+0xf8/frame
0xfffffe00d3bfaf30
[691684] --- syscall (7, FreeBSD ELF64, wait4), rip = 0x2893da, rsp =
0x821093d18, rbp = 0x821093d80 ---
[691684] KDB: enter: panic

Restarted both test systems (hardware and VM) with patched DEBUG kernels and
will report back in 10 days or so.

--- Comment #42 from Matthew L. Dailey <matthew.l.dailey@dartmouth.edu> ---
Thanks for the explanation, Gleb. It sounds like this is definitely worth
fixing, regardless of whether it's the cause of this specific bug. :-)

FWIW, neither of my test servers (uptime ~8 days with telegraf running
constantly) showed any zombies:
# ps aux | grep Z | grep -v grep
USER   PID  %CPU %MEM     VSZ    RSS TT  STAT STARTED        TIME COMMAND

I was just now able to panic my hardware system (took just over 8 days) and
will send a core the same way I did before. This is obviously from the
unpatched kernel. The last frame before the panic is in crunusebatch() so this
looks promising.

Unread portion of the kernel message buffer:
[691684] panic: crunusebatch: ref -4294967294 not >= 0 on cred
0xfffff8011b43cb00
[691684] cpuid = 17
[691684] time = 1742410811
[691684] KDB: stack backtrace:
[691684] db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame
0xfffffe00d3bfa870
[691684] vpanic() at vpanic+0x131/frame 0xfffffe00d3bfa9a0
[691684] panic() at panic+0x43/frame 0xfffffe00d3bfaa00
[691684] crunusebatch() at crunusebatch+0xfa/frame 0xfffffe00d3bfaa30
[691684] thread_reap_domain() at thread_reap_domain+0x28d/frame
0xfffffe00d3bfaae0
[691684] proc_reap() at proc_reap+0x660/frame 0xfffffe00d3bfab20
[691684] proc_to_reap() at proc_to_reap+0x3c4/frame 0xfffffe00d3bfab70
[691684] kern_wait6() at kern_wait6+0x1a6/frame 0xfffffe00d3bfac10
[691684] sys_wait4() at sys_wait4+0x6b/frame 0xfffffe00d3bfae00
[691684] amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d3bfaf30
[691684] fast_syscall_common() at fast_syscall_common+0xf8/frame
0xfffffe00d3bfaf30
[691684] --- syscall (7, FreeBSD ELF64, wait4), rip = 0x2893da, rsp =
0x821093d18, rbp = 0x821093d80 ---
[691684] KDB: enter: panic

Restarted both test systems (hardware and VM) with patched DEBUG kernels and
will report back in 10 days or so.

-- 
You are receiving this mail because:
You are the assignee for the bug.