Re: FreeBSD panics possibly caused by nfs clients

From: Matthew L. Dailey <Matthew.L.Dailey_at_dartmouth.edu>
Date: Wed, 06 Mar 2024 18:56:00 UTC
Posting a few updates on this issue.

I was able to induce a panic on a CURRENT kernel (20240215), built with 
GENERIC-KASAN and running kern.kstack_pages=6 (default) after ~189 
hours. The panic message and backtrace are below - please reach out 
directly if you'd like to have a look at the core. I'm specifically not 
increasing kstack_pages to see what effect this has on the panics.

I have another system running CURRENT (20240215) without any debugging. 
I'm able to regularly panic this (7 panics over two weeks with average 
uptime of ~42 hours) with kstack_pages=4. I set kstack_pages=6, and have 
also induced several panics. Oddly, this seems to happen more quickly (4 
panics over 2 days with average uptime of ~10.5 hours).

Another system running CURRENT (20240208), built with GENERIC-KASAN and 
running kern.kstack_pages=8 has now been running with our hdf5 workload 
non-stop since February 10th with no panics or issues.

 From all this, it seems like increasing kstack_pages to 8 eliminates 
the panics, but obviously doesn't fix the underlying cause of the 
issues. So, although this is an obvious workaround for our production 
system, it would be better to find and fix the underlying cause of the 
panics.

I'm going to continue testing to try to generate more cores with 
kstack_pages<8 on the system with the debug kernel.

Any other ideas to try to narrow down the cause are welcome.

Thanks,
Matt

[680940] Kernel page fault with the following non-sleepable locks held:
[680940] exclusive sleep mutex nfs_state_mutex (nfs_state_mutex) r = 0 
(0xffffffff830498e0) locked @ /usr/src/sys/fs/nfsserver/nfs_nfsdstate.c:6652
[680940] stack backtrace:
[680940] #0 0xffffffff8127958f at witness_debugger+0x13f
[680940] #1 0xffffffff8127b114 at witness_warn+0x674
[680940] #2 0xffffffff81aba0a6 at trap_pfault+0x116
[680940] #3 0xffffffff81ab901c at trap+0x54c
[680940] #4 0xffffffff81a75988 at calltrap+0x8
[680940] #5 0xffffffff80fb4bfa at nfsrv_freestateid+0x23a
[680940] #6 0xffffffff80fd5e3f at nfsrvd_freestateid+0x1df
[680940] #7 0xffffffff80f98d35 at nfsrvd_dorpc+0x2585
[680940] #8 0xffffffff80fbf588 at nfssvc_program+0x1078
[680940] #9 0xffffffff8173fce6 at svc_run_internal+0x1706
[680940] #10 0xffffffff8174094b at svc_thread_start+0xb
[680940] #11 0xffffffff811137a3 at fork_exit+0xa3
[680940] #12 0xffffffff81a769ee at fork_trampoline+0xe
[680940]
[680940]
[680940] Fatal trap 12: page fault while in kernel mode
[680940] cpuid = 3; apic id = 06
[680940] fault virtual address	= 0x7
[680940] fault code		= supervisor read data, page not present
[680940] instruction pointer	= 0x20:0xffffffff80fafd67
[680940] stack pointer	        = 0x28:0xfffffe0153ba2de0
[680940] frame pointer	        = 0x28:0xfffffe0153ba2eb0
[680940] code segment		= base 0x0, limit 0xfffff, type 0x1b
[680940] 			= DPL 0, pres 1, long 1, def32 0, gran 1
[680940] processor eflags	= interrupt enabled, resume, IOPL = 0
[680940] current process		= 55202 (nfsd: service)
[680940] rdi: 0000000000000007 rsi: 0000000000000000 rdx: dffff7c000000000
[680940] rcx: fffffe001b9ec1e8  r8: 0012c43500000002  r9: 0012c43500000002
[680940] rax: fffffe001b9ec1e8 rbx: ffffffffffffffff rbp: fffffe0153ba2eb0
[680940] r10: 0000000000000004 r11: 0000000000000006 r12: 0000000000000007
[680940] r13: fffffe019cd75700 r14: 0000000000001a1a r15: fffffe019cd75708
[680940] trap number		= 12
[680940] panic: page fault
[680940] cpuid = 3
[680940] time = 1709646178
[680940] KDB: stack backtrace:
[680940] db_trace_self_wrapper() at db_trace_self_wrapper+0xa5/frame 
0xfffffe0153ba2550
[680940] kdb_backtrace() at kdb_backtrace+0xc6/frame 0xfffffe0153ba26b0
[680940] vpanic() at vpanic+0x210/frame 0xfffffe0153ba2850
[680940] panic() at panic+0xb5/frame 0xfffffe0153ba2910
[680940] trap_fatal() at trap_fatal+0x65b/frame 0xfffffe0153ba2a10
[680940] trap_pfault() at trap_pfault+0x12b/frame 0xfffffe0153ba2b30
[680940] trap() at trap+0x54c/frame 0xfffffe0153ba2d10
[680940] calltrap() at calltrap+0x8/frame 0xfffffe0153ba2d10
[680940] --- trap 0xc, rip = 0xffffffff80fafd67, rsp = 
0xfffffe0153ba2de0, rbp = 0xfffffe0153ba2eb0 ---
[680940] nfsrv_freelockowner() at nfsrv_freelockowner+0x97/frame 
0xfffffe0153ba2eb0
[680940] nfsrv_freestateid() at nfsrv_freestateid+0x23a/frame 
0xfffffe0153ba2f70
[680940] nfsrvd_freestateid() at nfsrvd_freestateid+0x1df/frame 
0xfffffe0153ba3030
[680940] nfsrvd_dorpc() at nfsrvd_dorpc+0x2585/frame 0xfffffe0153ba3570
[680940] nfssvc_program() at nfssvc_program+0x1078/frame 0xfffffe0153ba3970
[680940] svc_run_internal() at svc_run_internal+0x1706/frame 
0xfffffe0153ba3ee0
[680940] svc_thread_start() at svc_thread_start+0xb/frame 0xfffffe0153ba3ef0
[680940] fork_exit() at fork_exit+0xa3/frame 0xfffffe0153ba3f30
[680940] fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0153ba3f30
[680940] --- trap 0xc, rip = 0x3b4ff896f0da, rsp = 0x3b4ff6a500e8, rbp = 
0x3b4ff6a50380 ---
[680940] KDB: enter: panic

#0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:57
#1  doadump (textdump=textdump@entry=0)
     at /usr/src/sys/kern/kern_shutdown.c:403
#2  0xffffffff805a52f6 in db_dump (dummy=<optimized out>,
     dummy2=<optimized out>, dummy3=<optimized out>, dummy4=<optimized out>)
     at /usr/src/sys/ddb/db_command.c:590
#3  0xffffffff805a512b in db_command (last_cmdp=<optimized out>,
     cmd_table=<optimized out>, dopager=true)
     at /usr/src/sys/ddb/db_command.c:503
#4  0xffffffff805a4b3d in db_command_loop ()
     at /usr/src/sys/ddb/db_command.c:550
#5  0xffffffff805aa209 in db_trap (type=<optimized out>, code=<optimized 
out>)
     at /usr/src/sys/ddb/db_main.c:267
#6  0xffffffff81239f25 in kdb_trap (type=3, code=code@entry=0,
     tf=tf@entry=0xfffffe0153ba25f0) at /usr/src/sys/kern/subr_kdb.c:790
#7  0xffffffff81ab8f88 in trap (frame=0xfffffe0153ba25f0)
     at /usr/src/sys/amd64/amd64/trap.c:606
#8  <signal handler called>
#9  kdb_enter (why=<optimized out>, msg=<optimized out>)
     at /usr/src/sys/kern/subr_kdb.c:556
#10 0xffffffff8118fdf1 in vpanic (fmt=fmt@entry=0xffffffff82491d20 <str> 
"%s",
     ap=ap@entry=0xfffffe0153ba28c0) at 
/usr/src/sys/kern/kern_shutdown.c:961
#11 0xffffffff8118fba5 in panic (fmt=0xffffffff82491d20 <str> "%s")
     at /usr/src/sys/kern/kern_shutdown.c:889
#12 0xffffffff81ab9f8b in trap_fatal (frame=0xfffffe0153ba2d20, eva=7)
     at /usr/src/sys/amd64/amd64/trap.c:950
#13 0xffffffff81aba0bb in trap_pfault (frame=frame@entry=0xfffffe0153ba2d20,
     usermode=false, signo=signo@entry=0x0, ucode=ucode@entry=0x0)
     at /usr/src/sys/amd64/amd64/trap.c:761
#14 0xffffffff81ab901c in trap (frame=0xfffffe0153ba2d20)
     at /usr/src/sys/amd64/amd64/trap.c:440
#15 <signal handler called>
#16 0xffffffff80fafd67 in nfsrv_freelockowner (
     stp=stp@entry=0xfffffe019cd75700, vp=vp@entry=0x0,
     cansleep=cansleep@entry=0, p=p@entry=0xfffffe0153e2c740)
     at /usr/src/sys/fs/nfsserver/nfs_nfsdstate.c:1550
#17 0xffffffff80fb4bfa in nfsrv_freestateid (nd=nd@entry=0xfffffe0153ba3710,
     stateidp=stateidp@entry=0xfffffe0153ba2fc0, 
p=p@entry=0xfffffe0153e2c740)
     at /usr/src/sys/fs/nfsserver/nfs_nfsdstate.c:6681
#18 0xffffffff80fd5e3f in nfsrvd_freestateid (nd=0xfffffe0153ba3710,
     isdgram=<optimized out>, vp=<optimized out>, exp=<optimized out>)
     at /usr/src/sys/fs/nfsserver/nfs_nfsdserv.c:4764
#19 0xffffffff80f98d35 in nfsrvd_compound (nd=0xfffffe0153ba3710, isdgram=0,
     tag=<optimized out>, taglen=0, minorvers=<optimized out>)
     at /usr/src/sys/fs/nfsserver/nfs_nfsdsocket.c:1338
#20 nfsrvd_dorpc (nd=nd@entry=0xfffffe0153ba3710, isdgram=isdgram@entry=0,
     tag=tag@entry=0xfffffe0153ba3670 "", taglen=taglen@entry=0,
     minorvers=<optimized out>)
     at /usr/src/sys/fs/nfsserver/nfs_nfsdsocket.c:633
#21 0xffffffff80fbf588 in nfs_proc (xid=<optimized out>,
     xprt=0xfffffe0199f0c600, nd=<optimized out>, rpp=<optimized out>)
     at /usr/src/sys/fs/nfsserver/nfs_nfsdkrpc.c:464
#22 nfssvc_program (rqst=0xfffffe00f3c07000, xprt=0xfffffe0199f0c600)
     at /usr/src/sys/fs/nfsserver/nfs_nfsdkrpc.c:348
#23 0xffffffff8173fce6 in svc_executereq (rqstp=0xfffffe00f3c07000)
     at /usr/src/sys/rpc/svc.c:1032
#24 svc_run_internal (grp=grp@entry=0xfffffe00f04ec100,
     ismaster=ismaster@entry=0) at /usr/src/sys/rpc/svc.c:1308
#25 0xffffffff8174094b in svc_thread_start (arg=0x7,
     arg@entry=0xfffffe00f04ec100) at /usr/src/sys/rpc/svc.c:1336
#26 0xffffffff811137a3 in fork_exit (
     callout=0xffffffff81740940 <svc_thread_start>, arg=0xfffffe00f04ec100,
     frame=0xfffffe0153ba3f40) at /usr/src/sys/kern/kern_fork.c:1157
#27 <signal handler called>
#28 0x00003b4ff896f0da in ?? ()