Major issues with nfsv4

Sun Dec 13 08:09:25 UTC 2020

The output of "procstat -kk -a" is extremely large.  So I found what
looked like four processes that were having problems, ran procstat -kk
in a loop for all four for a couple of minutes, appending the output
to a file, then did sort | uniq on the file to find common traces.

353 samples were collected.  11 "<running>" samples had no kernel call
stack and were discarded.

Of the 342 stack samples, here's a frequency analysis of the entries
present in the trace:

 451 VOP_LOOKUP_APV
 310 kern_statat
 309 fast_syscall_common
 303 amd64_syscall
 234 namei
 233 lookup
 226 null_lookup
 225 freebsd11_stat
 220 nfs_lookup
 189 __mtx_lock_sleep
 156 nfscl_nodeleg
 155 mi_switch
 136 vputx
 134 VOP_INACTIVE_APV
 134 vinactive
  83 sys_fstatat
  68 nfsrpc_close
  68 ncl_inactive
  67 turnstile_wait
  66 vrecycle
  66 VOP_RECLAIM_APV
  66 vgonel
  66 null_reclaim
  66 null_inactive
  55 VOP_GETATTR_APV
  55 nfscl_request
  55 newnfs_request
  55 clnt_reconnect_call
  54 clnt_vc_call
  53 sleepq_timedwait
  53 _sleep
  41 nfsrpc_lookup
  30 nfs_getattr
  26 nfscl_mustflush
  26 ncl_getattrcache
  25 null_bypass
  24 vn_stat
  24 null_getattr
  19 ast
  14 VOP_LOCK1_APV
  13 VOP_ACCESS_APV
  13 vn_dir_check_exec
  13 nfsrpc_accessrpc
  13 nfs34_access_otw
  13 nfs_access
  13 critical_exit_preempt
  12 Xipi_intr_bitmap_handler
  12 ipi_bitmap_handler
  12 doreti_ast
  11 thread_lock_flags_
   9 sigdeferstop_impl
   8 nfscl_doclose
   6 _vn_lock
   5 null_lock
   3 sleepq_wait
   3 sleeplk
   3 nfscl_deleggetmodtime
   3 nfs_lock
   2 VOP_READLINK_APV
   2 VOP_ISLOCKED_APV
   2 sigallowstop_impl
   2 nfscl_getcl
   2 lockmgr_xlock_hard
   1 Xtimerint
   1 vtnet_txq_mq_start_locked
   1 vtnet_txq_mq_start
   1 vget
   1 tcp_usr_send
   1 tcp_output
   1 sosend_generic
   1 sosend
   1 null_nodeget
   1 nfsrpc_getattr
   1 nfscl_clientrelease
   1 ncl_bioread
   1 lockmgr_slock_hard
   1 ip_output
   1 ether_output_frame
   1 ether_output
   1 cache_lookup

(VOP_LOOKUP_APV appears twice on many lines.)

The most common trace (91 times, or over 1/4 of all observed) is:

__mtx_lock_sleep+0xf8 nfscl_nodeleg+0x207 nfs_lookup+0x314
VOP_LOOKUP_APV+0x75 null_lookup+0x98 VOP_LOOKUP_APV+0x75 lookup+0x451
namei+0x414 kern_statat+0x72 freebsd11_stat+0x30 amd64_syscall+0x387
fast_syscall_common+0xf8

This trace appears roughly uniformly (28/38/34) in 3 of the 4 processes.

The full set of traces (sorted by uniq -c) is here:

https://pastebin.com/HUqkeMri

(This message is already long enough!)

The characters stripped off the front of each line are consistently:

(pid) (uid) python3 -

It happens these were all Python-based jobs.  Python seems predisposed
to trigger this, but non-Python jobs trigger it as well.  Heavy use of
stat() does seem to be a common element regardless of job type.

Here's the output of "nfsstat -E -c" 60 seconds after running it with -z:

Client Info:
RPC Counts:
      Getattr      Setattr       Lookup     Readlink         Read        Write
         1667          111         6376            0           42          153
       Create       Remove       Rename         Link      Symlink        Mkdir
          111            7           14            0            0            0
        Rmdir      Readdir     RdirPlus       Access        Mknod       Fsstat
            0            0            0         2620            0          160
       FSinfo     pathConf       Commit      SetClId    SetClIdCf         Lock
            0            0          113            0            0           48
        LockT        LockU         Open      OpenCfr
            0           48          320            0
   OpenDownGr        Close
            0          402
    RelLckOwn  FreeStateID    PutRootFH     DelegRet       GetAcl       SetAcl
            0            3            0            0            0            0
   ExchangeId   CreateSess  DestroySess  DestroyClId    LayoutGet   GetDevInfo
            0            0            0            0            0            0
 LayoutCommit LayoutReturn ReclaimCompl    ReadDataS   WriteDataS  CommitDataS
            0            0            0            0            0            0
   OpenLayout CreateLayout
            0            0
    OpenOwner        Opens    LockOwner        Locks       Delegs     LocalOwn
        21175       130439           30            6            0            0
    LocalOpen    LocalLown    LocalLock
            0            0            0
Rpc Info:
     TimedOut      Invalid    X Replies      Retries     Requests
            0            0            0            0        12247
Cache Info:
    Attr Hits  Attr Misses    Lkup Hits  Lkup Misses
      1110054          858      1002829         6361
    BioR Hits  BioR Misses    BioW Hits  BioW Misses
         2000           54          292          153
   BioRL Hits BioRL Misses    BioD Hits  BioD Misses
         6911            0          208            0
    DirE Hits  DirE Misses
          104            0

This does reflect the whole machine, not just those four processes.

Finally, when I attempted to kill those four processes with ktrace
running on all of them, the system panicked:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address = 0x10
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff8272d28f
stack pointer         = 0x28:0xfffffe008a5f24c0
frame pointer         = 0x28:0xfffffe008a5f25f0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 5627 (node)
trap number = 12
panic: page fault
cpuid = 2
time = 1607845622
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe008a5f2180
vpanic() at vpanic+0x17b/frame 0xfffffe008a5f21d0
panic() at panic+0x43/frame 0xfffffe008a5f2230
trap_fatal() at trap_fatal+0x391/frame 0xfffffe008a5f2290
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe008a5f22e0
trap() at trap+0x286/frame 0xfffffe008a5f23f0
calltrap() at calltrap+0x8/frame 0xfffffe008a5f23f0
--- trap 0xc, rip = 0xffffffff8272d28f, rsp = 0xfffffe008a5f24c0, rbp
= 0xfffffe008a5f25f0 ---
null_bypass() at null_bypass+0xaf/frame 0xfffffe008a5f25f0
VOP_ADVLOCK_APV() at VOP_ADVLOCK_APV+0x80/frame 0xfffffe008a5f2620
closef() at closef+0x8f/frame 0xfffffe008a5f26b0
fdescfree_fds() at fdescfree_fds+0x3c/frame 0xfffffe008a5f2700
fdescfree() at fdescfree+0x466/frame 0xfffffe008a5f27c0
exit1() at exit1+0x488/frame 0xfffffe008a5f2820
sigexit() at sigexit+0x159/frame 0xfffffe008a5f2b00
postsig() at postsig+0x2fd/frame 0xfffffe008a5f2bc0
ast() at ast+0x317/frame 0xfffffe008a5f2bf0
doreti_ast() at doreti_ast+0x1f/frame 0x7fffffff67a0

The fsck ate the ktrace.out file on reboot.  This system does not have
kernel debug symbols.  If there's a way to try to figure out what
null_bypass+0xaf corresponds to using the machine that built this
kernel, which has the kernel, objects, debug symbols, and source,
please let me know.  It's built from releng/12.2 r368515.

Thanks!