Re: Hangs in bacula / NFS? on recent Current

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Fri, 19 Aug 2022 15:36:38 UTC
On 08/18/2022 9:49 am, Larry Rosenman wrote:
> I didn't get all my mail on my bacula backups today (they backup to
> NFS mounted TrueNAS).
> Also a df hangs.
>
> Here are procstat -kk's for all:
> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
> ❯ ps auxxxwww|grep bacula
> bacula       2067    0.0  0.0  63188  13652  -  Is   11:30
> 0:17.49 /usr/local/sbin/bacula-sd -u bacula -g bacula -v -c
> /usr/local/etc/bacula/bacula-sd.conf
> root         2072    0.0  0.0  59280  31276  -  Is   11:30
> 0:00.31 /usr/local/sbin/bacula-fd -u root -g wheel -v -c
> /usr/local/etc/bacula/bacula-fd.conf
> bacula       2075    0.0  0.0  86992  19352  -  Is   11:30
> 0:56.95 /usr/local/sbin/bacula-dir -u bacula -g bacula -v -c
> /usr/local/etc/bacula/bacula-dir.conf
> postgres    50241    0.0  0.1 285764 160244  -  Is   23:05
> 0:00.38 postgres: bacula bacula [local]  (postgres)
> postgres    50244    0.0  0.1 298784  74448  -  Ds   23:05
> 0:00.67 postgres: bacula bacula [local]  (postgres)
> ler         66595    0.0  0.0  12888   2600  3  S+   09:46
> 0:00.00 grep --color=auto bacula
>
At the end, I'll list what options are needed for ps and all of its
output is needed. See the end of the email..

> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
> ❯ sudo procstat -kk 2067
>   PID    TID COMM                TDNAME              KSTACK
>  2067 100742 bacula-sd           -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
> amd64_syscall+0x12e fast_syscall_common+0xf8
>  2067 101036 bacula-sd           -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
> fast_syscall_common+0xf8
>  2067 101038 bacula-sd           -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
> fast_syscall_common+0xf8
>  2067 124485 bacula-sd           -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
> fast_syscall_common+0xf8
>
> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
> ❯ sudo procstat -kk 2072
>   PID    TID COMM                TDNAME              KSTACK
>  2072 100677 bacula-fd           -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
> amd64_syscall+0x12e fast_syscall_common+0xf8
>  2072 101039 bacula-fd           -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
> fast_syscall_common+0xf8
>  2072 101040 bacula-fd           -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
> fast_syscall_common+0xf8
>  2072 124490 bacula-fd           -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
> fast_syscall_common+0xf8
>
> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
> ❯ sudo procstat -kk 2075
>   PID    TID COMM                TDNAME              KSTACK
>  2075 101007 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
> _sleep+0x29b umtxq_sleep+0x242 do_wait+0x26b __umtx_op_wait+0x53
> sys__umtx_op+0x7e amd64_syscall+0x12e fast_syscall_common+0xf8
>  2075 101041 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
> fast_syscall_common+0xf8
>  2075 101045 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
> amd64_syscall+0x12e fast_syscall_common+0xf8
>  2075 101046 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
> fast_syscall_common+0xf8
>  2075 101047 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
>  2075 124479 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
> _cv_wait_sig+0x137 kern_poll_kfds+0x48c kern_poll+0x9f sys_poll+0x50
> amd64_syscall+0x12e fast_syscall_common+0xf8
>  2075 124480 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
>  2075 124481 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
>  2075 124489 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
> fast_syscall_common+0xf8
>  2075 124506 bacula-dir          -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_catch_signals+0x266
> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
>
> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
> ❯ sudo procstat -kk 66390
>   PID    TID COMM                TDNAME              KSTACK
> 66390 101514 df                  -                   mi_switch+0x157
> sleepq_switch+0x107 sleepq_timedwait+0x4b _sleep+0x28e
> clnt_reconnect_call+0x809 newnfs_request+0xa95 nfscl_request+0x5a
> nfsrpc_statfs+0x19d nfs_statfs+0x148 vfs_statfs_sigdefer+0x2e
> kern_getfsstat+0x1f1 sys_getfsstat+0x22 amd64_syscall+0x12e
> fast_syscall_common+0xf8
>
This says that the kernel RPC is trying to establish a new TCP connection
to the server. Now why the old connection stopped working
(which should only happen if there is a network partitioning or
client/server crash), we do not know, from the limited info.

> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
> ❯
>
> this was built yesterday:
> ❯ uname -a
> FreeBSD borg.lerctr.org 14.0-CURRENT FreeBSD 14.0-CURRENT #142
> ler/freebsd-main-changes-n257453-175a127a72f: Wed Aug 17 09:23:32 CDT
> 2022
> root@borg.lerctr.org:/usr/obj/usr/src/amd64.amd64/sys/LER-MINIMAL
> amd64
>
> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
> ❯
>
> What else do we need?
Typically, the output of the following commands (all of it, not just some
processes/threads) can be useful:

ps axHlw
- The MWCHAN field is by far the most useful, since it indicates where
   the process/thread is sleeping. "H" matters since there are nfsiod threads
   that do asynchronous I/O for readaheads/writebehinds and knowing what
   they are up to can be useful.

netstat -a
- To see what state the TCP connection(s) being used by the mount are up to.
   (Since it was trying to do a new connection, this one might have been useful.)

nfsstat -m
- To show exactly what NFS mount options are being used.
--> For example, use of "intr" and/or "soft" for an NFSv4 mount is
      guaranteed to break it. (See BUGS section of "man mount_nfs".)

vmstat -m
vmstat -z
- To see what kernel memory allocations look like. (Basically looking for
  a very large allocation that might indicate a leak or resource exhaustion.)

procstat -kk
- To be honest, for me, this does not usually give me any useful information
   compared with "ps", but it might be useful.

The rest require going into the debugger (if compiled into your kernel).
sysctl debug.kdb.enter=1
- type <ctrl><alt><escape>
then in the debugger, do "show alllocks" and the other commands
listed here:
https://docs.freebsd.org/doc/6.4-RELEASE/usr/share/doc/en/books/developers-handbook/kerneldebug-deadlocks.html

You can type "continue" to get out of the debugger.

Since it was trying to establish a new TCP connection, all I can say is that
something broke the old TCP connection and that could suggest a
problem with your network fabric, such as a broken TSO implementation.
(NFS traffic is really good at generating weird sizes of TSO segments that,
 in turn, are very good at finding flaws in TSO implementations that other
 TCP traffic never finds.)

rick