Re: Hangs in bacula / NFS? on recent Current

From: Larry Rosenman <ler_at_lerctr.org>
Date: Fri, 19 Aug 2022 15:38:48 UTC

On 08/19/2022 10:36 am, Rick Macklem wrote:
> On 08/18/2022 9:49 am, Larry Rosenman wrote:
>> I didn't get all my mail on my bacula backups today (they backup to
>> NFS mounted TrueNAS).
>> Also a df hangs.
>> 
>> Here are procstat -kk's for all:
>> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
>> ❯ ps auxxxwww|grep bacula
>> bacula       2067    0.0  0.0  63188  13652  -  Is   11:30
>> 0:17.49 /usr/local/sbin/bacula-sd -u bacula -g bacula -v -c
>> /usr/local/etc/bacula/bacula-sd.conf
>> root         2072    0.0  0.0  59280  31276  -  Is   11:30
>> 0:00.31 /usr/local/sbin/bacula-fd -u root -g wheel -v -c
>> /usr/local/etc/bacula/bacula-fd.conf
>> bacula       2075    0.0  0.0  86992  19352  -  Is   11:30
>> 0:56.95 /usr/local/sbin/bacula-dir -u bacula -g bacula -v -c
>> /usr/local/etc/bacula/bacula-dir.conf
>> postgres    50241    0.0  0.1 285764 160244  -  Is   23:05
>> 0:00.38 postgres: bacula bacula [local]  (postgres)
>> postgres    50244    0.0  0.1 298784  74448  -  Ds   23:05
>> 0:00.67 postgres: bacula bacula [local]  (postgres)
>> ler         66595    0.0  0.0  12888   2600  3  S+   09:46
>> 0:00.00 grep --color=auto bacula
>> 
> At the end, I'll list what options are needed for ps and all of its
> output is needed. See the end of the email..
> 
>> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
>> ❯ sudo procstat -kk 2067
>>   PID    TID COMM                TDNAME              KSTACK
>>  2067 100742 bacula-sd           -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
>> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
>> amd64_syscall+0x12e fast_syscall_common+0xf8
>>  2067 101036 bacula-sd           -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
>> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>>  2067 101038 bacula-sd           -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
>> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>>  2067 124485 bacula-sd           -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
>> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>> 
>> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
>> ❯ sudo procstat -kk 2072
>>   PID    TID COMM                TDNAME              KSTACK
>>  2072 100677 bacula-fd           -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
>> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
>> amd64_syscall+0x12e fast_syscall_common+0xf8
>>  2072 101039 bacula-fd           -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
>> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>>  2072 101040 bacula-fd           -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
>> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>>  2072 124490 bacula-fd           -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
>> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>> 
>> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
>> ❯ sudo procstat -kk 2075
>>   PID    TID COMM                TDNAME              KSTACK
>>  2075 101007 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
>> _sleep+0x29b umtxq_sleep+0x242 do_wait+0x26b __umtx_op_wait+0x53
>> sys__umtx_op+0x7e amd64_syscall+0x12e fast_syscall_common+0xf8
>>  2075 101041 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
>> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>>  2075 101045 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
>> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56
>> amd64_syscall+0x12e fast_syscall_common+0xf8
>>  2075 101046 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b
>> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>>  2075 101047 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
>> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
>>  2075 124479 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9
>> _cv_wait_sig+0x137 kern_poll_kfds+0x48c kern_poll+0x9f sys_poll+0x50
>> amd64_syscall+0x12e fast_syscall_common+0xf8
>>  2075 124480 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
>> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
>>  2075 124481 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
>> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
>>  2075 124489 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c
>> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>>  2075 124506 bacula-dir          -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_catch_signals+0x266
>> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1
>> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8
>> 
>> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
>> ❯ sudo procstat -kk 66390
>>   PID    TID COMM                TDNAME              KSTACK
>> 66390 101514 df                  -                   mi_switch+0x157
>> sleepq_switch+0x107 sleepq_timedwait+0x4b _sleep+0x28e
>> clnt_reconnect_call+0x809 newnfs_request+0xa95 nfscl_request+0x5a
>> nfsrpc_statfs+0x19d nfs_statfs+0x148 vfs_statfs_sigdefer+0x2e
>> kern_getfsstat+0x1f1 sys_getfsstat+0x22 amd64_syscall+0x12e
>> fast_syscall_common+0xf8
>> 
> This says that the kernel RPC is trying to establish a new TCP 
> connection
> to the server. Now why the old connection stopped working
> (which should only happen if there is a network partitioning or
> client/server crash), we do not know, from the limited info.
> 
>> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
>> ❯
>> 
>> this was built yesterday:
>> ❯ uname -a
>> FreeBSD borg.lerctr.org 14.0-CURRENT FreeBSD 14.0-CURRENT #142
>> ler/freebsd-main-changes-n257453-175a127a72f: Wed Aug 17 09:23:32 CDT
>> 2022
>> root@borg.lerctr.org:/usr/obj/usr/src/amd64.amd64/sys/LER-MINIMAL
>> amd64
>> 
>> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️  (us-east-1)
>> ❯
>> 
>> What else do we need?
> Typically, the output of the following commands (all of it, not just 
> some
> processes/threads) can be useful:
> 
> ps axHlw
> - The MWCHAN field is by far the most useful, since it indicates where
>    the process/thread is sleeping. "H" matters since there are nfsiod 
> threads
>    that do asynchronous I/O for readaheads/writebehinds and knowing 
> what
>    they are up to can be useful.
> 
> netstat -a
> - To see what state the TCP connection(s) being used by the mount are 
> up to.
>    (Since it was trying to do a new connection, this one might have
> been useful.)
> 
> nfsstat -m
> - To show exactly what NFS mount options are being used.
> --> For example, use of "intr" and/or "soft" for an NFSv4 mount is
>       guaranteed to break it. (See BUGS section of "man mount_nfs".)
> 
> vmstat -m
> vmstat -z
> - To see what kernel memory allocations look like. (Basically looking 
> for
>   a very large allocation that might indicate a leak or resource 
> exhaustion.)
> 
> procstat -kk
> - To be honest, for me, this does not usually give me any useful 
> information
>    compared with "ps", but it might be useful.
> 
> The rest require going into the debugger (if compiled into your 
> kernel).
> sysctl debug.kdb.enter=1
> - type <ctrl><alt><escape>
> then in the debugger, do "show alllocks" and the other commands
> listed here:
> https://docs.freebsd.org/doc/6.4-RELEASE/usr/share/doc/en/books/developers-handbook/kerneldebug-deadlocks.html
> 
> You can type "continue" to get out of the debugger.
> 
> Since it was trying to establish a new TCP connection, all I can say is 
> that
> something broke the old TCP connection and that could suggest a
> problem with your network fabric, such as a broken TSO implementation.
> (NFS traffic is really good at generating weird sizes of TSO segments 
> that,
>  in turn, are very good at finding flaws in TSO implementations that 
> other
>  TCP traffic never finds.)
> 
> rick


MarkJ pointed out there was a kernel bug that I was probably hitting, 
upgrading past that point fixed it.

-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106