Re: Hangs in bacula / NFS? on recent Current
- In reply to: Rick Macklem : "Re: Hangs in bacula / NFS? on recent Current"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 19 Aug 2022 15:38:48 UTC
On 08/19/2022 10:36 am, Rick Macklem wrote: > On 08/18/2022 9:49 am, Larry Rosenman wrote: >> I didn't get all my mail on my bacula backups today (they backup to >> NFS mounted TrueNAS). >> Also a df hangs. >> >> Here are procstat -kk's for all: >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ ps auxxxwww|grep bacula >> bacula 2067 0.0 0.0 63188 13652 - Is 11:30 >> 0:17.49 /usr/local/sbin/bacula-sd -u bacula -g bacula -v -c >> /usr/local/etc/bacula/bacula-sd.conf >> root 2072 0.0 0.0 59280 31276 - Is 11:30 >> 0:00.31 /usr/local/sbin/bacula-fd -u root -g wheel -v -c >> /usr/local/etc/bacula/bacula-fd.conf >> bacula 2075 0.0 0.0 86992 19352 - Is 11:30 >> 0:56.95 /usr/local/sbin/bacula-dir -u bacula -g bacula -v -c >> /usr/local/etc/bacula/bacula-dir.conf >> postgres 50241 0.0 0.1 285764 160244 - Is 23:05 >> 0:00.38 postgres: bacula bacula [local] (postgres) >> postgres 50244 0.0 0.1 298784 74448 - Ds 23:05 >> 0:00.67 postgres: bacula bacula [local] (postgres) >> ler 66595 0.0 0.0 12888 2600 3 S+ 09:46 >> 0:00.00 grep --color=auto bacula >> > At the end, I'll list what options are needed for ps and all of its > output is needed. See the end of the email.. > >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ sudo procstat -kk 2067 >> PID TID COMM TDNAME KSTACK >> 2067 100742 bacula-sd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56 >> amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2067 101036 bacula-sd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2067 101038 bacula-sd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2067 124485 bacula-sd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c >> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ sudo procstat -kk 2072 >> PID TID COMM TDNAME KSTACK >> 2072 100677 bacula-fd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56 >> amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2072 101039 bacula-fd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2072 101040 bacula-fd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2072 124490 bacula-fd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c >> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ sudo procstat -kk 2075 >> PID TID COMM TDNAME KSTACK >> 2075 101007 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _sleep+0x29b umtxq_sleep+0x242 do_wait+0x26b __umtx_op_wait+0x53 >> sys__umtx_op+0x7e amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 101041 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2075 101045 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56 >> amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 101046 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2075 101047 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1 >> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 124479 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _cv_wait_sig+0x137 kern_poll_kfds+0x48c kern_poll+0x9f sys_poll+0x50 >> amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 124480 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1 >> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 124481 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1 >> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 124489 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c >> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2075 124506 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1 >> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8 >> >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ sudo procstat -kk 66390 >> PID TID COMM TDNAME KSTACK >> 66390 101514 df - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_timedwait+0x4b _sleep+0x28e >> clnt_reconnect_call+0x809 newnfs_request+0xa95 nfscl_request+0x5a >> nfsrpc_statfs+0x19d nfs_statfs+0x148 vfs_statfs_sigdefer+0x2e >> kern_getfsstat+0x1f1 sys_getfsstat+0x22 amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> > This says that the kernel RPC is trying to establish a new TCP > connection > to the server. Now why the old connection stopped working > (which should only happen if there is a network partitioning or > client/server crash), we do not know, from the limited info. > >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ >> >> this was built yesterday: >> ❯ uname -a >> FreeBSD borg.lerctr.org 14.0-CURRENT FreeBSD 14.0-CURRENT #142 >> ler/freebsd-main-changes-n257453-175a127a72f: Wed Aug 17 09:23:32 CDT >> 2022 >> root@borg.lerctr.org:/usr/obj/usr/src/amd64.amd64/sys/LER-MINIMAL >> amd64 >> >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ >> >> What else do we need? > Typically, the output of the following commands (all of it, not just > some > processes/threads) can be useful: > > ps axHlw > - The MWCHAN field is by far the most useful, since it indicates where > the process/thread is sleeping. "H" matters since there are nfsiod > threads > that do asynchronous I/O for readaheads/writebehinds and knowing > what > they are up to can be useful. > > netstat -a > - To see what state the TCP connection(s) being used by the mount are > up to. > (Since it was trying to do a new connection, this one might have > been useful.) > > nfsstat -m > - To show exactly what NFS mount options are being used. > --> For example, use of "intr" and/or "soft" for an NFSv4 mount is > guaranteed to break it. (See BUGS section of "man mount_nfs".) > > vmstat -m > vmstat -z > - To see what kernel memory allocations look like. (Basically looking > for > a very large allocation that might indicate a leak or resource > exhaustion.) > > procstat -kk > - To be honest, for me, this does not usually give me any useful > information > compared with "ps", but it might be useful. > > The rest require going into the debugger (if compiled into your > kernel). > sysctl debug.kdb.enter=1 > - type <ctrl><alt><escape> > then in the debugger, do "show alllocks" and the other commands > listed here: > https://docs.freebsd.org/doc/6.4-RELEASE/usr/share/doc/en/books/developers-handbook/kerneldebug-deadlocks.html > > You can type "continue" to get out of the debugger. > > Since it was trying to establish a new TCP connection, all I can say is > that > something broke the old TCP connection and that could suggest a > problem with your network fabric, such as a broken TSO implementation. > (NFS traffic is really good at generating weird sizes of TSO segments > that, > in turn, are very good at finding flaws in TSO implementations that > other > TCP traffic never finds.) > > rick MarkJ pointed out there was a kernel bug that I was probably hitting, upgrading past that point fixed it. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106