From nobody Fri Aug 19 15:38:48 2022 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4M8Qs56JZCz4ZQV1 for ; Fri, 19 Aug 2022 15:38:49 +0000 (UTC) (envelope-from ler@lerctr.org) Received: from thebighonker.lerctr.org (thebighonker.lerctr.org [IPv6:2602:fcdb:0:10:7ae3:b5ff:fe1b:23b4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "*.lerctr.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4M8Qs50rt8z4QZY; Fri, 19 Aug 2022 15:38:49 +0000 (UTC) (envelope-from ler@lerctr.org) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lerctr.org; s=ler2019; h=Content-Transfer-Encoding:Content-Type:Message-ID:References: In-Reply-To:Subject:Cc:To:From:Date:MIME-Version:Sender:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=WM5EO4PrJHRAJwJzOEBl/KxZeka6xiy+op2jkB2Up44=; b=PTmWMEYomkOp0IYLo0IQDIUMlN K6Cvy4VScEEANsGpQh7wtr1JUtdzMauDiDkmBEmJ/uBeUKIg10V5crT+fYro+rQ0DGudRwyXQDAo4 7bC5BD37VWOPXXgaKp7E8vYl7MfV1fq6hQSce2l1Spa35Geh274CB5ZIBRNwqGh/v0pWHkaY8/8td eYkApU8uzxZ0Ksi/ExCe1lAeIUkCs+qvwkHJbHBlnkeSxX0ezXqPuVLWEDZa2XimPNqoIbWoXVKWm M/nYROzOVOg2Fa3eU6bsVEjTgbW7KD0+ChV3ODVpphGe7ex2Q72H7EvsggOfUHKusHB0ay3y/PMb7 8eivwuiw==; Received-SPF: pass (thebighonker.lerctr.org: domain of lerctr.org designates 2602:fcdb:0:10:7ae3:b5ff:fe1b:23b4 as permitted sender) client-ip=2602:fcdb:0:10:7ae3:b5ff:fe1b:23b4; envelope-from=ler@lerctr.org; helo=webmail.lerctr.org; Received: from thebighonker.lerctr.org ([2602:fcdb:0:10:7ae3:b5ff:fe1b:23b4]:36696 helo=webmail.lerctr.org) by thebighonker.lerctr.org with esmtpsa (TLS1.3) tls TLS_AES_256_GCM_SHA384 (Exim 4.95 (FreeBSD)) (envelope-from ) id 1oP45A-000Fda-ET; Fri, 19 Aug 2022 10:38:48 -0500 Received: from 2600:1700:210:b18f:55e0:4474:655b:795 by webmail.lerctr.org with HTTP (HTTP/1.1 POST); Fri, 19 Aug 2022 10:38:48 -0500 List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Date: Fri, 19 Aug 2022 10:38:48 -0500 From: Larry Rosenman To: Rick Macklem Cc: Freebsd current , Mark Johnston Subject: Re: Hangs in bacula / NFS? on recent Current In-Reply-To: References: <405b3873b709d42feb438b5b954ecdc2@lerctr.org> <9e63f40233fc29c9fbc3fac0cb82acd5@lerctr.org> Message-ID: <3325241f6237fe70129bac18c0938f66@lerctr.org> X-Sender: ler@lerctr.org Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 4M8Qs50rt8z4QZY X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=lerctr.org header.s=ler2019 header.b=PTmWMEYo; dmarc=pass (policy=none) header.from=lerctr.org; spf=pass (mx1.freebsd.org: domain of ler@lerctr.org designates 2602:fcdb:0:10:7ae3:b5ff:fe1b:23b4 as permitted sender) smtp.mailfrom=ler@lerctr.org X-Spamd-Result: default: False [-4.00 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.999]; DMARC_POLICY_ALLOW(-0.50)[lerctr.org,none]; R_SPF_ALLOW(-0.20)[+mx:c]; R_DKIM_ALLOW(-0.20)[lerctr.org:s=ler2019]; MIME_GOOD(-0.10)[text/plain]; ASN(0.00)[asn:55103, ipnet:2602:fcdb::/36, country:US]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; MLMMJ_DEST(0.00)[freebsd-current@freebsd.org]; ARC_NA(0.00)[]; TO_DN_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FREEFALL_USER(0.00)[ler]; RCVD_VIA_SMTP_AUTH(0.00)[]; DKIM_TRACE(0.00)[lerctr.org:+]; FROM_HAS_DN(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; SUBJECT_HAS_QUESTION(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On 08/19/2022 10:36 am, Rick Macklem wrote: > On 08/18/2022 9:49 am, Larry Rosenman wrote: >> I didn't get all my mail on my bacula backups today (they backup to >> NFS mounted TrueNAS). >> Also a df hangs. >> >> Here are procstat -kk's for all: >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ ps auxxxwww|grep bacula >> bacula 2067 0.0 0.0 63188 13652 - Is 11:30 >> 0:17.49 /usr/local/sbin/bacula-sd -u bacula -g bacula -v -c >> /usr/local/etc/bacula/bacula-sd.conf >> root 2072 0.0 0.0 59280 31276 - Is 11:30 >> 0:00.31 /usr/local/sbin/bacula-fd -u root -g wheel -v -c >> /usr/local/etc/bacula/bacula-fd.conf >> bacula 2075 0.0 0.0 86992 19352 - Is 11:30 >> 0:56.95 /usr/local/sbin/bacula-dir -u bacula -g bacula -v -c >> /usr/local/etc/bacula/bacula-dir.conf >> postgres 50241 0.0 0.1 285764 160244 - Is 23:05 >> 0:00.38 postgres: bacula bacula [local] (postgres) >> postgres 50244 0.0 0.1 298784 74448 - Ds 23:05 >> 0:00.67 postgres: bacula bacula [local] (postgres) >> ler 66595 0.0 0.0 12888 2600 3 S+ 09:46 >> 0:00.00 grep --color=auto bacula >> > At the end, I'll list what options are needed for ps and all of its > output is needed. See the end of the email.. > >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ sudo procstat -kk 2067 >> PID TID COMM TDNAME KSTACK >> 2067 100742 bacula-sd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56 >> amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2067 101036 bacula-sd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2067 101038 bacula-sd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2067 124485 bacula-sd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c >> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ sudo procstat -kk 2072 >> PID TID COMM TDNAME KSTACK >> 2072 100677 bacula-fd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56 >> amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2072 101039 bacula-fd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2072 101040 bacula-fd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2072 124490 bacula-fd - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c >> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ sudo procstat -kk 2075 >> PID TID COMM TDNAME KSTACK >> 2075 101007 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _sleep+0x29b umtxq_sleep+0x242 do_wait+0x26b __umtx_op_wait+0x53 >> sys__umtx_op+0x7e amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 101041 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2075 101045 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _cv_wait_sig+0x137 kern_select+0x9fe sys_select+0x56 >> amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 101046 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d umtxq_sleep+0x242 do_wait+0x26b >> __umtx_op_wait_uint_private+0x54 sys__umtx_op+0x7e amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2075 101047 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1 >> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 124479 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 sleepq_wait_sig+0x9 >> _cv_wait_sig+0x137 kern_poll_kfds+0x48c kern_poll+0x9f sys_poll+0x50 >> amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 124480 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1 >> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 124481 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1 >> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8 >> 2075 124489 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _cv_timedwait_sig_sbt+0x15c >> kern_poll_kfds+0x457 kern_poll+0x9f sys_poll+0x50 amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> 2075 124506 bacula-dir - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_catch_signals+0x266 >> sleepq_timedwait_sig+0x12 _sleep+0x27d kern_clock_nanosleep+0x1d1 >> sys_nanosleep+0x3b amd64_syscall+0x12e fast_syscall_common+0xf8 >> >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ sudo procstat -kk 66390 >> PID TID COMM TDNAME KSTACK >> 66390 101514 df - mi_switch+0x157 >> sleepq_switch+0x107 sleepq_timedwait+0x4b _sleep+0x28e >> clnt_reconnect_call+0x809 newnfs_request+0xa95 nfscl_request+0x5a >> nfsrpc_statfs+0x19d nfs_statfs+0x148 vfs_statfs_sigdefer+0x2e >> kern_getfsstat+0x1f1 sys_getfsstat+0x22 amd64_syscall+0x12e >> fast_syscall_common+0xf8 >> > This says that the kernel RPC is trying to establish a new TCP > connection > to the server. Now why the old connection stopped working > (which should only happen if there is a network partitioning or > client/server crash), we do not know, from the limited info. > >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ >> >> this was built yesterday: >> ❯ uname -a >> FreeBSD borg.lerctr.org 14.0-CURRENT FreeBSD 14.0-CURRENT #142 >> ler/freebsd-main-changes-n257453-175a127a72f: Wed Aug 17 09:23:32 CDT >> 2022 >> root@borg.lerctr.org:/usr/obj/usr/src/amd64.amd64/sys/LER-MINIMAL >> amd64 >> >> ler in 🌐 borg in ~ via C v14.0.5-clang on ☁️ (us-east-1) >> ❯ >> >> What else do we need? > Typically, the output of the following commands (all of it, not just > some > processes/threads) can be useful: > > ps axHlw > - The MWCHAN field is by far the most useful, since it indicates where > the process/thread is sleeping. "H" matters since there are nfsiod > threads > that do asynchronous I/O for readaheads/writebehinds and knowing > what > they are up to can be useful. > > netstat -a > - To see what state the TCP connection(s) being used by the mount are > up to. > (Since it was trying to do a new connection, this one might have > been useful.) > > nfsstat -m > - To show exactly what NFS mount options are being used. > --> For example, use of "intr" and/or "soft" for an NFSv4 mount is > guaranteed to break it. (See BUGS section of "man mount_nfs".) > > vmstat -m > vmstat -z > - To see what kernel memory allocations look like. (Basically looking > for > a very large allocation that might indicate a leak or resource > exhaustion.) > > procstat -kk > - To be honest, for me, this does not usually give me any useful > information > compared with "ps", but it might be useful. > > The rest require going into the debugger (if compiled into your > kernel). > sysctl debug.kdb.enter=1 > - type > then in the debugger, do "show alllocks" and the other commands > listed here: > https://docs.freebsd.org/doc/6.4-RELEASE/usr/share/doc/en/books/developers-handbook/kerneldebug-deadlocks.html > > You can type "continue" to get out of the debugger. > > Since it was trying to establish a new TCP connection, all I can say is > that > something broke the old TCP connection and that could suggest a > problem with your network fabric, such as a broken TSO implementation. > (NFS traffic is really good at generating weird sizes of TSO segments > that, > in turn, are very good at finding flaws in TSO implementations that > other > TCP traffic never finds.) > > rick MarkJ pointed out there was a kernel bug that I was probably hitting, upgrading past that point fixed it. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 5708 Sabbia Dr, Round Rock, TX 78665-2106