Re: FreeBSD 13.2 NFS client mount hangs
- Reply: J David : "Re: FreeBSD 13.2 NFS client mount hangs"
- In reply to: J David : "FreeBSD 13.2 NFS client mount hangs"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 30 Sep 2023 22:06:35 UTC
On Fri, Sep 29, 2023 at 5:50 PM J David <j.david.lists@gmail.com> wrote: > > I have noticed a new (to me) hang on FreeBSD NFS client machines > running 13.2-RELEASE-p2. > > It's happened twice this week to Apache processes. It's the root EUID > process and it appears to happen while the process is starting up or > reconfiguring. I.e., while it's reading the configs. > > The configs are not on NFS storage. But the vhost document roots are. > > The process ps looks like this: > > 0 19557 19548 3 25 5 25248 12036 nfstry DN - 0:12.85 > /usr/local/apache/2.4/bin/httpd -D FOREGROUND -f > /usr/local/apache/2.4/conf/httpd.conf > > The procstat -kk looks like: > > PID TID COMM TDNAME KSTACK > 19557 100341 httpd - mi_switch+0xc2 > sleepq_timedwait+0x2f _sleep+0x1ce clnt_vc_call+0x866 > clnt_reconnect_call+0x626 newnfs_request+0xc36 nfscl_request+0x5a > nfsrpc_getattr+0xbb nfs_close+0x489 vop_sigdefer+0x2b > VOP_CLOSE_APV+0x1c vn_close1+0x16a vn_closefile+0x3d _fdrop+0x11 > closef+0x24b closefp_impl+0x69 amd64_syscall+0x10c > fast_syscall_common+0xf8 This is just waiting for a reply for the Close RPC. > > The process slowly gains CPU time (a few hundredths per minute) but is > immune to kill -9 so it doesn't seem to be coming out of the kernel at > any point. > > I tried running procstat -kk every few seconds to see if I would get > anything different to show what it's doing. Most are the same as > above, but I also got this: > > 19557 100341 httpd - mi_switch+0xc2 > sleepq_timedwait+0x2f _sleep+0x1ce nfs_catnap+0x47 > newnfs_request+0x14b3 nfscl_request+0x5a nfsrpc_getattr+0xbb > nfs_close+0x489 vop_sigdefer+0x2b VOP_CLOSE_APV+0x1c vn_close1+0x16a > vn_closefile+0x3d _fdrop+0x11 closef+0x24b closefp_impl+0x69 > amd64_syscall+0x10c fast_syscall_common+0xf8 This one is sleeping for a short time before retrying an RPC. Although I cannot be 100% sure, it is probably one that received a NFS4ERR_DELAY reply from the server. Fairly recent versions of the Linux server hand out delegations. Imho delegations are pretty useless. I have seen reports on the linux-nfs@vger.kernel.org related to Close and Delegation Recall resulting in repeated NFS4ERR_DELAY replies. --> I'd suggest you try and disable delegations. I do not know how to do this on the Linux server, but not running the nfscbd(8) daemon should stop them from being issued. (No nfscbd(8) implies no callbacks and no callbacks should imply no delegations being issued. If the Linux server still issues delegations when the nfscbd(8) is not running (and was not running when the mount was done), it is broken. The FreeBSD client currently does not accept NFS4ERR_DELAY for Close. If the Linux server is replying NFS4ERR_DELAY for Close, all bets are off. > > (This differs starting at the newnfs_request after nfscl_request+0x5a.) > > I started unmounting NFS filesystems until I hit one where umount > hung. An ls on that filesystem also hung. However, an ls of that > filesystem from another client machine worked fine, so it does appear > to be a client-side issue rather than a server problem. umount -f > also hung. umount -N did unmount it very quickly and that caused all > the hanging umounts and the > httpd process to exit immediately. Yes, "umount -N <mnt_path> is the way (and the only way) to get rid of hung NFS mounts. > > I didn't find anything good in the syslog or dmesg. The only thing > related to nfs are a handful of "nfsv4 err=10068" that look like they > were way back near when the system booted (about 5 days ago). Hmm, interesting. 10068 is NFS4ERR_RETRY_UNCACHED_REP. I have never seen (and do not recall anyone else reporting) this error return. - The RFC says it can be replied when a retry of the same RPC with the same session slot/seqid is received and the reply is not cached. To be honest, I do not know how the FreeBSD NFSv4.1/4.2 client will handle this? I will have to look at the code to see if this can happen after a new TCP connection is established and outstanding RPCs are retried. --> If this can happen, the client code needs to be patched to retry the RPC with a different session slot or same session slot, but adavnced seqid. Basically, I suspect the FreeBSD client is broken for handling this case, which I have never seen before. > > The mount flags are: > > nfsv4,minorversion=2,oneopenown,tcp,resvport,nconnect=1,hard,cto,sec=sys,acdirmin=3,acdirmax=60,acregmin=5,acregmax=60,nametimeo=60,negnametimeo=60,rsize=65536,wsize=65536,readdirsize=65536,readahead=1,wcommitsize=16777216,timeout=120,retrans=2147483647 > > Is there any other information I could provide or try to catch next > time that would help debug this? I'd suggest you first check network connetcivity. Both NFS client and NFS server should be able to ping each other. If that is the case, then I'd suggest you capture packets. On the FreeBSD end: # tcpdump -s 0 -w out.pcap host <nfs-server-name> Let this run for a while and then pull out.pcap into wireshark and see what traffic is going between the NFS client and server. (Unlike tcpdump, wireshark does know how to decode NFS properly.) rick > > Thanks! >