[Bug 260011] Unresponsive NFS mount on AWS EFS
- In reply to: bugzilla-noreply_a_freebsd.org: "[Bug 260011] Unresponsive NFS mount on AWS EFS"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 22 May 2022 00:39:57 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=260011 --- Comment #13 from Rick Macklem <rmacklem@FreeBSD.org> --- Created attachment 234101 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=234101&action=edit mark session slots bad and set session defunct when all slots bad Ok, I looked at the attachments... - There were 154 ExchangeIDs. Thats about 153 more than there should be. These should only occur once when doing the mount, plus after a server reboot. - Unfortunately the Amazon EFS has a fundamental design flaw, where a new TCP connection may connect to a different "cluster" (whatever Amazon considers a cluster to be) and this different cluster does not know any of the open/lock state and acts like a rebooted NFSv4 server. There may be other things that cause this. To use it reliably, you need to avoid these ExchangeIDs (recovery cycles). If you monitor the TCP connection to the server via repeated "netstat -a" calls and you see the connection change (different client port#), then that is causing the problem (because of the Amazon EFS design flaw). Using "soft,intr" is asking for trouble, because an interrupted syscall leaves the file system in a non-deterministic state. Something called sessions maintains "exactly once" RPC semantics and this breaks the session slot. Once all the session slots are broken, the client must do one of these recoveries. --> It is much better to use hard mounts and "umount -N" if/when a mount point is hung. For this case, it is stuck partially through one of these recoveries, because the "nfscl" thread that does the recovery is stuck waiting for a session slot. I'm not sure how that can happen, since the session would normally be marked "defunct" so that "nfscl" would not be waiting for a slot in the session to become available. I have attached this patch, which might help? It does two things differently... - It scans through all sessions looking for a match to mark defunct instead of just doing the first/current one. I cannot think how a new session would be created without the previous one being marked defunct, but since your "ps axHl" suggests that happens, this might fix the problem - It keeps track of bad slots (caused by a soft,intr RPC failing without completing) and marks the session defunct when all slots are bad. This might make "soft,intr" mounts work better. If you can try the patch and it improves the situation, it could be considered for a FreeBSD commit. I doubt it will ever be committed otherwise, because I have no way of reproducing what you are getting. -- You are receiving this mail because: You are the assignee for the bug.