Re: Issues with NFS RPC

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Tue, 06 Jul 2021 22:49:09 UTC
Adam Stylinski wrote:
>Well, I put in a plea in that PR for something to trickle into RELENG, but, if not possible, does that patch you posted >in there for -STABLE apply cleanly to the RELENG branch?  It looks like it calls some helper functions but from what I >can tell they are fully present in the 13.0 releng branch (CC'ing the list and rscheff on this as well).
Just to clarify...the patch in the PR reverts r367492 (the commit that caused the breakage). It should
apply cleanly to a releng13.0 kernel.

It is not the patch that fixes the problem that is in stable/13. That one is on phabricator in D29690. I do
not know if this patch can be safely applied to a releng13.0 kernel. Presumably rscheff@ will know if it does.

>I used gmail for all of this correspondence so hopefully the mutt or other mail users don't lose their mind with all >the top posting.  Sorry in advance for the poor ML etiquette.
Personally, I don't care and, personally, sometimes it makes sense, if the reply is not addressing individual
comments within a post. Not to mention, Outlook is what I use and top posting is certainly easy (and expected)
by it.

You should do "netstat -a" on the server the next time you see the hang.
The connection will be in ESTABLISHED state with the Recv-Q value non-zero and increasing
over time, if this is the cause of your problem.

rick


On Tue, Jul 6, 2021 at 10:59 AM Rick Macklem <rmacklem@uoguelph.ca<mailto:rmacklem@uoguelph.ca>> wrote:
Adam Stylinski wrote:
>Yes, I'm using 13.0-RELEASE, with the latest security updates from freebsd-update.  If this is that issue, I'm glad it's >known and fixed, but I'd really hoped it'd get backported to -RELEASE :(.
At this time, you need to build a kernel from patched sources. You can either use stable/13 kernel
sources (which, of course, pulls in other changes) or apply the patch that reverts r367492 that is
an attachment on the PR to releng13.0 kernel sources.

You could ask rscheff@freebsd.org<mailto:rscheff@freebsd.org> (who is the author of r367492 and the patch that is in stable/13
and is believed to have fixed this problem if he has considered asking re@freebsd.org<mailto:re@freebsd.org> w.r.t. doing
the patch as an errata for releng13.0.

>Other info I forgot to include: the firmware on both adapters is the same and the latest for the ConnectX-3 series.  >The ring size, channels, offload, and all other settings are the driver defaults.  This issue is not limited to just this >workload, it occurs on other files as well.
If you look at r367492, it changed the timing and lock semantics related to TCP socket upcalls.
As such, it probably doesn't matter what network interface, etc, that you are using.
--> Although I would not recommend it, you could use NFSv3 over UDP, since this is a TCP specific issue.

rick