Re: FreeBSD 12.3/13.1 NFS client hang

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Sat, 28 May 2022 00:12:59 UTC
Andreas Kempe <kempe@lysator.liu.se> wrote:
[stuff snipped]
>
> The one thing we have seen logged are messages along the lines of:
> kernel: newnfs: server 'mail' error: fileid changed. fsid 4240eca6003a052a:0: > expected fileid 0x22, got 0x2. (BROKEN NFS SERVER OR MIDDLEWARE)
I think this can also happen if a Getattr operation fails with an error at
the server. It then has "default attributes" and a default of 0x2 (root inode#)
can be expected. So, I suspect this is what is happening. Generally, failed
Getattrs will be problematic, but I'm not sure if they can cause hangs?

If you can capture packets when these get logged, we can confirm if a
Getattr operation has failed with an error.

rick

> Also, maybe I'm old fashioned, but I find "ps axHl" useful, since it shows
> where all the processes are sleeping.
> And "procstat -kk" covers all of the locks.
>

I don't know if it is a matter of being old fashioned as much as one
of taste. :) In future dumps, I can provide both ps axHl and procstat -kk.

> > Below are procstat kstack $PID invocations showing where the processes
> > have hung. In the nfsv4_sequencelookup it seems hung waiting for
> > nfsess_slots to have an available slot. In the second nfs_lock case,
> > it seems the processes are stuck waiting on vnode locks.
> >
> > These issues seem to appear seemingly at random, but also if
> > operations that open a lot of files or create a lot of file locks are
> > used. An example that can often provoke a hang is performing a
> > recursive grep through a large file hierarchy like the FreeBSD
> > codebase.
> >
> > The NFS code is large and complicated so any advice is appriciated!
> Yea. I'm the author and I don't know exactly what it all does;-)\
>
> > Cordially,
> > Andreas Kempe
> >
>
> [...]
>
> Not very useful unless you have all the processes and their locks to try and figure out what is holding
> the vnode locks.
>

Yes, I sent this mostly in the hope that it might be something that
someone has seen before. I understand that more verbose information is
needed to track down the lock contention.

I'll switch our machines back to using hard mounts and try to get as
much diagnostic information as possible when the next lockup happens.

Do you have any good suggestions for tracking down the issue? I've
been contemplating enabling WITNESS or building with debug information
to be able to hook in the kernel debugger.

Thank you very much for your reply!
Cordially,
Andreas Kempe

> rick
>
>