Re: panic: nfsv4root ref cnt cpuid = 1

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Wed, 25 Sep 2024 00:30:06 UTC
On Tue, Sep 24, 2024 at 10:54 AM John F Carr <jfc@mit.edu> wrote:
>
>
>
> > On Sep 24, 2024, at 13:32, J David <j.david.lists@gmail.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 6:38 PM Rick Macklem <rick.macklem@gmail.com> wrote:
> >> If you can easily get the source line# for nfsrpc_lookup+0x87f, that
> >> could be helpful.
> >
> > Sure. I did it via lldb and got nfs_clrpcops.c 1697. Your method gives
> > the same result.
> >
> > According to the github version of 14.1-RELEASE, that's the "if (ndp
> > != NULL) {" after the call to nfscl_openrelease().
> >
> > Per lldb, the actual instruction at that address is:
> >
> > testq  %r13, %r13
> >
> > My knowledge of amd64 assembler is nearly nil, but I *think* this
> > corresponds to checking if ndp is null. And I think that %r13 is a
> > register, so I'm not sure that could cause a page fault. Maybe the
> > trace indicates that that's the line it would have come back to if
> > something in nfscl_openrelease() hadn't gone wrong?
> >
> > Thanks!
> >
>
> The stack dump on the console tends to omit the frame that faulted.
> The fault was probably in nfscl_openrelease or something it called.
> The faulting instruction address 0xffffffff809da260 should be accurate.
> The faulting data address 0x28 corresponds to the offset of field
> nfsow_rwlock in struct nfsclowner.  Perhaps in nfscl_openrelease
> the expression op->nfo_own is NULL and the fault is in one of the
> two function calls in this code block around line 850 of nfs_clstate.c.
>
>         owp = op->nfso_own;
>         if (NFSHASONEOPENOWN(nmp))
>                 nfsv4_relref(&owp->nfsow_rwlock);
>         else
>                 nfscl_lockunlock(&owp->nfsow_rwlock);
Yes, that sounds reasonable and I think it is another symptom of
the same underlying problem.
It is quite possible that this problem caused the other (hang I think?)
that you reported before.

A few years ago, I tried to avoid the extra RPC of doing an Open
after a Lookup, by combining them in the same compound RPC.
It turned out this did not work for multiple open owners, so it was
only enabled for "oneopenown".
(Since few use this mount option, I have not seen issues with it
reported by others and it worked ok for my testing.)

However, I now realize that Open works because a vnode lock
on the vnode being opened guarantees that the Open does not
go away during the VOP_OPEN().
Unfortunately, for VOP_LOOKUP(), it is the directory that is vnode
locked and a vnode lock on the file being looked up is not acquired
until the Lookup reply is processed.
--> To fix this, I think I need to delay the open processing until after
      the vnode lock on the file has been acquired
or
--> Just back this Open in Lookup patch out, since it can only be
     made to work for "oneopenown" and it only reduces each Open
     by one RPC.

The trivial patch I posted just disables the Open in Lookup RPC
optimization, so it should have the same effect as backing the patch
(it is actually several git commits) out.

The good news is that it looks like I finally know why you were
having problems others were not reporting. (I have no idea why 14.1
is crashing when previous versions would hang, etc, but crashes are
much easier to diagnose.)

rick

>
>
> John Carr
>