Re: git: c33509d49a6f - main - gssd: Fix handling of the gssname=<name> NFS mount option

From: Benjamin Kaduk <kaduk_at_mit.edu>
Date: Wed, 11 Jan 2023 22:12:38 UTC
Hi Rick,

On Tue, Jan 10, 2023 at 08:26:23PM -0800, Rick Macklem wrote:
> On Sat, Jan 7, 2023 at 6:04 PM Benjamin Kaduk <bjkfbsd@gmail.com> wrote:
> > This doesn't seem like a good long-term fix.
> > If we're going to have a gssname argument, we should actually make
> > it take effect, rather than silently ignoring it, which is what using GSS_C_NO_NAME
> > does (it indicates the use of "any credential", which ends up meaning the
> > default credential when used on a GSS initiator).
> >
> > It should be possible to inspect the "junk" credential from gss_acquire_cred()
> > and learn more about what happened (perhaps a non-kerberos mechanismm was
> > picked, or the name was in the wrong format)  using various gss_inquire_*() calls,
> > as a diagnostic measure.  Unfortunately I don't anticipate having a huge amount of time
> > to put into it anytime soon...
> >
> I found the underlying problem. The upcall RPC from the kernel was timing out
> at 25sec and the gss_acquire_cred() call was not done at that time.
> (It was close.
> gss_acquire_cred()  took about 27sec.) Then the kernel code would assume that
> the gssd(8) daemon had gone away and closed the upcall socket. This made the
> gssd(8) daemon to terminate, due to a SIGPIPE signal.

Thanks for digging into this more.
Wow, 27 seconds is really long for gss_acquire_cred() in normal operation!
Do you happen to have a packet capture including port 88 traffic during
this behavior?  I kind of have to assume that it is hitting the network to
incur such a long delay, as I'm not thinking of anything purely local that
would take so long.  (Actually, in the original model, this call would not
have hit the network at all, with that being deferred until the
security-context-establishment calls, but some modern extensions have
changed things.)

> Increasing the timeout makes it work.
> 
> I am now "on the fence" w.r.t. leaving this patch in.  As I noted, I
> think it is safe
> to do, since the credential cache used by the gssd(8) daemon should only have
> a TGT for the host-based client credential.
> Without the patch, the mount takes almost 30sec instead of a fraction
> of a second
> with the patch (assuming the timeout has been increased, which turns out to be
> needed for the case where a user's TGT has expired and they attempt to access
> the mount).
> 
> If you really think it should be reverted, I can do that.

I'm not going to specifically ask for a revert until we've tried a bit more
to figure ou the root cause of the issue.  The behavior is pretty weird,
and I'd like to see more data before making a decision.

> Thanks for your comments, rick
> ps: I will be committing a change to increase the timeout.

I see that, thanks for putting in the additional workaround quickly.


Thanks again,

Ben