Re: NFS, intermittent 'RPC struct is bad' errors

From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Wed, 19 Jun 2024 14:04:48 UTC
On Tue, Jun 18, 2024 at 11:32 PM Lexi Winter <lexi@le-fay.org> wrote:
>
> hi,
>
> i have a few systems running NFSv4 on FreeBSD, using Kerberos (MIT
> Kerberos KDC), with the server exporting ZFS filesystems.
>
> recently i've noticed intermittent errors of 'RPC struct is bad' when
> writing to the NFS server, which usually resolves itself after retrying.
> for example:
>
> % rsync -iavP /scratch/Star.Trek.Prodigy.S01E* .
> sending incremental file list
> >f++++++++++ Star.Trek.Prodigy.S01E01E02.1080p.WEBRip.x265-KONTRAST.mkv
>          32,768   0%    0.00kB/s    0:00:00  rsync: [receiver] write failed on "/data/public/TV/Star Trek Prodigy/Season 01/Star.Trek.Prodigy.S01E01E02.1080p.WEBRip.x265-KONTRAST.mkv": RPC struct is bad (72)
> rsync error: error in file IO (code 11) at receiver.c(380) [receiver=3.3.0]
>
> rsync: [sender] write error: Broken pipe (32)

The "RPC struct is bad" just refers to the RPC message that cannot be decoded
because it is trashed for some reason.

> % rsync -iavP /scratch/Star.Trek.Prodigy.S01E* .
> sending incremental file list
> >f.st....... Star.Trek.Prodigy.S01E01E02.1080p.WEBRip.x265-KONTRAST.mkv
>     912,704,431 100%   96.51MB/s    0:00:09 (xfr#1, to-chk=18/19)
> >f++++++++++ Star.Trek.Prodigy.S01E03.1080p.WEBRip.x265-KONTRAST.mkv
>     477,408,567 100%  100.06MB/s    0:00:04 (xfr#2, to-chk=17/19)
> [...]
>
> the client is running FreeBSD 15.0-CURRENT from around May 24, and the
> server is running a slightly older 15.0-CURRENT from around May 23.
There was an issue fixed in main/current by commits on Apr. 25. (client 8efba70,
server 54c3aa0). If you somehow ended up with the client having the patch and
the server not having the patch, this could possibly explain it?

Also, the breakage (I was tricked by wireshark into believing the code
was wrong.
It actually turned out to be wireshark broken. On Apr. 25, I put
things back to where
the RFCs said they should be.)

And this breakage should only occur if delegations are enabled, which will only
happen if you set "vfs.nfsd.issue_delegations=1" on the server (not on
by default).

I doubt this is what you are seeing.

>
> /etc/exports on the server is pretty standard:
>
> /data/public                    -sec=krb5:krb5i:krb5p   -network 2001:8b0:aab5::/48
> /data/public/Books              -sec=krb5:krb5i:krb5p   -network 2001:8b0:aab5::/48
> /data/public/CalibreLibrary     -sec=krb5:krb5i:krb5p   -network 2001:8b0:aab5::/48
> /data/public/Comics             -sec=krb5:krb5i:krb5p   -network 2001:8b0:aab5::/48
> /data/public/Films              -sec=krb5:krb5i:krb5p   -network 2001:8b0:aab5::/48
> /data/public/Miscellaneous      -sec=krb5:krb5i:krb5p   -network 2001:8b0:aab5::/48
> V4: /data                       -sec=sys:krb5:krb5i:krb5p       -network 2001:8b0:aab5::/48
>
> client mount options:
>
> hemlock.eden.le-fay.org:/public /data/public    nfs     rw,nfsv4,minorversion=2,sec=krb5p,gssname=host,bgnow,proto=tcp6,nconnect=4,rsize=1048576,wsize=1048576,noncontigwr      0 0

You might try getting rid of the "noncontigwr" option, since I do not
test that often,
to see if it helps.

>
> is there anything more i can do investigate this?  would a tcpdump
> capture of the error be useful (considering all the RPC traffic is
> Kerberos-encrypted)?
The only thing that a tcpdump (pulled into wireshark after capture) might
show you is TCP layer issues.

Unless getting rid of "nocontigwr" fixes the problem, it sounds like some
sort of corruption occurring in the network fabric. This might be caught be
wireshark as TCP timeouts or ???

No one else has reported anything like this recently, rick