Re: panic: nfsv4root ref cnt cpuid = 1

From: J David <j.david.lists_at_gmail.com>
Date: Fri, 27 Sep 2024 15:33:15 UTC
Circling back around to whether it's better to NFS mount once and
nullfs mount lots, or NFS mount lots, I've unfortunately gathered some
additional data.

We set up a version of our code that mounts the requisite NFS
filesystem directly for each job/jail root. That worked fine in
small-scale testing.

In a wider deployment, however, disaster ensued. With a few thousand
mounts, we started to observe two separate forms of bad behavior:

- requests from established sessions would hang indefinitely leading
to processes backlogging and client machines going OOM and becoming
unresponsive en masse.
- the NFS server appeared to be serving empty directories.

The first one is self-explanatory. The second one might bear further
explanation.

The server runs ZFS. There are several datasets that contain job roots.

E.g.:

tank
tank/roots
tank/roots/a
tank/roots/b
tank/roots/c
tank/roots/d

The /etc/exports looks like:

V4: /tank -sec=sys

For client machines using nullfs, there is an /etc/fstab line like:

fs:/roots /roots nfs
ro,nfsv4,minorversion=2,tcp,nosuid,noatime,nolockd,noresvport,oneopenown
0 0

Under ordinary operation, NFSv4 exports the child datasets correctly.

E.g.:

$ ls /roots/a
bin     etc     lib     net     proc    sbin    usr
dev     home    libexec root    tmp     var

Then a client does:

# for a "Type A" job
/sbin/mount_nullfs -o ro -o nosuid /roots/a /jobs/(job-uuid)

During the failure, I observed:

$ ls /roots
a b c d
$ ls /roots/a
$ ls /roots/b
$ ls /roots/c
$ ls /roots/d

I.e., the server appeared to have "forgotten" to descend into the
child datasets and behaved as NFSv3 would have done in that situation.

The server in question is FreeBSD 14.1-RELEASE-r5. There were no
console diagnostics, nothing in dmesg, and negligible visible load
(load average below 1.0, nfsd using ~7% of one CPU).

The individual client mounts (the ones that were hanging) were a
little different, because they would go straight to the subdirectory
they want:

# for a "Type A" job
/sbin/mount_nfs -o tcp,nfsv4,minorversion=2,noatime  -o ro  -o nosuid
-o noresvport fs:/roots/a /jobs/(job-uuid)

Once all the client machines were restarted in "nullfs mode" the
server returned to normal operation without further intervention, so
the server behavior does appear directly related to the number of
client NFS mounts. I couldn't exactly measure it at the time of the
incident, but I would ballpark it at about 5,000 +/- 2,000 NFS mounts
across 28 client machines.

FWIW, during the ~48 hour window where we were testing direct NFS
instead of nullfs on slowly increasing numbers of machines, no client
using direct NFS experienced the kernel panic we're discussing here.
(That's without the patch.) Contrast that to 2-3 total panics per day
among the machines using nullfs. So it's possible that indirection
through nullfs aggravates that particular bug.

Alas, based on the above, nullfs seems to be necessary for now.
Getting the patch tested & deployed is now top of my list.

Thanks!