kernel: nfsrv_cache_session: no session IPaddr=10.0.0.8, check NFS clients for unique /etc/hostid's

From: J David <j.david.lists_at_gmail.com>
Date: Fri, 27 Sep 2024 15:55:36 UTC
(Posting this separately because, due to timing and conditions, I'm
reasonably sure it's unrelated to the other issue.)

While recovering from the problems earlier today, this was dominating
the syslog on the NFS fileserver.

Sep 27 09:02:07 fs kernel: nfsrv_cache_session: no session
IPaddr=10.0.0.8, check NFS clients for unique /etc/hostid's
Sep 27 09:02:38 fs syslogd: last message repeated 31 times
Sep 27 09:04:39 fs syslogd: last message repeated 121 times
Sep 27 09:14:40 fs syslogd: last message repeated 599 times
Sep 27 09:24:41 fs syslogd: last message repeated 599 times
Sep 27 09:34:43 fs syslogd: last message repeated 600 times
Sep 27 09:44:44 fs syslogd: last message repeated 600 times
Sep 27 09:54:45 fs syslogd: last message repeated 600 times
Sep 27 10:02:05 fs syslogd: last message repeated 439 times

That started during the incident. It looks like it started right about
the time I rebooted 10.0.0.8 a second time (to switch it back to
"nullfs mode"), with the server logging "last message repeated 600
times" every ten minutes. (I.e., once per second)

On the client side, it's spewing this with equal frequency:

Sep 27 14:50:01 worker8 kernel: Initiate recovery. If server has not
rebooted, check NFS clients for unique /etc/hostid's

It's just that one client machine out of 28. It happens regardless of
whether jobs are run via nullfs or NFS. And I can absolutely guarantee
that the /etc/hostid files are unique:

$ cluster -p -c job_runners uname -n | wc -l
      28
$ cluster -p -c job_runners cat /etc/hostid | sort -u | wc -l
      28
$ cluster -p -c job_runners sysctl kern.hostid | sort -u | wc -l
      28

This continued happening every second, even hours after the incident.
Everything else appeared to be running normally. I spared that machine
out of the cluster, waited for it to quiesce, and then manually
unmounted its NFS mount to the server. Even so, these messages
continued to generate on both client and server.

Finally, I halted the client machine. It kept at it all the way down:

Uptime: 3h56m13s
Initiate recovery. If server has not rebooted, check NFS clients for
unique /etc/hostid's
Initiate recovery. If server has not rebooted, check NFS clients for
unique /etc/hostid's
Initiate recovery. If server has not rebooted, check NFS clients for
unique /etc/hostid's
Initiate recovery. If server has not rebooted, check NFS clients for
unique /etc/hostid's
Initiate recovery. If server has not rebooted, check NFS clients for
unique /etc/hostid's
uhub0: detached
acpi0: Powering system off

The messages stopped on the server after that, and did not reoccur
once I restarted it and returned it to service.

I don't know what's up with that, but it seems strange. Possibly
something related to rebooting twice (~30 min apart) during a
situation where not everything was working properly put NFS on that
client machine into an unhappy state?