kernel: nfsrv_cache_session: no session IPaddr=10.0.0.8, check NFS clients for unique /etc/hostid's
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 27 Sep 2024 15:55:36 UTC
(Posting this separately because, due to timing and conditions, I'm reasonably sure it's unrelated to the other issue.) While recovering from the problems earlier today, this was dominating the syslog on the NFS fileserver. Sep 27 09:02:07 fs kernel: nfsrv_cache_session: no session IPaddr=10.0.0.8, check NFS clients for unique /etc/hostid's Sep 27 09:02:38 fs syslogd: last message repeated 31 times Sep 27 09:04:39 fs syslogd: last message repeated 121 times Sep 27 09:14:40 fs syslogd: last message repeated 599 times Sep 27 09:24:41 fs syslogd: last message repeated 599 times Sep 27 09:34:43 fs syslogd: last message repeated 600 times Sep 27 09:44:44 fs syslogd: last message repeated 600 times Sep 27 09:54:45 fs syslogd: last message repeated 600 times Sep 27 10:02:05 fs syslogd: last message repeated 439 times That started during the incident. It looks like it started right about the time I rebooted 10.0.0.8 a second time (to switch it back to "nullfs mode"), with the server logging "last message repeated 600 times" every ten minutes. (I.e., once per second) On the client side, it's spewing this with equal frequency: Sep 27 14:50:01 worker8 kernel: Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's It's just that one client machine out of 28. It happens regardless of whether jobs are run via nullfs or NFS. And I can absolutely guarantee that the /etc/hostid files are unique: $ cluster -p -c job_runners uname -n | wc -l 28 $ cluster -p -c job_runners cat /etc/hostid | sort -u | wc -l 28 $ cluster -p -c job_runners sysctl kern.hostid | sort -u | wc -l 28 This continued happening every second, even hours after the incident. Everything else appeared to be running normally. I spared that machine out of the cluster, waited for it to quiesce, and then manually unmounted its NFS mount to the server. Even so, these messages continued to generate on both client and server. Finally, I halted the client machine. It kept at it all the way down: Uptime: 3h56m13s Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's Initiate recovery. If server has not rebooted, check NFS clients for unique /etc/hostid's uhub0: detached acpi0: Powering system off The messages stopped on the server after that, and did not reoccur once I restarted it and returned it to service. I don't know what's up with that, but it seems strange. Possibly something related to rebooting twice (~30 min apart) during a situation where not everything was working properly put NFS on that client machine into an unhappy state?