NFS: rpc.statd/lockd becomes unresponsive
Josh Beard
josh at signalboxes.net
Fri Mar 30 23:44:15 UTC 2012
Hello,
We've recently setup a FreeBSD 9.0-RELEASE (x64) system to test as an
NFS server for "live" network homes for Mac clients (mostly 10.5 and
10.6 clients).
We're a public school district and normally have around 150-200 users
logged in at a time with network homes. Currently, we're using netatalk
(AFP) on a Linux box, after migrating from an aging Mac OS X server.
Unfortunately, netatalk has some serious performance issues under the
load we're putting it under and we'd like to migrate to NFS.
We've tried several Linux distributions and various kernels and we're
now testing FreeBSD (and tested FreeNAS) with similar setups.
Unfortunately, they all suffer the same issue.
As a test, I have a series of scripts to simulate user activity on the
clients (e.g. opening Word, opening a browser, doing some read/writes
with dd, etc). After a while, NFS on the server runs into an issue
where (what I think happens) rpc.statd can't talk to rpc.lockd. Being
Mac clients, they all get a rather ugly dialog box stating that their
connection to the server has been lost.
It's worth mentioning that this server is a KVM 'guest' on a Linux
server. I'm aware of some I/O issues there, but I don't have a decent
piece of hardware to really test this on. I allocated 4 CPUs to it and
10GB of RAM. I've tested with the virtio net drivers and without.
Considering I've seen the same symptoms on around 6 Linux distributions,
with various kernels, FreeNAS, and FreeBSD, I wouldn't be surprised to
get the same results if I weren't virtualized.
I haven't really done any tuning on the FreeBSD server, it's fairly vanilla.
We have around ~2600 machines throughout our campus, with limited remote
management capabilities (that's on the big agenda to tackle), so
changing NFS mount options there would be rather difficult. These are
LDAP accounts with the NFS mounts in LDAP as well, for what it's worth.
The clients mount it pretty vanilla (output of 'mount' on client):
freenas.dsdk12.schoollocal:/mnt/homes on
/net/freenas.dsdk12.schoollocal/mnt/homes (nfs, nodev, nosuid,
automounted, nobrowse)
On the server, my /etc/exports looks like this:
/srv/homes -alldirs -network 172.30.0.0/16
This export doesn't have a lot of data - it's 150 small home directories
of test accounts. No other activity is being done on this server. The
filesystem if UFS.
/etc/rc.conf on the server:
rpcbind_enable="YES"
nfs_server_enable="YES"
mountd_flags="-r -l"
nfsd_enable="YES"
mountd_enable="YES"
rpc_lockd_enable="YES"
rpc_statd_enable="YES"
nfs_server_flags="-t -n 128"
When this occurs, /var/log/messages starts to fill up with this:
Mar 30 16:35:18 freefs kernel: Failed to contact local NSM - rpc error 5
Mar 30 16:35:20 freefs rpc.statd: unmon request from localhost, no
matching monitor
Mar 30 16:35:44 freefs rpc.statd: unmon request from localhost, no
matching monitor
-- repeated a few times every few seconds --
Mar 30 16:54:50 freefs rpc.statd: Unsolicited notification from host
hs00508s4434.dsdk12.schoollocal
Mar 30 16:55:01 freefs rpc.statd: Unsolicited notification from host
hs00520s4539.dsdk12.schoollocal
Mar 30 16:55:10 freefs rpc.statd: Failed to call rpc.statd client at
host localhost
nfsstat shortly after a failure:
Rpc Info:
TimedOut Invalid X Replies Retries Requests
0 0 0 0 1208
Cache Info:
Attr Hits Misses Lkup Hits Misses BioR Hits Misses BioW Hits
Misses
177 951 226 28 3 6
0 2
BioRLHits Misses BioD Hits Misses DirE Hits Misses Accs Hits
Misses
49 3 13 5 9 0
148 9
Server Info:
Getattr Setattr Lookup Readlink Read Write
Create Remove
262698 101012 1575347 29 1924761 2172712
0 43792
Rename Link Symlink Mkdir Rmdir Readdir
RdirPlus Access
27447 0 21 5596 1691 118073 0
2596146
Mknod Fsstat Fsinfo PathConf Commit
0 83638 108 108 183632
Server Ret-Failed
0
Server Faults
0
Server Cache Stats:
Inprog Idem Non-idem Misses
0 0 0 9172982
Server Write Gathering:
WriteOps WriteRPC Opsaved
2172712 2172712 0
rpcinfo shortly after a failure:
program version netid address service owner
100000 4 tcp 0.0.0.0.0.111 rpcbind superuser
100000 3 tcp 0.0.0.0.0.111 rpcbind superuser
100000 2 tcp 0.0.0.0.0.111 rpcbind superuser
100000 4 udp 0.0.0.0.0.111 rpcbind superuser
100000 3 udp 0.0.0.0.0.111 rpcbind superuser
100000 2 udp 0.0.0.0.0.111 rpcbind superuser
100000 4 tcp6 ::.0.111 rpcbind superuser
100000 3 tcp6 ::.0.111 rpcbind superuser
100000 4 udp6 ::.0.111 rpcbind superuser
100000 3 udp6 ::.0.111 rpcbind superuser
100000 4 local /var/run/rpcbind.sock rpcbind superuser
100000 3 local /var/run/rpcbind.sock rpcbind superuser
100000 2 local /var/run/rpcbind.sock rpcbind superuser
100005 1 udp6 ::.2.119 mountd superuser
100005 3 udp6 ::.2.119 mountd superuser
100005 1 tcp6 ::.2.119 mountd superuser
100005 3 tcp6 ::.2.119 mountd superuser
100005 1 udp 0.0.0.0.2.119 mountd superuser
100005 3 udp 0.0.0.0.2.119 mountd superuser
100005 1 tcp 0.0.0.0.2.119 mountd superuser
100005 3 tcp 0.0.0.0.2.119 mountd superuser
100024 1 udp6 ::.3.191 status superuser
100024 1 tcp6 ::.3.191 status superuser
100024 1 udp 0.0.0.0.3.191 status superuser
100024 1 tcp 0.0.0.0.3.191 status superuser
100003 2 tcp 0.0.0.0.8.1 nfs superuser
100003 3 tcp 0.0.0.0.8.1 nfs superuser
100003 2 tcp6 ::.8.1 nfs superuser
100003 3 tcp6 ::.8.1 nfs superuser
100021 0 udp6 ::.3.248 nlockmgr superuser
100021 0 tcp6 ::.2.220 nlockmgr superuser
100021 0 udp 0.0.0.0.3.202 nlockmgr superuser
100021 0 tcp 0.0.0.0.2.255 nlockmgr superuser
100021 1 udp6 ::.3.248 nlockmgr superuser
100021 1 tcp6 ::.2.220 nlockmgr superuser
100021 1 udp 0.0.0.0.3.202 nlockmgr superuser
100021 1 tcp 0.0.0.0.2.255 nlockmgr superuser
100021 3 udp6 ::.3.248 nlockmgr superuser
100021 3 tcp6 ::.2.220 nlockmgr superuser
100021 3 udp 0.0.0.0.3.202 nlockmgr superuser
100021 3 tcp 0.0.0.0.2.255 nlockmgr superuser
100021 4 udp6 ::.3.248 nlockmgr superuser
100021 4 tcp6 ::.2.220 nlockmgr superuser
100021 4 udp 0.0.0.0.3.202 nlockmgr superuser
100021 4 tcp 0.0.0.0.2.255 nlockmgr superuser
300019 1 tcp 0.0.0.0.2.185 amd superuser
300019 1 udp 0.0.0.0.2.162 amd superuser
The load can get fairly high during my 'stress' tests, but not *that*
high. I'm surprised to see these particular symptoms that affect every
connected user at the same time and would expect slowdowns rather than
the issue I'm seeing.
Any ideas or nudges in the right direction are most welcome. This is
severely plaguing us and our students :\
Thanks,
Josh
More information about the freebsd-net
mailing list