Trouble with NFSd under 6.1-Stable, any ideas?

Tue May 23 01:10:49 PDT 2006

On Mon, May 22, 2006 at 05:43:32PM -0400, Rong-en Fan wrote:
> On 5/14/06, Kris Kennaway <kris at obsecurity.org> wrote:
> >On Sun, May 14, 2006 at 02:28:55PM -0400, Howard Leadmon wrote:
> >>
> >>    Hello All,
> >>
> >>  I have been running FBSD a long while, and actually running since the 
> >5.x
> >> releases on the server I am having troubles with.   I basically have a 
> >small
> >> network and just use NIS/NFS to link my various FBSD and Solaris machines
> >> together.
> >>
> >>  This has all been running fine up till a few days ago, when all of a 
> >sudden
> >> NFS came to a crawl, and CPU usage so high the box appears to freeze 
> >almost.
> >> When I had 6.1-RC running all seemed well, then came the announcement 
> >for the
> >> official 6.1 release, so I did the cvs updates, made world, kernel, and 
> >ran
> >> mergemaster to get everything up to the 6.1 stable version.
> >>
> >>  Now after doing this, something is wrong with NFS.   It works, it will 
> >return
> >> information and open files, just it's very very slow, and while 
> >performing a
> >> request the CPU spike is astounding.  A simple du of my home directory 
> >can
> >> take minutes, and machine all but locks up if the request is done over 
> >NFS.
> >> Here is top snip:
> >>
> >>   PID USERNAME   THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU 
> >COMMAND
> >>   497 root         1   4    0  1252K   780K -      2  50:42 188.48% nfsd
> >>
> >>
> >>  This is a nice IBM eServer with dual P4-XEON's and a couple GB or RAM 
> >on a
> >> disk array, and locally is screams, heck NFS used to scream till I 
> >updated.  I
> >> am not really sure what info would be useful in debugging, so won't post 
> >tons
> >> of misc junk in this eMail, but if anyone has any ideas as to how best to
> >> figure out and resolve this issue it would sure be appreicated...
> >
> >Use tcpdump and related tools to find out what traffic is being sent.
> >
> >Also verify that you did not change your system configuration in any
> >way: there have been no changes to NFS since the release, so it is
> >unclear why an update would cause the problem to suddenly occur.
> >
> >Kris
> 
> Hi Kris and Howard,
> 
> As I posted few days ago, I have similar problems like Howard's
> (some details in the thread "6.1-RELEASE, em0 high interrupt rate
> and nfsd eats lots of cpu" on stable@). After binary searching
> the source tree, I found that
> 
> RELENG_6_1, 2006.04.30.03.57 ok
> RELENG_6_1, 2006.04.30.04.00 bad
> 
> The only commit is kern/vfs_lookup.c, an MFC of rev 1.90 and 1.91.
> With 04.30 03.57's source + manaully patched vfs_lookup.c rev 1.90,
> the same problem occurs.
> 
> Let me refresh what problems I'm seeing
> 
> 1. a client (no matter Linux 2.6.16 or FreeBSD 6.1) runs du on
>   a nfs directory
> 2. on server-side, nfsd starts to eats lots of CPU
> 3. the du finishes
> 4. on server-side, nfsd still eats lots of CPU, but there is no
>   nfs traffic. Wait for 5 minutes, you can still see that nfsd is
>   "running" and eats lots of CPU.
> 
> On FreeBSD 6.1R client, it uses UDP mount and fstab is like
> "rw,-L,nosuid,bg,nodev". On Linux cleint, it uses UDP mount and
> fstab is like "defaults,udp,hard,intr,nfsvers=3,rsize=8192,wsize=8192".
> The server's kernel conf is at
> 
> http://www.rafan.org/FreeBSD/nfs/KERNEL
> 
> Some related configuration files:
> 
> /etc/export
>  /export/dir1 host1 host2...
>  /export/dir2 host1 host2...
> 
> /etc/rc.conf
> nfs_server_enable="YES"
> nfs_server_flags="-u -t -n 16"
> mountd_enable="YES"
> mountd_flags="-r -l -n"
> rpc_lockd_enable="YES"
> rpc_statd_enable="YES"
> rpcbind_enable="YES"
> 
> /etc/fstab:
> /dev/...  /export/dir1 ufs rw,nosuid,noexec 2 2
> /dev/...  /export/dir2 ufs rw,nosuid,noexec,userquota 2 2
> 
> The NFS server is also using amd to mount some backup directories
> from another NFS server. the amd.conf is
> 
> [global]
> browsable_dirs = yes
> map_type = file
> mount_type = nfs
> auto_dir = /nfs
> fully_qualified_hosts = no
> log_file = syslog
> nfs_proto = udp
> nfs_allow_insecure_port = no
> nfs_vers = 3
> # plock = yes
> selectors_on_default = yes
> restart_mounts = yes
> 
> [/backup]
> map_options = type:=direct
> map_name = /etc/amd.direct
> 
> /etc/amd.direct:
> /defaults
> opts:=rw,grpid,resvport,vers=3,proto=udp,nosuid,nodev,rsize=8192,wsize=8192
> backup          type:=nfs;rhost:=nfs2;rfs:=/nfs2/${host}
> 
> 
> If there are any thing I can provide to help tracking this down. Please
> let me know. By the way, I tried with truss/kdump to see what happens
> when nfsd eats lot of CPUs, but in vain. They do not return anything.
> 
I tried your recipe on 7-CURRENT with locally exported fs, remounted
over nfs. I did not get the behaviour your described.

Could you, please, provide the backtrace for the nfsd that
eats the CPU (from the ddb). I think it would be helpful to get several
backtraces (i.e., bt <nfsd pid>, cont, bt <nfsd pid> ...) to
see where it running.

Also, just in case, does filesystem that is exported and shows problem,
have quotas enabled ? One line of your fstab has userquotas, other does not.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060523/94c6cd1e/attachment.pgp