Re: system stalled, no I/O but 100% CPU from nfs

Reply: Peter 'PMc' Much: "Re: system stalled, no I/O but 100% CPU from nfs"
In reply to: Peter 'PMc' Much: "system stalled, no I/O but 100% CPU from nfs"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Rick Macklem <rick.macklem_at_gmail.com>
Date: Mon, 06 Jan 2025 13:53:38 UTC
On Sun, Jan 5, 2025 at 8:45 PM Peter 'PMc' Much
<pmc@citylink.dinoex.sub.org> wrote:
>
> Cheers,
>
>  This doesn't look good. It goes on for hours. What can be done about it?
> (13.4 client & server)
>
>
> 44 processes:  4 running, 39 sleeping, 1 waiting
> CPU:  0.4% user,  0.0% nice, 99.6% system,  0.0% interrupt,  0.0% idle
> Mem: 21M Active, 198M Inact, 1190M Wired, 278M Buf, 3356M Free
> ARC: 418M Total, 39M MFU, 327M MRU, 128K Anon, 7462K Header, 43M Other
>      332M Compressed, 804M Uncompressed, 2.42:1 Ratio
> Swap: 15G Total, 15G Free
>
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
>   417 root          4  52    0    12M  2148K RUN     20:55  99.12% nfscbd
Do you have delegations enabled on your server
(vfs.nfsd.issue_delegations not 0)?
(If you do not, I have no idea why the server would be doing
callbacks, which is what nfscbd
handles.)

Also, "nfsstat -m" on the client shows you/us what your mount options are.

>     0 root         65 -16    -     0B  1040K swapin   0:17   0.64% kernel
> 11054 root          1  52    0    18M  7664K RUN      0:04   0.10% bsdtar
>    11 root         15 -56    -     0B   240K WAIT     0:15   0.05% intr
>    16 root          1 -16    -     0B    16K -        0:01   0.03% racctd
> 11062 root          1  20    0    14M  3804K RUN      0:00   0.03% top
>     7 root          3 -16    -     0B    48K psleep   0:00   0.01% pagedaemon
> 11056 root          1  20    0    21M    10M select   0:00   0.01% sshd
>     6 root          1 -16    -     0B    16K -        0:00   0.01% rand_harvest
>
>
>       Interface           Traffic               Peak                Total
>          vtnet0  in      5.380 KB/s          9.113 KB/s          781.439 MB
>                  out     4.012 KB/s          8.002 KB/s          674.294 MB
>
>
> # nfsstat -zc > /dev/null ; sleep 1 ; nfsstat -c
Adding -E makes it show all RPC counts. (Without -E you just get the
"old Sun compatible"
output.

> Rpc Counts:
>       Getattr      Setattr       Lookup     Readlink         Read        Write       Create       Remove
>             1            2            5            0            0            0            0            0
>        Rename         Link      Symlink        Mkdir        Rmdir      Readdir     RdirPlus       Access
>             0            0            0            0            0            1            0            1
>         Mknod       Fsstat       Fsinfo     PathConf       Commit
>             0            0            0            0            0
> Rpc Info:
>      TimedOut      Invalid    X Replies      Retries     Requests
>             0            0            0            0           11
> Cache Info:
>     Attr Hits  Attr Misses    Lkup Hits  Lkup Misses    BioR Hits  BioR Misses    BioW Hits  BioW Misses
>            11            1            2            5            0            0            0            0
>    BioRL Hits BioRL Misses    BioD Hits  BioD Misses    DirE Hits  DirE Misses    Accs Hits  Accs Misses
>             0            0            1            1            1            0            8            1
>
>
The above suggests that there is still some activity on the client, but the
info. is limited.

If the client is still in this state, you can collect more info via:
# tcpdump -s 0 -w out.pcap host <nfs-server>
run for a little while.
The out.pcap file needs to be looked at in wireshark (tcpdump is useless
at decoding NFS). If there is nothing secret in it, you can email it to
me as an attachment, so I can take a look.

# ps axHl done repeatedly gets a lot more info about the NFS related threads.
(I'll admit I doubt the info is useful for this case?)

# nfsstat -E -c -z repeatedly as above.

If you just want to get rid of the mount
# umount -N <mnt-path>
should work, although it can take a couple of minutes.

Either not running "nfscbd" on the client or disabling delegations by
setting vfs.nfsd.issue_delegations=0 on the server (assuming you
have them enabled) ,might/should avoid the problem.

rick