Re: zfs/nfsd performance limiter

From: Rick Macklem <rmacklem_at_uoguelph.ca>
Date: Sun, 22 May 2022 14:12:46 UTC
Adam Stylinski <kungfujesus06@gmail.com> wrote:
> jwd wrote:
> > What is your server system? Make/model/ram/etc.
> Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (6 cores, a little starved
> on the clock but the load at least is basically zero during this test)
> 128GB of memory
>
> > top -aH
> During the copy load (for brevity, only did the real top contenders
> for CPU here):
>
> last pid: 15560;  load averages:  0.25,  0.39,  0.27
>
>
>
>                                                         up 4+15:48:54
>  09:17:38
> 98 threads:    2 running, 96 sleeping
> CPU:  0.0% user,  0.0% nice, 19.1% system,  5.6% interrupt, 75.3% idle
> Mem: 12M Active, 4405M Inact, 8284K Laundry, 115G Wired, 1148M Buf, 4819M Free
> ARC: 98G Total, 80G MFU, 15G MRU, 772K Anon, 1235M Header, 1042M Other
>     91G Compressed, 189G Uncompressed, 2.09:1 Ratio
> Swap: 5120M Total, 5120M Free
>
>  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
> 3830 root         20    0    12M  2700K rpcsvc   2   1:16  53.26%
> nfsd: server (nfsd){nfsd: service}
> 3830 root         20    0    12M  2700K CPU5     5   5:42  52.96%
> nfsd: server (nfsd){nfsd: master}
> 15560 adam         20    0    17M  5176K CPU2     2   0:00   0.12% top -aH
>  1493 root         20    0    13M  2260K select   3   0:36   0.01%
> /usr/sbin/powerd
> 1444 root         20    0    75M  2964K select   5   0:19   0.01%
> /usr/sbin/mountd -r /etc/exports /etc/zfs/exports
> 1215 uucp         20    0    13M  2820K select   5   0:27   0.01%
> /usr/local/libexec/nut/usbhid-ups -a cyberpower
> 93424 adam         20    0    21M  9900K select   0   0:00   0.01%
> sshd: adam@pts/0 (sshd)
>
> > ifconfig -vm
> mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
options=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
capabilities=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
> ether 00:02:c9:35:df:20
> inet 10.5.5.1 netmask 0xffffff00 broadcast 10.5.5.255
> media: Ethernet autoselect (40Gbase-CR4 <full-duplex,rxpause,txpause>)
> status: active
> supported media:
> media autoselect
> media 40Gbase-CR4 mediaopt full-duplex
> media 10Gbase-CX4 mediaopt full-duplex
> media 10Gbase-SR mediaopt full-duplex
> media 1000baseT mediaopt full-duplex
> nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> plugged: QSFP+ 40GBASE-CR4 (No separable connector)
> vendor: Mellanox PN: MC2207130-002 SN: MT1419VS07971 DATE: 2014-06-06
> module temperature: 0.00 C voltage: 0.00 Volts
> lane 1: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
> lane 2: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
> lane 3: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
> lane 4: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
>
> > - What are your values for:
> >
> > -- kern.ipc.maxsockbuf
> > -- net.inet.tcp.sendbuf_max
> > -- net.inet.tcp.recvbuf_max
> >
> > -- net.inet.tcp.sendspace
> > -- net.inet.tcp.recvspace
> >
> > -- net.inet.tcp.delayed_ack
> kern.ipc.maxsockbuf: 16777216
> net.inet.tcp.sendbuf_max: 16777216
> net.inet.tcp.recvbuf_max: 16777216
> net.inet.tcp.sendspace: 32768 # This is interesting?  I'm not sure why
> the discrepancy here
> net.inet.tcp.recvspace: 4194304
> net.inet.tcp.delayed_ack: 0
>
> > netstat -i
> Name    Mtu Network       Address              Ipkts Ierrs Idrop
Opkts Oerrs  Coll
> igb0   9000 <Link#1>      ac:1f:6b:b0:60:bc 18230625     0     0
24178283     0     0
> igb1   9000 <Link#2>      ac:1f:6b:b0:60:bc 14341213     0     0
8447249     0     0
> lo0   16384 <Link#3>      lo0                 367691     0     0
367691     0     0
> lo0       - localhost     localhost               68     -     -
68     -     -
> lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -
 0     -     -
> lo0       - your-net      localhost           348944     -     -
348944     -     -
> mlxen  9000 <Link#4>      00:02:c9:35:df:20 13138046     0    12
26308206     0     0
> mlxen     - 10.5.5.0/24   10.5.5.1          11592389     -     -
24345184     -     -
> vm-pu  9000 <Link#6>      56:3e:55:8a:2a:f8     7270     0     0
962249   102     0
> lagg0  9000 <Link#5>      ac:1f:6b:b0:60:bc 31543941     0     0
31623674     0     0
> lagg0     - 192.168.0.0/2 nasbox            27967582     -     -
41779731     -     -
> 
> > What threads/irq are allocated to your NIC? 'vmstat -i'
>
> Doesn't seem perfectly balanced but not terribly imbalanced, either:
>
> interrupt                          total       rate
> irq9: acpi0                            3          0
> irq18: ehci0 ehci1+               803162          2
> cpu0:timer                      67465114        167
> cpu1:timer                      65068819        161
> cpu2:timer                      65535300        163
> cpu3:timer                      63408731        157
> cpu4:timer                      63026304        156
> cpu5:timer                      63431412        157
> irq56: nvme0:admin                    18          0
> irq57: nvme0:io0                  544999          1
> irq58: nvme0:io1                  465816          1
> irq59: nvme0:io2                  487486          1
> irq60: nvme0:io3                  474616          1
> irq61: nvme0:io4                  452527          1
> irq62: nvme0:io5                  467807          1
> irq63: mps0                     36110415         90
> irq64: mps1                    112328723        279
> irq65: mps2                     54845974        136
> irq66: mps3                     50770215        126
> irq68: xhci0                     3122136          8
> irq70: igb0:rxq0                 1974562          5
> irq71: igb0:rxq1                 3034190          8
> irq72: igb0:rxq2                28703842         71
> irq73: igb0:rxq3                 1126533          3
> irq74: igb0:aq                         7          0
> irq75: igb1:rxq0                 1852321          5
> irq76: igb1:rxq1                 2946722          7
> irq77: igb1:rxq2                 9602613         24
> irq78: igb1:rxq3                 4101258         10
> irq79: igb1:aq                         8          0
> irq80: ahci1                    37386191         93
> irq81: mlx4_core0                4748775         12
> irq82: mlx4_core0               13754442         34
> irq83: mlx4_core0                3551629          9
> irq84: mlx4_core0                2595850          6
> irq85: mlx4_core0                4947424         12
> Total                          769135944       1908
>
> > Are the above threads floating or mapped? 'cpuset -g ...'
>
> I suspect I was supposed to run this against the argument of a pid,
> maybe nfsd?  Here's the output without an argument
>
> pid -1 mask: 0, 1, 2, 3, 4, 5
> pid -1 domain policy: first-touch mask: 0
>
> > Disable nfs tcp drc
>
> This is the first I've even seen a duplicate request cache mentioned.
> It seems counter-intuitive for why that'd help but maybe I'll try
> doing that.  What exactly is the benefit?
The DRC improves correctness for NFSv3 and NFSv4.0 mounts. It is a
performance hit. However, for a read mostly load it won't add too
much overhead. Turning it off increases the likelyhood of data corruption
due to retried non-idempotent RPCs, but the failure will be rare over TCP.

If your mount is NFSv4.1 or 4.2, the DRC is not used, so don't worry about it.

> > What is your atime setting?
>
> Disabled at both the file system and the client mounts.
>
> > You also state you are using a Linux client. Are you using the MLX affinity > scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux system for a fbsd system?
> I've not, though I do vaguely recall mellanox supplying some scripts
> in their documentation that fixed interrupt handling on specific cores
> at one point.  Is this what you're referring to?  I could give that a
> try.  I don't at present have any FreeBSD client systems with enough
> PCI express bandwidth to swap things out for a Linux vs FreeBSD test.
If you have not already done so, do a "nfsstat -m" on the client to find
out what options it is actually using (works on both Linux and FreeBSD).

If the Linux client has a way of manually adjusting readahead, then try
increasing it. (FreeBSD has a "readahead" mount option, but I can't recall
if Linux has one?)

You can try mounting the server on the server, but that will use lo0 and not
the mellanox, so it might be irrelevant.

Also, I don't know how many queues the mellanox driver used. You'd want
an "nconnect" at least as high as the number of queues, since each TCP
connection will be serviced by one queue and that limits its bandwidth.

However, in general, RPC RTT will define how well NFS performs and not
the I/O rate for a bulk file read/write.
Btw, writing is a very different story than reading, largely due to the need
to commit data/metadata to stable storage while writing.

I can't help w.r.t. ZFS nor high performance nets (my fastest is 1Gbps), rick

>  You mention iperf. Please post the options you used when invoking iperf and it's output.

Setting up the NFS client as a "server", since it seems that the
terminology is a little bit flipped with iperf, here's the output:

-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
Accepted connection from 10.5.5.1, port 11534
[  5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  3.81 GBytes  32.7 Gbits/sec
[  5]   1.00-2.00   sec  4.20 GBytes  36.1 Gbits/sec
[  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec
[  5]   3.00-4.00   sec  4.21 GBytes  36.1 Gbits/sec
[  5]   4.00-5.00   sec  4.20 GBytes  36.1 Gbits/sec
[  5]   5.00-6.00   sec  4.21 GBytes  36.2 Gbits/sec
[  5]   6.00-7.00   sec  4.10 GBytes  35.2 Gbits/sec
[  5]   7.00-8.00   sec  4.20 GBytes  36.1 Gbits/sec
[  5]   8.00-9.00   sec  4.21 GBytes  36.1 Gbits/sec
[  5]   9.00-10.00  sec  4.20 GBytes  36.1 Gbits/sec
[  5]  10.00-10.00  sec  7.76 MBytes  35.3 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  41.5 GBytes  35.7 Gbits/sec                  receiver
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------

On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:
>
> ----- Adam Stylinski's Original Message -----
> > Hello,
> >
> > I have two systems connected via ConnectX-3 mellanox cards in ethernet
> > mode.  They have their MTU's maxed at 9000, their ring buffers maxed
> > at 8192, and I can hit around 36 gbps with iperf.
> >
> > When using an NFS client (client = linux, server = freebsd), I see a
> > maximum rate of around 20gbps.  The test file is fully in ARC.  The
> > test is performed with an NFS mount nconnect=4 and an rsize/wsize of
> > 1MB.
> >
> > Here's the flame graph of the kernel of the system in question, with
> > idle stacks removed:
> >
> > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#file-nfs_fg-svg
> >
> > The longest functions seems like maybe it's the ERMS aware memcpy
> > happening from the ARC?  Is there maybe a missing fast path that could
> > take fewer copies into the socket buffer?
>
> Hi Adam -
>
>    Some items to look at and possibly include for more responses....
>
> - What is your server system? Make/model/ram/etc. What is your
>   overall 'top' cpu utilization 'top -aH' ...
>
> - It looks like you're using a 40gb/s card. Posting the output of
>   'ifconfig -vm' would provide additional information.
>
> - Are the interfaces running cleanly? 'netstat -i' is helpful.
>
> - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?
>
> - Inspect 'netstat -m'. Denied? Delayed?
>
>
> - You mention iperf. Please post the options you used when
>   invoking iperf and it's output.
>
> - You appear to be looking for through-put vs low-latency. Have
>   you looked at window-size vs the amount of memory allocated to the
>   streams. These values vary based on the bit-rate of the connection.
>   Tcp connections require outstanding un-ack'd data to be held.
>   Effects values below.
>
>
> - What are your values for:
>
> -- kern.ipc.maxsockbuf
> -- net.inet.tcp.sendbuf_max
> -- net.inet.tcp.recvbuf_max
>
> -- net.inet.tcp.sendspace
> -- net.inet.tcp.recvspace
>
> -- net.inet.tcp.delayed_ack
>
> - What threads/irq are allocated to your NIC? 'vmstat -i'
>
> - Are the above threads floating or mapped? 'cpuset -g ...'
>
> - Determine best settings for LRO/TSO for your card.
>
> - Disable nfs tcp drc
>
> - What is your atime setting?
>
>
>    If you really think you have a ZFS/Kernel issue, and you're
> data fits in cache, dump ZFS, create a memory backed file system
> and repeat your tests. This will purge a large portion of your
> graph.  LRO/TSO changes may do so also.
>
>    You also state you are using a Linux client. Are you using
> the MLX affinity scripts, buffer sizing suggestions, etc, etc.
> Have you swapped the Linux system for a fbsd system?
>
>    And as a final note, I regularly use Chelsio T62100 cards
> in dual home and/or LACP environments in Supermicro boxes with 100's
> of nfs boot (Bhyve, QEMU, and physical system) clients per server
> with no network starvation or cpu bottlenecks.  Clients boot, perform
> their work, and then remotely request image rollback.
>
>
>    Hopefully the above will help and provide pointers.
>
> Cheers
>