Re: zfs/nfsd performance limiter

In reply to: John : "Re: zfs/nfsd performance limiter"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Adam Stylinski <kungfujesus06_at_gmail.com>
Date: Fri, 27 May 2022 16:39:56 UTC
     rx_pause: 0
     rx_pause_duration: 26
     rx_pause_transition: 2
     tx_pause: 4
     tx_pause_duration: 0
     tx_pause_transition: 0

The system's been up for about 95 days, though, so it's hard to say if
the 4 in the counter there actually occurred during the tests.

On Wed, May 25, 2022 at 11:37 PM John <jwd@freebsd.org> wrote:
>
> Hi Adam,
>
>    Sorry for the late reply to your original reply. Many good replies
> from Rick cc'd (Hi Rick :-)
>
>    1 item pops out to me below.
>
> ----- Adam Stylinski's Original Message -----
>
> > > What is your server system? Make/model/ram/etc.
> > Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (6 cores, a little starved
> > on the clock but the load at least is basically zero during this test)
> > 128GB of memory
> >
> > > top -aH
> > During the copy load (for brevity, only did the real top contenders for CPU here):
> > last pid: 15560;  load averages:  0.25,  0.39,  0.27 up 4+15:48:54 09:17:38
> > 98 threads:    2 running, 96 sleeping
> > CPU:  0.0% user,  0.0% nice, 19.1% system,  5.6% interrupt, 75.3% idle
> > Mem: 12M Active, 4405M Inact, 8284K Laundry, 115G Wired, 1148M Buf, 4819M Free
> > ARC: 98G Total, 80G MFU, 15G MRU, 772K Anon, 1235M Header, 1042M Other
> >      91G Compressed, 189G Uncompressed, 2.09:1 Ratio
> > Swap: 5120M Total, 5120M Free
> >
> >   PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
> >  3830 root         20    0    12M  2700K rpcsvc   2   1:16  53.26% nfsd: server (nfsd){nfsd: service}
> >  3830 root         20    0    12M  2700K CPU5     5   5:42  52.96% nfsd: server (nfsd){nfsd: master}
> > 15560 adam         20    0    17M  5176K CPU2     2   0:00   0.12% top -aH
> >  1493 root         20    0    13M  2260K select   3   0:36   0.01% /usr/sbin/powerd
> >  1444 root         20    0    75M  2964K select   5   0:19   0.01% /usr/sbin/mountd -r /etc/exports /etc/zfs/exports
> >  1215 uucp         20    0    13M  2820K select   5   0:27   0.01% /usr/local/libexec/nut/usbhid-ups -a cyberpower
> > 93424 adam         20    0    21M  9900K select   0   0:00   0.01% sshd: adam@pts/0 (sshd)
> >
> > > ifconfig -vm
> > mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
> > options=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
> > capabilities=ed07bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
> > ether 00:02:c9:35:df:20
> > inet 10.5.5.1 netmask 0xffffff00 broadcast 10.5.5.255
> > media: Ethernet autoselect (40Gbase-CR4 <full-duplex,rxpause,txpause>)
>
>    Something worth checking:
>
>    The media line indicates that rxpause/txpause are active for this
> interface. I don't have a mlx based Linux system handy so the following
> is from memory. On the Linux side, please try:
>
> ethtool -S $interface | grep pause
>
>    The output should include the number of times a pause frame was
> transmitted (Linux system NIC receive queue (almost) full) or a pause
> frame was received (meaning the FreeBSD system NIC receive queue
> (almost) full). If the values are 0 this is not an issue.
>
>    If rx/tx pause frames are being sent/received this will slow down
> the link (rather dramatically sometimes). With only rx/tx pause (vs
> dcbx) a system spewing pause frames can bring a set of connected
> systems to a slow crawl.
>
> -John
>
> > status: active
> > supported media:
> > media autoselect
> > media 40Gbase-CR4 mediaopt full-duplex
> > media 10Gbase-CX4 mediaopt full-duplex
> > media 10Gbase-SR mediaopt full-duplex
> > media 1000baseT mediaopt full-duplex
> > nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> > plugged: QSFP+ 40GBASE-CR4 (No separable connector)
> > vendor: Mellanox PN: MC2207130-002 SN: MT1419VS07971 DATE: 2014-06-06
> > module temperature: 0.00 C voltage: 0.00 Volts
> > lane 1: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
> > lane 2: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
> > lane 3: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
> > lane 4: RX power: 0.00 mW (-inf dBm) TX bias: 0.00 mA
> >
> > > - What are your values for:
> > >
> > > -- kern.ipc.maxsockbuf
> > > -- net.inet.tcp.sendbuf_max
> > > -- net.inet.tcp.recvbuf_max
> > >
> > > -- net.inet.tcp.sendspace
> > > -- net.inet.tcp.recvspace
> > >
> > > -- net.inet.tcp.delayed_ack
> > kern.ipc.maxsockbuf: 16777216
> > net.inet.tcp.sendbuf_max: 16777216
> > net.inet.tcp.recvbuf_max: 16777216
> > net.inet.tcp.sendspace: 32768 # This is interesting?  I'm not sure why
> > the discrepancy here
> > net.inet.tcp.recvspace: 4194304
> > net.inet.tcp.delayed_ack: 0
> >
> > > netstat -i
> > Name    Mtu Network       Address              Ipkts Ierrs Idrop
> > Opkts Oerrs  Coll
> > igb0   9000 <Link#1>      ac:1f:6b:b0:60:bc 18230625     0     0
> > 24178283     0     0
> > igb1   9000 <Link#2>      ac:1f:6b:b0:60:bc 14341213     0     0
> > 8447249     0     0
> > lo0   16384 <Link#3>      lo0                 367691     0     0
> > 367691     0     0
> > lo0       - localhost     localhost               68     -     -
> > 68     -     -
> > lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -
> >  0     -     -
> > lo0       - your-net      localhost           348944     -     -
> > 348944     -     -
> > mlxen  9000 <Link#4>      00:02:c9:35:df:20 13138046     0    12
> > 26308206     0     0
> > mlxen     - 10.5.5.0/24   10.5.5.1          11592389     -     -
> > 24345184     -     -
> > vm-pu  9000 <Link#6>      56:3e:55:8a:2a:f8     7270     0     0
> > 962249   102     0
> > lagg0  9000 <Link#5>      ac:1f:6b:b0:60:bc 31543941     0     0
> > 31623674     0     0
> > lagg0     - 192.168.0.0/2 nasbox            27967582     -     -
> > 41779731     -     -
> >
> > > What threads/irq are allocated to your NIC? 'vmstat -i'
> >
> > Doesn't seem perfectly balanced but not terribly imbalanced, either:
> >
> > interrupt                          total       rate
> > irq9: acpi0                            3          0
> > irq18: ehci0 ehci1+               803162          2
> > cpu0:timer                      67465114        167
> > cpu1:timer                      65068819        161
> > cpu2:timer                      65535300        163
> > cpu3:timer                      63408731        157
> > cpu4:timer                      63026304        156
> > cpu5:timer                      63431412        157
> > irq56: nvme0:admin                    18          0
> > irq57: nvme0:io0                  544999          1
> > irq58: nvme0:io1                  465816          1
> > irq59: nvme0:io2                  487486          1
> > irq60: nvme0:io3                  474616          1
> > irq61: nvme0:io4                  452527          1
> > irq62: nvme0:io5                  467807          1
> > irq63: mps0                     36110415         90
> > irq64: mps1                    112328723        279
> > irq65: mps2                     54845974        136
> > irq66: mps3                     50770215        126
> > irq68: xhci0                     3122136          8
> > irq70: igb0:rxq0                 1974562          5
> > irq71: igb0:rxq1                 3034190          8
> > irq72: igb0:rxq2                28703842         71
> > irq73: igb0:rxq3                 1126533          3
> > irq74: igb0:aq                         7          0
> > irq75: igb1:rxq0                 1852321          5
> > irq76: igb1:rxq1                 2946722          7
> > irq77: igb1:rxq2                 9602613         24
> > irq78: igb1:rxq3                 4101258         10
> > irq79: igb1:aq                         8          0
> > irq80: ahci1                    37386191         93
> > irq81: mlx4_core0                4748775         12
> > irq82: mlx4_core0               13754442         34
> > irq83: mlx4_core0                3551629          9
> > irq84: mlx4_core0                2595850          6
> > irq85: mlx4_core0                4947424         12
> > Total                          769135944       1908
> >
> > > Are the above threads floating or mapped? 'cpuset -g ...'
> >
> > I suspect I was supposed to run this against the argument of a pid,
> > maybe nfsd?  Here's the output without an argument
> >
> > pid -1 mask: 0, 1, 2, 3, 4, 5
> > pid -1 domain policy: first-touch mask: 0
> >
> > > Disable nfs tcp drc
> >
> > This is the first I've even seen a duplicate request cache mentioned.
> > It seems counter-intuitive for why that'd help but maybe I'll try
> > doing that.  What exactly is the benefit?
> >
> > > What is your atime setting?
> >
> > Disabled at both the file system and the client mounts.
> >
> > > You also state you are using a Linux client. Are you using the MLX affinity scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux system for a fbsd system?
> > I've not, though I do vaguely recall mellanox supplying some scripts
> > in their documentation that fixed interrupt handling on specific cores
> > at one point.  Is this what you're referring to?  I could give that a
> > try.  I don't at present have any FreeBSD client systems with enough
> > PCI express bandwidth to swap things out for a Linux vs FreeBSD test.
> >
> > >  You mention iperf. Please post the options you used when invoking iperf and it's output.
> >
> > Setting up the NFS client as a "server", since it seems that the
> > terminology is a little bit flipped with iperf, here's the output:
> >
> > -----------------------------------------------------------
> > Server listening on 5201 (test #1)
> > -----------------------------------------------------------
> > Accepted connection from 10.5.5.1, port 11534
> > [  5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-1.00   sec  3.81 GBytes  32.7 Gbits/sec
> > [  5]   1.00-2.00   sec  4.20 GBytes  36.1 Gbits/sec
> > [  5]   2.00-3.00   sec  4.18 GBytes  35.9 Gbits/sec
> > [  5]   3.00-4.00   sec  4.21 GBytes  36.1 Gbits/sec
> > [  5]   4.00-5.00   sec  4.20 GBytes  36.1 Gbits/sec
> > [  5]   5.00-6.00   sec  4.21 GBytes  36.2 Gbits/sec
> > [  5]   6.00-7.00   sec  4.10 GBytes  35.2 Gbits/sec
> > [  5]   7.00-8.00   sec  4.20 GBytes  36.1 Gbits/sec
> > [  5]   8.00-9.00   sec  4.21 GBytes  36.1 Gbits/sec
> > [  5]   9.00-10.00  sec  4.20 GBytes  36.1 Gbits/sec
> > [  5]  10.00-10.00  sec  7.76 MBytes  35.3 Gbits/sec
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate
> > [  5]   0.00-10.00  sec  41.5 GBytes  35.7 Gbits/sec                  receiver
> > -----------------------------------------------------------
> > Server listening on 5201 (test #2)
> > -----------------------------------------------------------
> >
> > On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote:
> > >
> > > ----- Adam Stylinski's Original Message -----
> > > > Hello,
> > > >
> > > > I have two systems connected via ConnectX-3 mellanox cards in ethernet
> > > > mode.  They have their MTU's maxed at 9000, their ring buffers maxed
> > > > at 8192, and I can hit around 36 gbps with iperf.
> > > >
> > > > When using an NFS client (client = linux, server = freebsd), I see a
> > > > maximum rate of around 20gbps.  The test file is fully in ARC.  The
> > > > test is performed with an NFS mount nconnect=4 and an rsize/wsize of
> > > > 1MB.
> > > >
> > > > Here's the flame graph of the kernel of the system in question, with
> > > > idle stacks removed:
> > > >
> > > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#file-nfs_fg-svg
> > > >
> > > > The longest functions seems like maybe it's the ERMS aware memcpy
> > > > happening from the ARC?  Is there maybe a missing fast path that could
> > > > take fewer copies into the socket buffer?
> > >
> > > Hi Adam -
> > >
> > >    Some items to look at and possibly include for more responses....
> > >
> > > - What is your server system? Make/model/ram/etc. What is your
> > >   overall 'top' cpu utilization 'top -aH' ...
> > >
> > > - It looks like you're using a 40gb/s card. Posting the output of
> > >   'ifconfig -vm' would provide additional information.
> > >
> > > - Are the interfaces running cleanly? 'netstat -i' is helpful.
> > >
> > > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order?
> > >
> > > - Inspect 'netstat -m'. Denied? Delayed?
> > >
> > >
> > > - You mention iperf. Please post the options you used when
> > >   invoking iperf and it's output.
> > >
> > > - You appear to be looking for through-put vs low-latency. Have
> > >   you looked at window-size vs the amount of memory allocated to the
> > >   streams. These values vary based on the bit-rate of the connection.
> > >   Tcp connections require outstanding un-ack'd data to be held.
> > >   Effects values below.
> > >
> > >
> > > - What are your values for:
> > >
> > > -- kern.ipc.maxsockbuf
> > > -- net.inet.tcp.sendbuf_max
> > > -- net.inet.tcp.recvbuf_max
> > >
> > > -- net.inet.tcp.sendspace
> > > -- net.inet.tcp.recvspace
> > >
> > > -- net.inet.tcp.delayed_ack
> > >
> > > - What threads/irq are allocated to your NIC? 'vmstat -i'
> > >
> > > - Are the above threads floating or mapped? 'cpuset -g ...'
> > >
> > > - Determine best settings for LRO/TSO for your card.
> > >
> > > - Disable nfs tcp drc
> > >
> > > - What is your atime setting?
> > >
> > >
> > >    If you really think you have a ZFS/Kernel issue, and you're
> > > data fits in cache, dump ZFS, create a memory backed file system
> > > and repeat your tests. This will purge a large portion of your
> > > graph.  LRO/TSO changes may do so also.
> > >
> > >    You also state you are using a Linux client. Are you using
> > > the MLX affinity scripts, buffer sizing suggestions, etc, etc.
> > > Have you swapped the Linux system for a fbsd system?
> > >
> > >    And as a final note, I regularly use Chelsio T62100 cards
> > > in dual home and/or LACP environments in Supermicro boxes with 100's
> > > of nfs boot (Bhyve, QEMU, and physical system) clients per server
> > > with no network starvation or cpu bottlenecks.  Clients boot, perform
> > > their work, and then remotely request image rollback.
> > >
> > >
> > >    Hopefully the above will help and provide pointers.
> > >
> > > Cheers
> > >