Re: zfs/nfsd performance limiter
- In reply to: Adam Stylinski : "Re: zfs/nfsd performance limiter"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 25 May 2022 20:04:35 UTC
Adam Stylinski <kungfujesus06@gmail.com> wrote: > Hmm, I don't know that the present of jumbo 9k mbufs is indicative > that the mellanox drivers are using them or not, given that I have a > link aggregation on a different (1gbps) NIC that also could be the > cause of that: > > mbuf: 256, 52231134, 49500, 25931,1956138424, 0, 0, 0 > mbuf_cluster: 2048, 8161114, 2794, 4352,700435355, 0, 0, 0 > mbuf_jumbo_page: 4096, 4080557, 12288, 3977,155289291, 0, 0, 0 > mbuf_jumbo_9k: 9216, 1609044, 32772, 4174,35785053, 0, 0, 0 > mbuf_jumbo_16k: 16384, 680092, 0, 0, 0, 0, 0, 0 > > Early on, 9k MTUs did show significant advantages for throughput from > what I remember. But of course, this is before trying any of the > aforementioned changes for multiplexing the connection. I may give you better performance for your test runs, but if you fragment the mbuf cluster pool you can get hit pretty hard. Buyer beware, as they say, rick ps: It doesn't matter what is using them. Any mixing of 2K, 4K and 9K can result in fragmentation of the pool such that an allocation cannot happen until mbuf clusters get free'd. On Wed, May 25, 2022 at 11:41 AM Rick Macklem <rmacklem@uoguelph.ca> wrote: > > Adam Stylinski <kungfujesus06@gmail.com> wrote: > [stuff snipped] > > > > ifconfig -vm > > mlxen0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 > Just in case you (or someone else reading this) is not aware of it, > use of 9K jumbo clusters causes fragmentation of the memory pool > clusters are allocated from and, therefore, their use is not recommended. > > Now, it may be that the mellanox driver doesn't use 9K clusters (it could > put the received frame in multiple smaller clusters), but if it does, you > should consider reducing the mtu. > If you: > # vmstat -z | fgrep mbuf_jumbo_9k > it will show you if they are being used. > > rick > > > > netstat -i > Name Mtu Network Address Ipkts Ierrs Idrop > Opkts Oerrs Coll > igb0 9000 <Link#1> ac:1f:6b:b0:60:bc 18230625 0 0 > 24178283 0 0 > igb1 9000 <Link#2> ac:1f:6b:b0:60:bc 14341213 0 0 > 8447249 0 0 > lo0 16384 <Link#3> lo0 367691 0 0 > 367691 0 0 > lo0 - localhost localhost 68 - - > 68 - - > lo0 - fe80::%lo0/64 fe80::1%lo0 0 - - > 0 - - > lo0 - your-net localhost 348944 - - > 348944 - - > mlxen 9000 <Link#4> 00:02:c9:35:df:20 13138046 0 12 > 26308206 0 0 > mlxen - 10.5.5.0/24 10.5.5.1 11592389 - - > 24345184 - - > vm-pu 9000 <Link#6> 56:3e:55:8a:2a:f8 7270 0 0 > 962249 102 0 > lagg0 9000 <Link#5> ac:1f:6b:b0:60:bc 31543941 0 0 > 31623674 0 0 > lagg0 - 192.168.0.0/2 nasbox 27967582 - - > 41779731 - - > > > What threads/irq are allocated to your NIC? 'vmstat -i' > > Doesn't seem perfectly balanced but not terribly imbalanced, either: > > interrupt total rate > irq9: acpi0 3 0 > irq18: ehci0 ehci1+ 803162 2 > cpu0:timer 67465114 167 > cpu1:timer 65068819 161 > cpu2:timer 65535300 163 > cpu3:timer 63408731 157 > cpu4:timer 63026304 156 > cpu5:timer 63431412 157 > irq56: nvme0:admin 18 0 > irq57: nvme0:io0 544999 1 > irq58: nvme0:io1 465816 1 > irq59: nvme0:io2 487486 1 > irq60: nvme0:io3 474616 1 > irq61: nvme0:io4 452527 1 > irq62: nvme0:io5 467807 1 > irq63: mps0 36110415 90 > irq64: mps1 112328723 279 > irq65: mps2 54845974 136 > irq66: mps3 50770215 126 > irq68: xhci0 3122136 8 > irq70: igb0:rxq0 1974562 5 > irq71: igb0:rxq1 3034190 8 > irq72: igb0:rxq2 28703842 71 > irq73: igb0:rxq3 1126533 3 > irq74: igb0:aq 7 0 > irq75: igb1:rxq0 1852321 5 > irq76: igb1:rxq1 2946722 7 > irq77: igb1:rxq2 9602613 24 > irq78: igb1:rxq3 4101258 10 > irq79: igb1:aq 8 0 > irq80: ahci1 37386191 93 > irq81: mlx4_core0 4748775 12 > irq82: mlx4_core0 13754442 34 > irq83: mlx4_core0 3551629 9 > irq84: mlx4_core0 2595850 6 > irq85: mlx4_core0 4947424 12 > Total 769135944 1908 > > > Are the above threads floating or mapped? 'cpuset -g ...' > > I suspect I was supposed to run this against the argument of a pid, > maybe nfsd? Here's the output without an argument > > pid -1 mask: 0, 1, 2, 3, 4, 5 > pid -1 domain policy: first-touch mask: 0 > > > Disable nfs tcp drc > > This is the first I've even seen a duplicate request cache mentioned. > It seems counter-intuitive for why that'd help but maybe I'll try > doing that. What exactly is the benefit? > > > What is your atime setting? > > Disabled at both the file system and the client mounts. > > > You also state you are using a Linux client. Are you using the MLX affinity scripts, buffer sizing suggestions, etc, etc. Have you swapped the Linux system for a fbsd system? > I've not, though I do vaguely recall mellanox supplying some scripts > in their documentation that fixed interrupt handling on specific cores > at one point. Is this what you're referring to? I could give that a > try. I don't at present have any FreeBSD client systems with enough > PCI express bandwidth to swap things out for a Linux vs FreeBSD test. > > > You mention iperf. Please post the options you used when invoking iperf and it's output. > > Setting up the NFS client as a "server", since it seems that the > terminology is a little bit flipped with iperf, here's the output: > > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 10.5.5.1, port 11534 > [ 5] local 10.5.5.4 port 5201 connected to 10.5.5.1 port 43931 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 3.81 GBytes 32.7 Gbits/sec > [ 5] 1.00-2.00 sec 4.20 GBytes 36.1 Gbits/sec > [ 5] 2.00-3.00 sec 4.18 GBytes 35.9 Gbits/sec > [ 5] 3.00-4.00 sec 4.21 GBytes 36.1 Gbits/sec > [ 5] 4.00-5.00 sec 4.20 GBytes 36.1 Gbits/sec > [ 5] 5.00-6.00 sec 4.21 GBytes 36.2 Gbits/sec > [ 5] 6.00-7.00 sec 4.10 GBytes 35.2 Gbits/sec > [ 5] 7.00-8.00 sec 4.20 GBytes 36.1 Gbits/sec > [ 5] 8.00-9.00 sec 4.21 GBytes 36.1 Gbits/sec > [ 5] 9.00-10.00 sec 4.20 GBytes 36.1 Gbits/sec > [ 5] 10.00-10.00 sec 7.76 MBytes 35.3 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.00 sec 41.5 GBytes 35.7 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > > On Sun, May 22, 2022 at 3:45 AM John <jwd@freebsd.org> wrote: > > > > ----- Adam Stylinski's Original Message ----- > > > Hello, > > > > > > I have two systems connected via ConnectX-3 mellanox cards in ethernet > > > mode. They have their MTU's maxed at 9000, their ring buffers maxed > > > at 8192, and I can hit around 36 gbps with iperf. > > > > > > When using an NFS client (client = linux, server = freebsd), I see a > > > maximum rate of around 20gbps. The test file is fully in ARC. The > > > test is performed with an NFS mount nconnect=4 and an rsize/wsize of > > > 1MB. > > > > > > Here's the flame graph of the kernel of the system in question, with > > > idle stacks removed: > > > > > > https://gist.github.com/KungFuJesus/918c6dcf40ae07767d5382deafab3a52#file-nfs_fg-svg > > > > > > The longest functions seems like maybe it's the ERMS aware memcpy > > > happening from the ARC? Is there maybe a missing fast path that could > > > take fewer copies into the socket buffer? > > > > Hi Adam - > > > > Some items to look at and possibly include for more responses.... > > > > - What is your server system? Make/model/ram/etc. What is your > > overall 'top' cpu utilization 'top -aH' ... > > > > - It looks like you're using a 40gb/s card. Posting the output of > > 'ifconfig -vm' would provide additional information. > > > > - Are the interfaces running cleanly? 'netstat -i' is helpful. > > > > - Inspect 'netstat -s'. Duplicate pkts? Resends? Out-of-order? > > > > - Inspect 'netstat -m'. Denied? Delayed? > > > > > > - You mention iperf. Please post the options you used when > > invoking iperf and it's output. > > > > - You appear to be looking for through-put vs low-latency. Have > > you looked at window-size vs the amount of memory allocated to the > > streams. These values vary based on the bit-rate of the connection. > > Tcp connections require outstanding un-ack'd data to be held. > > Effects values below. > > > > > > - What are your values for: > > > > -- kern.ipc.maxsockbuf > > -- net.inet.tcp.sendbuf_max > > -- net.inet.tcp.recvbuf_max > > > > -- net.inet.tcp.sendspace > > -- net.inet.tcp.recvspace > > > > -- net.inet.tcp.delayed_ack > > > > - What threads/irq are allocated to your NIC? 'vmstat -i' > > > > - Are the above threads floating or mapped? 'cpuset -g ...' > > > > - Determine best settings for LRO/TSO for your card. > > > > - Disable nfs tcp drc > > > > - What is your atime setting? > > > > > > If you really think you have a ZFS/Kernel issue, and you're > > data fits in cache, dump ZFS, create a memory backed file system > > and repeat your tests. This will purge a large portion of your > > graph. LRO/TSO changes may do so also. > > > > You also state you are using a Linux client. Are you using > > the MLX affinity scripts, buffer sizing suggestions, etc, etc. > > Have you swapped the Linux system for a fbsd system? > > > > And as a final note, I regularly use Chelsio T62100 cards > > in dual home and/or LACP environments in Supermicro boxes with 100's > > of nfs boot (Bhyve, QEMU, and physical system) clients per server > > with no network starvation or cpu bottlenecks. Clients boot, perform > > their work, and then remotely request image rollback. > > > > > > Hopefully the above will help and provide pointers. > > > > Cheers > > >