RE: FreeBSD TCP (with iperf3) comparison with Linux

Reply: Murali Krishnamurthy : "Re: FreeBSD TCP (with iperf3) comparison with Linux"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Scheffenegger, Richard <Richard.Scheffenegger_at_netapp.com>
Date: Wed, 26 Jul 2023 19:08:35 UTC
Hi Murali,

can you please confirm the version you are using this with, or any driver versions in use? Is this a GENERIC build of MAIN or STABLE13?


From: Murali Krishnamurthy <muralik1@vmware.com>
Sent: Mittwoch, 26. Juli 2023 07:40
To: Cheng Cui <cc@freebsd.org>
Cc: Scheffenegger, Richard <rscheff@freebsd.org>; FreeBSD Transport <freebsd-transport@freebsd.org>
Subject: Re: FreeBSD TCP (with iperf3) comparison with Linux

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.


Cheng,

We tried increasing Rx/Tx ring buffer sizes on our BSD VM. But it is not working for us.
What is the config for tx/rx pages/buffer sizes for vmx?
Same image on bare metal (not as VM) works fine. So we are sure that there is some config we are missing for BSD VM.
Could you pls provide input?

Attached is the screenshot of ifconfig from VM. We are trying to increase buffer sizes on vmx0.

We tried below, but not sure what is the right approach. We did not see any improvement in performance.

1. Increased the rx and tx queue descriptor values from 512 to 4096.

dev.vmx.0.rxq0.debug.comp_ndesc=4096
dev.vmx.0.rxq0.debug.cmd1_ndesc=4096
dev.vmx.0.rxq0.debug.cmd0_ndesc=4096

dev.vmx.0.txq0.debug.comp_ndesc=4096
dev.vmx.0.txq0.debug.cmd_ndesc=4096

Updated the values under /boot/loader.conf file and rebooted the VM.

But these values didn't get reflected as shown below and therefore no improvement in performance.

sysctl dev.vmx.0.rxq0.debug.comp_ndesc=1024
sysctl dev.vmx.0.rxq0.debug.cmd1_ndesc=4096
sysctl dev.vmx.0.rxq0.debug.cmd0_ndesc=4096

sysctl dev.vmx.0.txq0.debug.comp_ndesc=1024
sysctl dev.vmx.0.txq0.debug.cmd_ndesc=4096



2. Increased the value of override_nrxds and override_ntxds to override the number of TX and Rx descriptors for each queue.

dev.vmx.0.iflib.override_ntxds="0,4096"
dev.vmx.0.iflib.override_nrxds="0,2048,0"

Initially the value was 0,0 for tx and 0,0,0 for rx.
Increased the value using sysctl command and it gets changed but no improvement in performance.


3. Increased the value of txndesc and rxndesc to increase the Number of transmit and receive descriptors allocated by the driver.

hw.vmx.txndesc=4096
hw.vmx.rxndesc=4096

Updated the values under /boot/loader.conf.local file and rebooted the VM.

Regards
Murali


From: Cheng Cui <cc@freebsd.org<mailto:cc@freebsd.org>>
Date: Tuesday, 4 July 2023 at 1:54 AM
To: Murali Krishnamurthy <muralik1@vmware.com<mailto:muralik1@vmware.com>>
Cc: Scheffenegger, Richard <rscheff@freebsd.org<mailto:rscheff@freebsd.org>>, FreeBSD Transport <freebsd-transport@freebsd.org<mailto:freebsd-transport@freebsd.org>>
Subject: Re: FreeBSD TCP (with iperf3) comparison with Linux
!! External Email
I see. Sorry about a straight description in my previous email.

If you found the iperf3 report shows bad throughput and increasing numbers in the "Retr" field, also the "netstat -sp tcp" shows retransmitted packets without SACK recovery episodes (SACK is enabled by default). Then, you are likely hitting the problem I described, and the root cause is the TX queue drops. The tcpdump trace file won't show any packet retransmissions and the peer won't be aware of packet loss, as this is a local problem.

cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK"
tcp:
139 data packets (300416 bytes) retransmitted       <<
0 data packets unnecessarily retransmitted
3 retransmit timeouts
0 retransmitted
0 SACK recovery episodes                                      <<
0 segment rexmits in SACK recovery episodes
0 byte rexmits in SACK recovery episodes
0 SACK options (SACK blocks) received
0 SACK options (SACK blocks) sent
0 SACK retransmissions lost
0 SACK scoreboard overflow

Local packet drops due to TX full can be found from this command, for example
cc@s1:~ % netstat -i -I bce4 -nd Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll Drop bce4 1500 <Link#5> 00:10:18:56:94:d4 286184 0 0 148079 0 0 54 << bce4 - 10.1.1.0/24<http://10.1.1.0/24> 10.1.1.2 286183 - - 582111 - - - cc@s1:~ %

Hope the above stats can help you better root cause analysis. Also, increasing the TX queue size is a workaround and is specific to a particular NIC. But you get the idea.

Best Regards,
Cheng Cui


On Mon, Jul 3, 2023 at 11:34 AM Murali Krishnamurthy <muralik1@vmware.com<mailto:muralik1@vmware.com>> wrote:
Cheng,

Thanks for your inputs.

Sorry, I am not familiar with this area.

Few queries,

“I believe the default values for bce tx/rx pages are 2. And I happened to find
this problem before that when the tx queue was full, it would not enqueue packets
and started return errors.
And this error was misunderstood by the TCP layer as retransmission.”

Could you please elaborate what is misunderstood by TCP here? Loss of packets should anyway lead to retransmissions.

Could you point to some stats where I can see such drops due to queue getting full?

I have a vmx interface in my VM and I have attached the screenshot of ifconfig command for that.
Anything we can understand from that?
Will your suggestion of increasing tx_pages=4 and rx_pages=4 work for this ? If so, I assume names would be hw.vmx.tx_pages=4 and hw.vmx.rx_pages ?

Regards
Murali


From: Cheng Cui <cc@freebsd.org<mailto:cc@freebsd.org>>
Date: Friday, 30 June 2023 at 10:02 PM
To: Murali Krishnamurthy <muralik1@vmware.com<mailto:muralik1@vmware.com>>
Cc: Scheffenegger, Richard <rscheff@freebsd.org<mailto:rscheff@freebsd.org>>, FreeBSD Transport <freebsd-transport@freebsd.org<mailto:freebsd-transport@freebsd.org>>
Subject: Re: FreeBSD TCP (with iperf3) comparison with Linux
!! External Email
I used an emulation testbed from Emulab.net with Dummynet traffic shaper adding 100ms RTT
between two nodes, the link capacity is 1Gbps and both nodes are using freebsd13.2.

cc@s1:~ % ping -c 3 r1
PING r1-link1 (10.1.1.3): 56 data bytes
64 bytes from 10.1.1.3<http://10.1.1.3/>: icmp_seq=0 ttl=64 time=100.091 ms
64 bytes from 10.1.1.3<http://10.1.1.3/>: icmp_seq=1 ttl=64 time=99.995 ms
64 bytes from 10.1.1.3<http://10.1.1.3/>: icmp_seq=2 ttl=64 time=99.979 ms

--- r1-link1 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 99.979/100.022/100.091/0.049 ms


cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic
Connecting to host r1, port 5201
[  5] local 10.1.1.2 port 56089 connected to 10.1.1.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.19 MBytes  35.2 Mbits/sec    0   1.24 MBytes
[  5]   1.00-2.00   sec  56.5 MBytes   474 Mbits/sec    6   2.41 MBytes
[  5]   2.00-3.00   sec  58.6 MBytes   492 Mbits/sec   18   7.17 MBytes
[  5]   3.00-4.00   sec  65.6 MBytes   550 Mbits/sec   14    606 KBytes
[  5]   4.00-5.00   sec  60.8 MBytes   510 Mbits/sec   18   7.22 MBytes
[  5]   5.00-6.00   sec  62.1 MBytes   521 Mbits/sec   12   7.86 MBytes
[  5]   6.00-7.00   sec  60.9 MBytes   512 Mbits/sec   14   3.43 MBytes
[  5]   7.00-8.00   sec  62.8 MBytes   527 Mbits/sec   16    372 KBytes
[  5]   8.00-9.00   sec  59.3 MBytes   497 Mbits/sec   14   1.77 MBytes
[  5]   9.00-10.00  sec  57.0 MBytes   477 Mbits/sec   18   7.13 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   548 MBytes   459 Mbits/sec  130             sender
[  5]   0.00-10.10  sec   540 MBytes   449 Mbits/sec                  receiver

iperf Done.

cc@s1:~ % ifconfig bce4
bce4: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
ether 00:10:18:56:94:d4
inet 10.1.1.2 netmask 0xffffff00 broadcast 10.1.1.255
media: Ethernet 1000baseT <full-duplex>
status: active
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

I believe the default values for bce tx/rx pages are 2. And I happened to find
this problem before that when the tx queue was full, it would not enqueue packets
and started return errors.
And this error was misunderstood by the TCP layer as retransmission.

After adding hw.bce.tx_pages=4 and hw.bce.rx_pages=4 in /boot/loader.conf and reboot:

cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic
Connecting to host r1, port 5201
[  5] local 10.1.1.2 port 20478 connected to 10.1.1.3 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  4.15 MBytes  34.8 Mbits/sec    0   1.17 MBytes
[  5]   1.00-2.00   sec  83.1 MBytes   697 Mbits/sec    0   12.2 MBytes
[  5]   2.00-3.00   sec   112 MBytes   939 Mbits/sec    0   12.2 MBytes
[  5]   3.00-4.00   sec   113 MBytes   944 Mbits/sec    0   12.2 MBytes
[  5]   4.00-5.00   sec   112 MBytes   940 Mbits/sec    0   12.2 MBytes
[  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec    0   12.2 MBytes
[  5]   6.00-7.00   sec   112 MBytes   938 Mbits/sec    0   12.2 MBytes
[  5]   7.00-8.00   sec   113 MBytes   944 Mbits/sec    0   12.2 MBytes
[  5]   8.00-9.00   sec   112 MBytes   938 Mbits/sec    0   12.2 MBytes
[  5]   9.00-10.00  sec   113 MBytes   947 Mbits/sec    0   12.2 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   985 MBytes   826 Mbits/sec    0             sender
[  5]   0.00-10.11  sec   982 MBytes   815 Mbits/sec                  receiver

iperf Done.

Best Regards,
Cheng Cui


On Fri, Jun 30, 2023 at 12:26 PM Murali Krishnamurthy <muralik1@vmware.com<mailto:muralik1@vmware.com>> wrote:
Richard,

Appreciate the useful inputs you have shared so far. Will try to figure out regarding packet drops.

Regarding HyStart, I see even BSD code base has support for this. May I know by when can we see that in an release, if not already available ?

Regarding this point : “Switching to other cc modules may give some more insights. But again, I suspect that momentary (microsecond) burstiness of BSD may be causing this significantly higher loss rate.”
Is there some info somewhere where I can understand more on this in detail?

Regards
Murali


On 30/06/23, 9:35 PM, "owner-freebsd-transport@freebsd.org<mailto:owner-freebsd-transport@freebsd.org>" <owner-freebsd-transport@freebsd.org<mailto:owner-freebsd-transport@freebsd.org>> wrote:

Hi Murali,

> Q. Since you mention two hypervisors - what is the phyiscal network topology in between these two servers? What theoretical link rates would be attainable?
>
> Here is the topology
>
> Iperf end points are on 2 different hypervisors.
>
>  ———————————        ————————————————                                                                                 ——————                ——————-—
> | Linux VM1 |      |  BSD 13 VM 1  |                                                                                 |  Linux VM2  |                |  BSD 13 VM 2  |
> |___________|      |_ ____ ____ ___ |                                                                                  |___________ |                |_ ____ ____ ___ |
> |          |                         |                                                                                                           |                                   |
>           |                          |                                                                                                           |                                   |
> ———————————————                                                                                  ———————————————
> |           ESX Hypervisor 1          |           10G link connected via L2 Switch                      |           ESX Hypervisor  2            |
> |                                               |————————————————————————   |                                                |
> |—————————————— |                                                                                   |——————————————|
>
>
> Nic is of 10G capacity on both ESX server and it has below config.


So, when both VMs run on the same Hypervisor, maybe with another VM to simulate the 100ms delay, can you attain a lossless baseline scenario?


> BDP for 16MB Socket buffer: 16 MB * (1000 ms * 100ms latency) * 8 bits/ 1024 = 1.25 Gbps
>
> So theoretically we should see close to 1.25Gbps of Bitrate and we see Linux reaching close to this number.

Under no loss, yes.


> But BSD is not able to do that.
>
>
> Q. Did you run iperf3? Did the transmitting endpoint report any retransmissions between Linux or FBSD hosts?
>
> Yes, we used iper3. I see Linux doing less number retransmissions compared to BSD.
> On BSD, the best performance was around 600 Mbps bitrate and the number of retransmissions for this number seen is around 32K
> On Linux, the best performance was around 1.15 Gbps bitrate and the number of retransmissions for this number seen is only 2K.
> So as you pointed the number of retransmissions in BSD could be the real issue here.

There are other cc modules available; but I believe one major deviation is that Linux can perform mechanisms like hystart; ACKing every packet when the client detects slow start; perform pacing to achieve more uniform packet transmissions.

I think the next step would be to find out, at which queue those packet discards are coming from (external switch? delay generator? Vswitch? Eth stack inside the VM?)

Or alternatively, provide your ESX hypervisors with vastly more link speed, to rule out any L2 induced packet drops - provided your delay generator is not the source when momentarily overloaded.

> Is there a way to reduce this packet loss by fine tuning some parameters w.r.t ring buffer or any other areas?

Finding where these arise (looking at queue and port counters) would be the next step. But this is not really my specific area of expertise beyond the high level, vendor independent observations.

Switching to other cc modules may give some more insights. But again, I suspect that momentary (microsecond) burstiness of BSD may be causing this significantly higher loss rate.

TCP RACK would be another option. That stack has pacing, more fine-grained timing, the RACK loss recovery mechanisms etc. Maybe that helps reduce the observed packet drops by iperf, and consequently, yield a higher overall throuhgput.





!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.