Re: FreeBSD TCP (with iperf3) comparison with Linux
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 03 Jul 2023 20:24:00 UTC
I see. Sorry about a straight description in my previous email. If you found the iperf3 report shows bad throughput and increasing numbers in the "Retr" field, also the "netstat -sp tcp" shows retransmitted packets without SACK recovery episodes (SACK is enabled by default). Then, you are likely hitting the problem I described, and the root cause is the TX queue drops. The tcpdump trace file won't show any packet retransmissions and the peer won't be aware of packet loss, as this is a local problem. cc@s1:~ % netstat -sp tcp | egrep "tcp:|retrans|SACK" tcp: 139 data packets (300416 bytes) retransmitted << 0 data packets unnecessarily retransmitted 3 retransmit timeouts 0 retransmitted 0 SACK recovery episodes << 0 segment rexmits in SACK recovery episodes 0 byte rexmits in SACK recovery episodes 0 SACK options (SACK blocks) received 0 SACK options (SACK blocks) sent 0 SACK retransmissions lost 0 SACK scoreboard overflow Local packet drops due to TX full can be found from this command, for example cc@s1:~ % netstat -i -I bce4 -nd Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll Drop bce4 1500 <Link#5> 00:10:18:56:94:d4 286184 0 0 148079 0 0 54 << bce4 - 10.1.1.0/24 10.1.1.2 286183 - - 582111 - - - cc@s1:~ % Hope the above stats can help you better root cause analysis. Also, increasing the TX queue size is a workaround and is specific to a particular NIC. But you get the idea. Best Regards, Cheng Cui On Mon, Jul 3, 2023 at 11:34 AM Murali Krishnamurthy <muralik1@vmware.com> wrote: > Cheng, > > > > Thanks for your inputs. > > > > Sorry, I am not familiar with this area. > > > > Few queries, > > > > “I believe the default values for bce tx/rx pages are 2. And I happened to > find > this problem before that when the tx queue was full, it would not enqueue > packets > and started return errors. > And this error was misunderstood by the TCP layer as retransmission.” > > > > Could you please elaborate what is misunderstood by TCP here? Loss of > packets should anyway lead to retransmissions. > > > > Could you point to some stats where I can see such drops due to queue > getting full? > > > > I have a vmx interface in my VM and I have attached the screenshot of > ifconfig command for that. > > Anything we can understand from that? > > Will your suggestion of increasing tx_pages=4 and rx_pages=4 work for this > ? If so, I assume names would be hw.vmx.tx_pages=4 and hw.vmx.rx_pages ? > > > > Regards > > Murali > > > > > > *From: *Cheng Cui <cc@freebsd.org> > *Date: *Friday, 30 June 2023 at 10:02 PM > *To: *Murali Krishnamurthy <muralik1@vmware.com> > *Cc: *Scheffenegger, Richard <rscheff@freebsd.org>, FreeBSD Transport < > freebsd-transport@freebsd.org> > *Subject: *Re: FreeBSD TCP (with iperf3) comparison with Linux > > *!! External Email* > > I used an emulation testbed from Emulab.net with Dummynet traffic shaper > adding 100ms RTT > between two nodes, the link capacity is 1Gbps and both nodes are using > freebsd13.2. > > cc@s1:~ % ping -c 3 r1 > PING r1-link1 (10.1.1.3): 56 data bytes > 64 bytes from 10.1.1.3: icmp_seq=0 ttl=64 time=100.091 ms > 64 bytes from 10.1.1.3: icmp_seq=1 ttl=64 time=99.995 ms > 64 bytes from 10.1.1.3: icmp_seq=2 ttl=64 time=99.979 ms > > --- r1-link1 ping statistics --- > 3 packets transmitted, 3 packets received, 0.0% packet loss > round-trip min/avg/max/stddev = 99.979/100.022/100.091/0.049 ms > > > cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic > Connecting to host r1, port 5201 > [ 5] local 10.1.1.2 port 56089 connected to 10.1.1.3 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 4.19 MBytes 35.2 Mbits/sec 0 1.24 MBytes > > [ 5] 1.00-2.00 sec 56.5 MBytes 474 Mbits/sec 6 2.41 MBytes > > [ 5] 2.00-3.00 sec 58.6 MBytes 492 Mbits/sec 18 7.17 MBytes > > [ 5] 3.00-4.00 sec 65.6 MBytes 550 Mbits/sec 14 606 KBytes > > [ 5] 4.00-5.00 sec 60.8 MBytes 510 Mbits/sec 18 7.22 MBytes > > [ 5] 5.00-6.00 sec 62.1 MBytes 521 Mbits/sec 12 7.86 MBytes > > [ 5] 6.00-7.00 sec 60.9 MBytes 512 Mbits/sec 14 3.43 MBytes > > [ 5] 7.00-8.00 sec 62.8 MBytes 527 Mbits/sec 16 372 KBytes > > [ 5] 8.00-9.00 sec 59.3 MBytes 497 Mbits/sec 14 1.77 MBytes > > [ 5] 9.00-10.00 sec 57.0 MBytes 477 Mbits/sec 18 7.13 MBytes > > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 548 MBytes 459 Mbits/sec 130 > sender > [ 5] 0.00-10.10 sec 540 MBytes 449 Mbits/sec > receiver > > iperf Done. > > cc@s1:~ % ifconfig bce4 > bce4: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 > > options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE> > ether 00:10:18:56:94:d4 > inet 10.1.1.2 netmask 0xffffff00 broadcast 10.1.1.255 > media: Ethernet 1000baseT <full-duplex> > status: active > nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> > > I believe the default values for bce tx/rx pages are 2. And I happened to > find > this problem before that when the tx queue was full, it would not enqueue > packets > and started return errors. > And this error was misunderstood by the TCP layer as retransmission. > > After adding hw.bce.tx_pages=4 and hw.bce.rx_pages=4 in /boot/loader.conf > and reboot: > > cc@s1:~ % iperf3 -c r1 -t 10 -i 1 -C cubic > Connecting to host r1, port 5201 > [ 5] local 10.1.1.2 port 20478 connected to 10.1.1.3 port 5201 > [ ID] Interval Transfer Bitrate Retr Cwnd > [ 5] 0.00-1.00 sec 4.15 MBytes 34.8 Mbits/sec 0 1.17 MBytes > > [ 5] 1.00-2.00 sec 83.1 MBytes 697 Mbits/sec 0 12.2 MBytes > > [ 5] 2.00-3.00 sec 112 MBytes 939 Mbits/sec 0 12.2 MBytes > > [ 5] 3.00-4.00 sec 113 MBytes 944 Mbits/sec 0 12.2 MBytes > > [ 5] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 12.2 MBytes > > [ 5] 5.00-6.00 sec 112 MBytes 942 Mbits/sec 0 12.2 MBytes > > [ 5] 6.00-7.00 sec 112 MBytes 938 Mbits/sec 0 12.2 MBytes > > [ 5] 7.00-8.00 sec 113 MBytes 944 Mbits/sec 0 12.2 MBytes > > [ 5] 8.00-9.00 sec 112 MBytes 938 Mbits/sec 0 12.2 MBytes > > [ 5] 9.00-10.00 sec 113 MBytes 947 Mbits/sec 0 12.2 MBytes > > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate Retr > [ 5] 0.00-10.00 sec 985 MBytes 826 Mbits/sec 0 > sender > [ 5] 0.00-10.11 sec 982 MBytes 815 Mbits/sec > receiver > > iperf Done. > > > > Best Regards, > > Cheng Cui > > > > > > On Fri, Jun 30, 2023 at 12:26 PM Murali Krishnamurthy <muralik1@vmware.com> > wrote: > > Richard, > > > > Appreciate the useful inputs you have shared so far. Will try to figure > out regarding packet drops. > > > > Regarding HyStart, I see even BSD code base has support for this. May I > know by when can we see that in an release, if not already available ? > > > > Regarding this point : *“Switching to other cc modules may give some more > insights. But again, I suspect that momentary (microsecond) burstiness of > BSD may be causing this significantly higher loss rate.”* > > Is there some info somewhere where I can understand more on this in detail? > > > > Regards > > Murali > > > > > > On 30/06/23, 9:35 PM, "owner-freebsd-transport@freebsd.org" < > owner-freebsd-transport@freebsd.org> wrote: > > > > Hi Murali, > > > > > Q. Since you mention two hypervisors - what is the phyiscal network > topology in between these two servers? What theoretical link rates would be > attainable? > > > > > > Here is the topology > > > > > > Iperf end points are on 2 different hypervisors. > > > > > > ——————————— ———————————————— > —————— ——————-— > > > | Linux VM1 | | BSD 13 VM > 1 | > | Linux VM2 | | BSD 13 VM 2 | > > > |___________| |_ ____ ____ ___ > | |___________ > | |_ ____ ____ ___ | > > > | | > | > | | > > > > | | > | | > > > > ——————————————— ——————————————— > > > | ESX Hypervisor 1 | 10G link connected via > L2 Switch | ESX Hypervisor 2 | > > > | > |———————————————————————— > | | > > > |—————————————— > | > |——————————————| > > > > > > > > > Nic is of 10G capacity on both ESX server and it has below config. > > > > > > So, when both VMs run on the same Hypervisor, maybe with another VM to > simulate the 100ms delay, can you attain a lossless baseline scenario? > > > > > > > BDP for 16MB Socket buffer: 16 MB * (1000 ms * 100ms latency) * 8 bits/ > 1024 = 1.25 Gbps > > > > > > So theoretically we should see close to 1.25Gbps of Bitrate and we see > Linux reaching close to this number. > > > > Under no loss, yes. > > > > > > > But BSD is not able to do that. > > > > > > > > > Q. Did you run iperf3? Did the transmitting endpoint report any > retransmissions between Linux or FBSD hosts? > > > > > > Yes, we used iper3. I see Linux doing less number retransmissions > compared to BSD. > > > On BSD, the best performance was around 600 Mbps bitrate and the number > of retransmissions for this number seen is around 32K > > > On Linux, the best performance was around 1.15 Gbps bitrate and the > number of retransmissions for this number seen is only 2K. > > > So as you pointed the number of retransmissions in BSD could be the real > issue here. > > > > There are other cc modules available; but I believe one major deviation is > that Linux can perform mechanisms like hystart; ACKing every packet when > the client detects slow start; perform pacing to achieve more uniform > packet transmissions. > > > > I think the next step would be to find out, at which queue those packet > discards are coming from (external switch? delay generator? Vswitch? Eth > stack inside the VM?) > > > > Or alternatively, provide your ESX hypervisors with vastly more link > speed, to rule out any L2 induced packet drops - provided your delay > generator is not the source when momentarily overloaded. > > > > > Is there a way to reduce this packet loss by fine tuning some parameters > w.r.t ring buffer or any other areas? > > > > Finding where these arise (looking at queue and port counters) would be > the next step. But this is not really my specific area of expertise beyond > the high level, vendor independent observations. > > > > Switching to other cc modules may give some more insights. But again, I > suspect that momentary (microsecond) burstiness of BSD may be causing this > significantly higher loss rate. > > > > TCP RACK would be another option. That stack has pacing, more fine-grained > timing, the RACK loss recovery mechanisms etc. Maybe that helps reduce the > observed packet drops by iperf, and consequently, yield a higher overall > throuhgput. > > > > > > > > > > > > *!! External Email:* This email originated from outside of the > organization. Do not click links or open attachments unless you recognize > the sender. > > >