Re: TSO + ECN

From: Scheffenegger, Richard <rscheff_at_freebsd.org>
Date: Fri, 22 Dec 2023 12:19:43 UTC
Thanks Michael.

Having looked at that document, the bit masks there are incorrect.

In RFC3168, the CWR bit is supposed to be sent once only (and ideally as 
early as possible). The documented bitmasks for the First, Mid and Last 
segments don't make sense in that case:

0xFF6 0xFF6 0xF7F

These masks would allow the CWR bit in the first and any middle segment, 
only clearing it in the last - where PSH and FIN would be allowed to be 
sent... (Also, why the SYN and RST bits aren't similarly masked out 
escapes me).


I also checked how the vmxnet3 driver behaves when TSO is active - and 
found that it will leave the CWR bit unchanged on any of the TSO segments.

Finally, (and this is where this came from), the virtio driver discards 
TSO mbufs with ENOTSUP when encountering the CWR bit, but the host 
didn't indicate that the TSO capability there would "properly" support 
ECN. That leads to massive performance degradations, as TSO remains 
enabled, but every time a CWR bit is tried to be sent, the cwnd has to 
collapse to 1 MSS in order for a successful transmission. This typically 
takes an RTO...


Ultimately we also need to consider the upcoming changes in semantics of 
these ECN-related bits with AccECN (which do *NOT* require any special 
handling on the TX path for these bits any longer).


I decided to create D43166 to fix this in tcp_output(),
and D43167 to no longer stop TSO transmissions when encountering CWR on 
"unsupporting" hosts.


By restructuring some of the ECN handling, whenever the CWR bit is 
scheduled to be sent, this bypasses the TSO TX path completely.

For 3168 ECN - where only a single segment per RTT would be expected to 
have the CWR bit set, I believe this is an acceptable compromise - to 
bypass the various broken or misbehaving TSO implementations when it 
comes to ECN.

For AccECN, where long flights of data could easily have the CWR bit (as 
part of the ACE counter) set, a more performant solution would be needed.

I imagine the most simple one would be to remove any error branch for 
special handling of CWR - even on older TSO drivers, where ECN is not 
supported; Reprogramming the Header Bitmasks in "ECN-aware" TSO offload 
hardware to send the CWR bit unobstructed for the entire TSO:

0xFF6 0xFF6 0xFFF

and once that is all in place, allow TSO only for AccECN enabled 
sessions when the CWR bit is encountered...



I would like to gather some feedback by those who work on the various 
network drivers (intel, mlx, virtio, ...) if that sounds like a viable 
plan to rectify the sad state of ECN support with TSO - while becoming 
future-proof.


 > On Dec 20, 2023, at 12:15, Scheffenegger, Richard 
<rscheff@freebsd.org> wrote:
 >
 > Hi,
 >
 > I am curious if anyone here has expirience with the handling of ECN 
in TSO-enabled drivers/hardware...

Some data pointer if I read the specification correctly.
Have a look at the specification of the 10GBit/sec card ix:
https://cdrdv2-public.intel.com/331520/82599-datasheet-v3-4.pdf

According to section 7.2.4 and 8.2.3.9.3 and 8.2.3.9.4 the
* first segment gets all flags except PSH and FIN.
* middle segments get all flags except PSH and FIN.
* last segment gets all flags except the CWR.

I think you should be able to change the masks.

Best regards
Michael

 >
 > The other day I found that the virtio driver would bail out with 
ENOTSUP when encountering the TCP CWR header bit on a TSO-enabled flow, 
when the host does not also claim ECN-support for TSO.
 >
 > But this made me wonder, how the expected behavior is.
 >
 > Presumably, this means that the hardware (or driver) would clear the 
CWR bit after the first packet is sent, correct?
 >
 > However, in light of the upcoming AccECN signalling protocol, that is 
not what TSO should be doing (with AccECN, all segments should retain 
the exact same header flags, maybe expect PSH).
 >
 > Probably "non-ECN" capable TSO offload would actually work better 
with AccECN - and if the above behavior is what ECN-aware TSO is doing, 
AccECN sessions would need to somehow work around that (e.g. 
spoon-feeding any segment with CWR set individually - e.g. bypassing the 
TSO capabilities in tcp_output)?
 >
 >
 > Would appreciate any feedback around this...
 >
 > Best regards,
 > Richard