Re: TSO + ECN
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Fri, 22 Dec 2023 12:19:43 UTC
Thanks Michael. Having looked at that document, the bit masks there are incorrect. In RFC3168, the CWR bit is supposed to be sent once only (and ideally as early as possible). The documented bitmasks for the First, Mid and Last segments don't make sense in that case: 0xFF6 0xFF6 0xF7F These masks would allow the CWR bit in the first and any middle segment, only clearing it in the last - where PSH and FIN would be allowed to be sent... (Also, why the SYN and RST bits aren't similarly masked out escapes me). I also checked how the vmxnet3 driver behaves when TSO is active - and found that it will leave the CWR bit unchanged on any of the TSO segments. Finally, (and this is where this came from), the virtio driver discards TSO mbufs with ENOTSUP when encountering the CWR bit, but the host didn't indicate that the TSO capability there would "properly" support ECN. That leads to massive performance degradations, as TSO remains enabled, but every time a CWR bit is tried to be sent, the cwnd has to collapse to 1 MSS in order for a successful transmission. This typically takes an RTO... Ultimately we also need to consider the upcoming changes in semantics of these ECN-related bits with AccECN (which do *NOT* require any special handling on the TX path for these bits any longer). I decided to create D43166 to fix this in tcp_output(), and D43167 to no longer stop TSO transmissions when encountering CWR on "unsupporting" hosts. By restructuring some of the ECN handling, whenever the CWR bit is scheduled to be sent, this bypasses the TSO TX path completely. For 3168 ECN - where only a single segment per RTT would be expected to have the CWR bit set, I believe this is an acceptable compromise - to bypass the various broken or misbehaving TSO implementations when it comes to ECN. For AccECN, where long flights of data could easily have the CWR bit (as part of the ACE counter) set, a more performant solution would be needed. I imagine the most simple one would be to remove any error branch for special handling of CWR - even on older TSO drivers, where ECN is not supported; Reprogramming the Header Bitmasks in "ECN-aware" TSO offload hardware to send the CWR bit unobstructed for the entire TSO: 0xFF6 0xFF6 0xFFF and once that is all in place, allow TSO only for AccECN enabled sessions when the CWR bit is encountered... I would like to gather some feedback by those who work on the various network drivers (intel, mlx, virtio, ...) if that sounds like a viable plan to rectify the sad state of ECN support with TSO - while becoming future-proof. > On Dec 20, 2023, at 12:15, Scheffenegger, Richard <rscheff@freebsd.org> wrote: > > Hi, > > I am curious if anyone here has expirience with the handling of ECN in TSO-enabled drivers/hardware... Some data pointer if I read the specification correctly. Have a look at the specification of the 10GBit/sec card ix: https://cdrdv2-public.intel.com/331520/82599-datasheet-v3-4.pdf According to section 7.2.4 and 8.2.3.9.3 and 8.2.3.9.4 the * first segment gets all flags except PSH and FIN. * middle segments get all flags except PSH and FIN. * last segment gets all flags except the CWR. I think you should be able to change the masks. Best regards Michael > > The other day I found that the virtio driver would bail out with ENOTSUP when encountering the TCP CWR header bit on a TSO-enabled flow, when the host does not also claim ECN-support for TSO. > > But this made me wonder, how the expected behavior is. > > Presumably, this means that the hardware (or driver) would clear the CWR bit after the first packet is sent, correct? > > However, in light of the upcoming AccECN signalling protocol, that is not what TSO should be doing (with AccECN, all segments should retain the exact same header flags, maybe expect PSH). > > Probably "non-ECN" capable TSO offload would actually work better with AccECN - and if the above behavior is what ECN-aware TSO is doing, AccECN sessions would need to somehow work around that (e.g. spoon-feeding any segment with CWR set individually - e.g. bypassing the TSO capabilities in tcp_output)? > > > Would appreciate any feedback around this... > > Best regards, > Richard