kern/167325: [netinet] [patch] sosend sometimes return EINVAL with TSO and VLAN on 82599 NIC

YongHyeon PYUN pyunyh at gmail.com
Fri Sep 14 04:52:00 UTC 2012


On Fri, Sep 07, 2012 at 05:44:48PM -0400, Jeremiah Lott wrote:
> On Apr 27, 2012, at 2:07 AM, linimon at FreeBSD.org wrote:
> 
> > Old Synopsis: sosend sometimes return EINVAL with TSO and VLAN on 82599 NIC
> > New Synopsis: [netinet] [patch] sosend sometimes return EINVAL with TSO and VLAN on 82599 NIC
> 
> > http://www.freebsd.org/cgi/query-pr.cgi?pr=167325
> 
> I did an analysis of this pr a while back and I figured I'd share.  Definitely looks like a real problem here, but at least in 8.2 it is difficult to hit it.  First off, vlan tagging is not required to hit this.  The code is question does not account for any amount of link-local header, so you can reproduce the bug even without vlans.
> 
> In order to trigger it, the tcp stack must choose to send a tso "packet" with a total size (including tcp+ip header and options, but not link-local header) between 65522 and 65535 bytes (because adding 14 byte link-local header will then exceed 64K limit).  In 8.1, the tcp stack only chooses to send tso bursts that will result in full mtu-size on-wire packets.  To achieve this, it will truncate the tso packet size to be a multiple of mss, not including header and tcp options.  The check has been relaxed a little in head, but the same basic check is still there.  None of the "normal" mtus have multiples falling in this range.  To reproduce it I used an mtu of 1445.  When timestamps are in use, every packet has a 40 bytes tcp/ip header + 10 bytes for the timestamp option + 2 bytes pad.  You can get a packet length 65523 as follows:
> 
> 65523 - (40 + 10 + 2) = 65471 (size of tso packet data)
> 65471 / 47 = 1393 (size of data per on-wire packet)
> 1393 + (40 + 10 + 2) = 1445 (mtu is data + header + options + pad)
> 
> Once you set your mtu to 1445, you need a program that can get the stack to send a maximum sized packet.  With the congestion window that can be more difficult than it seems.  I used some python that sends enough data to open the window, sleeps long enough to drain all outstanding data, but not long enough for the congestion window to go stale and close again, then sends a bunch more data.  It also helps to turn off delayed acks on the receiver.  Sometimes you will not drain the entire send buffer because an ack for the final chunk is still delayed when you start the second transmit.  When the problem described in the pr hits, the EINVAL from bus_dmamap_load_mbuf_sg bubbles right up to userspace.
> 
> At first I thought this was a driver bug rather than stack bug.  The code in question does what it is commented to do (limit the tso packet so that ip->ip_len does not overflow).  However, it also seems reasonable that the driver limit its dma tag at 64K (do we really want it allocating another whole page just for the 14 byte link-local header).  Perhaps the tcp stack should ensure that the tso packet + max_linkhdr is < 64K.  Comments?

Hmm, I think it's a driver bug. Upper stack may not know whether L2
includes VLAN. Almost all drivers in tree includes L2 header size
in DMA tag. If ethernet hardwares can handle this oversized
frames(64KB + L2 header) with TSOv4/TSOv6 I think there is no
reason not to support it.

> 
> As an aside, the patch attached to the pr is also slightly wrong.  Taking the max_linkhdr into account when rounding the packet to be a multiple of mss does not make sense, it should only take it into account when calculating the max tso length.
> 
>   Jeremiah Lott
>   Avere Systems


More information about the freebsd-net mailing list