64K NFS I/O generates a 34mbuf list for TCP which breaks TSO

Thu Jan 30 03:56:31 UTC 2014

For some time, I've been seeing reports of NFS related issues
that get resolved by the user either disabling TSO or reducing
the rsize/wsize to 32K.

I now think I know why this is happening, although the evidence
is just coming in. (I have no hardware/software that does TSO,
so I never see these problems during testing.)

A 64K NFS read reply, readdir reply or write request results in
the krpc handing the TCP socket an mbuf list with 34 entries via
sosend(). Now, I am really rusty w.r.t. TCP, but it looks like
this will result in a TCP/IP header + 34 data mbufs being handed
to the network device driver, if if_hw_tsomax has the default
setting of 65535 (max IP datagram).

At a glance, many drivers use a scatter/gather list of around
32 elements for transmission. If the mbuf list doesn't fit in
this scatter/gather list (which looks to me like it will be the
case), then the driver either calls m_defrag() or m_collapse()
to try and fix the problem.
This seems like a serious problem to me.
1 - If m_collapse()/m_defrag() fails, the transmit doesn't happen
    and things wedge until a TCP timeout retransmit gets things
    going again. It looks like m_defrag() is less likely to fail,
    but generates a lot of overhead. m_collapse() seems to be less
    overhead, but seems less likely to succeed.
    (Since m_defrag() is called with M_NOWAIT, it can fail in that
     extreme case. I'm not sure if it will fail otherwise?)

So, how to fix this?
1 - Change NFS to use 4K clusters for these 64K reads/writes, reducing
    the mbuf list from 34->18. Preliminary patches for this are being
    tested.
    --> However, this seems to be more of a work-around than a fix.
2 - As soon as a driver needs to call m_defrag() or m_collapse()
    because the length of the TSO transmit mbuf list is too long,
    reduce if_hw_tsomax by a significant amount to try and get
    tcp_output() to generate shorter mbuf lists.
    Not great, but at least better than calling m_defrag()/m_collapse()
    over and over and over again.
    --> As a starting point, instrumenting the device drivers so that
        counts of # ofcalls to m_defrag()/m_collapse() and counts of
        failed calls would help to confirm how serious this problem is.
3 - ??? Any ideas from folk familiar with TSO and these drivers.

rick
ps: Until this gets resolved, please tell anyone with serious NFS
    performance/reliability issues to try either disabling TSO or
    doing client mounts with "-o rsize=32768,wsize=32768".
    I'm not sure how many believe me when I tell them, but at least
    I now have a theory as to why it can help a lot.