64K NFS I/O generates a 34mbuf list for TCP which breaks TSO
Rick Macklem
rmacklem at uoguelph.ca
Thu Jan 30 03:56:31 UTC 2014
For some time, I've been seeing reports of NFS related issues
that get resolved by the user either disabling TSO or reducing
the rsize/wsize to 32K.
I now think I know why this is happening, although the evidence
is just coming in. (I have no hardware/software that does TSO,
so I never see these problems during testing.)
A 64K NFS read reply, readdir reply or write request results in
the krpc handing the TCP socket an mbuf list with 34 entries via
sosend(). Now, I am really rusty w.r.t. TCP, but it looks like
this will result in a TCP/IP header + 34 data mbufs being handed
to the network device driver, if if_hw_tsomax has the default
setting of 65535 (max IP datagram).
At a glance, many drivers use a scatter/gather list of around
32 elements for transmission. If the mbuf list doesn't fit in
this scatter/gather list (which looks to me like it will be the
case), then the driver either calls m_defrag() or m_collapse()
to try and fix the problem.
This seems like a serious problem to me.
1 - If m_collapse()/m_defrag() fails, the transmit doesn't happen
and things wedge until a TCP timeout retransmit gets things
going again. It looks like m_defrag() is less likely to fail,
but generates a lot of overhead. m_collapse() seems to be less
overhead, but seems less likely to succeed.
(Since m_defrag() is called with M_NOWAIT, it can fail in that
extreme case. I'm not sure if it will fail otherwise?)
So, how to fix this?
1 - Change NFS to use 4K clusters for these 64K reads/writes, reducing
the mbuf list from 34->18. Preliminary patches for this are being
tested.
--> However, this seems to be more of a work-around than a fix.
2 - As soon as a driver needs to call m_defrag() or m_collapse()
because the length of the TSO transmit mbuf list is too long,
reduce if_hw_tsomax by a significant amount to try and get
tcp_output() to generate shorter mbuf lists.
Not great, but at least better than calling m_defrag()/m_collapse()
over and over and over again.
--> As a starting point, instrumenting the device drivers so that
counts of # ofcalls to m_defrag()/m_collapse() and counts of
failed calls would help to confirm how serious this problem is.
3 - ??? Any ideas from folk familiar with TSO and these drivers.
rick
ps: Until this gets resolved, please tell anyone with serious NFS
performance/reliability issues to try either disabling TSO or
doing client mounts with "-o rsize=32768,wsize=32768".
I'm not sure how many believe me when I tell them, but at least
I now have a theory as to why it can help a lot.
More information about the freebsd-net
mailing list