Terrible NFS performance under 9.2-RELEASE?

Tue Jan 28 01:46:23 UTC 2014

pyunyh at gmail.com wrote:
> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote:
> > pyunyh at gmail.com wrote:
> > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote:
> > > > Adam McDougall wrote:
> > > > > Also try rsize=32768,wsize=32768 in your mount options, made
> > > > > a
> > > > > huge
> > > > > difference for me.  I've noticed slow file transfers on NFS
> > > > > in 9
> > > > > and
> > > > > finally did some searching a couple months ago, someone
> > > > > suggested
> > > > > it
> > > > > and
> > > > > they were on to something.
> > > > > 
> > > > I have a "hunch" that might explain why 64K NFS reads/writes
> > > > perform
> > > > poorly for some network environments.
> > > > A 64K NFS read reply/write request consists of a list of 34
> > > > mbufs
> > > > when
> > > > passed to TCP via sosend() and a total data length of around
> > > > 65680bytes.
> > > > Looking at a couple of drivers (virtio and ixgbe), they seem to
> > > > expect
> > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I
> > > > think
> > > > (I don't have anything that does TSO to confirm this) that NFS
> > > > will
> > > > pass
> > > > a list that is longer (34 plus a TCP/IP header).
> > > > At a glance, it appears that the drivers call m_defrag() or
> > > > m_collapse()
> > > > when the mbuf list won't fit in their scatter table (32 or 33
> > > > elements)
> > > > and if this fails, just silently drop the data without sending
> > > > it.
> > > > If I'm right, there would considerable overhead from
> > > > m_defrag()/m_collapse()
> > > > and near disaster if they fail to fix the problem and the data
> > > > is
> > > > silently
> > > > dropped instead of xmited.
> > > > 
> > > 
> > > I think the actual number of DMA segments allocated for the mbuf
> > > chain is determined by bus_dma(9).  bus_dma(9) will coalesce
> > > current segment with previous segment if possible.
> > > 
Btw, I looked at ixgbe.c and it uses bus_dmamap_load_mbuf_sg(), which
seems to used the fixed size scatter/gather list provided as an argument.

> > Ok, I'll have to take a look, but I thought that an array of sized
> > by "num_segs" is passed in as an argument. (And num_segs is set to
> > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).)
> > It looked to me that the ixgbe driver called itself ix, so it isn't
> > obvious to me which we are talking about. (I know that Daniel
> > Braniss
> > had an ix0 and ix1, which were fixed for NFS by disabling TSO.)
> > 
> 
> It's ix(4). ixbge(4) is a different driver.
> 
Ok, well I was looking at ixgbe.c and that one seems like it
might have the problem, for the 82599 case.

> > I'll admit I mostly looked at virtio's network driver, since that
> > was the one being used by J David.
> > 
> > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been
> > cropping up for quite a while, and I am just trying to find out
> > why.
> > (I have no hardware/software that exhibits the problem, so I can
> > only look at the sources and ask others to try testing stuff.)
> > 
> > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but I
> > > see the total length of all segment size of ix(4) is 65535 so
> > > it has no room for ethernet/VLAN header of the mbuf chain.  The
> > > driver should be fixed to transmit a 64KB datagram.
> > Well, if_hw_tsomax is set to 65535 by the generic code (the driver
> > doesn't set it) and the code in tcp_output() seems to subtract the
> > size of an tcp/ip header from that before passing data to the
> > driver,
> > so I think the mbuf chain passed to the driver will fit in one
> > ip datagram. (I'd assume all sorts of stuff would break for TSO
> > enabled drivers if that wasn't the case?)
> 
> I believe the generic code is doing right.  I'm under the
> impression the non-working TSO indicates a bug in driver.  Some
> drivers didn't account for additional ethernet/VLAN header so the
> total size of DMA segments exceeded 65535.  I've attached a diff
> for ix(4). It wasn't tested at all as I don't have hardware to
> test.
> 
I agree that if my hunch is correct, the drivers aren't correct.
But since the problem seems to have shown up a lot and it is
always reported as an NFS issue, I really want to get to the
bottom of it. And, if changing to 4K clusters is useful
work-around for any breakage in the drivers, then that might
be useful.

If the problem isn't the number of mbufs in the mbuf chain,
then changing to 4K clusters won't have any effect, since the
total data length in the chain remains the same. That will
tell us that the problem is something else.

> > 
> > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO
> > > capable controllers are able to handle multiple TX buffers so it
> > > should have used m_collapse(9) rather than copying entire chain
> > > with m_defrag(9).
> > > 
> > I haven't looked at these closely yet (plan on doing so to-day),
> > but
> > even m_collapse() looked like it copied data between mbufs and that
> > is certainly suboptimal, imho. I don't see why a driver can't split
> > the mbuf list, if there are too many entries for the scatter/gather
> > and do it in two iterations (much like tcp_output() does already,
> > since the data length exceeds 65535 - tcp/ip header size).
> > 
> 
> It can split the mbuf list if controllers supports increased number
> of TX buffers.  Because controller shall consume the same number of
> DMA descriptors for the mbuf list, drivers tend to impose a limit
> on the number of TX buffers to save resources.
> 
> > However, at this point, I just want to find out if the long chain
> > of mbufs is why TSO is problematic for these drivers, since I'll
> > admit I'm getting tired of telling people to disable TSO (and I
> > suspect some don't believe me and never try it).
> > 
> 
> TSO capable controllers tend to have various limitations(the first
> TX buffer should have complete ethernet/IP/TCP header, ip_len of IP
> header should be reset to 0, TCP pseudo checksum should be
> recomputed etc) and cheap controllers need more assistance from
> driver to let its firmware know various IP/TCP header offset
> location in the mbuf.  Because this requires a IP/TCP header
> parsing, it's error prone and very complex.
> 
> > > > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE
> > > > clusters,
> > > > so the mbuf count drops from 34 to 18.
> > > > 
> > > 
> > > Could we make it conditional on size?
> > > 
> > Not sure what you mean? If you mean "the size of the read/write",
> > that would be possible for NFSv3, but less so for NFSv4. (The
> > read/write
> > is just one Op. in the compound for NFSv4 and there is no way to
> > predict how much more data is going to be generated by subsequent
> > Ops.)
> > 
> 
> Sorry, I should have been more clearer. You already answered my
> question.  Thanks.
> 
> > If by "size" you mean amount of memory in the machine then, yes, it
> > certainly could be conditional on that. (I plan to try and look at
> > the allocator to-day as well, but if others know of disadvantages
> > with
> > using MJUMPAGESIZE instead of MCLBYTES, please speak up.)
> > 
> > Garrett Wollman already alluded to the MCLBYTES case being
> > pre-allocated,
> > but I'll admit I have no idea what the implications of that are at
> > this
> > time.
> > 
> > > > If anyone has a TSO scatter/gather enabled net interface and
> > > > can
> > > > test this
> > > > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO
> > > > is
> > > > enabled
> > > > and see what effect it has, that would be appreciated.
> > > > 
> > > > Btw, thanks go to Garrett Wollman for suggesting the change to
> > > > MJUMPAGESIZE
> > > > clusters.
> > > > 
> > > > rick
> > > > ps: If the attachment doesn't make it through and you want the
> > > > patch, just
> > > >     email me and I'll send you a copy.
> > > > 
>