Terrible NFS performance under 9.2-RELEASE?

Mon Feb 3 07:06:36 UTC 2014

On Feb 2, 2014, at 6:15 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Daniel Braniss wrote:
>> hi Rick, et.all.
>> 
>> tried your patch but it didn’t help,the server is stuck.
> Oh well. I was hoping that was going to make TSO work reliably.
> Just to comfirm it, this server works reliably when TSO is disabled?
> 
absolutely, with TSO disabled there is no problem,and it’s slightly faster,
the host is ‘server class’ , a PC might be different.
cheers,
	danny
> Thanks for doing the testing, rick
> 
>> just for fun, I tried a different client/host, this one has a
>> broadcom NextXtreme II  that was
>> MFC’ed lately, and the results are worse than the Intel (5hs instead
>> of 4hs) but faster without TSO
>> 
>> with TSO enabled and bs=32k:
>> 5.09hs		18325.62 real      1109.23 user      4591.60 sys
>> 
>> without TSO:
>> 4.75hs		17120.40 real      1114.08 user      3537.61 sys
>> 
>> So what is the advantage of using TSO? (no complain here, just
>> curious)
>> 
>> I’ll try to see if as a server it has the same TSO related issues.
>> 
>> cheers,
>> 	danny
>> 
>> On Jan 28, 2014, at 3:51 AM, Rick Macklem <rmacklem at uoguelph.ca>
>> wrote:
>> 
>>> Jack Vogel wrote:
>>>> That header file is for the VF driver :) which I don't believe is
>>>> being
>>>> used in this case.
>>>> The driver is capable of handling 256K but its limited by the
>>>> stack
>>>> to 64K
>>>> (look in
>>>> ixgbe.h), so its not a few bytes off due to the vlan header.
>>>> 
>>>> The scatter size is not an arbitrary one, its due to hardware
>>>> limitations
>>>> in Niantic
>>>> (82599).  Turning off TSO in the 10G environment is not practical,
>>>> you will
>>>> have
>>>> trouble getting good performance.
>>>> 
>>>> Jack
>>>> 
>>> Well, if you look at this thread, Daniel got much better
>>> performance
>>> by turning off TSO. However, I agree that this is not an ideal
>>> solution.
>>> http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B
>>> 
>>> rick
>>> 
>>>> 
>>>> 
>>>> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN <pyunyh at gmail.com>
>>>> wrote:
>>>> 
>>>>> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote:
>>>>>> pyunyh at gmail.com wrote:
>>>>>>> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote:
>>>>>>>> Adam McDougall wrote:
>>>>>>>>> Also try rsize=32768,wsize=32768 in your mount options,
>>>>>>>>> made a
>>>>>>>>> huge
>>>>>>>>> difference for me.  I've noticed slow file transfers on NFS
>>>>>>>>> in 9
>>>>>>>>> and
>>>>>>>>> finally did some searching a couple months ago, someone
>>>>>>>>> suggested
>>>>>>>>> it
>>>>>>>>> and
>>>>>>>>> they were on to something.
>>>>>>>>> 
>>>>>>>> I have a "hunch" that might explain why 64K NFS reads/writes
>>>>>>>> perform
>>>>>>>> poorly for some network environments.
>>>>>>>> A 64K NFS read reply/write request consists of a list of 34
>>>>>>>> mbufs
>>>>>>>> when
>>>>>>>> passed to TCP via sosend() and a total data length of around
>>>>>>>> 65680bytes.
>>>>>>>> Looking at a couple of drivers (virtio and ixgbe), they seem
>>>>>>>> to
>>>>>>>> expect
>>>>>>>> no more than 32-33 mbufs in a list for a 65535 byte TSO xmit.
>>>>>>>> I
>>>>>>>> think
>>>>>>>> (I don't have anything that does TSO to confirm this) that
>>>>>>>> NFS will
>>>>>>>> pass
>>>>>>>> a list that is longer (34 plus a TCP/IP header).
>>>>>>>> At a glance, it appears that the drivers call m_defrag() or
>>>>>>>> m_collapse()
>>>>>>>> when the mbuf list won't fit in their scatter table (32 or 33
>>>>>>>> elements)
>>>>>>>> and if this fails, just silently drop the data without
>>>>>>>> sending it.
>>>>>>>> If I'm right, there would considerable overhead from
>>>>>>>> m_defrag()/m_collapse()
>>>>>>>> and near disaster if they fail to fix the problem and the
>>>>>>>> data is
>>>>>>>> silently
>>>>>>>> dropped instead of xmited.
>>>>>>>> 
>>>>>>> 
>>>>>>> I think the actual number of DMA segments allocated for the
>>>>>>> mbuf
>>>>>>> chain is determined by bus_dma(9).  bus_dma(9) will coalesce
>>>>>>> current segment with previous segment if possible.
>>>>>>> 
>>>>>> Ok, I'll have to take a look, but I thought that an array of
>>>>>> sized
>>>>>> by "num_segs" is passed in as an argument. (And num_segs is set
>>>>>> to
>>>>>> either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).)
>>>>>> It looked to me that the ixgbe driver called itself ix, so it
>>>>>> isn't
>>>>>> obvious to me which we are talking about. (I know that Daniel
>>>>>> Braniss
>>>>>> had an ix0 and ix1, which were fixed for NFS by disabling TSO.)
>>>>>> 
>>>>> 
>>>>> It's ix(4). ixbge(4) is a different driver.
>>>>> 
>>>>>> I'll admit I mostly looked at virtio's network driver, since
>>>>>> that
>>>>>> was the one being used by J David.
>>>>>> 
>>>>>> Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have
>>>>>> been
>>>>>> cropping up for quite a while, and I am just trying to find out
>>>>>> why.
>>>>>> (I have no hardware/software that exhibits the problem, so I can
>>>>>> only look at the sources and ask others to try testing stuff.)
>>>>>> 
>>>>>>> I'm not sure whether you're referring to ixgbe(4) or ix(4) but
>>>>>>> I
>>>>>>> see the total length of all segment size of ix(4) is 65535 so
>>>>>>> it has no room for ethernet/VLAN header of the mbuf chain.  The
>>>>>>> driver should be fixed to transmit a 64KB datagram.
>>>>>> Well, if_hw_tsomax is set to 65535 by the generic code (the
>>>>>> driver
>>>>>> doesn't set it) and the code in tcp_output() seems to subtract
>>>>>> the
>>>>>> size of an tcp/ip header from that before passing data to the
>>>>>> driver,
>>>>>> so I think the mbuf chain passed to the driver will fit in one
>>>>>> ip datagram. (I'd assume all sorts of stuff would break for TSO
>>>>>> enabled drivers if that wasn't the case?)
>>>>> 
>>>>> I believe the generic code is doing right.  I'm under the
>>>>> impression the non-working TSO indicates a bug in driver.  Some
>>>>> drivers didn't account for additional ethernet/VLAN header so the
>>>>> total size of DMA segments exceeded 65535.  I've attached a diff
>>>>> for ix(4). It wasn't tested at all as I don't have hardware to
>>>>> test.
>>>>> 
>>>>>> 
>>>>>>> I think the use of m_defrag(9) in TSO is suboptimal. All TSO
>>>>>>> capable controllers are able to handle multiple TX buffers so
>>>>>>> it
>>>>>>> should have used m_collapse(9) rather than copying entire chain
>>>>>>> with m_defrag(9).
>>>>>>> 
>>>>>> I haven't looked at these closely yet (plan on doing so to-day),
>>>>>> but
>>>>>> even m_collapse() looked like it copied data between mbufs and
>>>>>> that
>>>>>> is certainly suboptimal, imho. I don't see why a driver can't
>>>>>> split
>>>>>> the mbuf list, if there are too many entries for the
>>>>>> scatter/gather
>>>>>> and do it in two iterations (much like tcp_output() does
>>>>>> already,
>>>>>> since the data length exceeds 65535 - tcp/ip header size).
>>>>>> 
>>>>> 
>>>>> It can split the mbuf list if controllers supports increased
>>>>> number
>>>>> of TX buffers.  Because controller shall consume the same number
>>>>> of
>>>>> DMA descriptors for the mbuf list, drivers tend to impose a limit
>>>>> on the number of TX buffers to save resources.
>>>>> 
>>>>>> However, at this point, I just want to find out if the long
>>>>>> chain
>>>>>> of mbufs is why TSO is problematic for these drivers, since I'll
>>>>>> admit I'm getting tired of telling people to disable TSO (and I
>>>>>> suspect some don't believe me and never try it).
>>>>>> 
>>>>> 
>>>>> TSO capable controllers tend to have various limitations(the
>>>>> first
>>>>> TX buffer should have complete ethernet/IP/TCP header, ip_len of
>>>>> IP
>>>>> header should be reset to 0, TCP pseudo checksum should be
>>>>> recomputed etc) and cheap controllers need more assistance from
>>>>> driver to let its firmware know various IP/TCP header offset
>>>>> location in the mbuf.  Because this requires a IP/TCP header
>>>>> parsing, it's error prone and very complex.
>>>>> 
>>>>>>>> Anyhow, I have attached a patch that makes NFS use
>>>>>>>> MJUMPAGESIZE
>>>>>>>> clusters,
>>>>>>>> so the mbuf count drops from 34 to 18.
>>>>>>>> 
>>>>>>> 
>>>>>>> Could we make it conditional on size?
>>>>>>> 
>>>>>> Not sure what you mean? If you mean "the size of the
>>>>>> read/write",
>>>>>> that would be possible for NFSv3, but less so for NFSv4. (The
>>>>>> read/write
>>>>>> is just one Op. in the compound for NFSv4 and there is no way to
>>>>>> predict how much more data is going to be generated by
>>>>>> subsequent
>>>>>> Ops.)
>>>>>> 
>>>>> 
>>>>> Sorry, I should have been more clearer. You already answered my
>>>>> question.  Thanks.
>>>>> 
>>>>>> If by "size" you mean amount of memory in the machine then, yes,
>>>>>> it
>>>>>> certainly could be conditional on that. (I plan to try and look
>>>>>> at
>>>>>> the allocator to-day as well, but if others know of
>>>>>> disadvantages
>>>>>> with
>>>>>> using MJUMPAGESIZE instead of MCLBYTES, please speak up.)
>>>>>> 
>>>>>> Garrett Wollman already alluded to the MCLBYTES case being
>>>>>> pre-allocated,
>>>>>> but I'll admit I have no idea what the implications of that are
>>>>>> at this
>>>>>> time.
>>>>>> 
>>>>>>>> If anyone has a TSO scatter/gather enabled net interface and
>>>>>>>> can
>>>>>>>> test this
>>>>>>>> patch on it with NFS I/O (default of 64K rsize/wsize) when
>>>>>>>> TSO is
>>>>>>>> enabled
>>>>>>>> and see what effect it has, that would be appreciated.
>>>>>>>> 
>>>>>>>> Btw, thanks go to Garrett Wollman for suggesting the change
>>>>>>>> to
>>>>>>>> MJUMPAGESIZE
>>>>>>>> clusters.
>>>>>>>> 
>>>>>>>> rick
>>>>>>>> ps: If the attachment doesn't make it through and you want
>>>>>>>> the
>>>>>>>> patch, just
>>>>>>>>   email me and I'll send you a copy.
>>>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> freebsd-net at freebsd.org mailing list
>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>>> To unsubscribe, send any mail to
>>>>> "freebsd-net-unsubscribe at freebsd.org"
>>>>> 
>>>> _______________________________________________
>>>> freebsd-net at freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>>>> To unsubscribe, send any mail to
>>>> "freebsd-net-unsubscribe at freebsd.org"
>> 
>> _______________________________________________
>> freebsd-net at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to
>> "freebsd-net-unsubscribe at freebsd.org"