Major performance hit with ToS setting
Lawrence Stewart
lstewart at freebsd.org
Fri Jun 1 02:16:12 UTC 2012
On 05/31/12 13:33, Kevin Oberman wrote:
> On Fri, May 25, 2012 at 6:27 AM, Andrew Gallatin<gallatin at cs.duke.edu> wrote:
>> On 05/24/12 18:55, Kevin Oberman wrote:
>>
>>>
>>> This is,of course, on a 10G interface. On 7.3 there is little
>>
>>
>> Hi Kevin,
>>
>>
>> What you're seeing looks almost like a checksum is bad, or
>> there is some other packet damage. Do you see any
>> error counters increasing if you run netstat -s before
>> and after the test& compare the results?
>>
>> Thinking that, perhaps, this was a bug in my mxge(4), I attempted
>> to reproduce it this morning between 8.3 and 9.0 boxes and
>> failed to see the bad behavior..
>>
>> % nuttcp-6.1.2 -c32t -t diablo1-m< /dev/zero
>> 9161.7500 MB / 10.21 sec = 7526.5792 Mbps 53 %TX 97 %RX 0 host-retrans
>> 0.11 msRTT
>> % nuttcp-6.1.2 -t diablo1-m< /dev/zero
>> 9140.6180 MB / 10.21 sec = 7509.8270 Mbps 53 %TX 97 %RX 0 host-retrans
>> 0.11 msRTT
>>
>>
>> However, I don't have any 8.2-r box handy, so I cannot
>> exactly repro your experiment...
>
> Drew and Bjorn,
>
> At this point the flying fickle finger of fate (oops, just dated
> myself) is pointing to a bug in the CUBIC congestion control, which we
> run. But its really weird in several ways.
I can't find any evidence (yet) of cubic being responsible for the
problems you're seeing.
> I built another system from the same source files and it works fine,
> unlike all of the existing systems. I need to confirm that all systems
> have identical hardware including the Myricom firmware. I suspect some
> edge case is biting only in unusual cases.
>
> I used SIFTR at the suggestion of Lawrence Stewart who headed the
> project to bring plugable congestion algorithms to FreeBSD and found
> really odd congestion behavior. First, I do see a triple ACK, but the
> congestion window suddenly drops from 73K to 8K. If I understand
> CUBIC, it should half the congestion window, not what is happening..
> It then increases slowly (in slow start) to 82K. while the slow-start
> bytes are INCREASING, the congestion window again goes to 8K while the
> SS size moves from 36K up to 52K. It just continues to bound wildly
> between 8K (always the low point) and between 64k and 82K. The swings
> start at 83K and, over the first few seconds the peaks drop to about
> 64K.
>
> I am trying to think of any way that anything other then the CC
> algorithm could do this, but have not to this point. I will try
> installing Hamilton and see how it works. On the other hand, how could
> changing the ToS bits trigger this behavior?
As mentioned to you privately but I'll repeat here for the record, HD
and CHD are not good algorithms to compare cubic's behaviour against,
because they're delay based and will reduce cwnd in response to
observing changes in RTT i.e. their cwnd plots will look jumpy even
though no losses are occurring. Comparing cubic against newreno or htcp
would be more appropriate.
> I have sent all of my data to Lawrence Stewart and I expect to here
> from him soon, but I'd appreciate it if you can provide any other idea
> on what could cause this.
Ok, so here are some thoughts from looking at the siftr data you
forwarded me privately:
- Column 11 shows the snd_wnd for the flow, which mirrors the rcv_wnd
advertised by the receiver. It starts at 65k (default for
net.inet.tcp.recvspace) and fairly early on jumps to above 1Mb. This
implies socket buffer autotuning on the receiver is working correctly
and that your connection is not rcv_wnd limited - good on this front.
- Column 16 shows the MSS. It's 8k, so you're using jumbo frames. Jumbo
frames are notorious for trigger interesting bugs at lower layers in the
stack. Are you keeping track of your jumbo cluster usage via "netstat
-m"? Can you trigger the bug if you switch to 1500MTU?
- Column 17 shows the smoothed RTT estimate for the flow, in weird
units. You chopped off the siftr enable log line which reports the data
necessary to correctly decode the SRTT value, but assuming you've left
HZ at 1000 and TCP_RTT_SCALE is the default of 32, the kernel thinks the
receiver is ~56ms away, and the delay fluctuates up to ~70ms, most
likely attributable to build up of queuing delays along the path.
- Column 19 shows the TCP control block flags for the connection. If you
AND the number in this field with flags from <netinet/tcp_var.h>, you
can figure out if a particular flag is set. Also easy to paste the
number into a calculator and switch to hex or binary mode to figure out
which bits are set. 'cat bwctl.siftr | cut -f19 -d "," | less' will let
you easily see the range of flag combinations. Most of the interesting
flags are in the higher order bits. The value 16778208 early on
indicates TF_TSO is set. The value 554697696 which appears periodically
in the trace indicates TF_CONGRECOVERY|TF_FASTRECOVERY i.e. 3 dupacks
triggered a fast recovery episode. 1092625377 indicates
TF_WASCRECOVERY|TF_WASFRECOVERY i.e. an RTO happened. Value 1630544864
is interesting, as it indicates
TF_WASCRECOVERY|TF_CONGRECOVERY|TF_WASFRECOVERY|TF_FASTRECOVERY, which I
think is a minor bug (the TF_WAS flags should have been cleared), though
shouldn't affect your connection.
- Columns 21 and 22 show the size and occupancy of the socket send
buffer respectively. The size grows showing autotuning is kicking in,
and the occupancy is kept high which means the app is ensuring the send
pipeline doesn't stall - good on this front.
Without pcap trace data it's hard to obtain further insight into what
exactly is going on, but it would appear you are suffering from frequent
packet loss, quite possibly internal to the stack on the sender. More
digging required, and I would suggest you first look at trying to
reproduce with a 1500 byte MTU.
Cheers,
Lawrence
More information about the freebsd-net
mailing list