Major performance hit with ToS setting

Fri Jun 1 02:16:12 UTC 2012

On 05/31/12 13:33, Kevin Oberman wrote:
> On Fri, May 25, 2012 at 6:27 AM, Andrew Gallatin<gallatin at cs.duke.edu>  wrote:
>> On 05/24/12 18:55, Kevin Oberman wrote:
>>
>>>
>>> This is,of course, on a 10G interface. On 7.3 there is little
>>
>>
>> Hi Kevin,
>>
>>
>> What you're seeing looks almost like a checksum is bad, or
>> there is some other packet damage.  Do you see any
>> error counters increasing if you run netstat -s before
>> and after the test&  compare the results?
>>
>> Thinking that, perhaps, this was a bug in my mxge(4), I attempted
>> to reproduce it this morning between  8.3 and 9.0 boxes and
>> failed to see the bad behavior..
>>
>> % nuttcp-6.1.2 -c32t -t diablo1-m<  /dev/zero
>>   9161.7500 MB /  10.21 sec = 7526.5792 Mbps 53 %TX 97 %RX 0 host-retrans
>> 0.11 msRTT
>> % nuttcp-6.1.2  -t diablo1-m<  /dev/zero
>>   9140.6180 MB /  10.21 sec = 7509.8270 Mbps 53 %TX 97 %RX 0 host-retrans
>> 0.11 msRTT
>>
>>
>> However, I don't have any 8.2-r box handy, so I cannot
>> exactly repro your experiment...
>
> Drew and Bjorn,
>
> At this point the flying fickle finger of fate (oops, just dated
> myself) is pointing to a bug in the CUBIC congestion control, which we
> run. But its really weird in several ways.

I can't find any evidence (yet) of cubic being responsible for the 
problems you're seeing.

> I built another system from the same source files and it works fine,
> unlike all of the existing systems. I need to confirm that all systems
> have identical hardware including the Myricom firmware. I suspect some
> edge case is biting only in unusual cases.
>
> I used SIFTR at the suggestion of Lawrence Stewart who headed the
> project to bring plugable congestion algorithms to FreeBSD and found
> really odd congestion behavior. First, I do see a triple ACK, but the
> congestion window suddenly drops from 73K to 8K. If I understand
> CUBIC, it should half the congestion window, not what is happening..
> It then increases slowly (in slow start) to 82K. while the slow-start
> bytes are INCREASING, the congestion window again goes to 8K while the
> SS size moves from 36K up to 52K. It just continues to bound wildly
> between 8K (always the low point) and between 64k and 82K. The swings
> start at 83K and, over the first few seconds the peaks drop to about
> 64K.
>
> I am trying to think of any way that anything other then the CC
> algorithm could do this, but have not to this point. I will try
> installing Hamilton and see how it works. On the other hand, how could
> changing the ToS bits trigger this behavior?

As mentioned to you privately but I'll repeat here for the record, HD 
and CHD are not good algorithms to compare cubic's behaviour against, 
because they're delay based and will reduce cwnd in response to 
observing changes in RTT i.e. their cwnd plots will look jumpy even 
though no losses are occurring. Comparing cubic against newreno or htcp 
would be more appropriate.

> I have sent all of my data to Lawrence Stewart and I expect to here
> from him soon, but I'd appreciate it if you can provide any other idea
> on what could cause this.

Ok, so here are some thoughts from looking at the siftr data you 
forwarded me privately:

- Column 11 shows the snd_wnd for the flow, which mirrors the rcv_wnd 
advertised by the receiver. It starts at 65k (default for 
net.inet.tcp.recvspace) and fairly early on jumps to above 1Mb. This 
implies socket buffer autotuning on the receiver is working correctly 
and that your connection is not rcv_wnd limited - good on this front.

- Column 16 shows the MSS. It's 8k, so you're using jumbo frames. Jumbo 
frames are notorious for trigger interesting bugs at lower layers in the 
stack. Are you keeping track of your jumbo cluster usage via "netstat 
-m"? Can you trigger the bug if you switch to 1500MTU?

- Column 17 shows the smoothed RTT estimate for the flow, in weird 
units. You chopped off the siftr enable log line which reports the data 
necessary to correctly decode the SRTT value, but assuming you've left 
HZ at 1000 and TCP_RTT_SCALE is the default of 32, the kernel thinks the 
receiver is ~56ms away, and the delay fluctuates up to ~70ms, most 
likely attributable to build up of queuing delays along the path.

- Column 19 shows the TCP control block flags for the connection. If you 
AND the number in this field with flags from <netinet/tcp_var.h>, you 
can figure out if a particular flag is set. Also easy to paste the 
number into a calculator and switch to hex or binary mode to figure out 
which bits are set. 'cat bwctl.siftr | cut -f19 -d "," | less' will let 
you easily see the range of flag combinations. Most of the interesting 
flags are in the higher order bits. The value 16778208 early on 
indicates TF_TSO is set. The value 554697696 which appears periodically 
in the trace indicates TF_CONGRECOVERY|TF_FASTRECOVERY i.e. 3 dupacks 
triggered a fast recovery episode. 1092625377 indicates 
TF_WASCRECOVERY|TF_WASFRECOVERY i.e. an RTO happened. Value 1630544864 
is interesting, as it indicates 
TF_WASCRECOVERY|TF_CONGRECOVERY|TF_WASFRECOVERY|TF_FASTRECOVERY, which I 
think is a minor bug (the TF_WAS flags should have been cleared), though 
shouldn't affect your connection.

- Columns 21 and 22 show the size and occupancy of the socket send 
buffer respectively. The size grows showing autotuning is kicking in, 
and the occupancy is kept high which means the app is ensuring the send 
pipeline doesn't stall - good on this front.

Without pcap trace data it's hard to obtain further insight into what 
exactly is going on, but it would appear you are suffering from frequent 
packet loss, quite possibly internal to the stack on the sender. More 
digging required, and I would suggest you first look at trying to 
reproduce with a 1500 byte MTU.

Cheers,
Lawrence