tuning routing using cxgbe and T580-CR cards?
Navdeep Parhar
nparhar at gmail.com
Mon Jul 14 19:03:56 UTC 2014
Use UDP if you want more control over your experiments.
- It's easier to directly control the frame size on the wire. No TSO,
LRO, segmentation to worry about.
- UDP has no flow control so the transmitters will not let up even if a
frame goes missing. TCP will go into recovery. Lack of protocol
level flow control also means the transmitters cannot be influenced by
the receivers in any way.
- frames go only in the direction you want them to. With TCP you have
the receiver transmitting all the time too (ACKs).
Regards,
Navdeep
On 07/14/14 07:57, John Jasem wrote:
> The two physical CPUs are:
> Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz (2400.05-MHz K8-class CPU)
>
> Hyperthreading, at least from initial appearances, seems to offer no
> benefits or drawbacks.
>
> I tested iperf3, using a packet generator on each subnet, each sending 4
> streams to a server on another subnet.
>
> maximum segment size of 128 and 1460 used, with little variance. (iperf3
> -M).
>
> A snapshot of netstat -d -b -w1 -W -h included. Midway through, the
> numbers dropped. This coincides with launching this was when I launched
> 16 more streams, 4 new clients, 4 new servers on different nets, 4
> streams each.
>
> input (Total) output
> packets errs idrops bytes packets errs bytes colls drops
> 1.6M 0 514 254M 1.6M 0 252M 0 5
> 1.6M 0 294 244M 1.6M 0 246M 0 6
> 1.6M 0 95 255M 1.5M 0 236M 0 6
> 1.4M 0 0 216M 1.5M 0 224M 0 3
> 1.5M 0 0 225M 1.4M 0 219M 0 4
> 1.4M 0 389 214M 1.4M 0 216M 0 1
> 1.4M 0 270 207M 1.4M 0 207M 0 1
> 1.4M 0 279 210M 1.4M 0 209M 0 2
> 1.4M 0 12 207M 1.3M 0 204M 0 1
> 1.4M 0 303 206M 1.4M 0 214M 0 2
> 1.3M 0 2.3K 190M 1.4M 0 212M 0 1
> 1.1M 0 1.1K 175M 1.1M 0 176M 0 1
> 1.1M 0 1.6K 176M 1.1M 0 175M 0 1
> 1.1M 0 830 176M 1.1M 0 174M 0 0
> 1.2M 0 1.5K 187M 1.2M 0 187M 0 0
> 1.2M 0 1.1K 183M 1.2M 0 184M 0 1
> 1.2M 0 1.5K 197M 1.2M 0 196M 0 2
> 1.3M 0 2.2K 199M 1.2M 0 196M 0 0
> 1.3M 0 2.8K 200M 1.3M 0 202M 0 4
> 1.3M 0 1.5K 199M 1.2M 0 198M 0 1
>
>
> vmstat also included. You see similar drops in faults.
>
>
> procs memory page disks faults cpu
> r b w avm fre flt re pi po fr sr mf0 cd0 in sy cs
> us sy id
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 188799 224
> 387419 0 74 26
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 207447 150
> 425576 0 72 28
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 205638 202
> 421659 0 75 25
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 200292 150
> 411257 0 74 26
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 200338 197
> 411537 0 77 23
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 199289 156
> 409092 0 75 25
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 200504 200
> 411992 0 76 24
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 165042 152
> 341207 0 78 22
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 171360 200
> 353776 0 78 22
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 197557 150
> 405937 0 74 26
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 170696 204
> 353197 0 78 22
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 174927 150
> 361171 0 77 23
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 153836 200
> 319227 0 79 21
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 159056 150
> 329517 0 78 22
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 155240 200
> 321819 0 78 22
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 166422 156
> 344184 0 78 22
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 162065 200
> 335215 0 79 21
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 172857 150
> 356852 0 78 22
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 81267 197
> 176539 0 92 8
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 82151 150
> 177434 0 91 9
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 73904 204
> 160887 0 91 9
> 0 0 0 574M 15G 2 0 0 0 8 6 0 0 73820 150
> 161201 0 91 9
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 73926 196
> 161850 0 92 8
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 77215 150
> 166886 0 91 9
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 77509 198
> 169650 0 91 9
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 69993 156
> 154783 0 90 10
> 0 0 0 574M 15G 82 0 0 0 0 6 0 0 69722 199
> 153525 0 91 9
> 0 0 0 574M 15G 2 0 0 0 0 6 0 0 66353 150
> 147027 0 91 9
> 0 0 0 550M 15G 102 0 0 0 101 6 0 0 67906 259
> 149365 0 90 10
> 0 0 0 550M 15G 0 0 0 0 0 6 0 0 71837 125
> 157253 0 92 8
> 0 0 0 550M 15G 80 0 0 0 0 6 0 0 73508 179
> 161498 0 92 8
> 0 0 0 550M 15G 0 0 0 0 0 6 0 0 72673 125
> 159449 0 92 8
> 0 0 0 550M 15G 80 0 0 0 0 6 0 0 75630 175
> 164614 0 91 9
>
>
>
>
> On 07/11/2014 03:32 PM, Navdeep Parhar wrote:
>> On 07/11/14 10:28, John Jasem wrote:
>>> In testing two Chelsio T580-CR dual port cards with FreeBSD 10-STABLE,
>>> I've been able to use a collection of clients to generate approximately
>>> 1.5-1.6 million TCP packets per second sustained, and routinely hit
>>> 10GB/s, both measured by netstat -d -b -w1 -W (I usually use -h for the
>>> quick read, accepting the loss of granularity).
>> When forwarding, the pps rate is often more interesting, and almost
>> always the limiting factor, as compared to the total amount of data
>> being passed around. 10GB at this pps probably means 9000 MTU. Try
>> with 1500 too if possible.
>>
>> "netstat -d 1" and "vmstat 1" for a few seconds when your system is
>> under maximum load would be useful. And what kind of CPU is in this system?
>>
>>> While performance has so far been stellar, and I'm honestly speculating
>>> I will need more CPU depth and horsepower to get much faster, I'm
>>> curious if there is any gain to tweaking performance settings. I'm
>>> seeing, under multiple streams, with N targets connecting to N servers,
>>> interrupts on all CPUs peg at 99-100%, and I'm curious if tweaking
>>> configs will help, or its a free clue to get more horsepower.
>>>
>>> So, far, except for temporarily turning off pflogd, and setting the
>>> following sysctl variables, I've not done any performance tuning on the
>>> system yet.
>>>
>>> /etc/sysctl.conf
>>> net.inet.ip.fastforwarding=1
>>> kern.random.sys.harvest.ethernet=0
>>> kern.random.sys.harvest.point_to_point=0
>>> kern.random.sys.harvest.interrupt=0
>>>
>>> a) One of the first things I did in prior testing was to turn
>>> hyperthreading off. I presume this is still prudent, as HT doesn't help
>>> with interrupt handling?
>> It is always worthwhile to try your workload with and without
>> hyperthreading.
>>
>>> b) I briefly experimented with using cpuset(1) to stick interrupts to
>>> physical CPUs, but it offered no performance enhancements, and indeed,
>>> appeared to decrease performance by 10-20%. Has anyone else tried this?
>>> What were your results?
>>>
>>> c) the defaults for the cxgbe driver appear to be 8 rx queues, and N tx
>>> queues, with N being the number of CPUs detected. For a system running
>>> multiple cards, routing or firewalling, does this make sense, or would
>>> balancing tx and rx be more ideal? And would reducing queues per card
>>> based on NUMBER-CPUS and NUM-CHELSIO-PORTS make sense at all?
>> The defaults are nrxq = min(8, ncores) and ntxq = min(16, ncores). The
>> man page mentions this. The reason for 8 vs. 16 is that tx queues are
>> "cheaper" as they don't have to be backed by rx buffers. It only needs
>> some memory for the tx descriptor ring and some hardware resources.
>>
>> It appears that your system has >= 16 cores. For forwarding it probably
>> makes sense to have nrxq = ntxq. If you're left with 8 or fewer cores
>> after disabling hyperthreading you'll automatically get 8 rx and tx
>> queues. Otherwise you'll have to fiddle with the hw.cxgbe.nrxq10g and
>> ntxq10g tunables (documented in the man page).
>>
>>
>>> d) dev.cxl.$PORT.qsize_rxq: 1024 and dev.cxl.$PORT.qsize_txq: 1024.
>>> These appear to not be writeable when if_cxgbe is loaded, so I speculate
>>> they are not to be messed with, or are loader.conf variables? Is there
>>> any benefit to messing with them?
>> Can't change them after the port has been administratively brought up
>> even once. This is mentioned in the man page. I don't really recommend
>> changing them any way.
>>
>>> e) dev.t5nex.$CARD.toe.sndbuf: 262144. These are writeable, but messing
>>> with values did not yield an immediate benefit. Am I barking up the
>>> wrong tree, trying?
>> The TOE tunables won't make a difference unless you have enabled TOE,
>> the TCP endpoints lie on the system, and the connections are being
>> handled by the TOE on the chip. This is not the case on your systems.
>> The driver does not enable TOE by default and the only way to use it is
>> to switch it on explicitly. There is no possibility that you're using
>> it without knowing that you are.
>>
>>> f) based on prior experiments with other vendors, I tried tweaks to
>>> net.isr.* settings, but did not see any benefits worth discussing. Am I
>>> correct in this speculation, based on others experience?
>>>
>>> g) Are there other settings I should be looking at, that may squeeze out
>>> a few more packets?
>> The pps rates that you've observed are within the chip's hardware limits
>> by at least an order of magnitude. Tuning the kernel rather than the
>> driver may be the best bang for your buck.
>>
>> Regards,
>> Navdeep
>>
>>> Thanks in advance!
>>>
>>> -- John Jasen (jjasen at gmail.com)
>
More information about the freebsd-net
mailing list