tuning routing using cxgbe and T580-CR cards?
John Jasem
jjasen at gmail.com
Mon Jul 14 14:57:39 UTC 2014
The two physical CPUs are:
Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz (2400.05-MHz K8-class CPU)
Hyperthreading, at least from initial appearances, seems to offer no
benefits or drawbacks.
I tested iperf3, using a packet generator on each subnet, each sending 4
streams to a server on another subnet.
maximum segment size of 128 and 1460 used, with little variance. (iperf3
-M).
A snapshot of netstat -d -b -w1 -W -h included. Midway through, the
numbers dropped. This coincides with launching this was when I launched
16 more streams, 4 new clients, 4 new servers on different nets, 4
streams each.
input (Total) output
packets errs idrops bytes packets errs bytes colls drops
1.6M 0 514 254M 1.6M 0 252M 0 5
1.6M 0 294 244M 1.6M 0 246M 0 6
1.6M 0 95 255M 1.5M 0 236M 0 6
1.4M 0 0 216M 1.5M 0 224M 0 3
1.5M 0 0 225M 1.4M 0 219M 0 4
1.4M 0 389 214M 1.4M 0 216M 0 1
1.4M 0 270 207M 1.4M 0 207M 0 1
1.4M 0 279 210M 1.4M 0 209M 0 2
1.4M 0 12 207M 1.3M 0 204M 0 1
1.4M 0 303 206M 1.4M 0 214M 0 2
1.3M 0 2.3K 190M 1.4M 0 212M 0 1
1.1M 0 1.1K 175M 1.1M 0 176M 0 1
1.1M 0 1.6K 176M 1.1M 0 175M 0 1
1.1M 0 830 176M 1.1M 0 174M 0 0
1.2M 0 1.5K 187M 1.2M 0 187M 0 0
1.2M 0 1.1K 183M 1.2M 0 184M 0 1
1.2M 0 1.5K 197M 1.2M 0 196M 0 2
1.3M 0 2.2K 199M 1.2M 0 196M 0 0
1.3M 0 2.8K 200M 1.3M 0 202M 0 4
1.3M 0 1.5K 199M 1.2M 0 198M 0 1
vmstat also included. You see similar drops in faults.
procs memory page disks faults cpu
r b w avm fre flt re pi po fr sr mf0 cd0 in sy cs
us sy id
0 0 0 574M 15G 82 0 0 0 0 6 0 0 188799 224
387419 0 74 26
0 0 0 574M 15G 2 0 0 0 0 6 0 0 207447 150
425576 0 72 28
0 0 0 574M 15G 82 0 0 0 0 6 0 0 205638 202
421659 0 75 25
0 0 0 574M 15G 2 0 0 0 0 6 0 0 200292 150
411257 0 74 26
0 0 0 574M 15G 82 0 0 0 0 6 0 0 200338 197
411537 0 77 23
0 0 0 574M 15G 2 0 0 0 0 6 0 0 199289 156
409092 0 75 25
0 0 0 574M 15G 82 0 0 0 0 6 0 0 200504 200
411992 0 76 24
0 0 0 574M 15G 2 0 0 0 0 6 0 0 165042 152
341207 0 78 22
0 0 0 574M 15G 82 0 0 0 0 6 0 0 171360 200
353776 0 78 22
0 0 0 574M 15G 2 0 0 0 0 6 0 0 197557 150
405937 0 74 26
0 0 0 574M 15G 82 0 0 0 0 6 0 0 170696 204
353197 0 78 22
0 0 0 574M 15G 2 0 0 0 0 6 0 0 174927 150
361171 0 77 23
0 0 0 574M 15G 82 0 0 0 0 6 0 0 153836 200
319227 0 79 21
0 0 0 574M 15G 2 0 0 0 0 6 0 0 159056 150
329517 0 78 22
0 0 0 574M 15G 82 0 0 0 0 6 0 0 155240 200
321819 0 78 22
0 0 0 574M 15G 2 0 0 0 0 6 0 0 166422 156
344184 0 78 22
0 0 0 574M 15G 82 0 0 0 0 6 0 0 162065 200
335215 0 79 21
0 0 0 574M 15G 2 0 0 0 0 6 0 0 172857 150
356852 0 78 22
0 0 0 574M 15G 82 0 0 0 0 6 0 0 81267 197
176539 0 92 8
0 0 0 574M 15G 2 0 0 0 0 6 0 0 82151 150
177434 0 91 9
0 0 0 574M 15G 82 0 0 0 0 6 0 0 73904 204
160887 0 91 9
0 0 0 574M 15G 2 0 0 0 8 6 0 0 73820 150
161201 0 91 9
0 0 0 574M 15G 82 0 0 0 0 6 0 0 73926 196
161850 0 92 8
0 0 0 574M 15G 2 0 0 0 0 6 0 0 77215 150
166886 0 91 9
0 0 0 574M 15G 82 0 0 0 0 6 0 0 77509 198
169650 0 91 9
0 0 0 574M 15G 2 0 0 0 0 6 0 0 69993 156
154783 0 90 10
0 0 0 574M 15G 82 0 0 0 0 6 0 0 69722 199
153525 0 91 9
0 0 0 574M 15G 2 0 0 0 0 6 0 0 66353 150
147027 0 91 9
0 0 0 550M 15G 102 0 0 0 101 6 0 0 67906 259
149365 0 90 10
0 0 0 550M 15G 0 0 0 0 0 6 0 0 71837 125
157253 0 92 8
0 0 0 550M 15G 80 0 0 0 0 6 0 0 73508 179
161498 0 92 8
0 0 0 550M 15G 0 0 0 0 0 6 0 0 72673 125
159449 0 92 8
0 0 0 550M 15G 80 0 0 0 0 6 0 0 75630 175
164614 0 91 9
On 07/11/2014 03:32 PM, Navdeep Parhar wrote:
> On 07/11/14 10:28, John Jasem wrote:
>> In testing two Chelsio T580-CR dual port cards with FreeBSD 10-STABLE,
>> I've been able to use a collection of clients to generate approximately
>> 1.5-1.6 million TCP packets per second sustained, and routinely hit
>> 10GB/s, both measured by netstat -d -b -w1 -W (I usually use -h for the
>> quick read, accepting the loss of granularity).
> When forwarding, the pps rate is often more interesting, and almost
> always the limiting factor, as compared to the total amount of data
> being passed around. 10GB at this pps probably means 9000 MTU. Try
> with 1500 too if possible.
>
> "netstat -d 1" and "vmstat 1" for a few seconds when your system is
> under maximum load would be useful. And what kind of CPU is in this system?
>
>> While performance has so far been stellar, and I'm honestly speculating
>> I will need more CPU depth and horsepower to get much faster, I'm
>> curious if there is any gain to tweaking performance settings. I'm
>> seeing, under multiple streams, with N targets connecting to N servers,
>> interrupts on all CPUs peg at 99-100%, and I'm curious if tweaking
>> configs will help, or its a free clue to get more horsepower.
>>
>> So, far, except for temporarily turning off pflogd, and setting the
>> following sysctl variables, I've not done any performance tuning on the
>> system yet.
>>
>> /etc/sysctl.conf
>> net.inet.ip.fastforwarding=1
>> kern.random.sys.harvest.ethernet=0
>> kern.random.sys.harvest.point_to_point=0
>> kern.random.sys.harvest.interrupt=0
>>
>> a) One of the first things I did in prior testing was to turn
>> hyperthreading off. I presume this is still prudent, as HT doesn't help
>> with interrupt handling?
> It is always worthwhile to try your workload with and without
> hyperthreading.
>
>> b) I briefly experimented with using cpuset(1) to stick interrupts to
>> physical CPUs, but it offered no performance enhancements, and indeed,
>> appeared to decrease performance by 10-20%. Has anyone else tried this?
>> What were your results?
>>
>> c) the defaults for the cxgbe driver appear to be 8 rx queues, and N tx
>> queues, with N being the number of CPUs detected. For a system running
>> multiple cards, routing or firewalling, does this make sense, or would
>> balancing tx and rx be more ideal? And would reducing queues per card
>> based on NUMBER-CPUS and NUM-CHELSIO-PORTS make sense at all?
> The defaults are nrxq = min(8, ncores) and ntxq = min(16, ncores). The
> man page mentions this. The reason for 8 vs. 16 is that tx queues are
> "cheaper" as they don't have to be backed by rx buffers. It only needs
> some memory for the tx descriptor ring and some hardware resources.
>
> It appears that your system has >= 16 cores. For forwarding it probably
> makes sense to have nrxq = ntxq. If you're left with 8 or fewer cores
> after disabling hyperthreading you'll automatically get 8 rx and tx
> queues. Otherwise you'll have to fiddle with the hw.cxgbe.nrxq10g and
> ntxq10g tunables (documented in the man page).
>
>
>> d) dev.cxl.$PORT.qsize_rxq: 1024 and dev.cxl.$PORT.qsize_txq: 1024.
>> These appear to not be writeable when if_cxgbe is loaded, so I speculate
>> they are not to be messed with, or are loader.conf variables? Is there
>> any benefit to messing with them?
> Can't change them after the port has been administratively brought up
> even once. This is mentioned in the man page. I don't really recommend
> changing them any way.
>
>> e) dev.t5nex.$CARD.toe.sndbuf: 262144. These are writeable, but messing
>> with values did not yield an immediate benefit. Am I barking up the
>> wrong tree, trying?
> The TOE tunables won't make a difference unless you have enabled TOE,
> the TCP endpoints lie on the system, and the connections are being
> handled by the TOE on the chip. This is not the case on your systems.
> The driver does not enable TOE by default and the only way to use it is
> to switch it on explicitly. There is no possibility that you're using
> it without knowing that you are.
>
>> f) based on prior experiments with other vendors, I tried tweaks to
>> net.isr.* settings, but did not see any benefits worth discussing. Am I
>> correct in this speculation, based on others experience?
>>
>> g) Are there other settings I should be looking at, that may squeeze out
>> a few more packets?
> The pps rates that you've observed are within the chip's hardware limits
> by at least an order of magnitude. Tuning the kernel rather than the
> driver may be the best bang for your buck.
>
> Regards,
> Navdeep
>
>> Thanks in advance!
>>
>> -- John Jasen (jjasen at gmail.com)
More information about the freebsd-net
mailing list