tuning routing using cxgbe and T580-CR cards?

Mon Jul 14 14:57:39 UTC 2014

The two physical CPUs are: 
Intel(R) Xeon(R) CPU E5-4610 0 @ 2.40GHz (2400.05-MHz K8-class CPU)

Hyperthreading, at least from initial appearances, seems to offer no
benefits or drawbacks.

I tested iperf3, using a packet generator on each subnet, each sending 4
streams to a server on another subnet.

maximum segment size of 128 and 1460 used, with little variance. (iperf3
-M).

A snapshot of netstat -d -b -w1 -W -h included. Midway through, the
numbers dropped. This coincides with launching  this was when I launched
16 more streams, 4 new clients, 4 new servers on different nets, 4
streams each.

            input        (Total)           output
   packets  errs idrops      bytes    packets  errs      bytes colls drops
      1.6M     0   514       254M       1.6M     0       252M     0     5
      1.6M     0   294       244M       1.6M     0       246M     0     6
      1.6M     0    95       255M       1.5M     0       236M     0     6
      1.4M     0     0       216M       1.5M     0       224M     0     3
      1.5M     0     0       225M       1.4M     0       219M     0     4
      1.4M     0   389       214M       1.4M     0       216M     0     1
      1.4M     0   270       207M       1.4M     0       207M     0     1
      1.4M     0   279       210M       1.4M     0       209M     0     2
      1.4M     0    12       207M       1.3M     0       204M     0     1
      1.4M     0   303       206M       1.4M     0       214M     0     2
      1.3M     0  2.3K       190M       1.4M     0       212M     0     1
      1.1M     0  1.1K       175M       1.1M     0       176M     0     1
      1.1M     0  1.6K       176M       1.1M     0       175M     0     1
      1.1M     0   830       176M       1.1M     0       174M     0     0
      1.2M     0  1.5K       187M       1.2M     0       187M     0     0
      1.2M     0  1.1K       183M       1.2M     0       184M     0     1
      1.2M     0  1.5K       197M       1.2M     0       196M     0     2
      1.3M     0  2.2K       199M       1.2M     0       196M     0     0
      1.3M     0  2.8K       200M       1.3M     0       202M     0     4
      1.3M     0  1.5K       199M       1.2M     0       198M     0     1

vmstat also included. You see similar drops in faults.

 procs      memory      page                    disks     faults         cpu
 r b w     avm    fre   flt  re  pi  po    fr  sr mf0 cd0   in   sy   cs
us sy id
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 188799  224
387419  0 74 26
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 207447  150
425576  0 72 28
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 205638  202
421659  0 75 25
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 200292  150
411257  0 74 26
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 200338  197
411537  0 77 23
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 199289  156
409092  0 75 25
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 200504  200
411992  0 76 24
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 165042  152
341207  0 78 22
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 171360  200
353776  0 78 22
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 197557  150
405937  0 74 26
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 170696  204
353197  0 78 22
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 174927  150
361171  0 77 23
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 153836  200
319227  0 79 21
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 159056  150
329517  0 78 22
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 155240  200
321819  0 78 22
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 166422  156
344184  0 78 22
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 162065  200
335215  0 79 21
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 172857  150
356852  0 78 22
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 81267  197
176539  0 92  8
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 82151  150
177434  0 91  9
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 73904  204
160887  0 91  9
 0 0 0    574M    15G     2   0   0   0     8   6   0   0 73820  150
161201  0 91  9
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 73926  196
161850  0 92  8
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 77215  150
166886  0 91  9
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 77509  198
169650  0 91  9
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 69993  156
154783  0 90 10
 0 0 0    574M    15G    82   0   0   0     0   6   0   0 69722  199
153525  0 91  9
 0 0 0    574M    15G     2   0   0   0     0   6   0   0 66353  150
147027  0 91  9
 0 0 0    550M    15G   102   0   0   0   101   6   0   0 67906  259
149365  0 90 10
 0 0 0    550M    15G     0   0   0   0     0   6   0   0 71837  125
157253  0 92  8
 0 0 0    550M    15G    80   0   0   0     0   6   0   0 73508  179
161498  0 92  8
 0 0 0    550M    15G     0   0   0   0     0   6   0   0 72673  125
159449  0 92  8
 0 0 0    550M    15G    80   0   0   0     0   6   0   0 75630  175
164614  0 91  9

On 07/11/2014 03:32 PM, Navdeep Parhar wrote:
> On 07/11/14 10:28, John Jasem wrote:
>> In testing two Chelsio T580-CR dual port cards with FreeBSD 10-STABLE,
>> I've been able to use a collection of clients to generate approximately
>> 1.5-1.6 million TCP packets per second sustained, and routinely hit
>> 10GB/s, both measured by netstat -d -b -w1 -W (I usually use -h for the
>> quick read, accepting the loss of granularity).
> When forwarding, the pps rate is often more interesting, and almost
> always the limiting factor, as compared to the total amount of data
> being passed around.  10GB at this pps probably means 9000 MTU.  Try
> with 1500 too if possible.
>
> "netstat -d 1" and "vmstat 1" for a few seconds when your system is
> under maximum load would be useful.  And what kind of CPU is in this system?
>
>> While performance has so far been stellar, and I'm honestly speculating
>> I will need more CPU depth and horsepower to get much faster, I'm
>> curious if there is any gain to tweaking performance settings. I'm
>> seeing, under multiple streams, with N targets connecting to N servers,
>> interrupts on all CPUs peg at 99-100%, and I'm curious if tweaking
>> configs will help, or its a free clue to get more horsepower.
>>
>> So, far, except for temporarily turning off pflogd, and setting the
>> following sysctl variables, I've not done any performance tuning on the
>> system yet.
>>
>> /etc/sysctl.conf
>> net.inet.ip.fastforwarding=1
>> kern.random.sys.harvest.ethernet=0
>> kern.random.sys.harvest.point_to_point=0
>> kern.random.sys.harvest.interrupt=0
>>
>> a) One of the first things I did in prior testing was to turn
>> hyperthreading off. I presume this is still prudent, as HT doesn't help
>> with interrupt handling?
> It is always worthwhile to try your workload with and without
> hyperthreading.
>
>> b) I briefly experimented with using cpuset(1) to stick interrupts to
>> physical CPUs, but it offered no performance enhancements, and indeed,
>> appeared to decrease performance by 10-20%. Has anyone else tried this?
>> What were your results?
>>
>> c) the defaults for the cxgbe driver appear to be 8 rx queues, and N tx
>> queues, with N being the number of CPUs detected. For a system running
>> multiple cards, routing or firewalling, does this make sense, or would
>> balancing tx and rx be more ideal? And would reducing queues per card
>> based on NUMBER-CPUS and NUM-CHELSIO-PORTS make sense at all?
> The defaults are nrxq = min(8, ncores) and ntxq = min(16, ncores).  The
> man page mentions this.  The reason for 8 vs. 16 is that tx queues are
> "cheaper" as they don't have to be backed by rx buffers.  It only needs
> some memory for the tx descriptor ring and some hardware resources.
>
> It appears that your system has >= 16 cores.  For forwarding it probably
> makes sense to have nrxq = ntxq.  If you're left with 8 or fewer cores
> after disabling hyperthreading you'll automatically get 8 rx and tx
> queues.  Otherwise you'll have to fiddle with the hw.cxgbe.nrxq10g and
> ntxq10g tunables (documented in the man page).
>
>
>> d) dev.cxl.$PORT.qsize_rxq: 1024 and dev.cxl.$PORT.qsize_txq: 1024.
>> These appear to not be writeable when if_cxgbe is loaded, so I speculate
>> they are not to be messed with, or are loader.conf variables? Is there
>> any benefit to messing with them?
> Can't change them after the port has been administratively brought up
> even once.  This is mentioned in the man page.  I don't really recommend
> changing them any way.
>
>> e) dev.t5nex.$CARD.toe.sndbuf: 262144. These are writeable, but messing
>> with values did not yield an immediate benefit. Am I barking up the
>> wrong tree, trying?
> The TOE tunables won't make a difference unless you have enabled TOE,
> the TCP endpoints lie on the system, and the connections are being
> handled by the TOE on the chip.  This is not the case on your systems.
> The driver does not enable TOE by default and the only way to use it is
> to switch it on explicitly.  There is no possibility that you're using
> it without knowing that you are.
>
>> f) based on prior experiments with other vendors, I tried tweaks to
>> net.isr.* settings, but did not see any benefits worth discussing. Am I
>> correct in this speculation, based on others experience?
>>
>> g) Are there other settings I should be looking at, that may squeeze out
>> a few more packets?
> The pps rates that you've observed are within the chip's hardware limits
> by at least an order of magnitude.  Tuning the kernel rather than the
> driver may be the best bang for your buck.
>
> Regards,
> Navdeep
>
>> Thanks in advance!
>>
>> -- John Jasen (jjasen at gmail.com)