Chelsio NETMAP performance

Tue Feb 4 16:20:08 UTC 2020

On Mon, Feb 03, 2020 at 02:39:03PM -0800, Navdeep Parhar wrote:

> On 2/3/20 2:23 PM, Slawa Olhovchenkov wrote:
> > On Mon, Feb 03, 2020 at 01:39:52PM -0800, Navdeep Parhar wrote:
> > 
> >> On 2/3/20 12:17 PM, Slawa Olhovchenkov wrote:
> >>> I am try to use Chelsio T540-CR in netmap mode and see poor (compared
> >>> to Intel 82599ES) performance.
> >>
> >> What approximate FreeBSD version is this?
> > 
> > 12.1-STABLE
> > 
> >>>
> >>> Same application ac receive only about 8.9Mpss, compared to 12.5Mpps
> >>> at Intel.
> >>>
> >>> pmc profile show mostly time spend in:
> >>>
> >>> 49.76%  [17802]    service_nm_rxq @ /boot/kernel/if_cxgbe.ko
> >>>  100.0%  [17802]     t4_vi_intr
> >>>   100.0%  [17802]      ithread_loop @ /boot/kernel/kernel
> >>>    100.0%  [17802]       fork_exit
> >>>
> >>>
> >>> to be exact at line
> >>>
> >>>         while ((d->rsp.u.type_gen & F_RSPD_GEN) == nm_rxq->iq_gen) {
> >>>
> >>> Is this maximum limit for this vendor?
> >>
> >> No, a T540 should be able to sink full 10Gbps (14.88Mpps) on a single rx
> >> queue.  Try adding this to your loader.conf:
> >>
> >> hw.cxgbe.toecaps_allowed="0"
> >>
> >> Then try simple netmap "pkt-gen -f rx" instead of any custom app and see
> >> how many pps it's able to sink.
> > 
> > Thanks! `hw.cxgbe.toecaps_allowed="0"` allow recive 14Mpps for may
> > application too!
> > 
> > Now I am got only 10% less performance compared to Intel, as I see by
> > higher Chelsio interrupt cpu time (top show about 30% for every
> > interrupt handler). Is this normal? Is this posible to optimize?
> 
> Try changing the interrupt holdoff timer for the netmap rx queues.
> 
> This shows the list of timers available (in microseconds):
> # sysctl dev.t5nex.0.holdoff_timers
> 
> nm_holdoff_tmr_idx is a 0-based index into the list above.  So if the
> tmr idx is 0 you are using the 0th (first) value from the list of
> timers.  Try increasing nm_holdoff_tmr_idx and see if that brings down
> the interrupt rate under control.
> 
> # sysctl hw.cxgbe.nm_holdoff_tmr_idx=3/4/5

OK, interrupt rate go down, but interrupt time about same.
(interrupt rate for intel card about 0, compared to 25% chelsio).
Most time spent in service_nm_rxq(), in while() check.
Is this posible to do some prefetch?
Trivial `__builtin_prefetch(64+(char*)d);` in body of loop don't
change anything.

Is this posible to do batch prefetch before cycle?