POWER9 NICs failing at 100Gbps

From: Ali Mashtizadeh <mashtizadeh_at_gmail.com>
Date: Tue, 02 May 2023 17:45:22 UTC
Hello Everyone,

We've been testing FreeBSD 13.2 PowerPC64LE with an LC922 and a Raptor with
100Gbps Chelsio T6 and Mellanox ConnectX-6 NICs, but we get NIC failures
once we saturate either NIC.  We can trigger this bug instantly with a few
iperf3 instances running simultaneously.

I've included the log below for the Chelsio NIC and I'm wondering if this
is a known issue?

cc0: link state changed to UP
t6nex0: command 0x16 in mbox 4 timed out (0x4014c010).
t6nex0: mbox 4 cmdsent 16a0094400000001 2328f70000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000
t6nex0: mbox 4 current 16a0094400000001 2328f70000000000 0000000000000000
0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000000
t6nex0: encountered fatal error, adapter stopped (1).
cc0: set_rxmode (1) failed: 60
t6nex0: CIM debug regs1 00000000 00000000 00000000 00000000 00000000
t6nex0: CIM debug regs2 00000000 00000000 00000000 00000000 00330000
t6nex0: CIM LA dump follows.
Status   Inst    Data      PC     LS0Stat  LS0Addr  LS0Data  LS1Stat
 LS1Addr  LS1Data
  3c   00003003 1fffeedf 1fffeedf 00a00028 1fff0850 1fff3400 00b00020
1ffce2e8 00000000
  3c   00003008 1fffeee2 1fffeee2 00a00028 1fff06a4 1ffce200 00b00020
1ffce2e8 00000000
  3c   00003008 1fffeeea 1fffeeea 00a00028 1fff084c 1fff2f0c 00b00020
1ffce2e8 00000000
  3c   00003008 1fffeef2 1fffeef2 00a00020 1fff084c 00000000 00b00020
1ffce2e8 00000000
  3c   00003002 1fffeefa 1fffeefa 00a00020 1fff084c 00000000 00b00020
1ffce2e8 00000000
  3c   00003002 1fffeefc 1fffeefc 00a00020 1fff084c 00000000 00b00020
1ffce2e8 00000000
  3c   00003008 1fffeefe 1fffeefe 00a00005 1fff328b 0000000f 00b00025
1ffce2e8 00000000
....
t6nex0: device log follows.
....
        46       2968294087    NOTICE      PORT  port[0:0x11:0x0b]: l1cfg,
1G/10G can't be advertised for this port type. mcaps 0x339f007e acaps
0x20970078 rcaps 0xb3007e
        47       2968386457      INFO      PORT  port_link_state_handler[0]
powering up
        48       2968386460      INFO      PORT  port[0] update (flowcid
40236 rc 0)
        49       2968685971      INFO      PORT  bean_fsm[0] : state START
(count = 1)
        50       2968695782      INFO      PORT  hw_mac_init_port[0], ptype
0x11, speed 0x4, lanes 0xf, fec 0x800000
        51       2968696059      INFO      PORT  bean_fsm[0] : entering
state BASEP_HANDLE
        52       2969235973      INFO      PORT  bean_fsm[0] : entering
state NXP_HANDLE
        53       2969245973      INFO      PORT  bean_fsm[0] : entering
state EXT_NXP_HANDLE
        54       2969255973      INFO      PORT  consortium_fec[0]: local
0x7, remote 0x3, negotiated 0x800000
        55       2969255973      INFO      PORT  bean_fsm[0] : entering
state WAIT_FOR_NULL_PAGE
        56       2969285973      INFO      PORT  bean_fsm[0] : entering
state WAIT_COMPLETE
        57       2969285974      INFO      PORT  bean_fsm[0] : tech ability
local 0x710, remote 0x715 cr-s 0, local fec_ability 0x1
        58       2969285974      INFO      PORT  bean_fsm[0] : IEEE speed
0x40, FEC remote 0x4, negotiated 0x800000
        59       2969285975      INFO      PORT  bean_fsm[0] : state DONE
        60       2969285976      INFO      PORT  bean_fsm[0] : FEC local
0x1, negotiated 0x800000
        61       2969286976      INFO      PORT  hw_mac_init_port[0], ptype
0x11, speed 0x40, lanes 0xf, fec 0x800000
        62       2969287972      INFO      PORT  port[0] negotiated speed
0x40, lanes 0xf:0xf, fec 0x800000
        63       2969287974      INFO      PORT  aec_fsm[0] : state START
(sigdet 0xf)
        64       2969288111      INFO      PORT  aec_fsm[0] : transitioning
to TRAINING
        65       2969651045      INFO      PORT  aec_fsm[0] :
TRAINING_COMPLETE
        66       2969651046      INFO      PORT  aec_fsm[0] : COEFFICIENT
TAP OVERRIDE 1:2:3 :: 0x7e:0x1b:0x75
        67       2969651046      INFO      PORT  aec_fsm[0] : COEFFICIENT
TAP OVERRIDE 1:2:3 :: 0x7e:0x1b:0x75
        68       2969651047      INFO      PORT  aec_fsm[0] : COEFFICIENT
TAP OVERRIDE 1:2:3 :: 0x7e:0x1b:0x75
        69       2969651047      INFO      PORT  aec_fsm[0] : COEFFICIENT
TAP OVERRIDE 1:2:3 :: 0x7e:0x1b:0x75
        70       2969651905      INFO      PORT  aec_fsm[0] : Remote fault
while waiting for link status 0x29
        71       2975239314      INFO      PORT  aec_fsm[0]: aec training
completed, link timed out lstatus 0x5
        72       2975239314      INFO      PORT  aec_fsm[0] Link timed out
after training complete, Link Status 0x5
        73       2975335992      INFO      PORT  bean_fsm[0] : state START
(count = 1)
        74       2975345863      INFO      PORT  hw_mac_init_port[0], ptype
0x11, speed 0x4, lanes 0xf, fec 0x800000
        75       2975346140      INFO      PORT  bean_fsm[0] : entering
state BASEP_HANDLE
        76       2975415994      INFO      PORT  bean_fsm[0] : entering
state NXP_HANDLE
        77       2975425994      INFO      PORT  bean_fsm[0] : entering
state EXT_NXP_HANDLE
        78       2975435994      INFO      PORT  consortium_fec[0]: local
0x7, remote 0x3, negotiated 0x800000
        79       2975435994      INFO      PORT  bean_fsm[0] : entering
state WAIT_FOR_NULL_PAGE
        80       2975465994      INFO      PORT  bean_fsm[0] : entering
state WAIT_COMPLETE
        81       2975465995      INFO      PORT  bean_fsm[0] : tech ability
local 0x710, remote 0x715 cr-s 0, local fec_ability 0x1
        82       2975465995      INFO      PORT  bean_fsm[0] : IEEE speed
0x40, FEC remote 0x4, negotiated 0x800000
        83       2975465996      INFO      PORT  bean_fsm[0] : state DONE
        84       2975465996      INFO      PORT  bean_fsm[0] : FEC local
0x1, negotiated 0x800000
        85       2975466997      INFO      PORT  hw_mac_init_port[0], ptype
0x11, speed 0x40, lanes 0xf, fec 0x800000
        86       2975467993      INFO      PORT  port[0] negotiated speed
0x40, lanes 0xf:0xf, fec 0x800000
        87       2975467994      INFO      PORT  aec_fsm[0] : state START
(sigdet 0xf)
        88       2975468131      INFO      PORT  aec_fsm[0] : transitioning
to TRAINING
        89       2975837289      INFO      PORT  aec_fsm[0] :
TRAINING_COMPLETE
        90       2975837289      INFO      PORT  aec_fsm[0] : COEFFICIENT
TAP OVERRIDE 1:2:3 :: 0x7e:0x1b:0x75
        91       2975837290      INFO      PORT  aec_fsm[0] : COEFFICIENT
TAP OVERRIDE 1:2:3 :: 0x7e:0x1b:0x75
        92       2975837290      INFO      PORT  aec_fsm[0] : COEFFICIENT
TAP OVERRIDE 1:2:3 :: 0x7e:0x1b:0x75
        93       2975837291      INFO      PORT  aec_fsm[0] : COEFFICIENT
TAP OVERRIDE 1:2:3 :: 0x7e:0x1b:0x75
        94       2975838184      INFO      PORT  aec_fsm[0] : Remote fault
while waiting for link status 0x29
        95       2981015970      INFO      PORT  hw_mac_link_status[0]
int_cause 0x17011b4, link_status 0x22
        96       2981015970      INFO      PORT  aec_fsm[0] : Remote fault
cleared while waiting for link status 0x22
        97       2981015973      INFO      PORT  aec_fsm[0] : DONE
        98       2981015973      INFO      PORT  bean/aec complete (retry:
1)
        99       2981015974      INFO      PORT  port_hss_sigdet[0]:
hss_sigdet changed to 0xf
       100       2981106013      INFO      PORT  port[0] link up (1) (speed
0x40 acaps 0x20970078 lpcaps 0x10007e)
       101       2981106015      INFO      PORT  port[0] set PAUSE PARAMS:
pppen 0 txpe 0 rxpe 0
       102       2981106018      INFO      PORT  port[0] update (flowcid
40236 rc 0)

Best,
Ali