sfxge, lagg, cannot flush Tx/Rx queue and disconnects

Mon Aug 10 18:26:01 UTC 2020

Hi,

Apologies, first email went out as html letter. Re-sending as plain text.

I have a FreeBSD 12.1 system which has Solarflare SFN8522 network 
controller. Everything works perfectly fine, until at some point I loose 
connectivity to the server: it will stop responding to pings for some 
time, then will start and will continue for a long time.

lagg0 configured like this in the /etc/rc.conf:

ifconfig_sfxge0="up mtu 9000"
ifconfig_sfxge1="up mtu 9000"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto failover laggport sfxge0 laggport sfxge1 
xxx.xxx.xxx.xxx/24"

Output of the pciconf -lv:

sfxge0 at pci0:133:0:0: class=0x020000 card=0x80171924 chip=0x0a031924 
rev=0x02 hdr=0x00
vendor = 'Solarflare Communications'
device = 'SFC9220 10/40G Ethernet Controller'
class = network
subclass = ethernet
sfxge1 at pci0:133:0:1: class=0x020000 card=0x80171924 chip=0x0a031924 
rev=0x02 hdr=0x00
vendor = 'Solarflare Communications'
device = 'SFC9220 10/40G Ethernet Controller'
class = network
subclass = ethernet

The simplest fix is to reboot server and everything works as before, but 
this isn't the best option. When I tried to restart networking, during 
one of the troubleshooting session, (/etc/rc.d/netif restart) the 
process got stuck and I saw several message in the logs

kernel: sfxge0: Cannot flush Tx queue 23
kernel: sfxge0: Cannot flush Tx queue 15
kernel: sfxge0: Cannot flush Rx queue 23
kernel: sfxge0: Cannot flush Rx queue 15

I don't have access to switch to see what's going on, but from what I 
hear they don't see anything suspicious, which rolling out switch issue.

The latest step in troubleshoot is to disable tso4, tso6 and LRO by running

ifconfig sfxge0 -tso4 -tso6 -lro

Not sure if that helped yet.

Any help would be appreciated.

Thanks!