NFE adapter 'hangs'

Thu Sep 2 08:56:25 UTC 2010

Hiya,

I've been having trouble with two different machines (FBSD 8.0p3 & FBSD 7.0p5) using the NFE network adapter.  The machines are, respectively, Sun X2200 (AMD64) and a Sun X2100M2 (AMD64) and both are running the amd64 kernel. 

Basically what appears to happen is that traffic stops flowing through the interface and 'No buffer space available' error messages are produced when trying to send icmp packets. All establish connections appear to hang.

The machines are running as packet routers, and nfe0 is acting as the 'lan' side.  PF is being used for filtering, NAT, BINAT and RDR.  The same PF configuration works correctly on two other servers using different network adapters. One of them is configured with pfsync & CARP, but the other one isn't.

The problem seems to happen under fairly light number of sessions ( < 100 active states in PF) though the more states the quicker it occurs.  It is possible it's related to packet rates as putting on high bandwidth clients seems to produce the problem very quickly (several minutes) This is reinforced by the fact that the problem first manifested when we upgraded one of the leased lines.

Executing ifconfig nfe0 down && ifconfig nfe0 up will restart traffic flow.  

Neither box is very highly loaded, generally around ~ 1.5 Mb/s.  This doesn't appear to be related to the amount of traffic as I have tried re-routing 95% of traffic around the server without any improvement in performance.  The traffic profile is fairly random - a mix of TCP and UDP, mostly flowing OUT of nfe0.  It is all L3 and there are  less than 5 hosts on the segment attached to the nfe interface.

Both boxes are in different locations and are connected to different types of Cisco switches.  Both appear to autonegotiate correctly and the switch ports show no status changes.

It appears that PFSync, CARP & a GRE tunnel works correctly over the NFE interface for long periods of time (weeks +) And that it is something to do adding other traffic to the mix that is resulting in the interface 'hanging'.

If I move the traffic from NFE to the other BGE interface (the one shared with the LOM) everything is stable and works correctly.  I have not been able to reproduce this using test loads, and the interface worked correctly with iperf testing prior to deployment.  I unfortunately (legal reasons) can't provide a traffic trace up to the time it occurs though everything looks normal to me.

The FreeBSD 7 X2100 lists the following from PCI conf:
nfe0 at pci0:0:8:0:        class=0x068000 card=0x534c108e chip=0x037310de rev=0xa3 hdr=0x00
   vendor     = 'Nvidia Corp'
   device     = 'MCP55 Ethernet'
   class      = bridge
nfe1 at pci0:0:9:0:        class=0x068000 card=0x534c108e chip=0x037310de rev=0xa3 hdr=0x00
   vendor     = 'Nvidia Corp'
   device     = 'MCP55 Ethernet'
   class      = bridge

The FreeBSD 8 X2200 lists the same thing:
nfe0 at pci0:0:8:0:        class=0x068000 card=0x534b108e chip=0x037310de rev=0xa3 hdr=0x00
   vendor     = 'Nvidia Corp'
   device     = 'MCP55 Ethernet'
   class      = bridge
nfe1 at pci0:0:9:0:        class=0x068000 card=0x534b108e chip=0x037310de rev=0xa3 hdr=0x00
   vendor     = 'Nvidia Corp'
   device     = 'MCP55 Ethernet'
   class      = bridge

Here are the two obvious tests (both from the FreeBSD 7 box), but the icmp response & the mbuf stats are very much the same on both boxes.

ping 172.31.3.129
PING 172.31.3.129 (172.31.3.129): 56 data bytes
ping: sendto: No buffer space available
ping: sendto: No buffer space available
^C

-- 172.31.3.129 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss

netstat -m
852/678/1530 mbufs in use (current/cache/total)
818/448/1266/25600 mbuf clusters in use (current/cache/total/max)
817/317 mbuf+clusters out of packet secondary zone in use (current/cache)
0/362/362/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
1879K/2513K/4392K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/0/0 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

From the other machine, after the problem has occurred & and ifconfig down/up cycle has been done (ie when the interface is working)
vmstat -z 
mbuf_packet:              256,        0,     1033,     1783, 330792410,        0
mbuf:                     256,        0,        5,     1664, 395145472,        0
mbuf_cluster:            2048,    25600,     2818,     1690, 13234653,        0
mbuf_jumbo_page:         4096,    12800,        0,      336,   297749,        0
mbuf_jumbo_9k:           9216,     6400,        0,        0,        0,        0
mbuf_jumbo_16k:         16384,     3200,        0,        0,        0,        0
mbuf_ext_refcnt:            4,        0,        0,        0,        0,        0

Although I failed to keep a copy I don't believe there is a kmem problem

I'm at a complete loss as to what to try next :(  

All suggestions very gratefully received!!!  The 7.0 box is live so can't really be played with but I can occasionally run tests on the other box

Thank you :)
Mel