A flood of bacula traffic causes igb interface to go offline.

Tue Feb 1 21:19:10 UTC 2011

Hey net@,

I have a FreeBSD 8.2-RC2 system running on a HP DL180 G6, using the 
onboard Intel controller, and it is our primary Bacula storage node and 
director node.

We have 96 clients that are scheduled to run at 8:30pm. After about 9 - 
10 minutes of activity (mrtg graphs show about 50-60MB/sec incoming 
traffic), the igb1 interface is no longer able to communicate with the 
Cisco switch.

The interesting part is, the interface is still "up", there is nothing 
in the kernel message buffer, and nothing relevant in the log file (just 
syslogd and ldap errors because they cannot reach their respective 
network servers). The system only responds to the network until I either 
reboot, or run 'ifconfig igb1 down ;  ifconfig igb1 up'. There is no 
firewall loaded/configured.

Thankfully, I have a KVM over IP, so when this happens I can at least 
run script(1) and capture some useful information.
ifconfig igb1
igb1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500

options=1bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4>
     ether 1c:c1:de:e9:fb:af
     inet 128.15.136.105 netmask 0xffffff00 broadcast 128.15.136.255
     inet 128.15.136.108 netmask 0xffffff00 broadcast 128.15.136.255
     inet 128.15.136.102 netmask 0xffffff00 broadcast 128.15.136.255
     media: Ethernet autoselect (1000baseT <full-duplex>)
     status: active

I can ping the internal IP (but I realize that is probably a useless
test...)
root at write /etc]> ping 128.15.136.105
PING 128.15.136.105 (128.15.136.105): 56 data bytes
64 bytes from 128.15.136.105: icmp_seq=0 ttl=64 time=0.024 ms
64 bytes from 128.15.136.105: icmp_seq=1 ttl=64 time=0.015 ms
^C
--- 128.15.136.105 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.015/0.019/0.024/0.005 ms

Attempting to ping the router:
root at write /etc]> ping 128.15.136.254
PING 128.15.136.254 (128.15.136.254): 56 data bytes
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
^C
--- 128.15.136.254 ping statistics ---
9 packets transmitted, 0 packets received, 100.0% packet loss

The only thing that seems to solve this problem is to either reboot, or
do an "ifconfig down/up":

root at write /etc]> ifconfig igb1 down
root at write /etc]> ifconfig igb1
root at write /etc]> ping 128.15.136.254
PING 128.15.136.254 (128.15.136.254): 56 data bytes
64 bytes from 128.15.136.254: icmp_seq=1 ttl=255 time=1.015 ms
64 bytes from 128.15.136.254: icmp_seq=2 ttl=255 time=0.217 ms
64 bytes from 128.15.136.254: icmp_seq=3 ttl=255 time=0.278 ms
64 bytes from 128.15.136.254: icmp_seq=4 ttl=255 time=0.238 ms
^C
--- 128.15.136.254 ping statistics ---
5 packets transmitted, 4 packets received, 20.0% packet loss
round-trip min/avg/max/stddev = 0.217/0.437/1.015/0.334 ms

I was able to run tcpdump during all of this, and it *nothing* between 
the system and the switch until I run ifconfig igb1 down/up, and then 
you see the CDP and Tree Spanning traffic.

The networking team here has told me there are no errors on the switch, 
or the port I am on, and they even moved me from one port to another, 
but this is still happening on a fairly regular basis now that I've 
added more backup clients.

Is this a possible bug with my hardware and the intel driver? I have a 
pcap file and more system information that might provide a lot more 
information, but I don't want to send that out to a mailing list.