A flood of bacula traffic causes igb interface to go offline.

Wed Feb 2 18:07:38 UTC 2011

On Tue, 2011-02-01 at 12:50 -0800, Mike Carlson wrote:
> Hey net@,
> 
> I have a FreeBSD 8.2-RC2 system running on a HP DL180 G6, using the 
> onboard Intel controller, and it is our primary Bacula storage node and 
> director node.
> 
> We have 96 clients that are scheduled to run at 8:30pm. After about 9 - 
> 10 minutes of activity (mrtg graphs show about 50-60MB/sec incoming 
> traffic), the igb1 interface is no longer able to communicate with the 
> Cisco switch.
> 
> The interesting part is, the interface is still "up", there is nothing 
> in the kernel message buffer, and nothing relevant in the log file (just 
> syslogd and ldap errors because they cannot reach their respective 
> network servers). The system only responds to the network until I either 
> reboot, or run 'ifconfig igb1 down ;  ifconfig igb1 up'. There is no 
> firewall loaded/configured.
> 
> Thankfully, I have a KVM over IP, so when this happens I can at least 
> run script(1) and capture some useful information.
> ifconfig igb1
> igb1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>      
> options=1bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4>
>      ether 1c:c1:de:e9:fb:af
>      inet 128.15.136.105 netmask 0xffffff00 broadcast 128.15.136.255
>      inet 128.15.136.108 netmask 0xffffff00 broadcast 128.15.136.255
>      inet 128.15.136.102 netmask 0xffffff00 broadcast 128.15.136.255
>      media: Ethernet autoselect (1000baseT <full-duplex>)
>      status: active
> 
> I can ping the internal IP (but I realize that is probably a useless
> test...)
> root at write /etc]> ping 128.15.136.105
> PING 128.15.136.105 (128.15.136.105): 56 data bytes
> 64 bytes from 128.15.136.105: icmp_seq=0 ttl=64 time=0.024 ms
> 64 bytes from 128.15.136.105: icmp_seq=1 ttl=64 time=0.015 ms
> ^C
> --- 128.15.136.105 ping statistics ---
> 2 packets transmitted, 2 packets received, 0.0% packet loss
> round-trip min/avg/max/stddev = 0.015/0.019/0.024/0.005 ms
> 
> Attempting to ping the router:
> root at write /etc]> ping 128.15.136.254
> PING 128.15.136.254 (128.15.136.254): 56 data bytes
> ping: sendto: Host is down
> ping: sendto: Host is down
> ping: sendto: Host is down
> ping: sendto: Host is down
> ^C
> --- 128.15.136.254 ping statistics ---
> 9 packets transmitted, 0 packets received, 100.0% packet loss
> 
> 
> The only thing that seems to solve this problem is to either reboot, or
> do an "ifconfig down/up":
> 
> root at write /etc]> ifconfig igb1 down
> root at write /etc]> ifconfig igb1
> root at write /etc]> ping 128.15.136.254
> PING 128.15.136.254 (128.15.136.254): 56 data bytes
> 64 bytes from 128.15.136.254: icmp_seq=1 ttl=255 time=1.015 ms
> 64 bytes from 128.15.136.254: icmp_seq=2 ttl=255 time=0.217 ms
> 64 bytes from 128.15.136.254: icmp_seq=3 ttl=255 time=0.278 ms
> 64 bytes from 128.15.136.254: icmp_seq=4 ttl=255 time=0.238 ms
> ^C
> --- 128.15.136.254 ping statistics ---
> 5 packets transmitted, 4 packets received, 20.0% packet loss
> round-trip min/avg/max/stddev = 0.217/0.437/1.015/0.334 ms
> 
> I was able to run tcpdump during all of this, and it *nothing* between 
> the system and the switch until I run ifconfig igb1 down/up, and then 
> you see the CDP and Tree Spanning traffic.
> 
> The networking team here has told me there are no errors on the switch, 
> or the port I am on, and they even moved me from one port to another, 
> but this is still happening on a fairly regular basis now that I've 
> added more backup clients.
> 
> Is this a possible bug with my hardware and the intel driver? I have a 
> pcap file and more system information that might provide a lot more 
> information, but I don't want to send that out to a mailing list.
> _______________________________________________

You may want to pay attention to the current discussions regarding the
intel driver (em and igb).

Can you post the output of lspci -vvv ?

Sean