A flood of bacula traffic causes igb interface to go offline.
Mike Carlson
carlson39 at llnl.gov
Tue Feb 1 21:19:10 UTC 2011
Hey net@,
I have a FreeBSD 8.2-RC2 system running on a HP DL180 G6, using the
onboard Intel controller, and it is our primary Bacula storage node and
director node.
We have 96 clients that are scheduled to run at 8:30pm. After about 9 -
10 minutes of activity (mrtg graphs show about 50-60MB/sec incoming
traffic), the igb1 interface is no longer able to communicate with the
Cisco switch.
The interesting part is, the interface is still "up", there is nothing
in the kernel message buffer, and nothing relevant in the log file (just
syslogd and ldap errors because they cannot reach their respective
network servers). The system only responds to the network until I either
reboot, or run 'ifconfig igb1 down ; ifconfig igb1 up'. There is no
firewall loaded/configured.
Thankfully, I have a KVM over IP, so when this happens I can at least
run script(1) and capture some useful information.
ifconfig igb1
igb1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=1bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4>
ether 1c:c1:de:e9:fb:af
inet 128.15.136.105 netmask 0xffffff00 broadcast 128.15.136.255
inet 128.15.136.108 netmask 0xffffff00 broadcast 128.15.136.255
inet 128.15.136.102 netmask 0xffffff00 broadcast 128.15.136.255
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
I can ping the internal IP (but I realize that is probably a useless
test...)
root at write /etc]> ping 128.15.136.105
PING 128.15.136.105 (128.15.136.105): 56 data bytes
64 bytes from 128.15.136.105: icmp_seq=0 ttl=64 time=0.024 ms
64 bytes from 128.15.136.105: icmp_seq=1 ttl=64 time=0.015 ms
^C
--- 128.15.136.105 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.015/0.019/0.024/0.005 ms
Attempting to ping the router:
root at write /etc]> ping 128.15.136.254
PING 128.15.136.254 (128.15.136.254): 56 data bytes
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
ping: sendto: Host is down
^C
--- 128.15.136.254 ping statistics ---
9 packets transmitted, 0 packets received, 100.0% packet loss
The only thing that seems to solve this problem is to either reboot, or
do an "ifconfig down/up":
root at write /etc]> ifconfig igb1 down
root at write /etc]> ifconfig igb1
root at write /etc]> ping 128.15.136.254
PING 128.15.136.254 (128.15.136.254): 56 data bytes
64 bytes from 128.15.136.254: icmp_seq=1 ttl=255 time=1.015 ms
64 bytes from 128.15.136.254: icmp_seq=2 ttl=255 time=0.217 ms
64 bytes from 128.15.136.254: icmp_seq=3 ttl=255 time=0.278 ms
64 bytes from 128.15.136.254: icmp_seq=4 ttl=255 time=0.238 ms
^C
--- 128.15.136.254 ping statistics ---
5 packets transmitted, 4 packets received, 20.0% packet loss
round-trip min/avg/max/stddev = 0.217/0.437/1.015/0.334 ms
I was able to run tcpdump during all of this, and it *nothing* between
the system and the switch until I run ifconfig igb1 down/up, and then
you see the CDP and Tree Spanning traffic.
The networking team here has told me there are no errors on the switch,
or the port I am on, and they even moved me from one port to another,
but this is still happening on a fairly regular basis now that I've
added more backup clients.
Is this a possible bug with my hardware and the intel driver? I have a
pcap file and more system information that might provide a lot more
information, but I don't want to send that out to a mailing list.
More information about the freebsd-net
mailing list