Watchdog timeout em driver 8.2-R

Wed Apr 18 18:15:14 UTC 2012

Hi,

i first posted the following to the -stable list but got no
reply. Maybe someone here has some advice for me.

Switch: HP ProCurve 2910al
        The switch does passive LACP

Motherboard: Supermicro X8DTN+-F

NIC: Quad Port Card, i.e. em1:
     em1 at pci0:6:0:1: class=0x020000 card=0x125e15d9 chip=0x105e8086 rev=0x06 hdr=0x00
         vendor     = 'Intel Corporation'
         device     = 'HP NC360T PCIe DP Gigabit Server Adapter (n1e5132)'
         class      = network
         subclass   = ethernet
         bar   [10] = type Memory, range 32, base 0xfb9e0000, size 131072, enabled
         bar   [14] = type Memory, range 32, base 0xfb9c0000, size 131072, enabled
         bar   [18] = type I/O Port, range 32, base 0xcc00, size 32, enabled
         cap 01[c8] = powerspec 2  supports D0 D3  current D0
         cap 05[d0] = MSI supports 1 message, 64 bit enabled with 1 message
         cap 10[e0] = PCI-Express 1 endpoint max data 256(256) link x4(x4)
     ecap 0001[100] = AER 1 0 fatal 1 non-fatal 0 corrected
     ecap 0003[140] = Serial 1 002590ffff0484d8

I use CAT 6 cables and the switch and server are in the same cabinet.

OS: FBSD is 8.2-Release

rc.conf:
  ifconfig_em0="up"
  ifconfig_em1="up"
  ifconfig_em2="up"
  ifconfig_em3="up"
  cloned_interfaces="lagg0"
  ifconfig_lagg0="laggproto lacp laggport em0 laggport em1 laggport em2 laggport em3"
  ipv4_addrs_lagg0="192.168.80.20/24"

Hm, what sysctls might be interesting?
I use:
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendspace=65536
net.inet.tcp.recvspace=131072
kern.ipc.nmbclusters=230400
kern.maxvnodes=250000
kern.maxfiles=65536
kern.maxfilesperproc=32768
vfs.read_max=32

loader.conf: does only contain stuff concerning zfs

Except for swap the whole system uses zfs, swap is on a geom mirror.

Once in a while i see this messages in /var/log/messages

   Apr 13 08:53:07 san02 kernel: em1: Watchdog timeout -- resetting
   Apr 13 08:53:07 san02 kernel: em1: Queue(0) tdh = 232, hw tdt = 190
   Apr 13 08:53:07 san02 kernel: em1: TX(0) desc avail = 31,Next TX to
   Clean = 221
   Apr 13 08:53:07 san02 kernel: em1: Link is Down
   Apr 13 08:53:07 san02 kernel: em1: link state changed to DOWN

Sometimes nothing for days, sometimes under high Network load (NFSv3), sometimes
multiple times a day. I see this message/behaviour on always the same two of the
four interfaces (em1 and em3).

Then the NIC does not have the ACTIVE flag anymore, an ifconfig em1 up
solves the issue. But why does it loose the ACTIVE state and why does the
NIC reset itself in the first place?
On the switch i see that the port matching em1 on the server has left
the trunk, so the missing ACTIVE flag is not lying 8-/

Googling found many postings with the same problem and one site suggested
that this might be an ACPI problem but nothing concrete and the postings
i found were mostly FBSD7 and older.

Any pointers would be appreciated.
Thank you

   --lars

_______________________________________________
freebsd-stable at freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"