em0 watchdog timeout 7-stable
Greg Byshenk
freebsd at byshenk.net
Wed May 13 16:42:11 UTC 2009
As a followup to my own previous message, I continue to have annoying
problems with "em?: watchdog timeout" on one of my machines (now running
7.2-STABLE as of 2009-05-08).
I have discontinued using the on-board (em, copper) NICs, and replaced
the original fibre NIC with a newer model, but the problem persists.
I've also set
hw.pci.enable_msix=0
hw.pci.enable_msi=0
hw.em.rxd=1024
hw.em.txd=1024
net.inet.tcp.tso=0
...as suggested in some discussions of this problem, and set the em1
interface to 'polling', all to no avail. Frequently, though irregularly
(once or twice a day), the console begins to display
em1: watchdog timeout -- resetting
em1: watchdog timeout -- resetting
em1: watchdog timeout -- resetting
the nework is down, and the machine locks up.
[Note: I am getting 'em1' now instead of 'em0' as previously, but this
is due to changing all of the nics, which led to a different numbering;
the timeout is still occurring on the (main) interface, the fibre
gigabit connection.]
What is particularly perverse (IMO) is that, since changing the NIC to
the newer model (and updating the kernel), I can no longer break to the
debugger when the lockup occurs (there is no response to the break) --
bit I _can_ shut the machine down cleanly via hardware (a touch of the
power switch sends 'shutdown', and the machine shuts down cleanly --
after killing off processes waiting on network i/o).
The machine is running nfs and samba (3.2.10, from ports), and pretty
much nothing else.
Anyone have any ideas about this...? I'm going mad with this.
-greg byshenk
# pciconf -lvb
[...]
em1 at pci0:7:1:0: class=0x020000 card=0x10028086 chip=0x10118086 rev=0x01 hdr=0x00
vendor = 'Intel Corporation'
device = '82545EM Gigabit Ethernet Controller (Fiber)'
class = network
subclass = ethernet
bar [10] = type Memory, range 64, base 0xda300000, size 131072, enabled
bar [20] = type I/O Port, range 32, base 0x5000, size 64, enabled
[...]
# vmstat -i
interrupt total rate
irq4: sio0 1666 0
irq6: fdc0 10 0
irq14: ata0 58 0
irq16: skc0 em0 1437801 98
irq18: twa0 846981 57
irq24: em1 4378650 299
cpu0: timer 29258004 1999
cpu1: timer 29249758 1999
cpu3: timer 29249816 1999
cpu7: timer 29249779 1999
cpu2: timer 29249729 1999
cpu4: timer 29249852 1999
cpu6: timer 29249851 1999
cpu5: timer 29249814 1999
Total 240671769 16450
On Sun, Apr 26, 2009 at 02:50:08PM +0200, Greg Byshenk wrote:
> I have one machine that is seeing watchdog timeouts on em0, running 7-STABLE
> amd64 as of 2009.04.19, and also some other more perverse errors.
>
> Twice now in the last 48 hours, this machine has become unreachable via the
> network, and connecting to the console shows an endless string of
>
> [...]
> em0: watchdog timeout -- resetting
> em0: watchdog timeout -- resetting
> em0: watchdog timeout -- resetting
>
> messages. The machine is almost locked up. That is, I can get a login
> prompt, but can go no further than typing in a username; after the
> username, no password prompt, and nothing further. The only option is
> to hard reset the machine or to drop to debugger and reboot.
>
> Now the "perverse" part. After restarting, the system partition is no
> more.
>
> Background detail: the machine is a fileserver, with a 3Ware 9650SE-16ML
> SATA controller, connected to 16 1TB SATA drives, this configured as
> a 14-drive RAID10 array (+ 2 hot spares), with a 50GB system partition
> and 6.5TB data partition. The system partition is configured as da1,
> with one slice and more or less standard partitions for / /var /tmp, etc.
> (the data partition of the array is sliced with gpt).
>
> The issue here is that, upon restart, all parition information on da0
> seems to have disappeared, and restarting results in a "no operating
> system found" message, and a failure to boot (obviously).
>
> But all of the data is still present. If I boot into rescue mode,
> recreate da0s1, mark it bootable, and restore the bsdlabel, then
> everything works again. I can restart the machine, and it comes back
> up normally (it requires an fsck of everything on da0, but after that
> everything is back to normal).
>
> I don't know if this is two unrelated problems, or one problem with
> two symptoms, or something else. I think that I can safely say that
> it is not a problem with the 3Ware controller itself, as I replaced
> the controller with a spare (identical model), and the problem
> recurred. Additionally, I have an almost-identical configuration on
> four other machines, none of which are experiencing any problems.
> One thing that is different is that the other machines use
> Intel PRO/1000 PF (pci-e) NICs.
>
> Is there some known problem with the Intel 2572 fibre NIC? Or some
> potential interaction of it with the 3ware RAID controller?
>
> For the moment, I've set hw.pci.enable_msi=0 (as discussed in the
> threads on 7.2/bge), and am building a new kernel/world from sources
> csup'd one hour ago, but I'd really like to hear any ideas about this
> -- particularly the wiping of the label.
>
> Some information about the system:
>
>
> # /dev/da0s1:
> 8 partitions:
> # size offset fstype [fsize bsize bps/cpg]
> a: 2097152 0 4.2BSD 0 0 0
> b: 8388608 2097152 swap
> c: 104856192 0 unused 0 0 # "raw" part, don't edit
> d: 8388608 10485760 4.2BSD 0 0 0
> e: 2097152 18874368 4.2BSD 0 0 0
> f: 41943040 20971520 4.2BSD 0 0 0
> g: 41941632 62914560 4.2BSD 0 0 0
>
>
> em0 at pci0:4:1:0: class=0x020000 card=0x10038086 chip=0x10018086 rev=0x02 hdr=0x00
> vendor = 'Intel Corporation'thernet Controller (Fiber)'
> device = '2572 10/100/1000 Ethernet Controller (Fiber)'
> class = networktory, range 32, base 0xda000000, size 131072, enabled
> subclass = ethernetory, range 32, base 0xda000000, size 131072, enabled
> bar [10] = type Memory, range 32, base 0xda000000, size 131072, enabled
> bar [14] = type Memory, range 32, base 0xda020000, size 65536, enabled0x00
>
> twa0 at pci0:9:0:0: class=0x010400 card=0x100413c1 chip=0x100413c1 rev=0x01 hdr=0x00
> device = '9650SE Series PCI-Express SATA2 Raid Controller'
> class = mass storage
> subclass = RAID
> bar [10] = type Prefetchable Memory, range 64, base 0xd8000000, size 33554432, enabled
> bar [18] = type Memory, range 64, base 0xda300000, size 4096, enabled
> bar [20] = type I/O Port, range 32, base 0x3000, size 256, enabled
> cap 01[40] = powerspec 2 supports D0 D1 D2 D3 current D0
> cap 05[50] = MSI supports 32 messages, 64 bit
> cap 10[70] = PCI-Express 1 legacy endpoint
>
--
greg byshenk - gbyshenk at byshenk.net - Leiden, NL
More information about the freebsd-stable
mailing list