recent stability problems with fxp driver
John Polstra
jdp at polstra.com
Fri Sep 12 13:52:00 PDT 2003
On 12-Sep-2003 Mike Tancsa wrote:
> At 12:26 PM 12/09/2003, Info Account wrote:
>>I've spent the past four days or so updating machines here to 4.8/9-stable via
>>cvsup, and have done a complete make buildworld/kernel on each machine (some
>>SMP, some single processor). It seems something is broken with the latest fxp
>>driver, on each machine (different mobos and hardware configs) heavy network
>>traffic with fxp NICs causes timeouts and random kernel panics.
>
> I have a few boxes pushing over 50Mb with fxp cards and havent seen this
> problem. What type of fxp cards do you have ? What does
> pciconf -v -l
> show for the Intel types ?
>
> Also, I have found in the past that I would see this behavior if I changed
> NICs and didnt do a PCIconfig reset in the MB BIOS. There is something
> about Intel nics and Adaptec and 3ware cards that particularly require
> this. Also, make sure that you dont have some duplex mismatches on the
> nics. I have seen where excessive errors combined with high traffic will
> cause panics.
>
> Also, please post the actual error messages on each of the machines.
The problem is real, at least on some hardware. I had to give up on
using the two integrated fxp devices on my Dell 1550 -- which is a
real bummer, since it's a 1U box that only has two PCI slots. With
the latest -stable driver, I couldn't fetch a 560 MB file from
another machine on the LAN using FTP without killing the fxp device.
The messages vary in detail, but this will give you the general idea:
Sep 12 10:18:22 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x90 0x0
Sep 12 10:18:31 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x90 0x0
Sep 12 10:18:32 thin su: jdp to root on /dev/ttyp1
Sep 12 10:18:39 thin /kernel: fxp0: DMA timeout
Sep 12 10:18:39 thin last message repeated 2 times
Sep 12 10:18:49 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0
Sep 12 10:18:51 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0
Sep 12 10:18:54 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0
Sep 12 10:18:56 thin /kernel: fxp0: device timeout
Sep 12 10:18:56 thin /kernel: fxp0: DMA timeout
Sep 12 10:19:10 thin last message repeated 5 times
Sep 12 10:19:10 thin /kernel: fxp0: SCB timeout: 0x1 0x20 0x80 0x0
Sep 12 10:19:13 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0
Sep 12 10:19:14 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0
Sep 12 10:19:15 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0
Sep 12 10:19:16 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0
Sep 12 10:19:36 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0
Sep 12 10:19:38 thin /kernel: fxp0: device timeout
Sep 12 10:19:38 thin /kernel: fxp0: DMA timeout
Sep 12 10:19:38 thin last message repeated 2 times
Sep 12 10:19:52 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0
Sep 12 10:19:54 thin /kernel: fxp0: device timeout
Sep 12 10:19:54 thin /kernel: fxp0: DMA timeout
Sep 12 10:19:54 thin last message repeated 2 times
Sep 12 10:20:00 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0
Sep 12 10:20:21 thin /kernel: fxp0: device timeout
Sep 12 10:20:21 thin /kernel: fxp0: DMA timeout
Sep 12 10:20:21 thin last message repeated 2 times
Sep 12 10:20:29 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0
Sep 12 10:20:35 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0
Sep 12 10:20:35 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0
Sep 12 10:21:04 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x90 0x0
Sep 12 10:21:09 thin /kernel: fxp0: device timeout
Sep 12 10:21:09 thin /kernel: fxp0: DMA timeout
Sep 12 10:21:09 thin last message repeated 2 times
Sep 12 10:21:09 thin /kernel: fxp0: command queue timeout
Sep 12 10:21:12 thin shutdown: reboot by jdp:
This morning I tried regressing the driver to earlier versions in an
attempt to find the commit that broke it. Not good news:
RELENG_4_8_0_RELEASE bad
RELENG_4_7_0_RELEASE bad
RELENG_4_6_0_RELEASE bad
RELENG_4_4_0_RELEASE bad
RELENG_4_2_0_RELEASE bad
RELENG_4_1_0_RELEASE bad
The problem is easier to reproduce in recent versions of the
driver than in older versions. With the current -stable driver, I
can almost always kill the chips with a single transfer of that 560
MB file. With the 4.7.0 driver, it takes about 5 transfers before
it fails. With the 4.2.0 driver, it took 15+ transfers.
The devices are Intel 82559 chips. Here's their pciconf output:
none0 at pci0:1:0: class=0x020000 card=0x00da1028 chip=0x12298086 rev=0x08 hdr=0x00
vendor = 'Intel Corporation'
device = '82557/8/9 EtherExpress PRO/100(B) Ethernet Adapter'
class = network
subclass = ethernet
none1 at pci0:2:0: class=0x020000 card=0x00da1028 chip=0x12298086 rev=0x08 hdr=0x00
vendor = 'Intel Corporation'
device = '82557/8/9 EtherExpress PRO/100(B) Ethernet Adapter'
class = network
subclass = ethernet
Maybe the problem really is in the Dell 1550. I have various flavors
of fxp card in several other machines, and I never have trouble with
them. I did check my firmware and BIOS versions, though, and they're
fully up-to-date. I have a suspicion that our driver may not be
dealing properly with Dell's power management or IPMI stuff, but it's
just a vague suspicion without any real evidence.
John
More information about the freebsd-stable
mailing list