XL driver checksum producing corrupted but checksum-correct
packets
Robert Watson
rwatson at freebsd.org
Sat Jan 24 08:09:34 PST 2004
On Fri, 23 Jan 2004, Matthew Dillon wrote:
> I tracked down an occassional buildworld failure on DragonFly to my
> XL driver, which is synchronized to 4.x's XL driver.
It would be very helpful if you could do the following:
(1) See if you can reproduce this using something other than NFS --
perhaps netperf using UDP_STREAM or the like, between that machine and
another machine. This would give us a more reproduceable workload
than "builds", and hopefully one that is less sensitive to things like
context switching, etc.
(2) See if you can reproduce this with a stock 4.9-RELEASE kernel (or
4-STABLE). While the drivers are similar between 4.x and DFBSD, there
are actually quite a few structural changes in the DFBSD version.
Maybe it would make sense to try backing out the local DFBSD changes
to the base FreeBSD version, even if not trying a completely FreeBSD
system, to see if they are the cause. It's difficult to diff the two
because of reorganization and style changes.
> xl0 at pci1:6:0: class=0x020000 card=0x764610b7 chip=0x764610b7 rev=0x30 hdr=0x00
Does this card have a product name, or is it one of those chips embedded
in a motherboard without a separate name?
I took a look through the xl cards/chips on my various machines, and was
unable to find anything that had remotely the same card or chip ID. I did
some high-volume packet flows between them with hardware checksumming
disabled and didn't see any corrupted UDP packets, but the workloads I'm
using sound pretty different. Knowing it could be reproduced using a more
simple workload (and the specifics) would be good.
FYI, I checked the Linux driver for these cards, and didn't see mention of
any quirks for the particular chips/card you're using. The only thing of
note in the Linux driver was the following:
/* Check the PCI latency value. On the 3c590 series the latency timer
must be set to the maximum value to avoid data corruption that occurs
when the timer expires during a transfer. This bug exists the Vortex
chip only. */
if (pdev) {
u8 pci_latency;
u8 new_latency = (drv_flags & IS_VORTEX) ? 248 : 32;
pci_read_config_byte(pdev, PCI_LATENCY_TIMER, &pci_latency);
if (pci_latency < new_latency) {
printk(KERN_INFO "%s: Overriding PCI latency"
" timer (CFLT) setting of %d, new value is %d.\n",
dev->name, pci_latency, new_latency);
pci_write_config_byte(pdev, PCI_LATENCY_TIMER, new_latency);
}
}
The rate at which you have failures sounds like it could be a similar
issue, however -- an occasional collision between a timer and DMA. NFS is
often a mix of small RPCs handling lookups and attributes, and larger RPCs
carrying data. Using netperf or a related tool might help you identify if
one of those is more likely to cause the failure.
Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org Senior Research Scientist, McAfee Research
More information about the freebsd-hackers
mailing list