XL driver checksum producing corrupted but checksum-correct packets

Sat Jan 24 08:09:34 PST 2004

On Fri, 23 Jan 2004, Matthew Dillon wrote:

>     I tracked down an occassional buildworld failure on DragonFly to my
>     XL driver, which is synchronized to 4.x's XL driver.

It would be very helpful if you could do the following:

(1) See if you can reproduce this using something other than NFS --
    perhaps netperf using UDP_STREAM or the like, between that machine and
    another machine.  This would give us a more reproduceable workload
    than "builds", and hopefully one that is less sensitive to things like
    context switching, etc.

(2) See if you can reproduce this with a stock 4.9-RELEASE kernel (or
    4-STABLE).  While the drivers are similar between 4.x and DFBSD, there
    are actually quite a few structural changes in the DFBSD version.
    Maybe it would make sense to try backing out the local DFBSD changes
    to the base FreeBSD version, even if not trying a completely FreeBSD
    system, to see if they are the cause.  It's difficult to diff the two
    because of reorganization and style changes.

> xl0 at pci1:6:0:   class=0x020000 card=0x764610b7 chip=0x764610b7 rev=0x30 hdr=0x00

Does this card have a product name, or is it one of those chips embedded
in a motherboard without a separate name?

I took a look through the xl cards/chips on my various machines, and was
unable to find anything that had remotely the same card or chip ID.  I did
some high-volume packet flows between them with hardware checksumming
disabled and didn't see any corrupted UDP packets, but the workloads I'm
using sound pretty different.  Knowing it could be reproduced using a more
simple workload (and the specifics) would be good.

FYI, I checked the Linux driver for these cards, and didn't see mention of
any quirks for the particular chips/card you're using.  The only thing of
note in the Linux driver was the following:

	/* Check the PCI latency value.  On the 3c590 series the latency timer
	   must be set to the maximum value to avoid data corruption that occurs
	   when the timer expires during a transfer.  This bug exists the Vortex
	   chip only. */
	if (pdev) {
		u8 pci_latency;
		u8 new_latency = (drv_flags & IS_VORTEX) ? 248 : 32;

		pci_read_config_byte(pdev, PCI_LATENCY_TIMER, &pci_latency);
		if (pci_latency < new_latency) {
			printk(KERN_INFO "%s: Overriding PCI latency"
				   " timer (CFLT) setting of %d, new value is %d.\n",
				   dev->name, pci_latency, new_latency);
			pci_write_config_byte(pdev, PCI_LATENCY_TIMER, new_latency);
		}
	}

The rate at which you have failures sounds like it could be a similar
issue, however -- an occasional collision between a timer and DMA.  NFS is
often a mix of small RPCs handling lookups and attributes, and larger RPCs
carrying data.  Using netperf or a related tool might help you identify if
one of those is more likely to cause the failure. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Senior Research Scientist, McAfee Research