svn commit: r323516 - in head/sys: dev/bnxt dev/e1000 kern net sys

Sat Sep 16 09:41:30 UTC 2017

On Sat, 16 Sep 2017, Alexander Leidinger wrote:

> Quoting Bruce Evans <brde at optusnet.com.au> (from Sat, 16 Sep 2017 13:46:37 
> +1000 (EST)):
>
>> It gives lesser breakage here:
>> - with an old PCI em, an error that occur every few makeworlds over nfs now
>>   hang the hardware.  It used to be recovered from afger about 10 seconds.
>>   This only happened once.  I then applied my old fix which ignores the
>>   error better so as to recover from it immediately.  This seems to work as
>>   before.
>
> As I also have an em device which switches into non-working state: what's the 
> patch you have for this? I would like to see if your change also helps my 
> device to get back into working shape again.

X Index: em_txrx.c
X ===================================================================
X --- em_txrx.c	(revision 323636)
X +++ em_txrx.c	(working copy)
X @@ -640,9 +640,20 @@
X 
X  		/* Make sure bad packets are discarded */
X  		if (errors & E1000_RXD_ERR_FRAME_ERR_MASK) {
X +#if 0
X  			adapter->dropped_pkts++;
X -			/* XXX fixup if common */
X  			return (EBADMSG);
X +#else
X +			/*
X +			 * XXX the above error handling is worse than none.
X +			 * First it it drops 'i' packets before the current
X +			 * one and doesn't count them.  Then it returns an
X +			 * error.  iflib can't really handle this error.
X +			 * It just resets, and this usually drops many more
X +			 * packets (without counting them) and much time.
X +			 */
X +			printf("lem: frame error: ignored\n");
X +#endif
X  		}
X 
X  		ri->iri_frags[i].irf_flid = 0;

This is for old em.  nfs doesn't seem to notice the dropped packet(s) after
this.

I think the comment "fixup if common" means "this error should actually
be handled if it occurs enough to matter".

I removed the increment of the dropped packet count because with the change
none are dropped directly here.  I think the error is just for this packet
but more than 1 packet might be dropped by returning in the old code, but
debugging code seem to show no more than 1 packet at a time having an error.
I think returning drops good packets after the bad one together with leaving
the state inconsistent, and it takes almost a reset to recover.

X @@ -703,8 +714,12 @@
X 
X  		/* Make sure bad packets are discarded */
X  		if (staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) {
X +#if 0
X  			adapter->dropped_pkts++;
X  			return EBADMSG;
X +#else
X +			printf("em: frame error: ignored\n");
X +#endif
X  		}
X 
X  		ri->iri_frags[i].irf_flid = 0;

This is for newer em.  I haven't noticed any problems with that (except it
has 27 usec higher latency).

Bruce