[Fwd: Re: bge Ierr rate increase from 5.3R -> 6.1R]

Sat Dec 30 05:44:25 PST 2006

[cc changed from developers to net]

On Wed, 13 Dec 2006, Bruce Evans wrote:

> On Tue, 12 Dec 2006, Doug Barton wrote:
>
>> This guy's first message about this problem was very detailed, and he
>> seems highly motivated. Anyone want to help him out?
>
> This might be because bge now actually reports error statistics correctly
> (so the larger counts are correct, or because the fixes in -current
> aren't all in RELENG_6 (so the larger and smaller counts may both be
> incorrect).

I now think that this is a bug in mii (brgphy_service()) introduced
or enlarged since FreeBSD-5.early.  Under loads that can be handled,
my 5701 often gets a small number of input errors every second, and
returning immediately from brgphy_service() fixes these.  bge uses the
same logic as most NIC drivers for mii_tick(), and this is bad for
interrupt latency, but the problem here seems to be mangling of packets
and unrelated to interrupt latency (high loads just usually give a
packet in flight for brgphy_service() to mangle?).

> I can easily generate hundreds of thousands of input errors per second
> by flooding my '5701 and '5705 bge devices with hundreds of thousands
> more packets per second than they can handle.  The error counts are 0
> in ~5.2 but are are correct in -current (dropped packets are counted
> as input errors).  The '5705 was most broken and hasn't been fixed in
> RELENG_6.

These happen under loads that can't be handled, and generally cause
thousands of input errors every second.  The hardware records dropped
packets separately from other input errors, but unfortunately all types
of input errors are counted together in if_ierrors, and I haven't done
more than muck around in ddb to separate them.

I have coalescence tuning sysctls that among other things allow tuning
to the threshold where packets are dropped, keping the load large and
constant.  Playing with these showed that bge is not using its full
rx ring to reduce its latency requirements.  bge shouldn't drop packets
on input unless it gets more than BGE_STD_RX_RING_COUNT = 512 rx
descriptors behind, but instead it drops packets when it gets the
following numbers behind:
- 20 under combined rx+tx load.  I can't explain this.  Apparently some
   other hardware resource is running out.
- 192 under rx-only load.  256 rx slots are lost by not allocating mbufs
   for them, since allocating mbufs would cost 1MB of memory and 1MB was
   considered large.  See bge_init_rx_ring_std() and the bogus SSLOTS macro.
   There used to be similar bogusness for jumbo buffers and JSLOTS, but now
   JSLOTS is unused garbage -- mbufs are now allocated for the full jumbo
   ring without worrying that this takes much more than 1MB.  I can't
   explain the remaining missing 64, but suspect BGE_CMD_RING_COUNT.

With the limit of 192, polling at 1000 Hz breaks at an input rate of
192 kpps.  Allocating 256 more std rx ring slots increases this limit
to 448 kpps, which is close to the maximum that I can test.  The extra
256 shouldn't be necessary, but are useful for avoiding various latency
bugs.

bge also has a "mini" rx ring which FreeBSD doesn't use.  I don't really
understand this or the interaction of the separate rx rings, but hope
that the mini ring can be used to handle small packets and would only
need an mbuf (not an mbuf cluster) for each packet, and with it and the
jumbo ring the total hardware buffering would be 1024(mini) + 512(std) +
256(jumbo), with the 1024-entry event ring only used to communicate
with the host and its size not really limiting buffering.  Meeting a
latency requirement of 1024+512 tiny-packet times is much easier than
meeting one of 192 or 20 tiny-packet times.  (I only actually saw the
limits of 20 and 192 for full-sized (non jumbo) packets.)

Fix for most of the 192 limit and nearby bogusness:

%%%
Index: if_bge.c
===================================================================
RCS file: /home/ncvs/src/sys/dev/bge/if_bge.c,v
retrieving revision 1.172
diff -u -2 -r1.172 if_bge.c

--- if_bge.c	26 Dec 2006 18:33:55 -0000	1.172
+++ if_bge.c	28 Dec 2006 18:15:44 -0000
@@ -811,10 +814,4 @@
  }

-/*
- * The standard receive ring has 512 entries in it. At 2K per mbuf cluster,
- * that's 1MB or memory, which is a lot. For now, we fill only the first
- * 256 ring entries and hope that our CPU is fast enough to keep up with
- * the NIC.
- */
  static int
  bge_init_rx_ring_std(struct bge_softc *sc)
@@ -822,8 +819,8 @@
  	int i;

-	for (i = 0; i < BGE_SSLOTS; i++) {
+	for (i = 0; i < BGE_STD_RX_RING_CNT; i++) {
  		if (bge_newbuf_std(sc, i, NULL) == ENOBUFS)
  			return (ENOBUFS);
-	};
+	}

  	bus_dmamap_sync(sc->bge_cdata.bge_rx_std_ring_tag,
@@ -866,5 +863,5 @@
  		if (bge_newbuf_jumbo(sc, i, NULL) == ENOBUFS)
  			return (ENOBUFS);
-	};
+	}

  	bus_dmamap_sync(sc->bge_cdata.bge_rx_jumbo_ring_tag,
Index: if_bgereg.h
===================================================================
RCS file: /home/ncvs/src/sys/dev/bge/if_bgereg.h,v
retrieving revision 1.65
diff -u -2 -r1.65 if_bgereg.h
--- if_bgereg.h	22 Dec 2006 02:59:58 -0000	1.65
+++ if_bgereg.h	30 Dec 2006 13:35:32 -0000
@@ -2327,13 +2329,7 @@

  /*
- * Memory management stuff. Note: the SSLOTS, MSLOTS and JSLOTS
- * values are tuneable. They control the actual amount of buffers
- * allocated for the standard, mini and jumbo receive rings.
+ * Memory management stuff.
   */

-#define BGE_SSLOTS	256
-#define BGE_MSLOTS	256
-#define BGE_JSLOTS	384
-
  #define BGE_JRAWLEN (BGE_JUMBO_FRAMELEN + ETHER_ALIGN)
  #define BGE_JLEN (BGE_JRAWLEN + (sizeof(uint64_t) - \
%%%

Bruce