[Fwd: Re: bge Ierr rate increase from 5.3R -> 6.1R]

Tue Jan 2 18:33:34 PST 2007

On Tue, 2 Jan 2007, David Christensen wrote:

>> These happen under loads that can't be handled, and generally cause
>> thousands of input errors every second.  The hardware records dropped
>> packets separately from other input errors, but unfortunately
>> all types
>> of input errors are counted together in if_ierrors, and I haven't done
>> more than muck around in ddb to separate them.
>
> There are a couple places I can suggest you look to understand if there
> are any resource limitations being encountered by the controller.

I mostly understand the resource limits but not errors caused by mii
accesses.  The following debugging code:

% diff -c2 ./amd64/amd64/io_apic.c~ ./amd64/amd64/io_apic.c
% *** ./amd64/amd64/io_apic.c~	Wed Nov 22 03:09:32 2006
% --- ./amd64/amd64/io_apic.c	Sun Dec 24 06:50:14 2006
% ***************
% *** 2841,2845 ****
% --- 2961,2968 ----
%   			stdcnt++;
%   			if (cur_rx->bge_flags & BGE_RXBDFLAG_ERROR) {
% + 				if (bge_errsrc & 1)
%   				ifp->if_ierrors++;
% + 				if (bge_errsrc & 8)
% + 				printf("errflag %#x\n", cur_rx->bge_error_flag);
%   				bge_newbuf_std(sc, sc->bge_std, m);
%   				continue;

gives (except under loads so high that the interrupt handler can't keep up):
- no input errors except the ones recorded here
- the ones recorded here are always 0x04 (LINK_LOSS).  These happen under
    high loads if and only if at least 1 device register is read in mii code.
    They are easy to avoid by not calling mii_tick().
Why is FreeBSD polling link status in bge_tick() anyway?  I think the
interrupt for link status change works on most devices.

> The
> first is register 0x4450 (bits 15:0 for 5700 to 5704 and bits 8:0 for
> other controllers) which shows the number of internal
> free RX MBUFs on the controller.  (In this case an MBUF is an internal
> 128 byte data structure used by the controller.)  This value is
> especially
> important on 5700 to 5704 devices since those devices share a pool of
> MBUFs between the RX and TX data paths, unlike the 5705 and later
> devices
> which use separate MBUF pools for RX and TX data.  As the value
> approaches
> 0, the controller will either start to generate pause frames or simply
> drop packets as specified by the values in registers 0x4410, 0x4414, and
>
> 0x4418. If you see ifInDiscards incrementing then this is a definite
> possibility.  You would also see bit 4 of register 0x4400 set if this
> occurs.

Thanks, I'll check that.  Contention between RX and TX would explain some
other behaviour that I saw and want to avoid.  I want to set the parameters
to large values under high loads.  Nearly 512 descriptors should work for
RX, but on a 5701 there is a magic limit of just 20 when TX is combined
with RX, and a not so magic limit of 64 less than the number of host RX
mbufs for RX only.  The magic 20 is remarkably independent of other
coalescing parameters, but I think it depends on the details of the loads
and the inteaction of all the layers of buffering.  ifInDiscards shows this
error.

Hmm, my original reply to this thread may have covered the main point.
ifInDiscards wasn't added to if_ierrors until last January, so the
sometimes-huge numbers of errors for packet drops were not reported in
RELENG_5.

> I also noticed on -CURRENT that register 0x2c18 was set to a rather high
> value (0x40).  This register controls how many bge_rx_bd's will be
> fetched at
> one time.  A high value will reduce PCI bus utilization as it will fetch
> more BD's in a single transaction, but it will increase latency and
> could
> potentially deplete the available internal MBUFs as the controller waits
> for the driver to populate additional bge_rx_bd's.  Try using a smaller
> value such as 0x08.

This seems to be from misreading the data sheet where it says to use
a value of <ring size>/8.  The old data sheet that I have emphasizes
that the value should be low for the std rx ring and gives a value of
25.  The N/8 is for something nearby.  Decreasing this to 8 or so
changes the above magic 20 to about 28 (not nearly as much increase
as the reduction).  I couldn't understand what this parameter does
from the data sheet -- why is it called a threshold when (as you
described it above) it is a burst size?

> It would be really interesting if you could add a sysctl that would
> bring
> out the hardware statistics maintained by the device, similar to what
> exists
> in the "bce" driver.  With this information we could focus on individual
> blocks to see where the packet loss is occurring and may be able to come
> up with more ideas on tuning the controller.

I might get around to that if no one else does.  I saw a few more
statistics with a Broadcom Windows utility for another machine with a
5705.  Not the ones I really wanted to see of course, but they indicated
that the FreeBSD side (with a 5701) is working OK, and the WindXP side
(with a 5705) is working better in some respects than the 5705 under
FreeBSD -- the 5705 can receive at a higher rate than under FreeBSD,
until this activity crashes the utility and WinXP.  Pause frames don't
seem to be working right -- apparently, only WinXP generates them,
and I had to turn them off for WinXP to get anywhere near the maximum
packet rate.

While I'm here I'll ask you why coalescing doesn't work at all on my
5705 (rev A3) under FreeBSD.  I use sysctls to tune it on the 5701
and am now writing dynamic tuning.

Bruce