vr(4) troubles for AMD Geode CS5536 chipset

Mon Sep 3 02:41:03 UTC 2012

On Fri, Aug 31, 2012 at 12:45:53PM +0700, Eugene Grosbein wrote:
> In previous letter I've described my attempts to try vr(4) from HEAD.
> Now I'd like to explain why I've tried it.
> 
> The problem is that stock vr(4) from 8.3-STABLE/i386 has serious issues for my system.
> I have home router with two vr interfaces, vr0 is for LAN (IPoE) and vr1 is for WAN (PPPoE/mpd).
> 
> Presently, every day my WAN vr interface stops running correctly:
> sometimes it stops receiving all packets - tcpdump shows none of them.
> Sometimes, it receives some but with great delay - up to 10 seconds (not miliseconds)
> and even more. tcpdump shows that delay occurs on receive path.
> Sometimes, it even rearranges packets - tcpdump shows that some incoming ICMP echo requests
> with lower sequence numbers come in later that already answered higher-numbered requests.

Hmm, it seems driver's consumer/producer index of RX path were
corrupted.

> 
> ifconfig vr1 down/up revives interface completely until next morning.
> sysctl net.inet.ip.fw.enable=0 does not solve the problem.
> 
> I have control over WAN switching/routing network and may assure it runs just fine.
> However, I can't guarantee it has no "soft" anomalies like short storms or some silly broadcasts.
> 
> I've tried to make incoming flood with ng_source(4) generated UDP flood at 100M rate
> for 60 seconds and failed to reproduce the problem artificially.
> 
> I've tried to move WAN from vr1 to vr0 and the problem has moved to vr0 too.
> My LAN has very little traffic and corresponding vr interface exhibits no problems.
> 
> This router also routinely runs transmission (torrent client from ports)
> serving torrents from USB-attached HDD making severe CPU load, so I suspect
> the problem may be related with CPU load.
> 
> I've also checked mbuf/mbuf clusters usage and they are all right:
> 
> # netstat -m
> 1539/2076/3615 mbufs in use (current/cache/total)
> 1200/1278/2478/65536 mbuf clusters in use (current/cache/total/max)
> 1200/306 mbuf+clusters out of packet secondary zone in use (current/cache)
> 318/181/499/12800 4k (page size) jumbo clusters in use (current/cache/total/max)
> 0/0/0/6400 9k jumbo clusters in use (current/cache/total/max)
> 0/0/0/3200 16k jumbo clusters in use (current/cache/total/max)
> 4056K/3799K/7855K bytes allocated to network (current/cache/total)
> 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
> 0/0/0 requests for jumbo clusters denied (4k/9k/16k)
> 0/4/6656 sfbufs in use (current/peak/max)
> 0 requests for sfbufs denied
> 0 requests for sfbufs delayed
> 0 requests for I/O initiated by sendfile
> 0 calls to protocol drain routines
> 
> # vmstat -z | egrep -i 'ITEM|mbuf'
> ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
> mbuf_packet:              256,        0,     1429,       77, 112854470,        0
> mbuf:                     256,        0,      489,     1620, 369073316,        0
> mbuf_cluster:            2048,    65536,     1506,      604,  5401864,        0
> mbuf_jumbo_page:         4096,    12800,      469,      158,  8306777,        0
> mbuf_jumbo_9k:           9216,     6400,        0,        0,        0,        0
> mbuf_jumbo_16k:         16384,     3200,        0,        0,        0,        0
> mbuf_ext_refcnt:            4,        0,        0,        0,        0,        0
> NetGraph items:            36,     4130,        1,      117,   263123,        0
> NetGraph data items:       36,      531,        0,      295, 106663377,        0
> 
> While ifconfig vr1 down/up solves the problem completely (for some long time),
> taking link down/up using switch solves it "in half" - huge packet delays disappear
> and turn to 25% packet loss happening in regular short intervals, once a second of like.
> 
> ifconfig down/up clears this mess too.
> 
> Please help me to debug this, it's pretty annoying.

By chance, did vr(4) spew some kind of diagnostics messages to
console?  If I remember correctly, vr(4) automatically restarts
controller and show these errors when it detects abnormal
condition. Abnormal conditions for vr(4) would be:
 - TX/RX MAC stuck
 - RX MAC stop due to FIFO overflow or no RX buffers
 - PCI bus errors
 - TX abort
 - TX underrun

> I had a hope new vr(4) driver would help but it takes my system down under average load
> and is unusable.
> 
> Here is start of dmesg.boot:
> 
> Copyright (c) 1992-2012 The FreeBSD Project.
> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
>         The Regents of the University of California. All rights reserved.
> FreeBSD is a registered trademark of The FreeBSD Foundation.
> FreeBSD 8.3-STABLE #1: Wed Aug 29 22:49:45 NOVT 2012
>     root at grosbein.pp.ru:/usr/local/obj/nanobsd.gw/i386/usr/local/src/sys/GW i386
> Timecounter "i8254" frequency 1193182 Hz quality 0
> CPU: Geode(TM) Integrated Processor by AMD PCS (499.91-MHz 586-class CPU)
>   Origin = "AuthenticAMD"  Id = 0x5a2  Family = 5  Model = a  Stepping = 2
>   Features=0x88a93d<FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CLFLUSH,MMX>
>   AMD Features=0xc0400000<MMX+,3DNow!+,3DNow!>
> real memory  = 1065025536 (1015 MB)
> avail memory = 1032929280 (985 MB)
> K6-family MTRR support enabled (2 registers)
> 
> I must also note that this system runs with ACPI disabled in /boot/loader.conf:
> hint.acpi.0.disabled=1
> 
> Otherwise, its timekeeping becomes broken.
> 
> Eugene Gtosbein