Panic in 6.2-PRERELEASE with bge on amd64
Sven Willenberger
sven at dmv.com
Tue Jan 9 14:30:56 UTC 2007
On Tue, 2007-01-09 at 12:50 +1100, Bruce Evans wrote:
> On Mon, 8 Jan 2007, Sven Willenberger wrote:
>
> > On Mon, 2007-01-08 at 16:06 +1100, Bruce Evans wrote:
> >> On Sun, 7 Jan 2007, Sven Willenberger wrote:
>
> >>> The short and dirty of the dump:
> >>> ...
> >>> --- trap 0xc, rip = 0xffffffff801d5f17, rsp = 0xffffffffb371ab50, rbp = 0xffffffffb371aba0 ---
> >>> bge_rxeof() at bge_rxeof+0x3b7
> >>
> >> What is the instruction here?
> >
> > I will do my best to ferret out the information you need. For the
> > bge_rxeof() at bge_rxeof+0x3b7 line, the instruction is:
> >
> > 0xffffffff801d5f17 <bge_rxeof+951>: mov %r15,0x28(%r14)
> > ...
> >> Looks like a null pointer panic anyway. I guess the instruction is
> >> movl to/from 0x28(%reg) where %reg is a null pointer.
> >>
> >
> > from the above lines, apparently %r14 is null then.
>
> Yes. It's a bit suprising that the access is a write.
>
> >>> ...
> >>> #8 0xffffffff801db818 in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2707
> >>
> >> What is the statement here? It presumably follow a null pointer and only
> >> the exprssion for the pointer is interesting. xsc is already null but
> >> that is probably a bug in gdb, or the result of excessive optimization.
> >> Compiling kernels with -O2 has little effect except to break debugging.
> >
> > the block of code from if_bge.c:
> >
> > 2705 if (ifp->if_drv_flags & IFF_DRV_RUNNING) {
> > 2706 /* Check RX return ring producer/consumer. */
> > 2707 bge_rxeof(sc);
> > 2708
> > 2709 /* Check TX ring producer/consumer. */
> > 2710 bge_txeof(sc);
> > 2711 }
>
> Oops. I should have asked for the statment in bge_rxeof().
#7 0xffffffff801d5f17 in bge_rxeof (sc=0xffffffff8836b000) at /usr/src/sys/dev/bge/if_bge.c:2528
2528 m->m_pkthdr.len = m->m_len = cur_rx->bge_len - ETHER_CRC_LEN;
(where m is defined as:
2449 struct mbuf *m = NULL;
)
>
> > By default -O2 is passed to CC (I don't use any custom make flags other
> > than and only define CPUTYPE in my /etc/make.conf).
>
> -O2 is unfortunately the default for COPTFLAGS for most arches in
> sys/conf/kern.pre.mk. All of my machines and most FreeBSD cluster
> machines override this default in /etc/make.conf.
>
> With the override overridden for RELENG_6 amd64, gcc inlines bge_rxeof(),
> so your environment must be a little different to get even the above
> ifo. I think gdb can show the correct line numbers but not the call
> frames (since there is no call). ddb and the kernel stack trace can
> only show the call frames for actual calls.
>
> With -O1, I couldn't find any instruction similar to the mov to the
> null pointer + 28. 28 is a popular offset in mbufs
If you have a suggestion for an /etc/make.conf line, I can recompile the
kernel accordingly assuming it still panics or locks up after the change
of interface noted below.
>
> > The short of it is that this interface sees pretty much non-stop traffic
> > as this is a mailserver (final destination) and is constantly being
> > delivered to (direct disk access) and mail being retrieved (remote
> > machine(s) with nfs mounted mail spools. If a momentary down of the
> > interface is enough to completely panic the driver and then the kernel,
> > this hardly seems "robust" if, in fact, this is what is happening. So
> > the question arises as to what would be causing the down/up of the
> > interface; I could start looking at the cable, the switch it's connected
> > to and ... any other ideas? (I don't have watchdog enabled or anything
> > like that, for example).
>
> I don't think down/up can occur in normal operation, since it takes ioctls
> or a watchdog timeout to do it. Maybe some ioctls other than a full
> down/up can cause problems... bge_init() is called for the following
> ioctls:
> - mtu changes
> - some near down/up (possibly only these)
> Suspend/resume and of course detach/attach do much the same things as
> down/up.
>
> BTW, I added some sysctls and found it annoying to have to do down/up
> to make the sysctls take effect. Sysctls in several other NIC drivers
> require the same, since doing a full reinitialization is easiest.
> Since I am tuning using sysctls, I got used to doing down/up too much.
>
> Similarly for the mtu ioctl. I think a full reinitialization is used
> for mtu changes mainly in cases the change switches on/off support for
> jumbo buffers. Then there is a lot of buffer reallocation to be
> done, and interfaces have to be stopped to ensure that the bufferes
> being deallocated are not in use, etc.
>
> Bruce
As this was connected to a gigE switch with mtu left at 1500 I supposed
it is possible that perhaps some mtu discovery/change may have been
happening on the switch but that seems a bit out in left field. For now
I am using the fxp interface connected to the same switch to see if the
issue continues (the change of interface was driven by a hard lockup
yesterday where I could not even type anything on the term).
Sven
More information about the freebsd-amd64
mailing list