freebsd-5.4-stable panics
Robert Watson
rwatson at FreeBSD.org
Tue Sep 27 12:43:47 PDT 2005
On Tue, 27 Sep 2005, Rob Watt wrote:
> Thanks for your quick response and suggestions. We have now experienced
> an additional type of crash. Type 3 is from 6.0-BETA5, it did not enter
> the debugger at all and we could not generate a core.
Is this an SMP box? If so, could you try compiling options KDB_STOP_NMI
into your kernel -- you'll also need to set debug.kdb.stop_cpus_with_nmi=1
in either loader.conf or at runtime with sysctls. This will probably
become the default at some point -- in the mean time, the default when
entering the debugger on one CPU is to generate an IPI to the other CPUs
telling them "go into the debugger". This works fine unless the CPU has
interrupts disabled, such as if it's holding a spin lock in the scheduler,
in which case the system will deadlock because that CPU won't acknowledge
the IPI. With the above option, a non-maskable interrupt is used to
signal the other CPUs into the debugger, which gets into the debugger much
more reliably.
The trap information you've provided indicates that it is likely a data
NULL pointer dereference in the kernel (faulting address is a small
increment above NULL). The instruction pointer looks valid -- if you have
a debugging copy of the kernel, could you load it into gdb and show me
what line number / piece of code it's in? you can use "l
*ffffffff803b88ca" to generate that, even without a live debugger session
or core. If you can get into DDB with the above, generally good starting
point debugging information (ideally gathered with a serial console) is:
trace # current thread trace
show pcpu # current CPU data
show pcpu 0 # CPU 0 data
show pcpu 1 # CPU 1 data
... # Any other CPUs
ps # process listing
show lockedvnods # VFS locking information
If you have WITNESS compiled in, also:
show alllocks
> Unfortunately the 6-BETA crash was completely different from everything
> we've seen so far. The panic was related to a page fault and 'top' was
> the active process. We are trying again to run our tests on 6.0, but if
> we keep encountering other bugs, then those other bugs may prevent us
> from determining if multicast is the problem.
Let's see if we can get whatever this first bug you're hitting is fixed
and see if we can get to the next original problems.
> We also ran our applications in 5-STABLE without reading from or writing
> to disk (ie we ran the multicast data streams on a remote machine, and
> we told our listener/rebroadcaster apps not to write to disk). In this
> configuration we were able to run for 4 days without crashing. A few
> hours before the crash we had introduced disk activity (bonnie in a
> constant loop with 1G test file size). This crash was a type 1, and we
> were not able to save a core. The longest we had gone before without a
> crash was 6 hours, so it is possible that either load, or disk activity
> help trigger the bugs we have seen.
I'm heading off on a vacation for two days, and will be offline for that
period, but if we can't easily get through solving 6.x problems on the
host, I can backport a subset of the multicast fixes to 5.x and we can see
if that fixes things up. It may make sense to do this anyway, but I may
not have an opportunity to go through the development and testing on that
until after 6.0 is out the door.
> files attached:
> kernel-conf.txt (6.0 kernel)
> type3-core.txt (copy of panic output to console)
>
> We will update you with more info from our 6.0 tests when we have it.
>
> We are in a bind right now. All modern hardware (ie emt64/amd64) only
> seems to work with versions of freebsd that aren't stable when running
> our applications. Many vendors do not even sell server hardware that is
> purely i386. We never encountered these types of problems on freebsd
> 4.x, and many of our 120+ i386 class machines that are running 4.x are
> showing their age and need to be replaced. Assuming that the problems we
> are experiencing are purely related to ths OS, we now don't have an OS
> to run on the newer hardware we've been buying. We really need to find a
> way to patch these problems or find a version of freebsd that supports
> our platform and is stable. Obviously we appreciate the hard work that
> all of you on the freebsd team do, and we are happy to do whatever we
> can to help squash these bugs.
Hopefully we can get this fixed up as soon as possible.
Do you have a testbed or set of test hosts set up so you can
non-disruptively test change sets, btw?
Thanks,
Robert N M Watson
More information about the freebsd-amd64
mailing list