A different 32-bit powerpc head -r317820 panic on old PowerMac G5: dual backtraces from "timeout stopping cpus" (dump failed though): any comments?

Tue Jun 6 18:16:01 UTC 2017

On 2017-Jun-6, at 3:46 AM, Mark Millard <markmi at dsl-only.net> wrote:

> I'm not sure what to make of this. May be someone has
> some idea.
> 
> One of the bt's is for "irq53: smudoorbell0" (tid 100055) the other is for usb.
> 
> spinlock 0xefbbf00 (sched lock 1) held by 0x593d6c0 (tid 100055) too long
> spinlock 0xxf2e1f8 (sleepq chain) help by 0x17606c0 (tid 100049) too long
> timeout stopping cpus
> 
> usb. . .
> (I omit any "at kdb_backtrace" lines)
> 
> 0xe56636d0: at vpanic+0x210
> 0xe5663740: at panic+0x54
> 0xe5663790: at _mtx_lock_spin_failed_+0x54
> 0xe56637b0: at _mtx_lock_spin_cookie+0x19c
> 0xe5663810: at sched_add+0x140
> 0xe5663850: at sched_wakeup+0xa4
> 0xe5663870: at setrunnable+0x9c
> 0xe5663890: at sleepq_resume_thread+0x17c
> 0xe56638c0: at sleepq_signal+0xa4
> 0xe56638f0: at cv_signal+0x94
> 0xe5663910: at usbd_do_request_callback+0x80
> 0xe5663930: at usb_request_callback_0x28
> 0xe5663950: at usbd_callback_wrapper+0xa98
> 0xe5663990: at usb_command_wrapper+0x14c
> 0xe56639d0: at usb_callback_proc+0x190
> 0xe5663a00: at usb_process+0x1d0
> 0xe5663a50: at fork_exit+0xf4
> 0xe5663a80: at fork_trampoline+0xc
> 
> irq53: smudoorbell0. . .
> 
> 0xe56e4460: at vpanic+0x22c
> 0xe56e44d0: at panic+0x54
> 0xe56e4520: at _mtx_lock_spin_failed+0x54
> 0xe56e4540: at thread_lock_flags_0x1bc
> 0xe56e45b0: at sleepq_timeout+0x44
> 0xe56e45f0: at softclock_call_cc+0x1bc
> 0xe56e4660: at callout_process+0x27c
> 0xe56e46c0: at handleevents+0x2ac
> 0xe56e4710: at timercb+0x4c4
> 0xe56e4790: at decr_intr+0xf0
> 0xe56e47b0: at powerpc_intrerrupt+0xf4
> 0xe56e47e0: at kernel DECR trap
>               by _mtx_lock_spin_cookie+0x194
>               srr1= 0x9032
>               r1=   0xe56e48a0
>               cr=   0x20000002
>               xer=  0
>               ctr=  0
> 0xe56e48a0: at _mtx_lock+_spin_cookie+0x190
> 0xe56e4900: at sleepq_lock+0xac
> 0xe56e4930: at wakeup+0x24
> 0xe56e4950: at smu_doorbell_intr+0x128
> 0xe56e4980: at intr_event_execute_handler+0x220
> 0xe56e49f0: at ithread_loop+0xf0
> 0xe56e4a50: at fork_exit+0xf4
> 0xe56e4a80: at fork_trampoline+0xc
> 
> For reference acttrace also reported
> (that looks odd after the first 2
> stack lines):
> 
> powerpd pid 960 tid 100097 td 0x5ba1999 (CPU 2)
> 0xd2418510: at uma_zalloc_arg+0x178
> 0xd2418580: at mmu_unmapdev+desc+0x8
> 0x00fcf9dc: at primes+0x68
> 0x00d026b0: at moe64_page_exists_quick+0x1d8
> 0x00ce0ff8: at mmu_unmapdev_desc+0x8
> 
> (CPU3 had an empty backtrace for pid 11's
> tid 100006.)
> 
> 
> I had another example of sorts but it did not
> get far before hanging up:
> 
> spinlock 0xefbbf00 (sched lock 1) held by 0x146d360 (tid 100004) too long
> spinlock 0xefbbf00 (sched lock 1) held by 0x146d360 (tid 100004) too long
> 
> (Note: Two different colored lines above but the
> same otherwise.)
> 
> timeout stopping cpus
> 
> Note: The partial backtrace was:
> ("cpuid = 2", all in the first color above)
> 
> 0xdf6bb5c0: at vpanic+0x210
> 0xdf6bb630: at panic++0x54
> 0xdf6bb680: at _mtx_lock_spin_failed+0x54
> 0xdf6bb6a0: at _mtx_lock_spin_cookie+0x190
> 0xdf6bb700: at sched_add+0x140
> 0xdf6bb740: at sched_wakeup+0xa4
> 0xdf6bb760: at 
> 
> (and that is as far as it got for the backtrace(s))

FYI: I'm currently doing an approximate
binary search for localizing part of the panic problem.

This is based on the classic panics that are instead
from jumping to a non-code area. . .

At a given point in my other experiments I was
getting:

srr0=0x90a0f0 etext+0xb8fc

Adding (unused) code somewhat before that etext
(so increasing etext) got:

srr0=0x90a0f0 etext+0xb8a8
(The additional code was larger than I now use.)

But instead adding some code earlier (by around
0x100000 in this example) got:

srr0=0x90a110 etext+0xb8fc

So comparing to the starting conditions in
each case:

The bad-address accessed in one case stayed
constant but the etext offset decreased: in essence
the only thing that happened is etext increased
(matching the offset decrease).

In the other case the etext offset stayed constant
but the bad-address and etext increased by the
same amount.

Unfortunately this will take a while:

A) It typically takes hours for a failure to happen,
   I've had up to between 7 hours and 8 hours.
   (I've had minutes-to-failure but it is rare.)

B) I'd let it go 24 hours before concluding that
   a test combination was not likely to produce
   an obvious form of failure (such as the panic).

(I've had (B) in some earlier forms of testing
variations of things.)

Another time-taker is that I need to be around
because I avoid leaving the PowerMac G5 panicked
with the fans going fill-speed for long periods.

Currently I'm adding code by adding:

void HACKISH_EXTRA_CODE(void) {}

to one .c file from /usr/src/sys/. . . based which
file gets to within a ballpark of a more accurate
binary search position. (Large binary search
jumps currently: I'm not being picky about where
in the .c the addition is made yet.)

===
Mark Millard
markmi at dsl-only.net