TARGET_ARCH=powerpc head -r317820 production-style kernel: periodic panics always in pid=11 (the Idle threads)
Mark Millard
markmi at dsl-only.net
Sat May 20 09:01:59 UTC 2017
On 2017-May-19, at 9:42 PM, Mark Millard <markmi at dsl-only.net> wrote:
> On 2017-May-9, at 2:00 PM, Mark Millard <markmi at dsl-only.net> wrote:
>
> . . .
>> fatal kernel trap:
>> exception = 0x903a64e (unknown)
>> srr0 = 0x7ff760
>> srr1 = 0xc1007c
>> lr = 0x907f
>> curthread = 0x147d6c0
>> pid = 11, comm = idle: cpu0
>> [ thread pid 11 tid 100003 ]
>> Stopped at ffs_truncate+0x1080: stw r11, 0xf8(r31)
>>
>> 1 contains (cpu1 instead of cpu0, so different tid):
>>
>> fatal kernel trap:
>> exception = 0x903a64e (unknown)
>> srr0 = 0x7ff760
>> srr1 = 0xc1007c
>> lr = 0x907f
>> curthread = 0x147d360
>> pid = 11, comm = idle: cpu1
>> [ thread pid 11 tid 100004 ]
>> Stopped at ffs_truncate+0x1080: stw r11, 0xf8(r31)
>>
>> 1 contains:
>
> I've discovered where to find the trapframe
> in the vmcore.* files for these specific
> examples with 0x903a64e as the exception
> and such.
>
> In the vmcore the memory image starts at
> byte offset 0x1000.
>
> To see the values reported the only
> place in the image file to start that
> produces those values at the offsets
> for in side the powerpc trapframe is:
>
> offset 0x1001 in the vmcore.* file.
>
> So memory address 0x1 is being used
> as the trapframe address when that
> odd exception information is being
> displayed. Yep: misaligned.
>
> The decoding is not of the actual
> trapframe: it is garbage that is
> not to be believed.
>
>
> Note: I lucked out after the above and
> got a somewhat different odd trap information
> that lead to actually getting a backtrace
> that included the actual pid 11 cpu 1 kernel
> thread stack bt associated with that odd
> information display.
Typo: That should have been "cpu 2".
> I'll send a separate reply for that information
> as it will take some transcribing from camera
> pictures and such.
As indicated, I got a different odd trap report
that gave a backtrace. . .
fatal user trap
exception = 0x4210000 (unknown)
srr0 = 0xc1007c09
srr1 = 0x3a64e80
lr = 0xc0807fc9
curthread = 0x147d000
pid = 11, comm = idle: cpu 2
Now at this point it attempted to
db_print_loc_and_inst and got
another exception (at offset +0x60
in the routine).
So the backtrace has both the
consequences of that and what lead
up to that: an EXI trap was
attempting to report trap frame
information but was using a bad
address for the supposed frame.
The details of the backtrace:
panic: data storage interrupt trap
cpuid = 2
time = 145187154
KDB: stack backtrace
0xdf5ef2c0: at kdb_backtrace+0x5c
0xdf5ef3a0: at panic+0x54
0xdf5ef3f0: at trap_fatal+0x1cc
0xdf5ef420: at powerpc_interrupt+0x180
0xdf5ef5c0: kernel DSI read trap @ 0xc1007c09
by db_disasm+0x30:
srr1=0x1032
r1 =0xdf5ef6b0
cr =0x24009022
xer =0
ctr =0x1852cc
sr =0x40000000
0xdf5ef6b0: at 0x1007480
0xdf5ef6d0: at db_print_loc_and_inst+0x60
0xdf5ef700: at db_trap+0x104
0xdf5ef790: at kdb_trap+0x1bc
0xdf5ef810: at trap_fatal+0x1b0
0xdf5ef840: at trap+0x1184
0xdf5ef870: kernel EXI trap
by cpu_idle_60x+0x88:
srr1=0x1032
r1 =0xdf5ef930
cr =0x40000042
xer =0x20000000
ctr =0x8e3bd8
saved LR(0x2) is invalid.
So an EXI trap was attempting to
report a trap frame.
(Note: the LR's for pid 11 cpu threads
normally report an invalid LR in ddb.)
The actual EXI trapframe starts at 013f0878 in
vmcore.5:
013f0870 df 5e f9 30 00 10 08 f8 00 04 90 32 df 5e f9 30 |.^.0.......2.^.0|
013f0880 01 47 d0 00 00 00 00 00 25 94 48 3f 00 00 00 00 |.G......%.H?....|
013f0890 25 94 48 3f 00 4a a9 c8 00 00 00 00 00 00 00 44 |%.H?.J.........D|
013f08a0 01 fc a0 55 00 00 90 32 df 5d 1d 00 00 00 00 00 |...U...2.]......|
013f08b0 00 d4 bd ec 00 cb 98 98 00 c9 66 bc 00 c4 5d 08 |..........f...].|
013f08c0 00 c9 66 bc 00 d4 c5 3c df 5e f9 e0 00 eb a7 80 |..f....<.^......|
013f08d0 00 c9 66 bc 01 47 d0 00 df 5e f9 8c 00 00 00 06 |..f..G...^......|
013f08e0 00 00 00 06 00 eb b5 80 00 00 00 00 00 8e 3b d8 |..............;.|
013f08f0 00 d2 6b f0 df 5e f9 30 00 8e 3b f4 40 00 00 42 |..k..^.0..;. at ..B|
013f0900 20 00 00 00 00 8e 3b d8 00 8e 3c 60 00 00 90 32 | .....;...<`...2|
013f0910 00 00 05 00 41 a1 d5 d4 42 00 00 00 00 00 00 00 |....A...B.......|
So:
r0 = 0x00049032
r1 = 0xdf5ef930
r2 = 0x0147d000
r3 = 0x00000000
r4 = 0x2594483f
r5 = 0x00000000
r6 = 0x2594483f
r7 = 0x004aa9c8
r8 = 0x00000000
r9 = 0x00000044
r10 = 0x01fca055
r11 = 0x00009032
r12 = 0xdf5d1d00
r13 = 0x00000000
r14 = 0x00d4bdec
r15 = 0x00cb9898
r16 = 0x00c966bc
r17 = 0x00c45d08
r18 = 0x00c966bc
r19 = 0x00d4c53c
r20 = 0xdf5ef9e0
r21 = 0x00eba780
r22 = 0x00c966bc
r23 = 0x1047d000
r24 = 0xdf5ef98c
r25 = 0x00000006 (this value shows up later in a bad spot)
r26 = 0x00000006 (this value shows up next to that)
r27 = 0x00ebb580
r28 = 0x00000000
r29 = 0x008e3bd8
r30 = 0x00d26bf0
r31 = 0xdf5ef930
lr = 0x008e3bf4
cr = 0x40000042
xer = 0x20000000
ctr = 0x008e3bd8
srr0 = 0x008e3c60
srr1 = 0x00009032
exc = 0x00000500
dar = 0x41a1d5d4
dsisr = 0x42000000
Other elements of the stack leading to this are:
013f0920 45 8b a7 b5 b1 fd 96 be 00 00 00 00 00 00 00 04 |E...............|
013f0930 df 5e f9 50 00 00 00 06 00 00 00 06 00 eb b5 80 |.^.P............|
(odd lr value; value repeats
above, not ; here. r25/r26
a multiple ; pair saved
of 4 even. ; in those?
matches r25 ; matches r26
in trapframe; in trapframe)
013f0940 00 00 00 00 00 d4 ca 34 00 d2 6b f0 df 5e f9 50 |.......4..k..^.P|
013f0950 df 5e f9 70 00 8e 31 7c 00 00 00 02 00 eb b5 80 |.^.p..1|........|
013f0960 00 f2 d5 fc 00 00 00 06 00 d1 ca ac df 5e f9 70 |.............^.p|
013f0970 df 5e fa 50 00 53 6e 58 df 5e f9 80 00 00 00 00 |.^.P.SnX.^......|
. . .
013f0a50 df 5e fa 80 00 4a 3c b4 df 5e fa 60 fa 50 05 af |.^...J<..^.`.P..|
013f0a60 df 5e fa 80 00 00 00 00 00 00 00 00 00 00 00 00 |.^..............|
013f0a70 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
013f0a80 00 00 00 00 00 8f 18 90 00 53 69 84 00 00 00 00 |.........Si.....|
From which I get from the stack
backpointer/lr-value pairs(via objdump
on the kernel):
00000006: ????????????????????
008e317c: cpu_idle+0x58
00536e58: scheduletd+0x4d4
004a3cb4: fork_exit+0xf8
008f1890: fork_trampoline+0x10
But cpu_idle+0x58 is the bl ,cpu_activeclock>
in the below and the prior bctrl after
the bl <cpu_idleclock> is what made the
call.
008e3124 <cpu_idle> stwu r1,-32(r1)
008e3128 <cpu_idle+0x4> mflr r0
008e312c <cpu_idle+0x8> stw r29,20(r1)
008e3130 <cpu_idle+0xc> stw r30,24(r1)
008e3134 <cpu_idle+0x10> stw r31,28(r1)
008e3138 <cpu_idle+0x14> stw r0,36(r1)
008e313c <cpu_idle+0x18> mr r31,r1
008e3140 <cpu_idle+0x1c> bcl- 20,4*cr7+so,008e3144 <cpu_idle+0x20>
008e3144 <cpu_idle+0x20> mflr r30
008e3148 <cpu_idle+0x24> lwz r0,-36(r30)
008e314c <cpu_idle+0x28> add r30,r0,r30
008e3150 <cpu_idle+0x2c> lwz r29,-32768(r30)
008e3154 <cpu_idle+0x30> lwz r0,0(r29)
008e3158 <cpu_idle+0x34> cmpwi cr7,r0,0
008e315c <cpu_idle+0x38> beq- cr7,008e3198 <cpu_idle+0x74>
008e3160 <cpu_idle+0x3c> cmpwi cr7,r3,0
008e3164 <cpu_idle+0x40> bne- cr7,008e3188 <cpu_idle+0x64>
008e3168 <cpu_idle+0x44> bl 005002a4 <critical_enter>
008e316c <cpu_idle+0x48> bl 008ad894 <cpu_idleclock>
008e3170 <cpu_idle+0x4c> lwz r29,0(r29)
008e3174 <cpu_idle+0x50> mtctr r29
008e3178 <cpu_idle+0x54> bctrl
008e317c <cpu_idle+0x58> bl 008ad794 <cpu_activeclock>
But ctr was reported as: 0x8e3bd8
which is cpu_idle_60x.
(So no surprises so far.)
The code through the reported cpu_idle_60x+0x88
is:
008e3bd8 <cpu_idle_60x> stwu r1,-32(r1)
008e3bdc <cpu_idle_60x+0x4> mflr r0
008e3be0 <cpu_idle_60x+0x8> stw r30,24(r1)
008e3be4 <cpu_idle_60x+0xc> stw r31,28(r1)
008e3be8 <cpu_idle_60x+0x10> stw r0,36(r1)
008e3bec <cpu_idle_60x+0x14> mr r31,r1
008e3bf0 <cpu_idle_60x+0x18> bcl- 20,4*cr7+so,008e3bf4 <cpu_idle_60x+0x1c>
008e3bf4 <cpu_idle_60x+0x1c> mflr r30
008e3bf8 <cpu_idle_60x+0x20> lwz r0,-32(r30)
008e3bfc <cpu_idle_60x+0x24> add r30,r0,r30
008e3c00 <cpu_idle_60x+0x28> lwz r9,-32756(r30)
008e3c04 <cpu_idle_60x+0x2c> lwz r0,0(r9)
008e3c08 <cpu_idle_60x+0x30> cmpwi cr7,r0,0
008e3c0c <cpu_idle_60x+0x34> beq- cr7,008e3c78 <cpu_idle_60x+0xa0>
008e3c10 <cpu_idle_60x+0x38> mfmsr r11
008e3c14 <cpu_idle_60x+0x3c> mfsprg r0,7
008e3c18 <cpu_idle_60x+0x40> rlwinm r9,r0,16,16,31
008e3c1c <cpu_idle_60x+0x44> cmpwi cr7,r9,68
008e3c20 <cpu_idle_60x+0x48> beq- cr7,008e3c4c <cpu_idle_60x+0x74>
008e3c24 <cpu_idle_60x+0x4c> cmplwi cr7,r9,68
008e3c28 <cpu_idle_60x+0x50> bgt- cr7,008e3c40 <cpu_idle_60x+0x68>
008e3c2c <cpu_idle_60x+0x54> cmpwi cr7,r9,57
008e3c30 <cpu_idle_60x+0x58> beq- cr7,008e3c4c <cpu_idle_60x+0x74>
008e3c34 <cpu_idle_60x+0x5c> cmpwi cr7,r9,60
008e3c38 <cpu_idle_60x+0x60> bne+ cr7,008e3c64 <cpu_idle_60x+0x8c>
008e3c3c <cpu_idle_60x+0x64> b 008e3c4c <cpu_idle_60x+0x74>
008e3c40 <cpu_idle_60x+0x68> addi r0,r9,-32768
008e3c44 <cpu_idle_60x+0x6c> cmplwi cr7,r0,4
008e3c48 <cpu_idle_60x+0x70> bgt- cr7,008e3c64 <cpu_idle_60x+0x8c>
008e3c4c <cpu_idle_60x+0x74> oris r0,r11,4
008e3c50 <cpu_idle_60x+0x78> dssall
008e3c54 <cpu_idle_60x+0x7c> sync
008e3c58 <cpu_idle_60x+0x80> mtmsr r0
008e3c5c <cpu_idle_60x+0x84> isync
008e3c60 <cpu_idle_60x+0x88> b 008e3c78 <cpu_idle_60x+0xa0>
So that is the context for the EXI trap.
The "mtsmr r0" merged in PSL_POW to the
msr value when it was originally not
set. (r11 vs. r0 value comparison in
the trapframe.)
Why it would be going from without
POW for so long and then merging in
POW I do not know. (Or may be I just
did not find code that turns POW
back off on occasion.)
Still setting POW in msr seems to be
what is unique to the failure context.
Mkes me wonder if smu_doorbell_intr
and its irq is somehow involved.
There is no obvious, local tie to
r25 and r26 in/for the spots identified
earlier for the 013f0930 line of the
thread stack area.
Interestingly when I looked up POW in
mtsmsr use what I found reported:
Synchronization Required Prior: implementation dependent
Synchronization Required After: implementation dependent
The source code looks like:
static void
cpu_idle_60x(sbintime_t sbt)
{
register_t msr;
uint16_t vers;
if (!powerpc_pow_enabled)
return;
msr = mfmsr();
vers = mfpvr() >> 16;
#ifdef AIM
switch (vers) {
case IBM970:
case IBM970FX:
case IBM970MP:
case MPC7447A:
case MPC7448:
case MPC7450:
case MPC7455:
case MPC7457:
__asm __volatile("\
dssall; sync; mtmsr %0; isync"
:: "r"(msr | PSL_POW));
break;
default:
powerpc_sync();
mtmsr(msr | PSL_POW);
isync();
break;
}
#endif
}
I've not yet figured out how to check
that the details are right here for
IBM970MP used via 32-bit FreeBSD.
===
Mark Millard
markmi at dsl-only.net
More information about the freebsd-ppc
mailing list