lr=u_trap+0x10 and srr0=k_trap+0x28 for "stopped at 0 illegal instruction 0" before-copyright hang on PowerMac G5's
Mark Millard
markmi at dsl-only.net
Sat Sep 27 05:18:45 UTC 2014
The first send of this was big enough for the moderator to be involved. So I canceled and am sending with less history included.
[I'll note that I seem to have trouble typing 0xdbb290 vs. 0xbdd290. The actual value is 0xdbb290. The references to the incorrect typing should say 0xbdd290, which is the wrong value. But I've had both types of references listing the wrong text... in various notes.]
===
Mark Millard
markmi at dsl-only.net
On Sep 26, 2014, at 10:11 PM, Mark Millard <markmi at dsl-only.net> wrote:
The openfirmware peer crash (i.e., the before Copyright notice crash) happens during/just-after the MMU setup and the peer pfwcall is the first ofwcall where pmap_bootstrapped is non-zero at the time. In other words: the very first ofwcall in the new context fails.
And this failure involves some of the same code area that I got a backtrace for and reported as a separate crash (with the trace listed). As a reminder for that backtrace that has a difference failure point:
.pvo_vaddr_compare+0x14, instruction ld r0, r4, 0x58 [or ld r0,88(r4) in an alternate notation]
.pvo_tree_RB_FIND+0x38
.moea64_dev_direct_mapped_0x90
.pmap_dev_direct_mapped+0x84 ("_dev" was missing in earlier note)
.bs_remap_earlyboot_0x6c
.moea64_late_bootstrap+0x178
.moea64_bootstrap_native+0x120
.pmap_bootstrap+0xac
.powerpc_init+0x514
btext+0xa8
As for the sequence of ofwcall's that I reported: starting at the last OF_finddevice before the OF_instance_to_package that I reported in the sequence of ofwcall's from quiesce until the crash...
moea64_late_bootstrap does
chosen = OF_finddevice("/chosen");
if (chosen != -1 && OF_getprop(chosen, "mmu", &mmui, 4) != -1) {
mmu = OF_instance_to_package(mmui);
if (mmu == -1 || (sz = OF_getproplen(mmu, "translations")) == -1)
sz = 0;
if (sz > 6144 /* tmpstksz - 2 KB headroom */)
panic("moea64_bootstrap: too many ofw translations");
if (sz > 0)
moea64_add_ofw_mappings(mmup, mmu, sz);
}
with moea64_add_ofw_mappings called. Then...
moea64_add_ofw_mappings does...
bzero(translations, sz);
OF_getprop(OF_finddevice("/"), "#address-cells", &acells,
sizeof(acells));
if (OF_getprop(mmu, "translations", trans_cells, sz) == -1)
panic("moea64_bootstrap: can't get ofw translations");
And it is the next ofwcall after that last OF_getprop that fails. (It happens to be a peer request.) Adding a dump of the pmap_bootstrapped value with the ofwcall name in my hack for reporting things about the crash confirmed that peer ofwcall as the first with pmap_bootstrapped non-zero.
I will note here that it is somewhat later than the above code that pvo_vaddr_compare ends up executing via bs_remap_earlyboot. That earlier moea64_late_bootstrap code continues after the } from the first if above with:
/*
* Calculate the last available physical address.
*/
for (i = 0; phys_avail[i + 2] != 0; i += 2)
;
Maxmem = powerpc_btop(phys_avail[i + 1]);
/*
* Initialize MMU and remap early physical mappings
*/
MMU_CPU_BOOTSTRAP(mmup,0);
mtmsr(mfmsr() | PSL_DR | PSL_IR);
pmap_bootstrapped++;
bs_remap_earlyboot();
(and more). I've not found the peer call yet but it may well be after the pvo_vaddr_compare shown above as far as execution order goes.
===
Mark Millard
markmi at dsl-only.net
On Sep 25, 2014, at 2:41 PM, Mark Millard <markmi at dsl-only.net> wrote:
The first boot after make -8 kernel without quiesce also died during peer, I'd guess the same one.
Looks like quiesce does not matter for the issue. (But it is handy for identifying which peer fails.)
===
Mark Millard
markmi at dsl-only.net
On Sep 25, 2014, at 2:08 PM, Nathan Whitehorn <nwhitehorn at freebsd.org> wrote:
Can you comment out the call to quiesce? It may not be necessary on your system.
-Nathan
On 09/25/14 13:17, Mark Millard wrote:
> The "before copyright" hang/exception is during the first openfirmware "peer" after "quiesce". The ofw_restore_trap_vec(save_trap_init) completes fine, the ofwcall(args) is made but it does not return normally.
>
> Ignoring the ofwcall's from before quiesce, the sequence of ofwcall's is:
>
> quiesce
> finddevice
> parent
> getprop
> getprop
> getprop
> finddevice
> getprop
> instance-to-package
> getproplen
> finddevice
> getprop
> getprop
> peer
>
> And when the boot fails before the copyright that ofwcall for peer ends up resulting in the register dump with no register pointing to the kernel's normal stack area.
>
> I still have no clue what is happening during peer. ofw_restore_trap_vec(save_trap_init) is being called and is returning before ofwcall is used. For all I know some uses of peer could require not being quiesce'd in order for peer to be reliable.
>
> In the form of my display indicating what executed the text reported ends in:
>
> <peer>^
>
> where the ^ indicates the stage that last completed in the call sequence inside openfirmware_core. This information is displayed by the
>
> x/s ofw_name_history
>
> in the automatically created default script for DDB. I read the sequence backwards from the end marker (here ^), following the wraparound if there is that much text and if I care to go back that far.
>
> FreeBSD FBSDG5M1 10.1-BETA2 FreeBSD 10.1-BETA2 #11 r271944M: Thu Sep 25 12:14:05 PDT 2014 root at FBSDG5M1:/usr/obj/usr/src/sys/GENERIC64 powerpc
>
> My current hacks to get this information are:
>
> Index: /usr/src/sys/ddb/db_script.c
> ===================================================================
> --- /usr/src/sys/ddb/db_script.c (revision 271944)
> +++ /usr/src/sys/ddb/db_script.c (working copy)
> @@ -319,10 +319,25 @@
> {
> char scriptname[DB_MAXSCRIPTNAME];
>
> + /* HACK!!! : Additional lines to force a basic default script to exist.
> + * Will dump information even if ddb input is not available for early crash.
> + * Used to get more information about PowerMac G5 "before Copyright" hangs.
> + */
> + struct ddb_script *dsp = db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT);
> + if (!dsp) db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "show registers; bt; x/s ofw_name_history");
> +
> snprintf(scriptname, sizeof(scriptname), "%s.%s",
> DB_SCRIPT_KDBENTER_PREFIX, eventname);
> if (db_script_exec(scriptname, 0) == ENOENT)
> (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
> +
> + /* HACK!!! : Additional lines to always use the default script,
> + * even if scriptname existed and was executed.
> + * Will dump information even if ddb input is not available for early crash.
> + * Used to get more information about PowerMac G5 "before Copyright" hangs.
> + */
> + else
> + (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
> }
>
> /*-
> Index: /usr/src/sys/powerpc/conf/GENERIC64
> ===================================================================
> --- /usr/src/sys/powerpc/conf/GENERIC64 (revision 271944)
> +++ /usr/src/sys/powerpc/conf/GENERIC64 (working copy)
> @@ -76,6 +76,8 @@
> # Debugging support. Always need this:
> options KDB # Enable kernel debugger support.
> options KDB_TRACE # Print a stack trace for a panic.
> +options DDB
> +options GDB
>
> # Make an SMP-capable kernel by default
> options SMP # Symmetric MultiProcessor Kernel
> Index: /usr/src/sys/powerpc/ofw/ofw_machdep.c
> ===================================================================
> --- /usr/src/sys/powerpc/ofw/ofw_machdep.c (revision 271944)
> +++ /usr/src/sys/powerpc/ofw/ofw_machdep.c (working copy)
> @@ -324,6 +324,12 @@
> openfirmware(&args);
> }
>
> +/* Part of HACK to have record of ofw call names */
> +#define ofw_name_history_record_size 256
> +char ofw_name_history[ofw_name_history_record_size+1] = {}; /* Initially: automatically '\0' filled */
> +char * ofw_name_history_pos = ofw_name_history;
> +/* End Part of HACK */
> +
> static int
> openfirmware_core(void *args)
> {
> @@ -330,6 +336,42 @@
> int result;
> register_t oldmsr;
>
> + { /* HACK to have record of ofw call names */
> + struct argtype_prefix {
> + cell_t name;
> + };
> +
> + char *name = (char*) (uintptr_t) (((struct argtype_prefix*)args)->name);
> +
> + int i;
> +
> + *ofw_name_history_pos = '<';
> +
> + for(i=0; (*name) && i!=20; i++) {
> + ofw_name_history_pos++;
> + if (ofw_name_history_pos == &ofw_name_history[ofw_name_history_record_size]) {
> + ofw_name_history_pos = ofw_name_history;
> + }
> + *ofw_name_history_pos = *name;
> +
> + name++;
> + }
> +
> + ofw_name_history_pos++;
> + if (ofw_name_history_pos == &ofw_name_history[ofw_name_history_record_size]) {
> + ofw_name_history_pos = ofw_name_history;
> + }
> + *ofw_name_history_pos = '>';
> +
> + ofw_name_history_pos++;
> + if (ofw_name_history_pos == &ofw_name_history[ofw_name_history_record_size]) {
> + ofw_name_history_pos = ofw_name_history;
> + }
> + *ofw_name_history_pos = '@';
> +
> + ofw_name_history[ofw_name_history_record_size] = '\0'; /* Paranoia */
> + } /* HACK end */
> +
> /*
> * Turn off exceptions - we really don't want to end up
> * anywhere unexpected with PCPU set to something strange
> @@ -337,14 +379,22 @@
> */
> oldmsr = intr_disable();
>
> + *ofw_name_history_pos = '#'; /* HACK */
> +
> ofw_sprg_prepare();
>
> + *ofw_name_history_pos = '$'; /* HACK */
> +
> /* Save trap vectors */
> ofw_save_trap_vec(save_trap_of);
>
> + *ofw_name_history_pos = '%'; /* HACK */
> +
> /* Restore initially saved trap vectors */
> ofw_restore_trap_vec(save_trap_init);
>
> + *ofw_name_history_pos = '^'; /* HACK */
> +
> #if defined(AIM) && !defined(__powerpc64__)
> /*
> * Clear battable[] translations
> @@ -357,13 +407,21 @@
>
> result = ofwcall(args);
>
> + *ofw_name_history_pos = '&'; /* HACK */
> +
> /* Restore trap vecotrs */
> ofw_restore_trap_vec(save_trap_of);
>
> + *ofw_name_history_pos = '*'; /* HACK */
> +
> ofw_sprg_restore();
>
> + *ofw_name_history_pos = '~'; /* HACK */
> +
> intr_restore(oldmsr);
>
> + *ofw_name_history_pos = '!'; /* HACK */
> +
> return (result);
> }
>
>
>
>
>
> ===
> Mark Millard
> markmi at dsl-only.net
>
> On Sep 25, 2014, at 3:46 AM, Mark Millard <markmi at dsl-only.net> wrote:
>
> One source code oddity that I notice is the following mixed use of ofw_real_mode: always tested vs. never tested (#if 0 ... #endif) ...
>
>> /*
>> * Saved SPRG0-3 from OpenFirmware. Will be restored prior to the callback.
>> */
>> register_t ofw_sprg0_save;
>>
>> static __inline void
>> ofw_sprg_prepare(void)
>> {
>> if (ofw_real_mode)
>> return;
>>
>> /*
>> * Assume that interrupt are disabled at this point, or
>> * SPRG1-3 could be trashed
>> */
>> __asm __volatile("mfsprg0 %0\n\t"
>> "mtsprg0 %1\n\t"
>> "mtsprg1 %2\n\t"
>> "mtsprg2 %3\n\t"
>> "mtsprg3 %4\n\t"
>> : "=&r"(ofw_sprg0_save)
>> : "r"(ofmsr[1]),
>> "r"(ofmsr[2]),
>> "r"(ofmsr[3]),
>> "r"(ofmsr[4]));
>> }
>>
>> static __inline void
>> ofw_sprg_restore(void)
>> {
>> #if 0
>> if (ofw_real_mode)
>> return;
>> #endif
>>
>> /*
>> * Note that SPRG1-3 contents are irrelevant. They are scratch
>> * registers used in the early portion of trap handling when
>> * interrupts are disabled.
>> *
>> * PCPU data cannot be used until this routine is called !
>> */
>> __asm __volatile("mtsprg0 %0" :: "r"(ofw_sprg0_save));
>> }
>
> It would seem that for ofw_real_mode != 0 that ofw_sprg_prepare would never set up ofw_sprg0_save (via mfsprg0) for the later ofw_sprg_restore's always-executed mtsprg0 that is based on ofw_sprg0_save.
>
> register_t seems to trace back to __int64_t --and that would leave ofw_sprg0_save initialized to zero as a global and that would have to be okay as the SPRG0 value to restore in such a case. (I have not tracked down what any of the per-processor values for SPRG0 are/should-be.)
>
>
>
>
>
>
> ===
> Mark Millard
> markmi at dsl-only.net
>
> On Sep 25, 2014, at 2:12 AM, Mark Millard <markmi at dsl-only.net> wrote:
>
> The register dump that has no kernel stack addresses in any registers does have register contents suggesting a ofwcall use, matching up reasonably with the code I looked at that is related to ofwcall. ofwcall is only reached via openfirmware_core from what I can tell. (If there are other paths into openfirmware than via ofwcall then the register dump suggests that they are not in use around the crash.)
>
> And openfirmware_core has logic for exception vector swapping, going both directions:
>
>> static int
>> openfirmware_core(void *args)
>> {
>> int result;
>> register_t oldmsr;
>>
>> /*
>> * Turn off exceptions - we really don't want to end up
>> * anywhere unexpected with PCPU set to something strange
>> * or the stack pointer wrong.
>> */
>> oldmsr = intr_disable();
>>
>> ofw_sprg_prepare();
>>
>> /* Save trap vectors */
>> ofw_save_trap_vec(save_trap_of);
>>
>> /* Restore initially saved trap vectors */
>> ofw_restore_trap_vec(save_trap_init);
>>
>> #if defined(AIM) && !defined(__powerpc64__)
>> /*
>> * Clear battable[] translations
>> */
>> if (!(cpu_features & PPC_FEATURE_64))
>> __asm __volatile("mtdbatu 2, %0\n"
>> "mtdbatu 3, %0" : : "r" (0));
>> isync();
>> #endif
>>
>> result = ofwcall(args);
>>
>> /* Restore trap vecotrs */
>> ofw_restore_trap_vec(save_trap_of);
>>
>> ofw_sprg_restore();
>>
>> intr_restore(oldmsr);
>>
>> return (result);
>> }
>
> In turn openfirmware_core is used only by ofw_rendezvous_dispatch and in turn that is used only by openfirmware: only PCPU_GET(cpuid) == 0 does the above. save_trap_init is initialized by powerpc_init using ofw_save_trap_vec.
>
> [Note that ofw_restore_trap_vec uses __syncicache which does not use dcbf after the bcopy but instead uses dcbst: That is part of what lead my investigation into the distinction --and so to my more overall dcbst vs. dcbf use questions after proving dcbf would not be sufficient for a fix to the specific boot issue.]
>
> Unless the initialization of save_trap_init ends up with the wrong contents for openfirmware it would appear that the exception vectors are kept tracking by the above code. But the above does assume that the openfirmware vectors are unchanged after save_trap_init is initialized: there is no attempt at tracking of any potential updates to the openfirmware exception vectors.
>
> I would infer then that after ofw_restore_trap_vec(save_trap_of) is executed is when the exception that DDB reports happened: That is when FreeBSD's exception vectors are again in place. But a stack pointer into the kernel stack is not then in place in any register (based on DDB's register dump): stack handling is messed up already by the point of the reported exception. And that may actually be why an illegal instruction at address zero was reached: an incorrect stack context used to get an address to execute at.
>
>
>
>
> ===
> Mark Millard
> markmi at dsl-only.net
>
> On Sep 24, 2014, at 8:36 AM, Nathan Whitehorn <nwhitehorn at freebsd.org> wrote:
>
> There shouldn't be any exceptions at that point, nested or otherwise. What I suspect is happening is that Open Firmware has turned them on for some bizarre reason, taken one, and ended up in the kernel's handlers but with the Open Firmware environment. Saving and restoring the OF interrupt vectors would be a possible solution; flattening the device tree in loader so that the kernel doesn't call Open Firmware at all would be another. I think Justin may have tried the first at some point.
> -Nathan
>
> On 09/24/14 02:04, Mark Millard wrote:
>> Now that I've had a kernel/boot crash with a successful DDB bt and show registers (a different submittal) it makes for a good comparison/contrast with what DDB reports for this "before copyright" crash.
>>
>> Something unique to the "before copyright" context is...
>>
>> No registers are reported to have values that point into the range between tmpstk and esym.
>>
>> In other words: There is no valid stack pointer reported as far as I can tell. r1 has the value 0 instead of being a handling a valid stack address. tmpstk=0xbd7000 and esym=0xbdb000 (example for one of my WITH_DEBUG_FILES= and options DDB and GDB builds of 10.1-BETA2). That at least gives a ball park on the range to expect for pointing into the stack even with some build variation.
>>
>> It leaves me wondering if the DDB report is for a nested exception handling. That could explain why lr points to u_trap+0x10 and srr0 points to k_trap+0x28 when normally srr0 would point to the the failing instruction (or the instruction after) and lr to where that routine would normally return to.
>>
>> The register values that are reported for my 10.1-BETA2 builds that crash before the copyright notice are:
>>
>> r0: 0
>> r1: 0
>> r2: 0xc81538 vop_unlock_desc
>> r3: 0xd18868
>> r4: 0x894b58
>> r5: 0
>> r6: 0xc1dee0 M_AUDITBSM
>> r7: 0xe3f818 ofw_real_mode
>> r8: 0x1
>> r9: 0xe0f580 __pcpu
>> r10: 0x1c35ec0
>> r11: 0
>> r12: 0x10000000
>> r13: 0xdbb290 thread0 (Note: another submittal has this mistyped as 0xdbb290.)
>> r14-r19: all 0
>> r20: 0x10c1000
>> r21: 0x4
>> r22: 0x180abd4
>> r23: 0x1803a28
>> r24: 0xc000000000008760
>> r25: 0xcc89b8 smp_no...
>> r26: 0xcea108 ofw_rend...
>> r27: 0x894b58 ofwcall+0xa8
>> r28: 0x894b58 ofwcall+0xa8
>> r29: 2400022
>> r30: 9000000000001032
>> r31: 0xbb7d38
>>
>> srr0: 0x102720 k_trap+0x28
>> srr1: 9000000000001032
>> lr: 0x1026f0 u_trap+0x10
>> ctr: 0xff846d78
>> cr: 2000deb0
>> xer: 0
>> dar: f...d50 (lots of f's)
>> dsisr: 42000000
>>
>>
>>
>>
>>
>>
>> ===
>> Mark Millard
>> markmi at dsl-only.net
More information about the freebsd-ppc
mailing list