My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [current workaround]
Mark Millard
markmi at dsl-only.net
Sun Oct 19 21:12:45 UTC 2014
I should be explicit about some of what I've not done in my investigations...
I've not been brave enough to try to have the likes of, say, slbtrap and kern_slbtrap execution paths that do not use powerpc_interrupt noted in with the ofwcall history to see if/when they are happening relative to ofwcall usage.
For example I've never done anything explicit to force the memory address range then involved in holding the simple history text to have invariant addressing across the contexts: (A) mmu-disabled, (B) mmu-enabled but virtual-memory-disabled, (C) mmu and virtual-memory active. Nor have I done anything explicit to force that memory range to stay in RAM so that it could not cause interrupts itself.
When I had powerpc_interrupt recording in the history I greatly limited what I did with such a kernel build. (I keep a separate /boot/good/kernel around that is not replaced by make -j 8 kernel experiments.) I was really only interested in the early boot information and what I did was sufficient for that.
So, for example, I've not eliminated slbtrap exceptions as happening in the problem context --at least not by what can be known from the history recorded. But looking at that code it is far from obvious how it would contribute to the %r1/%r3 corruption (%r1=%r3=%r2+0x400, with %r2 having the before openfirware value that is preserved to the after openfirmware context).
And my earlier reference to alternate code being safer is tied to not depending on the garbage access via the corruptede %r1 to cause a panic/hang to avoid processing otehr possibly garbage bits. It is true that I mixed that with trying to continue instead of forcing things to stop. There are contexts where detecting either the %r1 or the %r3 corruption should lead to things stopping for fear of what else might be messed up. I've not checked all the code to know if ofwcall returning -1 in %r3 would always stop or not.
===
Mark Millard
markmi at dsl-only.net
On Oct 19, 2014, at 12:43 AM, Mark Millard <markmi at dsl-only.net> wrote:
Short of extracting and analyzing the openfirmware code and its behavior directly I've run out of ideas for investigation of the %r1 and %r3 corruptions during openfirmware calls on the PowerMac G5's.
So my next investigative direction will probably be to hack in %r1 and %r3 validation into the powerpc/GENERIC ofwcall 32 bit code and have it report if it finds anything odd. This may take a while for me to get to. And some time to conclude that nothing is being found if nothing is found.
I believe that given the known problems and observed %r1 and %r3 corruptions that the FreeBSD ofwcall code for powerpc64 on PowerMacs would be safer if ofwcall was changed to have the following properties (at least on/for powerpc64 PowerMacs):
A) check if %r3 ends up neither 0 nor -1 and if not then change it to -1 for what is returned overall. In other words: do not presume things are okay with other information returned other ways (fields of struct pointed to by argument) unless the returned openfirmware status in %r3 is exactly zero. So otherwise have the openfirmware error indicator (-1) returned from ofwcall.
[Do all openfirmware's have the one's complement Boolean style return values (0 vs. -1) that PowerMac G5's seem to have? If not the code above would fail to be very general.]
B) Similarly check for if %r1 had a net-change (a corruption) and use the known/recorded before-value and have %r3 be -1 to get to the point of returning to the caller a failure status to the code calling ofwcall.
C) Possibly have one automatic retry of the openfirmware call if (A) or (B) type problems happen before having such a failure (-1) return. Re-setup %r1 and %r3 first for such a retry if such is attempted. Handle retry-failure as in (A) and (B) above.
[This comes from my investigation only finding one-time-failures in the sequence of ofwcall's: after a failure later calls from the same boot sequence and until shutdown worked without observed corruptions of %r1 or %r3.]
D) As paranoia for now: Have a general bias to not depending on most registers being preserved across the openfirmware call since bad register values are part of the observed problem. Probably be biased to mostly use the registers that ofwcall already explicit saves and restores (non-volatile registers that openfirmware should also explicitly save and restore) but use separate storage to save and then recover values across any calls into openfirmware.
However, such changes would mean that such PowerMac builds would not be generic FreeBSD code unless such things were tolerable for the other powerpc64 contexts that use ofwcall from ofwcall64.S.
My code for this below certainly qualifies as a personal hack based on information specific to PowerMac G5's. I have also left in place the early restore of the FreeBSD sprg0 value that allowed the original exception to have a proper value to use during my investigations. (Those specific exceptions should no longer be possible in my code.) I've got ofw_sprg0_save being accessible and used from both ofw_machdep.c and ofwcalla64.S because of leaving this paranoia item in place.
I also have DDB/GDB option additions in GENERIC64 and ddb hacks such that early crashes tend to "bt; show registers" before hanging. (There is also the PS3 disable and the addition of sc.)
My context is still 10.1-RC1 based. /etc/make.conf with WITH_DEBUG_FILES= , WITHOUT_CLANG= , WITH_DEBUG= , and WORKDIRPREFIX assigned. I tend to have verbose_loading="YES" in /boot/loader.conf . kern.vty depends on which video hardware is involved. Panic dumps are effectively disabled by it attempting larger dma transfers than are actually supported: that that size relationship ends up reported instead.
root at FBSDG5M1:/usr/home/markmi # svnlite diff /usr/src/sys/
Index: /usr/src/sys/ddb/db_main.c
===================================================================
--- /usr/src/sys/ddb/db_main.c (revision 272558)
+++ /usr/src/sys/ddb/db_main.c (working copy)
@@ -46,6 +46,9 @@
#include <ddb/db_command.h>
#include <ddb/db_sym.h>
+/* HACK: part of dealing with lack of input for early boot time */
+#include <ddb/db_output.h>
+
SYSCTL_NODE(_debug, OID_AUTO, ddb, CTLFLAG_RW, 0, "DDB settings");
static dbbe_init_f db_init;
@@ -210,6 +213,9 @@
watchpt = IS_WATCHPOINT_TRAP(type, code);
if (db_stop_at_pc(&bkpt)) {
+ /* HACK: part of early boot handling: no input possible */
+ db_disable_pager();
+
if (db_inst_count) {
db_printf("After %d instructions (%d loads, %d stores),\n",
db_inst_count, db_load_count, db_store_count);
Index: /usr/src/sys/ddb/db_script.c
===================================================================
--- /usr/src/sys/ddb/db_script.c (revision 272558)
+++ /usr/src/sys/ddb/db_script.c (working copy)
@@ -319,10 +319,25 @@
{
char scriptname[DB_MAXSCRIPTNAME];
+ /* HACK!!! : Additional lines to force a basic default script to exist.
+ * Will dump information even if ddb input is not available for early crash.
+ * Used to get more information about PowerMac G5 "before Copyright" hangs.
+ */
+ struct ddb_script *dsp = db_script_lookup(DB_SCRIPT_KDBENTER_DEFAULT);
+ if (!dsp) db_script_set(DB_SCRIPT_KDBENTER_DEFAULT, "bt; show registers");
+
snprintf(scriptname, sizeof(scriptname), "%s.%s",
DB_SCRIPT_KDBENTER_PREFIX, eventname);
if (db_script_exec(scriptname, 0) == ENOENT)
(void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
+
+ /* HACK!!! : Additional lines to always use the default script,
+ * even if scriptname existed and was executed.
+ * Will dump information even if ddb input is not available for early crash.
+ * Used to get more information about PowerMac G5 "before Copyright" hangs.
+ */
+ else
+ (void)db_script_exec(DB_SCRIPT_KDBENTER_DEFAULT, 0);
}
/*-
Index: /usr/src/sys/powerpc/conf/GENERIC64
===================================================================
--- /usr/src/sys/powerpc/conf/GENERIC64 (revision 272558)
+++ /usr/src/sys/powerpc/conf/GENERIC64 (working copy)
@@ -28,7 +28,7 @@
# Platform support
options POWERMAC #NewWorld Apple PowerMacs
-options PS3 #Sony Playstation 3
+#options PS3 #Sony Playstation 3 HACK!!! to allow sc
options MAMBO #IBM Mambo Full System Simulator
options PSERIES #PAPR-compliant systems (e.g. IBM p)
@@ -76,6 +76,12 @@
# Debugging support. Always need this:
options KDB # Enable kernel debugger support.
options KDB_TRACE # Print a stack trace for a panic.
+options DDB # HACK!!! to dump early crash info
+options GDB # HACK!!! ...
+#options KTR
+#options KTR_MASK=KTR_TRAP
+#options KTR_CPUMASK=0xF
+#options KTR_VERBOSE
# Make an SMP-capable kernel by default
options SMP # Symmetric MultiProcessor Kernel
@@ -115,6 +121,14 @@
device vt # Core console driver
device kbdmux
+# HACK!!! to allow sc for 2560x1440 display on Radeon X1950 that vt mishandled
+# syscons is a console driver, resembling an SCO console
+device sc
+#device kbdmux # HACK: already listed by vt
+options SC_OFWFB # OFW frame buffer
+options SC_DFLT_FONT # compile font in
+makeoptions SC_DFLT_FONT=cp437
+
# Serial (COM) ports
device scc
device uart
Index: /usr/src/sys/powerpc/ofw/ofw_machdep.c
===================================================================
--- /usr/src/sys/powerpc/ofw/ofw_machdep.c (revision 272558)
+++ /usr/src/sys/powerpc/ofw/ofw_machdep.c (working copy)
@@ -94,6 +94,11 @@
/*
* Saved SPRG0-3 from OpenFirmware. Will be restored prior to the callback.
*/
+/* HACK: ofw_sprg0_save storage defined in ofwcall
+ * for use in very early FreeBSD sprg0 restore
+ * as part of ready-for-possible-exception parania.
+ */
+extern
register_t ofw_sprg0_save;
static __inline void
Index: /usr/src/sys/powerpc/ofw/ofwcall64.S
===================================================================
--- /usr/src/sys/powerpc/ofw/ofwcall64.S (revision 272558)
+++ /usr/src/sys/powerpc/ofw/ofwcall64.S (working copy)
@@ -52,6 +52,20 @@
GLOBAL(rtas_entry)
.llong 0 /* RTAS entry point */
+ /* HACK: part of dealing with openfirmware %r1, %r3 corruptions */
+ofw_entry_addr: /* accessed under ofw msr */
+ .space 4
+ofw_r1_for_retry: /* accessed under ofw msr */
+ .space 4
+ofw_r3_for_retry: /* accessed under ofw msr */
+ .space 4
+
+ /* HACK: part of having FreeBSD sprg0 in place for potential exceptions */
+ofwsprg0save: /* accessed under ofw msr */
+ .space 8 /* sizeof(register_t) */
+GLOBAL(ofw_sprg0_save) /* accessed under FreeBSD msr */
+ .llong 0
+
/*
* Open Firmware Real-mode Entry Point. This is a huge pain.
*/
@@ -90,50 +104,121 @@
std %r30,192(%r1)
std %r31,200(%r1)
+ /* HACK: Avoid depending much on preserved registers
+ * and be biased to use the ones saved above
+ */
+
/* Record the old MSR */
- mfmsr %r6
+ mfmsr %r14
/* read client interface handler */
- lis %r4,openfirmware_entry at ha
- ld %r4,openfirmware_entry at l(%r4)
+ lis %r15,openfirmware_entry at ha
+ ld %r15,openfirmware_entry at l(%r15)
+ /* HACK: part of having FreeBSD's sprg0 in place for exceptions.
+ * Parania code at this point since corrupted %r1 values are
+ * avoided by forcing the before-openfirmware value.
+ */
+ lis %r16,ofw_sprg0_save at ha
+ ld %r16,ofw_sprg0_save at l(%r16)
+
/*
* Set the MSR to the OF value. This has the side effect of disabling
* exceptions, which is important for the next few steps.
+ * NOTE: The call chain may well have already disabled such in FreeBSD's
+ * msr.
*/
- lis %r5,ofmsr at ha
- ld %r5,ofmsr at l(%r5)
- mtmsrd %r5
+ lis %r17,ofmsr at ha
+ ld %r17,ofmsr at l(%r17)
+ mtmsrd %r17
isync
/*
* Set up OF stack. This needs to be accessible in real mode and
* use the 32-bit ABI stack frame format. The pointer to the current
- * kernel stack is placed at the very top of the stack along with
- * the old MSR so we can get them back later.
+ * kernel stack is placed below the effective ofw-stack along with the
+ * active FreeBSD TOC and FreeBSD MSR so we can get them back later.
*/
- mr %r5,%r1
+ mr %r18,%r1
lis %r1,(ofwstk+OFWSTKSZ-32)@ha
addi %r1,%r1,(ofwstk+OFWSTKSZ-32)@l
- std %r5,8(%r1) /* Save real stack pointer */
- std %r2,16(%r1) /* Save old TOC */
- std %r6,24(%r1) /* Save old MSR */
- li %r5,0
- stw %r5,4(%r1)
- stw %r5,0(%r1)
+ std %r18,8(%r1) /* Save FreeBSD stack pointer */
+ std %r2,16(%r1) /* Save FreeBSD TOC */
+ std %r14,24(%r1) /* Save FreeBSD MSR */
+ li %r19,0
+ stw %r19,4(%r1)
+ stw %r19,0(%r1)
+ /* HACK: Avoid depending much on preserved registers */
+
+ /* HACK: recording openfirmware entry address for use in possible retry */
+ lis %r20,ofw_entry_addr at ha
+ stw %r15,ofw_entry_addr at l(%r20)
+
+ /* HACK: recording %r1 before openfirmware for use in possible retry
+ * and also for testing for corruption (net-change)
+ */
+ lis %r21,ofw_r1_for_retry at ha
+ stw %r1,ofw_r1_for_retry at l(%r21)
+
+ /* HACK: recording %r3 before openfirmware for use in possible retry */
+ lis %r22,ofw_r3_for_retry at ha
+ stw %r3,ofw_r3_for_retry at l(%r22)
+
+ /* HACK: part of having FreeBSD's sprg0 in place for exceptions.
+ * Parania code at this point since corrupted %r1 values are
+ * avoided by forcing the before-openfirmware value.
+ */
+ lis %r23,ofwsprg0save at ha
+ std %r16,ofwsprg0save at l(%r23)
+
/* Finally, branch to OF */
- mtctr %r4
+ mtctr %r15
bctrl
- /* Reload stack pointer and MSR from the OFW stack */
- ld %r6,24(%r1)
+ /* HACK: check if %r1 was corrupted (had a net-change) */
+ lis %r21,ofw_r1_for_retry at ha
+ lwz %r24,ofw_r1_for_retry at l(%r21)
+ cmpw %r24,%r1
+ bne 2f /* stack pointer corrupted so go retry once */
+
+ /* HACK: %r1 okay but check %r3 for being 0 or -1 vs. anything else */
+ xoris %r25,%r3,0
+ cmpw %r25,%r3
+ bne 2f /* %r3 was neither 0 nor -1 so corruption: go retry once */
+
+ /* HACK: here both %r1 and %r3 appeared to be okay:
+ * so sequential flow was for "no problems"
+ */
+
+1: /* HACK status: continue/return from whatever status,
+ * trying to get back cleanly to the FreeBSD context
+ */
+
+ /* HACK: part of having FreeBSD's sprg0 in place for any exception
+ * during return.
+ * Parania code at this point since corrupted %r1 values are
+ * avoided by forcing the before-openfirmware value.
+ * NOTE: Calling code also deals with this but too late for the
+ * original exceptions after openfirmware returned to this code.
+ */
+ lis %r23,ofwsprg0save at ha
+ ld %r16,ofwsprg0save at l(%r23)
+ mtsprg0 %r16
+
+ /* Reload FreeBSD stack pointer and MSR
+ * from the bottom of the (i.e., below the effective) OFW stack
+ *
+ * HACK note: %r1 may have been forced to the before-openfirmware value
+ * (to avoid garbage results and the resulting exceptions)
+ */
+ ld %r26,24(%r1)
ld %r2,16(%r1)
ld %r1,8(%r1)
- /* Now set the real MSR */
- mtmsrd %r6
+ /* Now set the FreeBSD MSR */
+ mtmsrd %r26
isync
/* Sign-extend the return value from OF */
@@ -168,6 +253,43 @@
mtlr %r0
blr
+/* HACK: code for %r1 and/or %r3 corruption's single-retry */
+/* Still under openfirmware's msr, sprg0, stack values */
+
+2: /* HACK: corruption observed so retry, restoring %r1 and %r3 first */
+ lis %r20,ofw_entry_addr at ha
+ lwz %r15,ofw_entry_addr at l(%r20)
+ lis %r21,ofw_r1_for_retry at ha
+ lwz %r1,ofw_r1_for_retry at l(%r21)
+ lis %r22,ofw_r3_for_retry at ha
+ lwz %r3,ofw_r3_for_retry at l(%r22)
+ mtctr %r15
+ bctrl
+
+ /* HACK: check if %r1 was corrupted (had a net-change) */
+ lis %r21,ofw_r1_for_retry at ha
+ lwz %r24,ofw_r1_for_retry at l(%r21)
+ cmpw %r24,%r1
+ bne 3f /* retry corrupted %r1
+ * so go give up with %r3 being -1 and %r1 forced-good
+ */
+
+ /* HACK: %r1 okay but check %r3 for being 0 or -1 vs. anything else */
+ xoris %r25,%r3,0
+ cmpw %r25,%r3
+ beq 1b /* %r3 also was 0 or -1 so no corruption observed on retry
+ * so go do a normal return
+ */
+
+3: /* Either %r1 had a net change after retry
+ * or %r3 was not one of 0,-1 after retry
+ * so force %r1 and have %r3 be -1 then go return
+ */
+ lis %r21,ofw_r1_for_retry at ha
+ lwz %r1,ofw_r1_for_retry at l(%r21)
+ li %r3,-1 /* the openfirmware failure return value */
+ b 1b
+
/*
* RTAS 32-bit Entry Point. Similar to the OF one, but simpler (no separate
* stack)
===
Mark Millard
markmi at dsl-only.net
More information about the freebsd-ppc
mailing list