My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [important typos fixed]

Mark Millard markmi at dsl-only.net
Tue Oct 14 22:19:00 UTC 2014


For openfirmware: is %r3 on return any more then a failed vs. not flag with a particular failed-value? Is there any way to validate that %r3 values for non-failure look reasonable vs. not looking reasonable? (For all I know %r3 could also be corrupt.)

I do not have any documentation for the PowerMac G5 openfirmware API that is in use or the associated ABI as far as I remember. I do not know if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if there was some conversion going back and forth (as there is for FreeBSD, at least for powerpc64). For openfirmware I derive properties from what I see in FreeBSD's code (which has to be more explicit then when a compiler's code generation happens to match at least large parts of an ABI directly).

As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's ABI but FreeBSD does. If true I do not know what other differences that there might be (even ignoring the 32 bit vs. 64 bit issues for the kernels). But the point would be an existence proof of at least one difference. My understanding is that %r1 was as in FreeBSD.

I vaguely seem to remember that for Darwin/Mac OS X some register was volatile in leaf functions but non-volatile otherwise, or at least when nested functions were involved. And that brings to mind that the condition code sets in cr might have had a mix of volatile and non-volatile status despite being in one register? Did Darwin/Mac OS X have something special for register usage for Thread-Specific Storage? Position Independent Code? Indirect Calls? Frame Pointers?

I may have some Darwin/Mac OS X information around but I doubt that it is complete, especially for the 64-bit ABI or for privileged contexts. For the 32-bit ABI (non-priviledged) I likely have the information about the above possible ABI properties.

I assume that openfirmware avoids the FPU and other such --but I do not know. But it is privileged code.

Are there any known sources of at least some of the information for the the PowerMac G5 openfirmware ABI(s)? What are good references for the FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)?

[I cut off some of the older history.]

===
Mark Millard
markmi at dsl-only.net

On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn <nwhitehorn at freebsd.org> wrote:

r1 *must be* preserved by the standard and for anything to work. It's being corrupted somehow (Mark's comment about r3 is illuminating), and if r1 is being corrupted, you can't rely on anything. I suspect it might be an exception handling issue since it's non-deterministic, but it's hard to tell. It could also be triggered by the way we've set up the OF stack frame. It would be good to check if that makes sense.
-Nathan

On 10/14/14 09:53, Justin Hibbits wrote:
> Interesting.  Perhaps, instead of using %r1, and relying purely on the
> stack we use yet another (non-volatile) register to hold the MSR.
> Once we reload the MSR we can get back the saved registers, because
> the stack will be valid again.
> 
> Nathan, thoughts?
> 
> - Justin
> 
> On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard <markmi at dsl-only.net> wrote:
>> Additional notes from additional experiments... (So far from one G5.)
>> 
>> I got back trace, show registers, and my openfirmware-history list going for failure reporting based on explicit before vs. after tests of %r1 values. (Explicit breakpoint call for unequal, being careful to save/restore %r3 around the call.) I filled several registers with potentially interesting values that would otherwise have had zero as a value (%r15-%r19, although %r15 is redundant with %r6 currently).
>> 
>> An interesting property resulted: every time %r1 had changed from having the before-value (stack pointer value) %r1 instead ended up with a value equal to what openfirmware put in %r3.
>> 
>> And more then that: For builds with the same ofwstk position the %r3 value involved was fixed for the failures, for example when 0x30400=ofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as 0xd23450 for the failures. When 0x31400=ofwstk+0xfe0: %r3 and %r1 ended up for failure as 0xd24450 instead. Yep: offset by the same amount as ofwstk.
>> 
>> And I got one example where the openfirmware %r1-value-change failure was instead much later in the boot, well after pmap_bootstrapped went true: It was just after the message lines...
>> 
>> vgapci0: Boot video device ...
>> pcib1: <IBM CPC9X5 Hypertransport tunnel> ...
>> 
>> with back trace (from OF_peer down):
>> 
>> .OF_peer+0x8c
>> .cpcht_attach+0x884
>> .device_attach+0x3ac
>> .device_probe_and_attach+0x3c
>> .bus_generic_new_pass+0x12c
>> .bus_generic_new_pass+0x114
>> .bus_generic_new_pass+0x114 (yep: listed twice)
>> .bus_set_pass+0xc0
>> .root_bus_configure+0x14
>> .mi_startup+0x10c
>> btext+0xbc
>> 
>> %r1 before: 0xc30400 ofwstk+0xfe0
>> %r1 after:  0xd23450
>> %r3 after:  0xd23450
>> FreeBSD msr to restore: 0x9000000000001032
>> ofmsr[0]  to restore:   0x1000000000003030
>> 
>> The same after-openfirmware %r1 and %r3 values that had been showing up for the before-copyright examples of ofwcall failures.
>> 
>> And note that it again was a peer request. All the ofwcall-tied boot-failures have been for peer requests as far as I remember.
>> 
>> I later did some experiments where I had it report but not stop when the after-value was different from the before-value for %r1. When this happened for these types of tests it seem to be an isolated example: later calls normally have the stack pointer value still in %r1 after openfirmware returns. In more detail: At most one report was made for such a boot, the rest of the boot went fine. (Of course to get that far my hacked ofwcall code avoids using the after-openfirmware %r1 value to extract the 3 saved values to be restored from the bottom of ofwstk.)
>> 
>> 
>> 
>> I was not successful at using "capture on" in DDB for this early-boot context. (It hangs things after the first report.) So I've been limited to one screen's report and only when I have it stop at the end of the report (so it does not scroll away). (No input to DDB available that early.) Otherwise the information just scrolls by rather quickly for reading any detail. Still it was useful to see that other reports were not produced after the first (when there was a first). (I can not claim multiple are impossible. It just appears at least infrequent.)
>> 
>> I have not yet investigated making analogous powerpc/GENERIC code and builds.
>> 
>> Nor have I dealt with having it report more detail about the peer requests that fail.
>> 
>> Nor have I seen examples of what "not failing/%r1-unchanged" looks like overall.
>> 
>> I still have no examples of unstable/incomplete initialization(s) or race condition(s) to explain why both ways can and do occur from one attempt to the next --or that difference peer requests in the sequence can be where the problem happens.



More information about the freebsd-ppc mailing list