My PowerMac G5's no longer crash at boot: PowerMac G5 specific ofwcall changes with justifying evidence [%r3 corrupted too]

Mark Millard markmi at dsl-only.net
Wed Oct 15 08:40:15 UTC 2014


More information on the odd %r1 and %r3 value...

The current and recent kernels that I've built get 0xd23450 for the corrupted values in %r1 and %r3 after openfirmware returns.

So I decided to look up what that might be...

objdump -h /boot/kernel/kernel shows  (.got: "global object table" or some such?) ...

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
...
 35 .got          0002f5c0  0000000000cfb248  0000000000cfb248  00bfb248  2**3
                  CONTENTS, ALLOC, LOAD, DATA
 36 .dynamic      000000d0  0000000000d2a808  0000000000d2a808  00c2a808  2**3
                  CONTENTS, ALLOC, LOAD, DATA
...

and objdump -s -j .got /boot/kernel/kernel shows...

 d23438 00000000 00bbfd48 00000000 00bbfd60  .......H.......`
 d23448 00000000 00bbfd90 00000000 00bbfdb0  ................
 d23458 00000000 00bbfdf0 00000000 00e17dd0  ..............}.

Then for 0xbbfdb0 from the above: objdump -h /boot/kernel/kernel shows...

  6 .rodata.str1.8 000834a8  0000000000b4ddf8  0000000000b4ddf8  00a4ddf8  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  7 set_sysinit_set 00002538  0000000000bd12a0  0000000000bd12a0  00ad12a0  2**3
                  CONTENTS, ALLOC, LOAD, READONLY, DATA

and objdump -s -j .rodata.str1.8 /boot/kernel/kernel shows...

 bbfda8 6f756e74 00000000 436f756e 74206f66  ount....Count of
 bbfdb8 2074696d 65732074 68726f74 746c696e   times throttlin
 bbfdc8 67206261 73656420 6f6e2072 65717565  g based on reque
 bbfdd8 73742073 70616365 20686173 206f6363  st space has occ
 bbfde8 75727265 64000000 25733a20 6d617374  urred...%s: mast

So 0xd23450 appears to possibly be a indirect reference to the string "Count of times throttling based on request space has occurred" or similar indirect content based on some offset from 0xd23450 indirectly getting to something else through the .got section. That string that I quoted is from /usr/src/sys/rpc/svc.c:

SVCPOOL*
svcpool_create(const char *name, struct sysctl_oid_list *sysctl_base)
{
...

                SYSCTL_ADD_INT(&pool->sp_sysctl, sysctl_base, OID_AUTO,
                    "request_space_throttle_count", CTLFLAG_RD,
                    &pool->sp_space_throttle_count, 0,
                    "Count of times throttling based on request space has occurred");
        }

        return pool;
}

(I have not done this lookup sequence across various FreeBSD updates and rebuilds that also get 0xd23450 in %r1 and %r3. Nor with FreeBSD builds that get some other corruption value. I do not know that the indirect lookup would have always gotten to that same string.)



===
Mark Millard
markmi at dsl-only.net

On Oct 14, 2014, at 11:30 PM, Mark Millard <markmi at dsl-only.net> wrote:

I added including after-ofwcall %r1 and %r3 values to my ofwcall history buffer that I have ddb report when there is a problem.

This makes it apparent that %r3 has also been corrupted when %r1 has been.

I say that because the usual/normal %r3 value is 0 in what the code records and reports and I gather from the FreeBSD source code that the error indicator is -1. But all along I've been reporting %r3 values for the crashes that look more like 0xd18868 or other such. Never a 0 or -1 (0xfff...). And the %r3 crash values even move around when the ofwstk changes place from build to build.

(This "usual"/"error-check" mix suggests %r3 from openfirmware is a multi-bit representation of a Boolean value, with one's complemented alternative values and zero as one of the two bit patterns --when %r3 is not corrupted.)



I also got an example of a somewhat later than normal ofwcall failure: about 23 ofwcall's later than normal. It was not a peer request:

...
OF_finddevice+0x90
powermac_smp_get_bsp+0x20
platform_smp_get_bsp+0x78
cpu_mp_start+0x24
mp_startup+0x7c
mi_startup+0x10c
btext_0xbc

So pmap_bootstrapped had been true for a while by this point. Available memory had been displayed as of when this example stopped to report the %r1 change.




===
Mark Millard
markmi at dsl-only.net

On Oct 14, 2014, at 3:18 PM, Mark Millard <markmi at dsl-only.net> wrote:

For openfirmware: is %r3 on return any more then a failed vs. not flag with a particular failed-value? Is there any way to validate that %r3 values for non-failure look reasonable vs. not looking reasonable? (For all I know %r3 could also be corrupt.)

I do not have any documentation for the PowerMac G5 openfirmware API that is in use or the associated ABI as far as I remember. I do not know if it strictly followed Darwin's/Mac OS X's ABI on PowerMac G5's vs. if there was some conversion going back and forth (as there is for FreeBSD, at least for powerpc64). For openfirmware I derive properties from what I see in FreeBSD's code (which has to be more explicit then when a compiler's code generation happens to match at least large parts of an ABI directly).

As I vaguely-remember Apple did not use the TOC for Darwin's/Mac OS X's ABI but FreeBSD does. If true I do not know what other differences that there might be (even ignoring the 32 bit vs. 64 bit issues for the kernels). But the point would be an existence proof of at least one difference. My understanding is that %r1 was as in FreeBSD.

I vaguely seem to remember that for Darwin/Mac OS X some register was volatile in leaf functions but non-volatile otherwise, or at least when nested functions were involved. And that brings to mind that the condition code sets in cr might have had a mix of volatile and non-volatile status despite being in one register? Did Darwin/Mac OS X have something special for register usage for Thread-Specific Storage? Position Independent Code? Indirect Calls? Frame Pointers?

I may have some Darwin/Mac OS X information around but I doubt that it is complete, especially for the 64-bit ABI or for privileged contexts. For the 32-bit ABI (non-priviledged) I likely have the information about the above possible ABI properties.

I assume that openfirmware avoids the FPU and other such --but I do not know. But it is privileged code.

Are there any known sources of at least some of the information for the the PowerMac G5 openfirmware ABI(s)? What are good references for the FreeBSD PowerPC ABI(s) (32 bit and 64 bit, privileged vs. not)?

[I cut off some of the older history.]

===
Mark Millard
markmi at dsl-only.net

On Oct 14, 2014, at 10:18 AM, Nathan Whitehorn <nwhitehorn at freebsd.org> wrote:

r1 *must be* preserved by the standard and for anything to work. It's being corrupted somehow (Mark's comment about r3 is illuminating), and if r1 is being corrupted, you can't rely on anything. I suspect it might be an exception handling issue since it's non-deterministic, but it's hard to tell. It could also be triggered by the way we've set up the OF stack frame. It would be good to check if that makes sense.
-Nathan

On 10/14/14 09:53, Justin Hibbits wrote:
> Interesting.  Perhaps, instead of using %r1, and relying purely on the
> stack we use yet another (non-volatile) register to hold the MSR.
> Once we reload the MSR we can get back the saved registers, because
> the stack will be valid again.
> 
> Nathan, thoughts?
> 
> - Justin
> 
> On Tue, Oct 14, 2014 at 9:14 AM, Mark Millard <markmi at dsl-only.net> wrote:
>> Additional notes from additional experiments... (So far from one G5.)
>> 
>> I got back trace, show registers, and my openfirmware-history list going for failure reporting based on explicit before vs. after tests of %r1 values. (Explicit breakpoint call for unequal, being careful to save/restore %r3 around the call.) I filled several registers with potentially interesting values that would otherwise have had zero as a value (%r15-%r19, although %r15 is redundant with %r6 currently).
>> 
>> An interesting property resulted: every time %r1 had changed from having the before-value (stack pointer value) %r1 instead ended up with a value equal to what openfirmware put in %r3.
>> 
>> And more then that: For builds with the same ofwstk position the %r3 value involved was fixed for the failures, for example when 0x30400=ofwstk+0xfe0 (%r1 before) was reported %r3 and %r1 end up as 0xd23450 for the failures. When 0x31400=ofwstk+0xfe0: %r3 and %r1 ended up for failure as 0xd24450 instead. Yep: offset by the same amount as ofwstk.
>> 
>> And I got one example where the openfirmware %r1-value-change failure was instead much later in the boot, well after pmap_bootstrapped went true: It was just after the message lines...
>> 
>> vgapci0: Boot video device ...
>> pcib1: <IBM CPC9X5 Hypertransport tunnel> ...
>> 
>> with back trace (from OF_peer down):
>> 
>> .OF_peer+0x8c
>> .cpcht_attach+0x884
>> .device_attach+0x3ac
>> .device_probe_and_attach+0x3c
>> .bus_generic_new_pass+0x12c
>> .bus_generic_new_pass+0x114
>> .bus_generic_new_pass+0x114 (yep: listed twice)
>> .bus_set_pass+0xc0
>> .root_bus_configure+0x14
>> .mi_startup+0x10c
>> btext+0xbc
>> 
>> %r1 before: 0xc30400 ofwstk+0xfe0
>> %r1 after:  0xd23450
>> %r3 after:  0xd23450
>> FreeBSD msr to restore: 0x9000000000001032
>> ofmsr[0]  to restore:   0x1000000000003030
>> 
>> The same after-openfirmware %r1 and %r3 values that had been showing up for the before-copyright examples of ofwcall failures.
>> 
>> And note that it again was a peer request. All the ofwcall-tied boot-failures have been for peer requests as far as I remember.
>> 
>> I later did some experiments where I had it report but not stop when the after-value was different from the before-value for %r1. When this happened for these types of tests it seem to be an isolated example: later calls normally have the stack pointer value still in %r1 after openfirmware returns. In more detail: At most one report was made for such a boot, the rest of the boot went fine. (Of course to get that far my hacked ofwcall code avoids using the after-openfirmware %r1 value to extract the 3 saved values to be restored from the bottom of ofwstk.)
>> 
>> 
>> 
>> I was not successful at using "capture on" in DDB for this early-boot context. (It hangs things after the first report.) So I've been limited to one screen's report and only when I have it stop at the end of the report (so it does not scroll away). (No input to DDB available that early.) Otherwise the information just scrolls by rather quickly for reading any detail. Still it was useful to see that other reports were not produced after the first (when there was a first). (I can not claim multiple are impossible. It just appears at least infrequent.)
>> 
>> I have not yet investigated making analogous powerpc/GENERIC code and builds.
>> 
>> Nor have I dealt with having it report more detail about the peer requests that fail.
>> 
>> Nor have I seen examples of what "not failing/%r1-unchanged" looks like overall.
>> 
>> I still have no examples of unstable/incomplete initialization(s) or race condition(s) to explain why both ways can and do occur from one attempt to the next --or that difference peer requests in the sequence can be where the problem happens.





More information about the freebsd-ppc mailing list