PowerMac G5 powerpc64: new context where repeatedly booting varies between failing and working

Mark Millard markmi at dsl-only.net
Sun Feb 15 02:13:52 UTC 2015


I found a time frame before the TOC corruption. The corruption happens during SI_SUB_TUNABLES processing, not before and not after in any of my examples so far.

With my current extra displays of the address calculations a normal boot now starts out (dmesg -a prefix) as below:

> powerpc_init end: &authnone_private: 0xe313a8
> mi_startup start &authnone_private: 0xe313a8
> 
> 
> 
> sysinit: 0xbd9c00 *sysinit: 0xc3c8a8
> &authnone_private: 0xe313a8
> 
> &authnone_private before subsystem: 0xe313a8
>  subsystem 700000
> &authnone_private before subsystem: 0xe313a8
>  subsystem 800001
> Copyright (c) 1992-2015 The FreeBSD Project.
> ...


But when the boots fail the before "subsystem 700000" &authnone_private figure above is unchanged but after that things look like (picking an example bad value that has occurred and manually typing it):

> &authnone_private before subsystem: 0x2400004200e313a8
>  subsystem 800001
> Copyright (c) 1992-2015 The FreeBSD Project.
> ...

and all later displays of the calculation agree with the displayed bad value until it crashes. I've never seen the value change at any other stage so far.


The code for mi_startup displaying the values as above is:


> root at FBSDG5M1:/usr/src # svnlite diff sys/kern/init_main.c
> Index: sys/kern/init_main.c
> ===================================================================
> --- sys/kern/init_main.c	(revision 278443)
> +++ sys/kern/init_main.c	(working copy)
> @@ -91,6 +91,11 @@
>  #include <ddb/ddb.h>
>  #include <ddb/db_sym.h>
>  
> +#if defined(VERBOSE_SYSINIT)
> +/* HACK! */
> +extern void* authnone_create(void);
> +#endif
> +
>  void mi_startup(void);				/* Should be elsewhere */
>  
>  /* Components of the first process -- never freed. */
> @@ -207,6 +212,8 @@
>  	int verbose;
>  #endif
>  
> +printf("mi_startup start &authnone_private: %p\n\n", authnone_create());
> +
>  	if (boothowto & RB_VERBOSE)
>  		bootverbose++;
>  
> @@ -215,7 +222,12 @@
>  		sysinit_end = SET_LIMIT(sysinit_set);
>  	}
>  
> +
>  restart:
> +
> +printf("\n\nsysinit: %p *sysinit: %p\n", sysinit, *sysinit);
> +printf("&authnone_private: %p\n\n", authnone_create());
> +
>  	/*
>  	 * Perform a bubble sort of the system initialization objects by
>  	 * their subsystem (primary key) and order (secondary key).
> @@ -234,6 +246,8 @@
>  
>  #if defined(VERBOSE_SYSINIT)
>  	last = SI_SUB_COPYRIGHT;
> +/* HACK */
> +	last = SI_SUB_DUMMY;
>  	verbose = 0;
>  #if !defined(DDB)
>  	printf("VERBOSE_SYSINIT: DDB not enabled, symbol lookups disabled.\n");
> @@ -254,7 +268,11 @@
>  
>  #if defined(VERBOSE_SYSINIT)
>  		if ((*sipp)->subsystem > last) {
> +printf("&authnone_private before subsystem: %p\n ", authnone_create());
> +
>  			verbose = 1;
> +/* HACK */
> +verbose = 0;
>  			last = (*sipp)->subsystem;
>  			printf("subsystem %x\n", last);
>  		}


I have also observed a new wildly different bad value: 0 instead of 0x2400004200e313a8.

The kernel runs much farther in this case but eventually dies for another large bad address. But the 0 means that some stomping on low memory occurred, such as 24(r29) indicating address 24 (0x18 hex) in the instruction that fails for r29=0x2400004200e313a8.





===
Mark Millard
markmi at dsl-only.net

On 2015-Feb-14, at 02:21 PM, Mark Millard <markmi at dsl-only.net> wrote:

I've been able to show that the TOC entry that authnone_init accesses is messed up and it is so from very early on.

I took advantage of sys/rpc/auth_none.c exposing the static variable's address calculation result, in fact the same one that the crash happened for:

AUTH *
authnone_create()
{
       struct authnone_private *ap = &authnone_private;

       return (&ap->no_client);
}

The no_client even happens to be the first field of the struct pointed to by ap.

So I put calls of that routine where it would periodically monitor the calculation during the early part of booting:

root at FBSDG5M1:/usr/src # svnlite diff sys/kern/init_main.c 
Index: sys/kern/init_main.c
===================================================================
--- sys/kern/init_main.c	(revision 278443)
+++ sys/kern/init_main.c	(working copy)
@@ -91,6 +91,11 @@
#include <ddb/ddb.h>
#include <ddb/db_sym.h>

+#if defined(VERBOSE_SYSINIT)
+/* HACK! */
+extern void* authnone_create(void);
+#endif
+
void mi_startup(void);				/* Should be elsewhere */

/* Components of the first process -- never freed. */
@@ -282,7 +287,9 @@

#if defined(VERBOSE_SYSINIT)
		if (verbose)
+{printf(" authnone_private address generation check: %p ", authnone_create());
			printf("done.\n");
+}
#endif

So when it boots successfully it reports messages like:

  malloc_init(&M_JFREEFRAG)...  authnone_private address generation check: 0xe313a8 done.

When the boots fails the very first such message of that form shows the 0x2400002200e313a8 value, as do all the later ones. When the boot works it always shows 0xe313a8.

[I have since shortened the text with:
printf(" &authnone_private: %p ", authnone_create());]

It would appear that the TOC entry generation/update is the source of the variations in value that are observed that can lead to a crash.



===
Mark Millard
markmi at dsl-only.net

On 2015-Feb-14, at 01:53 AM, Mark Millard <markmi at dsl-only.net> wrote:

FreeBSD context: PowerMac G5 Quad-Core with 16GB RAM.

root at FBSDG5M1:/usr/src # freebsd-version -ku ; uname -a
10.1-STABLE
10.1-STABLE
FreeBSD FBSDG5M1 10.1-STABLE FreeBSD 10.1-STABLE #10 r278443M: Fri Feb 13 03:26:27 PST 2015     root at FBSDG5M1:/usr/obj/usr/src/sys/GENERIC64vtsc  powerpc

Other configuration/context details for /boot/kernel10.1S/kernel are given late in this message. But I will here mention the use of:

options         DDB
options         GDB
options         VERBOSE_SYSINIT
options         BOOTVERBOSE=1
options         BOOTHOWTO=RB_VERBOSE



I've got a new context where repeatedly booting via power-off then power-on varies between failing and working, always failing in the same way/place when it does.


Here are 3 addresses (&...: ...) from a successful boot of kernel10.1S, where the first one will be  different for the failing boots (this is from dmesg -a):

 authnone_init(0)...  &authnone_private: 0xe313a8, &_null_auth: 0xe31608, &authnone_ops: 0xc31f80 done.

where the extra output is from the added printf in:

static void
authnone_init(void *dummy)
{
      struct authnone_private *ap = &authnone_private;
      XDR xdrs;

      ap->no_client.ah_cred = ap->no_client.ah_verf = _null_auth;
      ap->no_client.ah_ops = &authnone_ops;
printf(" &authnone_private: %p, &_null_auth: %p, &authnone_ops: %p ", ap, &_null_auth, &authnone_ops);
      xdrmem_create(&xdrs, ap->mclient, MAX_MARSHAL_SIZE, XDR_ENCODE);
      xdr_opaque_auth(&xdrs, &ap->no_client.ah_cred);
      xdr_opaque_auth(&xdrs, &ap->no_client.ah_verf);
      ap->mcnt = XDR_GETPOS(&xdrs);
      XDR_DESTROY(&xdrs);
}
SYSINIT(authnone_init, SI_SUB_KMEM, SI_ORDER_ANY, authnone_init, NULL);

The authnone_init code for through its first store into ap->... is:

00000000007a3ea4 <.authnone_init> mflr    r0
00000000007a3ea8 <.authnone_init+0x4> std     r29,-24(r1)
00000000007a3eac <.authnone_init+0x8> std     r30,-16(r1)
00000000007a3eb0 <.authnone_init+0xc> std     r31,-8(r1)
00000000007a3eb4 <.authnone_init+0x10> std     r0,16(r1)
00000000007a3eb8 <.authnone_init+0x14> stdu    r1,-192(r1)
00000000007a3ebc <.authnone_init+0x18> mr      r31,r1
00000000007a3ec0 <.authnone_init+0x1c> ld      r29,304(r2)
00000000007a3ec4 <.authnone_init+0x20> ld      r9,312(r2)
00000000007a3ec8 <.authnone_init+0x24> ld      r0,0(r9)
00000000007a3ecc <.authnone_init+0x28> ld      r11,8(r9)
00000000007a3ed0 <.authnone_init+0x2c> ld      r9,16(r9)
00000000007a3ed4 <.authnone_init+0x30> std     r0,24(r29)

When the boots fail they fail on that last std: std r0,24(r29) , doing so based on

r2: 0xd2da20
r29: 0x2400002200e313a8
bad virtual address: 0x2400002200e313c0

(These are from a boot-crash time register display, so hand copied off screen as it is too soon for interaction with DDB. I've got a default ddb script in place that does the display.)

When it boots okay r29 = 0x00e313a8 and the address accessed is 0x00e313c0 instead: see the first address that I started with above (for &authnone_private).

In other words: The difference is the upper half of r29. I've no evidence that r2 is corrupt for failing boots for this code.

So either 304(r2) accesses different values from one time to the next for booting or the r29 register is corrupted somehow between

00000000007a3ec0 <.authnone_init+0x1c> ld      r29,304(r2)
and
00000000007a3ed4 <.authnone_init+0x30> std     r0,24(r29)

(such as an interrupt not restoring the 64bit-ABI's register value fully).

At this point I've no clue if variability in the TOC contents that 304(r2) references makes any sense or not. I've yet to figure out how it is established.


More FreeBSD configuration details:

10.1-STABLE's buildworld kernel and installworld were all done from the PowerMac G5 itself.

root at FBSDG5M1:/usr/src # more sys/powerpc/conf/GENERIC64
GENERIC64      GENERIC64vtsc  
root at FBSDG5M1:/usr/src # more sys/powerpc/conf/GENERIC64vtsc 
include GENERIC64
ident   GENERIC64vtsc

nooptions       PS3                     #Sony Playstation 3               HACK!!! to allow sc

options         DDB                     # HACK!!! to dump early crash info (but 11.0-CURRENT already has it)
options         GDB                     # HACK!!! ...
options         VERBOSE_SYSINIT
options         BOOTVERBOSE=1
options         BOOTHOWTO=RB_VERBOSE
#options        KTR
#options        KTR_MASK=KTR_TRAP
#options        KTR_CPUMASK=0xF
#options        KTR_VERBOSE

# HACK!!! to allow sc for 2560x1440 display on Radeon X1950 that vt historically mishandled during booting
device          sc
#device          kbdmux         # HACK: already listed by vt
options         SC_OFWFB        # OFW frame buffer
options         SC_DFLT_FONT    # compile font in
makeoptions     SC_DFLT_FONT=cp437


# Disable extra checking typically used for FreeBSD 11.0-CURRENT:
nooptions       DEADLKRES               #Enable the deadlock resolver
nooptions       INVARIANTS              #Enable calls of extra sanity checking
nooptions       INVARIANT_SUPPORT       #Extra sanity checks of internal structures, required by INVARIANTS
nooptions       WITNESS                 #Enable checks to detect deadlocks and cycles
nooptions       WITNESS_SKIPSPIN        #Don't run witness on spinlocks for speed
nooptions       MALLOC_DEBUG_MAXZONES   # Separate malloc(9) zones

root at FBSDG5M1:/usr/src # svnlite status
?       .snap
M       sys/ddb/db_main.c
M       sys/ddb/db_script.c
M      sys/powerpc/conf
?       sys/powerpc/conf/GENERIC64vtsc
M       sys/powerpc/ofw/ofw_machdep.c
M       sys/powerpc/ofw/ofwcall64.S
M       sys/powerpc/powermac/powermac_thermal.c
M       sys/rpc/auth_none.c

root at FBSDG5M1:/usr/src # more /boot/loader.conf 
#kernel="kernel"
#kernel="kernel10.1RE"
kernel="kernel10.1S"
#kernel="kernel11C"
verbose_loading="YES"
kern.vty=vt

root at FBSDG5M1:/usr/src # more /etc/make.conf 
WRKDIRPREFIX=/usr/obj/portswork
WITH_DEBUG=
MALLOC_PRODUCTION=

root at FBSDG5M1:/usr/src # more /etc/src.conf
CFLAGS+=-DELF_VERBOSE
#WITH_DEBUG_FILES=
#WITHOUT_CLANG=

Other than powermac_thermal.c (from Justin H.) the source changes are for investigation of various early-boot problems for PowerMac G5's. The PowerMac G5 that powermac_thermal.c was put in place to experiment with is no longer around but I've not yet removed the powermac_thermal.c update from my environment.



===
Mark Millard
markmi at dsl-only.net





More information about the freebsd-ppc mailing list