PowerMac G5 powerpc64: new context where repeatedly booting varies between failing and working
Mark Millard
markmi at dsl-only.net
Sun Feb 15 02:13:52 UTC 2015
I found a time frame before the TOC corruption. The corruption happens during SI_SUB_TUNABLES processing, not before and not after in any of my examples so far.
With my current extra displays of the address calculations a normal boot now starts out (dmesg -a prefix) as below:
> powerpc_init end: &authnone_private: 0xe313a8
> mi_startup start &authnone_private: 0xe313a8
>
>
>
> sysinit: 0xbd9c00 *sysinit: 0xc3c8a8
> &authnone_private: 0xe313a8
>
> &authnone_private before subsystem: 0xe313a8
> subsystem 700000
> &authnone_private before subsystem: 0xe313a8
> subsystem 800001
> Copyright (c) 1992-2015 The FreeBSD Project.
> ...
But when the boots fail the before "subsystem 700000" &authnone_private figure above is unchanged but after that things look like (picking an example bad value that has occurred and manually typing it):
> &authnone_private before subsystem: 0x2400004200e313a8
> subsystem 800001
> Copyright (c) 1992-2015 The FreeBSD Project.
> ...
and all later displays of the calculation agree with the displayed bad value until it crashes. I've never seen the value change at any other stage so far.
The code for mi_startup displaying the values as above is:
> root at FBSDG5M1:/usr/src # svnlite diff sys/kern/init_main.c
> Index: sys/kern/init_main.c
> ===================================================================
> --- sys/kern/init_main.c (revision 278443)
> +++ sys/kern/init_main.c (working copy)
> @@ -91,6 +91,11 @@
> #include <ddb/ddb.h>
> #include <ddb/db_sym.h>
>
> +#if defined(VERBOSE_SYSINIT)
> +/* HACK! */
> +extern void* authnone_create(void);
> +#endif
> +
> void mi_startup(void); /* Should be elsewhere */
>
> /* Components of the first process -- never freed. */
> @@ -207,6 +212,8 @@
> int verbose;
> #endif
>
> +printf("mi_startup start &authnone_private: %p\n\n", authnone_create());
> +
> if (boothowto & RB_VERBOSE)
> bootverbose++;
>
> @@ -215,7 +222,12 @@
> sysinit_end = SET_LIMIT(sysinit_set);
> }
>
> +
> restart:
> +
> +printf("\n\nsysinit: %p *sysinit: %p\n", sysinit, *sysinit);
> +printf("&authnone_private: %p\n\n", authnone_create());
> +
> /*
> * Perform a bubble sort of the system initialization objects by
> * their subsystem (primary key) and order (secondary key).
> @@ -234,6 +246,8 @@
>
> #if defined(VERBOSE_SYSINIT)
> last = SI_SUB_COPYRIGHT;
> +/* HACK */
> + last = SI_SUB_DUMMY;
> verbose = 0;
> #if !defined(DDB)
> printf("VERBOSE_SYSINIT: DDB not enabled, symbol lookups disabled.\n");
> @@ -254,7 +268,11 @@
>
> #if defined(VERBOSE_SYSINIT)
> if ((*sipp)->subsystem > last) {
> +printf("&authnone_private before subsystem: %p\n ", authnone_create());
> +
> verbose = 1;
> +/* HACK */
> +verbose = 0;
> last = (*sipp)->subsystem;
> printf("subsystem %x\n", last);
> }
I have also observed a new wildly different bad value: 0 instead of 0x2400004200e313a8.
The kernel runs much farther in this case but eventually dies for another large bad address. But the 0 means that some stomping on low memory occurred, such as 24(r29) indicating address 24 (0x18 hex) in the instruction that fails for r29=0x2400004200e313a8.
===
Mark Millard
markmi at dsl-only.net
On 2015-Feb-14, at 02:21 PM, Mark Millard <markmi at dsl-only.net> wrote:
I've been able to show that the TOC entry that authnone_init accesses is messed up and it is so from very early on.
I took advantage of sys/rpc/auth_none.c exposing the static variable's address calculation result, in fact the same one that the crash happened for:
AUTH *
authnone_create()
{
struct authnone_private *ap = &authnone_private;
return (&ap->no_client);
}
The no_client even happens to be the first field of the struct pointed to by ap.
So I put calls of that routine where it would periodically monitor the calculation during the early part of booting:
root at FBSDG5M1:/usr/src # svnlite diff sys/kern/init_main.c
Index: sys/kern/init_main.c
===================================================================
--- sys/kern/init_main.c (revision 278443)
+++ sys/kern/init_main.c (working copy)
@@ -91,6 +91,11 @@
#include <ddb/ddb.h>
#include <ddb/db_sym.h>
+#if defined(VERBOSE_SYSINIT)
+/* HACK! */
+extern void* authnone_create(void);
+#endif
+
void mi_startup(void); /* Should be elsewhere */
/* Components of the first process -- never freed. */
@@ -282,7 +287,9 @@
#if defined(VERBOSE_SYSINIT)
if (verbose)
+{printf(" authnone_private address generation check: %p ", authnone_create());
printf("done.\n");
+}
#endif
So when it boots successfully it reports messages like:
malloc_init(&M_JFREEFRAG)... authnone_private address generation check: 0xe313a8 done.
When the boots fails the very first such message of that form shows the 0x2400002200e313a8 value, as do all the later ones. When the boot works it always shows 0xe313a8.
[I have since shortened the text with:
printf(" &authnone_private: %p ", authnone_create());]
It would appear that the TOC entry generation/update is the source of the variations in value that are observed that can lead to a crash.
===
Mark Millard
markmi at dsl-only.net
On 2015-Feb-14, at 01:53 AM, Mark Millard <markmi at dsl-only.net> wrote:
FreeBSD context: PowerMac G5 Quad-Core with 16GB RAM.
root at FBSDG5M1:/usr/src # freebsd-version -ku ; uname -a
10.1-STABLE
10.1-STABLE
FreeBSD FBSDG5M1 10.1-STABLE FreeBSD 10.1-STABLE #10 r278443M: Fri Feb 13 03:26:27 PST 2015 root at FBSDG5M1:/usr/obj/usr/src/sys/GENERIC64vtsc powerpc
Other configuration/context details for /boot/kernel10.1S/kernel are given late in this message. But I will here mention the use of:
options DDB
options GDB
options VERBOSE_SYSINIT
options BOOTVERBOSE=1
options BOOTHOWTO=RB_VERBOSE
I've got a new context where repeatedly booting via power-off then power-on varies between failing and working, always failing in the same way/place when it does.
Here are 3 addresses (&...: ...) from a successful boot of kernel10.1S, where the first one will be different for the failing boots (this is from dmesg -a):
authnone_init(0)... &authnone_private: 0xe313a8, &_null_auth: 0xe31608, &authnone_ops: 0xc31f80 done.
where the extra output is from the added printf in:
static void
authnone_init(void *dummy)
{
struct authnone_private *ap = &authnone_private;
XDR xdrs;
ap->no_client.ah_cred = ap->no_client.ah_verf = _null_auth;
ap->no_client.ah_ops = &authnone_ops;
printf(" &authnone_private: %p, &_null_auth: %p, &authnone_ops: %p ", ap, &_null_auth, &authnone_ops);
xdrmem_create(&xdrs, ap->mclient, MAX_MARSHAL_SIZE, XDR_ENCODE);
xdr_opaque_auth(&xdrs, &ap->no_client.ah_cred);
xdr_opaque_auth(&xdrs, &ap->no_client.ah_verf);
ap->mcnt = XDR_GETPOS(&xdrs);
XDR_DESTROY(&xdrs);
}
SYSINIT(authnone_init, SI_SUB_KMEM, SI_ORDER_ANY, authnone_init, NULL);
The authnone_init code for through its first store into ap->... is:
00000000007a3ea4 <.authnone_init> mflr r0
00000000007a3ea8 <.authnone_init+0x4> std r29,-24(r1)
00000000007a3eac <.authnone_init+0x8> std r30,-16(r1)
00000000007a3eb0 <.authnone_init+0xc> std r31,-8(r1)
00000000007a3eb4 <.authnone_init+0x10> std r0,16(r1)
00000000007a3eb8 <.authnone_init+0x14> stdu r1,-192(r1)
00000000007a3ebc <.authnone_init+0x18> mr r31,r1
00000000007a3ec0 <.authnone_init+0x1c> ld r29,304(r2)
00000000007a3ec4 <.authnone_init+0x20> ld r9,312(r2)
00000000007a3ec8 <.authnone_init+0x24> ld r0,0(r9)
00000000007a3ecc <.authnone_init+0x28> ld r11,8(r9)
00000000007a3ed0 <.authnone_init+0x2c> ld r9,16(r9)
00000000007a3ed4 <.authnone_init+0x30> std r0,24(r29)
When the boots fail they fail on that last std: std r0,24(r29) , doing so based on
r2: 0xd2da20
r29: 0x2400002200e313a8
bad virtual address: 0x2400002200e313c0
(These are from a boot-crash time register display, so hand copied off screen as it is too soon for interaction with DDB. I've got a default ddb script in place that does the display.)
When it boots okay r29 = 0x00e313a8 and the address accessed is 0x00e313c0 instead: see the first address that I started with above (for &authnone_private).
In other words: The difference is the upper half of r29. I've no evidence that r2 is corrupt for failing boots for this code.
So either 304(r2) accesses different values from one time to the next for booting or the r29 register is corrupted somehow between
00000000007a3ec0 <.authnone_init+0x1c> ld r29,304(r2)
and
00000000007a3ed4 <.authnone_init+0x30> std r0,24(r29)
(such as an interrupt not restoring the 64bit-ABI's register value fully).
At this point I've no clue if variability in the TOC contents that 304(r2) references makes any sense or not. I've yet to figure out how it is established.
More FreeBSD configuration details:
10.1-STABLE's buildworld kernel and installworld were all done from the PowerMac G5 itself.
root at FBSDG5M1:/usr/src # more sys/powerpc/conf/GENERIC64
GENERIC64 GENERIC64vtsc
root at FBSDG5M1:/usr/src # more sys/powerpc/conf/GENERIC64vtsc
include GENERIC64
ident GENERIC64vtsc
nooptions PS3 #Sony Playstation 3 HACK!!! to allow sc
options DDB # HACK!!! to dump early crash info (but 11.0-CURRENT already has it)
options GDB # HACK!!! ...
options VERBOSE_SYSINIT
options BOOTVERBOSE=1
options BOOTHOWTO=RB_VERBOSE
#options KTR
#options KTR_MASK=KTR_TRAP
#options KTR_CPUMASK=0xF
#options KTR_VERBOSE
# HACK!!! to allow sc for 2560x1440 display on Radeon X1950 that vt historically mishandled during booting
device sc
#device kbdmux # HACK: already listed by vt
options SC_OFWFB # OFW frame buffer
options SC_DFLT_FONT # compile font in
makeoptions SC_DFLT_FONT=cp437
# Disable extra checking typically used for FreeBSD 11.0-CURRENT:
nooptions DEADLKRES #Enable the deadlock resolver
nooptions INVARIANTS #Enable calls of extra sanity checking
nooptions INVARIANT_SUPPORT #Extra sanity checks of internal structures, required by INVARIANTS
nooptions WITNESS #Enable checks to detect deadlocks and cycles
nooptions WITNESS_SKIPSPIN #Don't run witness on spinlocks for speed
nooptions MALLOC_DEBUG_MAXZONES # Separate malloc(9) zones
root at FBSDG5M1:/usr/src # svnlite status
? .snap
M sys/ddb/db_main.c
M sys/ddb/db_script.c
M sys/powerpc/conf
? sys/powerpc/conf/GENERIC64vtsc
M sys/powerpc/ofw/ofw_machdep.c
M sys/powerpc/ofw/ofwcall64.S
M sys/powerpc/powermac/powermac_thermal.c
M sys/rpc/auth_none.c
root at FBSDG5M1:/usr/src # more /boot/loader.conf
#kernel="kernel"
#kernel="kernel10.1RE"
kernel="kernel10.1S"
#kernel="kernel11C"
verbose_loading="YES"
kern.vty=vt
root at FBSDG5M1:/usr/src # more /etc/make.conf
WRKDIRPREFIX=/usr/obj/portswork
WITH_DEBUG=
MALLOC_PRODUCTION=
root at FBSDG5M1:/usr/src # more /etc/src.conf
CFLAGS+=-DELF_VERBOSE
#WITH_DEBUG_FILES=
#WITHOUT_CLANG=
Other than powermac_thermal.c (from Justin H.) the source changes are for investigation of various early-boot problems for PowerMac G5's. The PowerMac G5 that powermac_thermal.c was put in place to experiment with is no longer around but I've not yet removed the powermac_thermal.c update from my environment.
===
Mark Millard
markmi at dsl-only.net
More information about the freebsd-ppc
mailing list