Some evidence about the PowerMac G5 multiprocessor boot hang ups with the modern VM_MAX_KERNEL_ADDRESS value

Fri Feb 15 22:09:48 UTC 2019

On Fri, 15 Feb 2019 14:01:18 -0800
Mark Millard <marklmi at yahoo.com> wrote:

> On 2019-Feb-15, at 13:17, Justin Hibbits <chmeeedalf at gmail.com> wrote:
> 
> > On Fri, 15 Feb 2019 11:51:26 -0800
> > Mark Millard via freebsd-ppc <freebsd-ppc at freebsd.org> wrote:
> >   
> >> Old: 0xe0000000013ff0??
> >> New: 0xe000000087fd20??  
> > 
> > The addresses are pretty inconsequential, since they're virtual
> > addresses.  It would be nice to be able to profile how far a CPU
> > gets in its launch (writing a value to a well-known address,
> > 0xc000000000000010, or such, anywhere in the bottom 256 bytes of
> > space really).  If you add writes to that address, we can track the
> > progress at panic time.  I'd presume all APs would behave the same
> > way, so no need for lock management, or isolation between them.  
> 
> Thanks for the note.
> 
> Just to be sure, was the 0xc prefix a typo
> (vs. 0xe as a prefix)?:
> 
> 0xc000000000000010
> vs.
> 0xe000000000000010

No, 0xc is correct.  0xc... is the address of the DMAP, and it so
happens that the upper bits are ignored in real mode, simply by the
fact that they're not placed onto the address bus.  We take advantage
of that elsewhere as well.  So writing to 0xc000....10 actually writes
to 0x0000...10, both in real mode and translated mode.  Writing to this
at various points when the AP is starting up, we can see just how far
into the boot it gets.

> 
> The hangs do not produce panics so I'd have to
> induce one someplace/somehow if a panic is to be
> involved.
> 
> Since boots hang only sometimes, a fixed panic point
> does not seem appropriate.
> 
> Part of the issue is that this is before ddb user input
> works as far as I can tell. (I do not have a serial
> debug connection.) I'm unable to enter ddb via keyboard
> sequences when it is hung up.
> 
> Classically I've dealt with this sort of issue by building
> in a ddb script that automatically executes, dumping some
> information. But that still requires inducing the ddb
> session somehow. Historically I was investigating
> panics.
> 
> But since CPU 0 does complete its CPU 3 sequence and starts
> attempting CPU 2, I might get CPU 0 to print value(s)
> for the CPU 3 case before it tries for CPU 2.
> 
> In summary:
> 
> I've been pondering what to do for earlier evidence of why:
> 
> A) CPU 0 never sees pc->pc_awake become non zero for CPU
>    3 in the examples. (The 2 (void)(*rstvec) complete and
>    kicking CPU 2 starts to be attempted.)
> 
> B) CPU 0 never completes the sequence of 2 (void)(*rstvec)
>    for kicking CPU 2 in the examples.
> 
> (It has been some time since I've seen only one Waiting for
> CPU message: there have been 2 for hangs in recent times.)
> 
> Writing to appropriate memory and reading it later should
> help with that.

You can simply assume that it will hang (obviously keep another kernel
handy that boots, or make sure this kernel boots with SMP disabled as
well) and panic after a few seconds in cpu_mp_unleash(), after
sleeping for a few seconds in place of the while (ap_awake < smp_cpus)
loop. You may need to throw in a sync after your writes, I'm not 100%
sure, it depends on how coherency is handled in real mode.

- Justin