An experimental hack that appears to allow old PowerMacG5 4-core (system total) system to boot reliably (head -r343884 based context)

Tue Feb 26 21:11:57 UTC 2019

[I explicitly note that my hack is racy. It apepars that
I've finally had an example.]

On 2019-Feb-24, at 13:50, Mark Millard <marklmi at yahoo.com> wrote:

> On 2019-Feb-24, at 13:07, Justin Hibbits <chmeeedalf at gmail.com> wrote:
> 
>> On Sat, Feb 23, 2019 at 1:36 PM Mark Millard <marklmi at yahoo.com> wrote:
>>> 
>>> For sys/powerpc/aim/mp_cpudep.c 's cpudep_ap_bootstrap I added as shown below:
>>> 
>>> +extern void hack_into_slb_if_needed(void* vap); // HACK!!!
>>> +
>>> uintptr_t
>>> cpudep_ap_bootstrap(void)
>>> {
>>> . . .
>>> +       hack_into_slb_if_needed(pcpup->pc_curpcb); // HACK!!!
>>> +
>>>       sp = pcpup->pc_curpcb->pcb_sp;

In the above, after the implict slb_insert_kernel, but before
the pcpup->pc_curpcb-> attempt, the slb entry could be replaced
again. There are, after all, other threads in operation before
SI_SUB_SMP starts:

        SI_SUB_KTHREAD_INIT     = 0xe000000,    /* init process*/
        SI_SUB_KTHREAD_PAGE     = 0xe400000,    /* pageout daemon*/
        SI_SUB_KTHREAD_VM       = 0xe800000,    /* vm daemon*/
        SI_SUB_KTHREAD_BUF      = 0xea00000,    /* buffer daemon*/
        SI_SUB_KTHREAD_UPDATE   = 0xec00000,    /* update daemon*/
        SI_SUB_KTHREAD_IDLE     = 0xee00000,    /* idle procs*/
#ifndef EARLY_AP_STARTUP
        SI_SUB_SMP              = 0xf000000,    /* start the APs*/
#endif

I've finally had one boot hang-up, apparently from this happening.

>>> and in src/sys/powerpc/aim/slb.c I added an implementation:
>>> 
>>> +void hack_into_slb_if_needed(void* vap); // HACK!!!
>>> +void hack_into_slb_if_needed(void* vap) // HACK!!!
>>> +{ // HACK!!!
>>> +       struct slb *cache= PCPU_GET(aim.slb);
>>> +       vm_offset_t va=    (vm_offset_t)vap;
>>> +       uint64_t    slbv=  kernel_va_to_slbv(va);
>>> +       uint64_t    esid=  va>>ADDR_SR_SHFT;
>>> +       uint64_t    slbe=  (esid<<SLBE_ESID_SHIFT) | SLBE_VALID;
>>> +       int i;
>>> +
>>> +       for (i = 0; i < n_slbs; i++) {
>>> +               if (i == USER_SLB_SLOT)
>>> +                       continue;
>>> +               if (cache[i].slbe == (slbe | i))
>>> +                       break;
>>> +       }
>>> +
>>> +       if (i==n_slbs)
>>> +               slb_insert_kernel(slbe,slbv);
>>> +} // HACK!!!
>>> +
>>> 
>>> So far I've not had any boot hang-ups after this.
>>> 
>>> Given the random nature of the hang-ups it will be a
>>> while before I conclude for sure how reliable this
>>> change makes booting, but so far so good.
>>> 
>>> (I recognize that the "break" could be "return"
>>> and then then the "if (i==n_slbs)" would not be
>>> needed.)
>>> 
>>> 
>>> Other issues not fixed by this:
>>> 
>>> This does not change the buf*daemon* randomly getting
>>> hung up (and so timing out on shutdown). This appears
>>> to be the same issue that leads to the fans sometimes
>>> starting to run full-rate because of pmac_thermal
>>> being hun -up.
>>> 
>>> For  buf*daemon* "top -SHIopid" before shutdown shows
>>> just the ones that will not hang-up. The same goes for
>>> seeing before hand for pmac_thermal vs. the fans.
>>> 
>>> ===
>>> Mark Millard
>> 
>> Hi Mark,
>> 
>> Fantastic work tracking this down!  So the problem is we now can fault
>> when accessing KVA space.  I think we should allow this, otherwise we
>> can hamper performance with reduced KVA size.  I'll have to think
>> about how best to do this.  Would you be willing to test patches I
>> come up with?
> 
> I'll try to test whatever updates you want but there may be some
> issues with timeliness.
> 
> 
> 
> The reason for the "sometimes" boot-failure is that the entry in the
> slb for the PCB/stack for the CPU being added has sometimes been
> replaced already before the CPU the pcb is for has sufficiently
> configured to allow automatic handling --and other times has not
> yet been replaced: the random slb replacement mechanism.
> 
> There already is code to handle slb entry replacements but it does
> not work for a CPU still being set up (at the stage of the
> sometimes failure). At least that is what I expect for:
> 
> # grep -r "handle_kernel_slb_spill" /usr/src/sys/powerpc/
> /usr/src/sys/powerpc/aim/trap_subr64.S:	bl	handle_kernel_slb_spill
> /usr/src/sys/powerpc/powerpc/trap.c:       void	handle_kernel_slb_spill(int, register_t, register_t);
> /usr/src/sys/powerpc/powerpc/trap.c:handle_kernel_slb_spill(int type, register_t dar, register_t srr0)
> 
> So my hack was to separately do the potential replacement in that
> early time frame to allow the configuration for the CPU to get
> far enough along for the existing mechanism to work. (At least
> that is what I expect that I did.)
> 
> So far I've had no boot failures of any kind with the hack.
> I've removed the hacks for reporting information and things
> still work.
> 
> But I've not tried anything extensive after booting because
> things like buf*daemon* threads and pmac_thermal are randomly
> hanging up in/at:
> 
> mi_switch+0x134 sleepq_switch+0x2ec sleepq_timedwait+0x48 _sleep+0x41c
> (mi_swtich seems to have called sched_switch based on the
> "+0x134" and the code in that area --but ched_switch is not
> listed)
> 
> I've no clue what is safe when one or more buf*daeomon* threads
> make no progress.
> 
> For shutdown that frequently leads to timeouts for stopping some
> buf*deamon* threads (when all 8 time out it takes about 8 minutes).
> The buf*deamon* that fail are the ones that "top -SHIopid" no
> longer shows.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)