Starting APs earlier during boot

Wed Feb 17 09:42:51 UTC 2016

On Tue, Feb 16, 2016 at 12:50:22PM -0800, John Baldwin wrote:
> Currently the kernel bootstraps the non-boot processors fairly early in the
> SI_SUB_CPU SYSINIT.  The APs then spin waiting to be "released".  We currently
> release the APs as one of the last steps at SI_SUB_SMP.  On the one hand this
> removes much of the need for synchronization while SYSINITs are running since
> SYSINITs basically assume they are single-threaded.  However, it also enforces
> some odd quirks.  Several places that deal with per-CPU resources have to
> split initialization up so that the BSP init happens in one SYSINIT and the
> initialization of the APs happens in a second SYSINIT at SI_SUB_SMP.
> 
> Another issue that is becoming more prominent on x86 (and probably will also
> affect other platforms if it isn't already) is that to support working
> interrupts for interrupt config hooks we bind all interrupts to the BSP during
> boot and only distribute them among other CPUs near the end at SI_SUB_SMP. 
> This is especially problematic with drivers for modern hardware allocating
> num(CPUs) interrupts (hoping to use one per CPU).  On x86 we have aboug 190
> IDT vectors available for device interrupts, so in theory we should be able to
> tolerate a lot of drivers doing this (e.g. 60 drivers could allocate 3
> interrupts for every CPU and we should still be fine).  However, if you have,
> say, 32 cores in a system, then you can only handle about 5 drivers doing
> this before you run out of vectors on CPU 0.
> 
> Longer term we would also like to eventually have most drivers attach in the 
> same environment during boot as during post-boot.  Right now post-boot is 
> quite different as all CPUs are running, interrupts work, etc.  One of the 
> goals of multipass support for new-bus is to help us get there by probing 
> enough hardware to get timers working and starting the scheduler before 
> probing the rest of the devices.  That goal isn't quite realized yet.
> 
> However, we can run a slightly simpler version of our scheduler before
> timers are working.  In fact, sleep/wakeup work just fine fairly early (we
> allocate the necessary structures at SI_SUB_KMEM which is before the APs
> are even started).  Once idle threads are created and ready we could in
> theory let the APs startup and run other threads.  You just don't have working 
> timeouts.  OTOH, you can sort of simulate timeouts if you modify the scheduler 
> to yield the CPU instead of blocking the thread for a sleep with a timeout.  
> The effect would be for threads that do sleeps with a timeout to fall back to 
> polling before timers are working.  In practice, all of the early kernel 
> threads use sleeps without timeouts when idle so this doesn't really matter.
I understand that timeouts can be somewhat simulated this way.

But I do not quite understand how generic scheduling can work without
(timer) interrupts. Suppose that we have two threads 1 and 2 of the same
priority, both runnable, and due to some event thread 2 preempted thread
1. If thread 2 just runs without calling the preempt functions like
msleep, what would guarentee that thread 1 eventually gets it CPU slice ?

E.g. there might be no interrupts set up yet, and idle thread on UP
gets on CPU, then the whole boot process could deadlock.

> 
> I've implemented these changes and tested them for x86.  For x86 at least
> AP startup needed some bits of the interrupt infrastructure in place, so
> I moved SI_SUB_SMP up to after SI_SUB_INTR but before SI_SUB_SOFTINTR.  I
> modified the *sleep() and cv_*wait*() routines to not always bail if cold
> is true.  Instead, sleeps without a timeout are permitted to sleep
> "normally".  Sleeps with a timeout drop their interlock and yield the
> CPU (but remain runnable).  Since APs are now fully running this means
> interrupts are now routed to all CPUs from the get go removing the need for 
> the post-boot shuffle.  This also resolves the issue of running out of IDT 
> vectors on the boot CPU.
> 
> I believe that adopting other platforms for this change should be relatively
> simple, but we should do that before committing the full patch.  I do think
> that some parts of the patch (such as the changes to the sleep routines, and
> using SI_SUB_LAST instead of SI_SUB_SMP as a catch-all SYSINIT) can be 
> committed now without breaking anything.
> 
> However, I'd like feedback on the general idea and if it is acceptable I'd
> like to coordinate testing with other platforms so this can go into the
> tree.
> 
> The current changes are in the 'ap_startup' branch at github/bsdjhb/freebsd.
> You can view them here:
> 
> https://github.com/bsdjhb/freebsd/compare/master...bsdjhb:ap_startup
> 
> -- 
> John Baldwin
> _______________________________________________
> freebsd-arch at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe at freebsd.org"