Re: git: 03f868b163ad - main - x86: Add a required store-load barrier in cpu_idle()

From: Mark Johnston <markj_at_freebsd.org>
Date: Sun, 17 Jul 2022 10:16:24 UTC
On Sun, Jul 17, 2022 at 11:31:43AM +0200, Tijl Coosemans wrote:
> On Thu, 14 Jul 2022 14:47:37 GMT Mark Johnston <markj@FreeBSD.org> wrote:
> > The branch main has been updated by markj:
> > 
> > URL: https://cgit.FreeBSD.org/src/commit/?id=03f868b163ad46d6f7cb03dc46fb83ca01fb8f69
> > 
> > commit 03f868b163ad46d6f7cb03dc46fb83ca01fb8f69
> > Author:     Mark Johnston <markj@FreeBSD.org>
> > AuthorDate: 2022-07-14 14:24:25 +0000
> > Commit:     Mark Johnston <markj@FreeBSD.org>
> > CommitDate: 2022-07-14 14:28:01 +0000
> > 
> >     x86: Add a required store-load barrier in cpu_idle()
> >     
> >     ULE's tdq_notify() tries to avoid delivering IPIs to the idle thread.
> >     In particular, it tries to detect whether the idle thread is running.
> >     There are two mechanisms for this:
> >     - tdq_cpu_idle, an MI flag which is set prior to calling cpu_idle().  If
> >       tdq_cpu_idle == 0, then no IPI is needed;
> >     - idle_state, an x86-specific state flag which is updated after
> >       cpu_idleclock() is called.
> >     
> >     The implementation of the second mechanism is racy; the race can cause a
> >     CPU to go to sleep with pending work.  Specifically, cpu_idle_*() set
> >     idle_state = STATE_SLEEPING, then check for pending work by loading the
> >     tdq_load field of the CPU's runqueue.  These operations can be reordered
> >     so that the idle thread observes tdq_load == 0, and tdq_notify()
> >     observes idle_state == STATE_RUNNING.
> >     
> >     Some counters indicate that the idle_state check in tdq_notify()
> >     frequently elides an IPI.  So, fix the problem by inserting a fence
> >     after the store to idle_state, immediately before idling the CPU.
> >     
> >     PR:             264867
> >     Reviewed by:    mav, kib, jhb
> >     MFC after:      1 month
> >     Sponsored by:   The FreeBSD Foundation
> >     Differential Revision:  https://reviews.freebsd.org/D35777
> > ---
> >  sys/x86/x86/cpu_machdep.c | 103 ++++++++++++++++++++++++++++------------------
> >  1 file changed, 62 insertions(+), 41 deletions(-)
> > 
> > diff --git a/sys/x86/x86/cpu_machdep.c b/sys/x86/x86/cpu_machdep.c
> > index fa11f64e2779..040438043c73 100644
> > --- a/sys/x86/x86/cpu_machdep.c
> > +++ b/sys/x86/x86/cpu_machdep.c
> > @@ -52,6 +52,7 @@ __FBSDID("$FreeBSD$");
> >  #include "opt_maxmem.h"
> >  #include "opt_mp_watchdog.h"
> >  #include "opt_platform.h"
> > +#include "opt_sched.h"
> >  #ifdef __i386__
> >  #include "opt_apic.h"
> >  #endif
> > @@ -532,32 +533,25 @@ static int	idle_mwait = 1;		/* Use MONITOR/MWAIT for short idle. */
> >  SYSCTL_INT(_machdep, OID_AUTO, idle_mwait, CTLFLAG_RWTUN, &idle_mwait,
> >      0, "Use MONITOR/MWAIT for short idle");
> >  
> > -static void
> > -cpu_idle_acpi(sbintime_t sbt)
> > +static bool
> > +cpu_idle_enter(int *statep, int newstate)
> >  {
> > -	int *state;
> > +	KASSERT(atomic_load_int(statep) == STATE_RUNNING,
> > +	    ("%s: state %d", __func__, atomic_load_int(statep)));
> >  
> > -	state = &PCPU_PTR(monitorbuf)->idle_state;
> > -	atomic_store_int(state, STATE_SLEEPING);
> > -
> > -	/* See comments in cpu_idle_hlt(). */
> > -	disable_intr();
> > -	if (sched_runnable())
> > -		enable_intr();
> > -	else if (cpu_idle_hook)
> > -		cpu_idle_hook(sbt);
> > -	else
> > -		acpi_cpu_c1();
> > -	atomic_store_int(state, STATE_RUNNING);
> > -}
> > -
> > -static void
> > -cpu_idle_hlt(sbintime_t sbt)
> > -{
> > -	int *state;
> > -
> > -	state = &PCPU_PTR(monitorbuf)->idle_state;
> > -	atomic_store_int(state, STATE_SLEEPING);
> > +	/*
> > +	 * A fence is needed to prevent reordering of the load in
> > +	 * sched_runnable() with this store to the idle state word.  Without it,
> > +	 * cpu_idle_wakeup() can observe the state as STATE_RUNNING after having
> > +	 * added load to the queue, and elide an IPI.  Then, sched_runnable()
> > +	 * can observe tdq_load == 0, so the CPU ends up idling with pending
> > +	 * work.  tdq_notify() similarly ensures that a prior update to tdq_load
> > +	 * is visible before calling cpu_idle_wakeup().
> > +	 */
> > +	atomic_store_int(statep, newstate);
> > +#if defined(SCHED_ULE) && defined(SMP)
> > +	atomic_thread_fence_seq_cst();
> > +#endif
> 
> Can't you combine the store and the fence with
> atomic_store_explicit(memory_order_seq_cst) or, since this is x86
> specific code, atomic_swap_int?
> 
> #if defined(SCHED_ULE) && defined(SMP)
> 	atomic_swap_int(statep, newstate);
> #else
> 	atomic_store_int(statep, newstate);
> #endif

Yes, I believe so.  I wrote it that way initially, but decided it was
nicer to use the same fence as the MI code with which this synchronizes.
What's the advantage of using xchg instead, given that the CPU is about
to idle?

> >  	/*
> >  	 * Since we may be in a critical section from cpu_idle(), if
> > @@ -576,35 +570,62 @@ cpu_idle_hlt(sbintime_t sbt)
> >  	 * interrupt.
> >  	 */
> >  	disable_intr();
> > -	if (sched_runnable())
> > +	if (sched_runnable()) {
> >  		enable_intr();
> > -	else
> > -		acpi_cpu_c1();
> > -	atomic_store_int(state, STATE_RUNNING);
> > +		atomic_store_int(statep, STATE_RUNNING);
> > +		return (false);
> > +	} else {
> > +		return (true);
> > +	}
> >  }
> >  
> >  static void
> > -cpu_idle_mwait(sbintime_t sbt)
> > +cpu_idle_exit(int *statep)
> > +{
> > +	atomic_store_int(statep, STATE_RUNNING);
> > +}
> 
> > +static void
> > +cpu_idle_hlt(sbintime_t sbt)
> > +{
> > +	int *state;
> > +
> > +	state = &PCPU_PTR(monitorbuf)->idle_state;
> > +	if (cpu_idle_enter(state, STATE_SLEEPING)) {
> > +		acpi_cpu_c1();
> >  		atomic_store_int(state, STATE_RUNNING);
> 
> Should this call be replaced with cpu_idle_exit(state), just for
> consistency?

Yes, that would be a bit better.

> > -		enable_intr();
> > -		return;
> >  	}
> > +}