proposed smp_rendezvous change

Mon May 16 20:30:48 UTC 2011

On Monday 16 May 2011 14:21:27 John Baldwin wrote:
> On Monday, May 16, 2011 1:46:33 pm Max Laier wrote:
> > I'd like to propose that we move forward with fixing the race, before
> > redesigning the API - please.
> > 
> > We agree that there is a problem with the exit rendezvous.  Let's fix
> > that. We agree that the generation approach proposed by NetAPP is the
> > right way to go?  Can we check it in, then?  Again, I'd like some
> > comments in the code if at all possible.
> 
> How about this:
> 
> Index: subr_smp.c
> ===================================================================
> --- subr_smp.c	(revision 221939)
> +++ subr_smp.c	(working copy)
> @@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a
>  static void (*volatile smp_rv_teardown_func)(void *arg);
>  static void *volatile smp_rv_func_arg;
>  static volatile int smp_rv_waiters[3];
> +static volatile int smp_rv_generation;
> 
>  /*
>   * Shared mutex to restrict busywaits between smp_rendezvous() and
> @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
>  void
>  smp_rendezvous_action(void)
>  {
> -	void* local_func_arg = smp_rv_func_arg;
> -	void (*local_setup_func)(void*)   = smp_rv_setup_func;
> -	void (*local_action_func)(void*)   = smp_rv_action_func;
> -	void (*local_teardown_func)(void*) = smp_rv_teardown_func;
> +	void *local_func_arg;
> +	void (*local_setup_func)(void*);
> +	void (*local_action_func)(void*);
> +	void (*local_teardown_func)(void*);
> +	int generation;
> 
>  	/* Ensure we have up-to-date values. */
>  	atomic_add_acq_int(&smp_rv_waiters[0], 1);
>  	while (smp_rv_waiters[0] < smp_rv_ncpus)
>  		cpu_spinwait();
> 
> -	/* setup function */
> +	/* Fetch rendezvous parameters after acquire barrier. */
> +	local_func_arg = smp_rv_func_arg;
> +	local_setup_func = smp_rv_setup_func;
> +	local_action_func = smp_rv_action_func;
> +	local_teardown_func = smp_rv_teardown_func;
> +	generation = smp_rv_generation;
> +
> +	/*
> +	 * If requested, run a setup function before the main action
> +	 * function.  Ensure all CPUs have completed the setup
> +	 * function before moving on to the action function.
> +	 */
>  	if (local_setup_func != smp_no_rendevous_barrier) {
>  		if (smp_rv_setup_func != NULL)
>  			smp_rv_setup_func(smp_rv_func_arg);
> -
> -		/* spin on entry rendezvous */
>  		atomic_add_int(&smp_rv_waiters[1], 1);
>  		while (smp_rv_waiters[1] < smp_rv_ncpus)
>                  	cpu_spinwait();
>  	}
> 
> -	/* action function */
>  	if (local_action_func != NULL)
>  		local_action_func(local_func_arg);
> 
> -	/* spin on exit rendezvous */
> +	/*
> +	 * Signal that the main action has been completed.  If a
> +	 * full exit rendezvous is requested, then all CPUs will
> +	 * wait here until all CPUs have finished the main action.
> +	 *
> +	 * Note that the write by the last CPU to finish the action
> +	 * may become visible to different CPUs at different times.
> +	 * As a result, the CPU that initiated the rendezvous may
> +	 * exit the rendezvous and drop the lock allowing another
> +	 * rendezvous to be initiated on the same CPU or a different
> +	 * CPU.  In that case the exit sentinel may be cleared before
> +	 * all CPUs have noticed causing those CPUs to hang forever.
> +	 * Workaround this by using a generation count to notice when
> +	 * this race occurs and to exit the rendezvous in that case.
> +	 */

MPASS(generation == smp_rv_generation);

>  	atomic_add_int(&smp_rv_waiters[2], 1);
>  	if (local_teardown_func == smp_no_rendevous_barrier)
>                  return;
> -	while (smp_rv_waiters[2] < smp_rv_ncpus)
> +	while (smp_rv_waiters[2] < smp_rv_ncpus &&
> +	    generation == smp_rv_generation)
>  		cpu_spinwait();
> 
> -	/* teardown function */
>  	if (local_teardown_func != NULL)
>  		local_teardown_func(local_func_arg);
>  }
> @@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map,
>  	if (ncpus == 0)
>  		panic("ncpus is 0 with map=0x%x", map);
> 
> -	/* obtain rendezvous lock */
>  	mtx_lock_spin(&smp_ipi_mtx);
> 
> -	/* set static function pointers */
> +	atomic_add_acq_int(&smp_rv_generation, 1);
> +
> +	/* Pass rendezvous parameters via global variables. */
>  	smp_rv_ncpus = ncpus;
>  	smp_rv_setup_func = setup_func;
>  	smp_rv_action_func = action_func;
> @@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map,
>  	smp_rv_waiters[2] = 0;
>  	atomic_store_rel_int(&smp_rv_waiters[0], 0);
> 
> -	/* signal other processors, which will enter the IPI with interrupts off
> */ +	/*
> +	 * Signal other processors, which will enter the IPI with
> +	 * interrupts off.
> +	 */
>  	ipi_selected(map & ~(1 << curcpu), IPI_RENDEZVOUS);
> 
>  	/* Check if the current CPU is in the map */
>  	if ((map & (1 << curcpu)) != 0)
>  		smp_rendezvous_action();
> 
> +	/*
> +	 * If the caller did not request an exit barrier to be enforced
> +	 * on each CPU, ensure that this CPU waits for all the other
> +	 * CPUs to finish the rendezvous.
> +	 */
>  	if (teardown_func == smp_no_rendevous_barrier)
>  		while (atomic_load_acq_int(&smp_rv_waiters[2]) < ncpus)
>  			cpu_spinwait();
> 
> -	/* release lock */
>  	mtx_unlock_spin(&smp_ipi_mtx);
>  }
> 
> > A second commit should take care of the entry sync - which may or may not
> > be required on some architectures.  If we decide that we don't need it,
> > we should remove it.  Otherwise, we must move all the assignments from
> > global smp_rv_* to the stack in smp_rendezvous_action to after the sync.
> >  Otherwise we should just kill it.  In any case, it would be nice to
> > clean this up.
> 
> The patch also fixes this in the cautious fashion (not removing the early
> rendezvous).

Good idea for now.  We can always remove the sync later and the assign on 
declare is a style(9) violation anyways.

> > After that, I have one more bugfix for rmlock(9) - one of the three
> > consumers of this API (that I am aware of).  As it uses a spinlock
> > inside its IPI action, we have to mark smp_rendezvous_action a critical
> > section otherwise the spin unlock might yield ... with the expected
> > consequences.
> 
> Yes, we need to fix that.  Humm, it doesn't preempt when you do a
> critical_exit() though?  Or do you use a hand-rolled critical exit that
> doesn't do a deferred preemption?

Right now I just did a manual td_critnest++/--, but I guess ...

> Actually, I'm curious how the spin unlock inside the IPI could yield the
> CPU.  Oh, is rmlock doing a wakeup inside the IPI handler?  I guess that is
> ok as long as the critical_exit() just defers the preemption to the end of
> the IPI handler.

... the earliest point where it is safe to preempt is after doing the 

   atomic_add_int(&smp_rv_waiters[2], 1);

so that we can start other IPIs again.  However, since we don't accept new 
IPIs until we signal EOI in the MD code (on amd64), this might still not be a 
good place to do the yield?!?

The spin unlock boils down to a critical_exit() and unless we did a 
critical_enter() at some point during the redenvouz setup, we will yield() if 
we owepreempt.  I'm not quite sure how that can happen, but it seems like 
there is a path that allows the scheduler to set it from a foreign CPU.

> > It is arguable if you should be allowed to use spinlocks in the rendevous
> > at all, but that's beside the point.
> 
> I'm less convinced this is bad since we don't use NMIs for these, so we
> know the interrupted thread doesn't have any spin locks in the scheduler,
> etc.

Agreed.  I'll send a patch as soon as this one's in, but it really is a simple 
matter of adding critical_enter/exit (or the manual version thereof) to the 
right places.

Thanks,
  Max