proposed smp_rendezvous change
Max Laier
max at love2party.net
Mon May 16 20:30:48 UTC 2011
On Monday 16 May 2011 14:21:27 John Baldwin wrote:
> On Monday, May 16, 2011 1:46:33 pm Max Laier wrote:
> > I'd like to propose that we move forward with fixing the race, before
> > redesigning the API - please.
> >
> > We agree that there is a problem with the exit rendezvous. Let's fix
> > that. We agree that the generation approach proposed by NetAPP is the
> > right way to go? Can we check it in, then? Again, I'd like some
> > comments in the code if at all possible.
>
> How about this:
>
> Index: subr_smp.c
> ===================================================================
> --- subr_smp.c (revision 221939)
> +++ subr_smp.c (working copy)
> @@ -111,6 +111,7 @@ static void (*volatile smp_rv_action_func)(void *a
> static void (*volatile smp_rv_teardown_func)(void *arg);
> static void *volatile smp_rv_func_arg;
> static volatile int smp_rv_waiters[3];
> +static volatile int smp_rv_generation;
>
> /*
> * Shared mutex to restrict busywaits between smp_rendezvous() and
> @@ -311,39 +312,62 @@ restart_cpus(cpumask_t map)
> void
> smp_rendezvous_action(void)
> {
> - void* local_func_arg = smp_rv_func_arg;
> - void (*local_setup_func)(void*) = smp_rv_setup_func;
> - void (*local_action_func)(void*) = smp_rv_action_func;
> - void (*local_teardown_func)(void*) = smp_rv_teardown_func;
> + void *local_func_arg;
> + void (*local_setup_func)(void*);
> + void (*local_action_func)(void*);
> + void (*local_teardown_func)(void*);
> + int generation;
>
> /* Ensure we have up-to-date values. */
> atomic_add_acq_int(&smp_rv_waiters[0], 1);
> while (smp_rv_waiters[0] < smp_rv_ncpus)
> cpu_spinwait();
>
> - /* setup function */
> + /* Fetch rendezvous parameters after acquire barrier. */
> + local_func_arg = smp_rv_func_arg;
> + local_setup_func = smp_rv_setup_func;
> + local_action_func = smp_rv_action_func;
> + local_teardown_func = smp_rv_teardown_func;
> + generation = smp_rv_generation;
> +
> + /*
> + * If requested, run a setup function before the main action
> + * function. Ensure all CPUs have completed the setup
> + * function before moving on to the action function.
> + */
> if (local_setup_func != smp_no_rendevous_barrier) {
> if (smp_rv_setup_func != NULL)
> smp_rv_setup_func(smp_rv_func_arg);
> -
> - /* spin on entry rendezvous */
> atomic_add_int(&smp_rv_waiters[1], 1);
> while (smp_rv_waiters[1] < smp_rv_ncpus)
> cpu_spinwait();
> }
>
> - /* action function */
> if (local_action_func != NULL)
> local_action_func(local_func_arg);
>
> - /* spin on exit rendezvous */
> + /*
> + * Signal that the main action has been completed. If a
> + * full exit rendezvous is requested, then all CPUs will
> + * wait here until all CPUs have finished the main action.
> + *
> + * Note that the write by the last CPU to finish the action
> + * may become visible to different CPUs at different times.
> + * As a result, the CPU that initiated the rendezvous may
> + * exit the rendezvous and drop the lock allowing another
> + * rendezvous to be initiated on the same CPU or a different
> + * CPU. In that case the exit sentinel may be cleared before
> + * all CPUs have noticed causing those CPUs to hang forever.
> + * Workaround this by using a generation count to notice when
> + * this race occurs and to exit the rendezvous in that case.
> + */
MPASS(generation == smp_rv_generation);
> atomic_add_int(&smp_rv_waiters[2], 1);
> if (local_teardown_func == smp_no_rendevous_barrier)
> return;
> - while (smp_rv_waiters[2] < smp_rv_ncpus)
> + while (smp_rv_waiters[2] < smp_rv_ncpus &&
> + generation == smp_rv_generation)
> cpu_spinwait();
>
> - /* teardown function */
> if (local_teardown_func != NULL)
> local_teardown_func(local_func_arg);
> }
> @@ -374,10 +398,11 @@ smp_rendezvous_cpus(cpumask_t map,
> if (ncpus == 0)
> panic("ncpus is 0 with map=0x%x", map);
>
> - /* obtain rendezvous lock */
> mtx_lock_spin(&smp_ipi_mtx);
>
> - /* set static function pointers */
> + atomic_add_acq_int(&smp_rv_generation, 1);
> +
> + /* Pass rendezvous parameters via global variables. */
> smp_rv_ncpus = ncpus;
> smp_rv_setup_func = setup_func;
> smp_rv_action_func = action_func;
> @@ -387,18 +412,25 @@ smp_rendezvous_cpus(cpumask_t map,
> smp_rv_waiters[2] = 0;
> atomic_store_rel_int(&smp_rv_waiters[0], 0);
>
> - /* signal other processors, which will enter the IPI with interrupts off
> */ + /*
> + * Signal other processors, which will enter the IPI with
> + * interrupts off.
> + */
> ipi_selected(map & ~(1 << curcpu), IPI_RENDEZVOUS);
>
> /* Check if the current CPU is in the map */
> if ((map & (1 << curcpu)) != 0)
> smp_rendezvous_action();
>
> + /*
> + * If the caller did not request an exit barrier to be enforced
> + * on each CPU, ensure that this CPU waits for all the other
> + * CPUs to finish the rendezvous.
> + */
> if (teardown_func == smp_no_rendevous_barrier)
> while (atomic_load_acq_int(&smp_rv_waiters[2]) < ncpus)
> cpu_spinwait();
>
> - /* release lock */
> mtx_unlock_spin(&smp_ipi_mtx);
> }
>
> > A second commit should take care of the entry sync - which may or may not
> > be required on some architectures. If we decide that we don't need it,
> > we should remove it. Otherwise, we must move all the assignments from
> > global smp_rv_* to the stack in smp_rendezvous_action to after the sync.
> > Otherwise we should just kill it. In any case, it would be nice to
> > clean this up.
>
> The patch also fixes this in the cautious fashion (not removing the early
> rendezvous).
Good idea for now. We can always remove the sync later and the assign on
declare is a style(9) violation anyways.
> > After that, I have one more bugfix for rmlock(9) - one of the three
> > consumers of this API (that I am aware of). As it uses a spinlock
> > inside its IPI action, we have to mark smp_rendezvous_action a critical
> > section otherwise the spin unlock might yield ... with the expected
> > consequences.
>
> Yes, we need to fix that. Humm, it doesn't preempt when you do a
> critical_exit() though? Or do you use a hand-rolled critical exit that
> doesn't do a deferred preemption?
Right now I just did a manual td_critnest++/--, but I guess ...
> Actually, I'm curious how the spin unlock inside the IPI could yield the
> CPU. Oh, is rmlock doing a wakeup inside the IPI handler? I guess that is
> ok as long as the critical_exit() just defers the preemption to the end of
> the IPI handler.
... the earliest point where it is safe to preempt is after doing the
atomic_add_int(&smp_rv_waiters[2], 1);
so that we can start other IPIs again. However, since we don't accept new
IPIs until we signal EOI in the MD code (on amd64), this might still not be a
good place to do the yield?!?
The spin unlock boils down to a critical_exit() and unless we did a
critical_enter() at some point during the redenvouz setup, we will yield() if
we owepreempt. I'm not quite sure how that can happen, but it seems like
there is a path that allows the scheduler to set it from a foreign CPU.
> > It is arguable if you should be allowed to use spinlocks in the rendevous
> > at all, but that's beside the point.
>
> I'm less convinced this is bad since we don't use NMIs for these, so we
> know the interrupted thread doesn't have any spin locks in the scheduler,
> etc.
Agreed. I'll send a patch as soon as this one's in, but it really is a simple
matter of adding critical_enter/exit (or the manual version thereof) to the
right places.
Thanks,
Max
More information about the freebsd-current
mailing list