proposed smp_rendezvous change
Andriy Gapon
avg at FreeBSD.org
Sun May 15 08:13:53 UTC 2011
on 15/05/2011 07:33 Max Laier said the following:
> On Saturday 14 May 2011 11:25:36 John Baldwin wrote:
>> On 5/13/11 9:43 AM, Andriy Gapon wrote:
>>> This is a change in vein of what I've been doing in the xcpu branch and
>>> it's supposed to fix the issue by the recent commit that (probably
>>> unintentionally) stress-tests smp_rendezvous in TSC code.
>>>
>>> Non-essential changes:
>>> - ditch initial, and in my opinion useless, pre-setup rendezvous in
>>> smp_rendezvous_action()
>>
>> As long as IPIs ensure all data is up to date (I think this is certainly
>> true on x86) that is fine. Presumably sending an IPI has an implicit
>> store barrier on all other platforms as well?
>
> Note that if the barrier is required on any platform, then we have to move all
> the initializations after the _acq, otherwise it is pointless indeed.
Yeah, I still would like to get an explanation why that synchronization is
needed. I can understand "better safe than sorry" mentality, I often stick to
it myself. But for the benefit of improving objective knowledge it would be
useful to understand why that spin-waiting could it useful, especially given
your observation about where the assignment are made.
>>> Essential changes (the fix):
>>> - re-use freed smp_rv_waiters[2] to indicate that a slave/target is
>>> really fully done with rendezvous (i.e. it's not going to access any
>>> members of smp_rv_* pseudo-structure)
>>> - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not
>>> re-use the smp_rv_* pseudo-structure too early
>>
>> Hmmm, so this is not actually sufficient. NetApp ran into a very
>> similar race with virtual CPUs in BHyVe. In their case because virtual
>> CPUs are threads that can be preempted, they have a chance at a longer
>> race.
>>
>> The problem that they see is that even though the values have been
>> updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2]
>> to zero before one of the other CPUs notices that it has finished.
>>
>> A more concrete example:
>>
>> Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator.
>>
>> All three CPUs run the rendezvous down to the point of increasing
>> smp_rv_waiters[2]. CPU 1 is the last one to rendezvous for some reason
>> so he notices smp_rv_waiters[2] being correct first and exits the
>> rendezvous. It immediately does a new rendezvous. Even with your
>> change, it will see that smp_rv_waiters[2] is correct and will proceed
>> to clear it before starting the next rendezvous.
>>
>> Meanwhile one of the other CPUs never sees the final update by CPU 1 to
>> smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU
>> 1's two writes were coalesced, or in the case of the hypervisor, CPU 2
>> or 3's backing thread is preempted and both writes have posted before
>> the thread backing CPU 2 or 3 gets to run again).
>>
>> At this point that CPU (call it CPU 2) will spin forever. When CPU 1
>> sends a second rendezvous IPI it will be held in the local APIC of CPU 2
>> because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump
>> smp_rv_waiters[2] and the entire system will deadlock.
>>
>> NetApp's solution is to add a monotonically increasing generation count
>> to the rendezvous data set, which is cached in the rendezvous handler
>> and to exit the last synchronization point if either smp_rv_waiters[2]
>> is high enough, or the generation count has changed.
>>
>> I think this would also handle the case the TSC changes have provoked.
>> I'm not sure if this would be sufficient for the error case Max Laier
>> has encountered.
>
> It seems to address the issue, albeit a bit difficult to understand why this
> is correct. I'd like to see some comments in the code instead of just the
> commit log, but otherwise ... please go ahead with this.
I think it works approximately the same as what I suggested.
Only my patch forces the next master CPU to wait until all slave CPUs are fully
done with the previous rendezvous (and ware of that fact), while this patch
provides a "side-channel" to tell late slave CPUs that their rendezvous was
completed when there is a new rendezvous started by a new master CPU.
>> NetApp's patch:
>>
>> Extra protection for consecutive smp_rendezvous() calls.
>>
>> We need this because interrupts are not really disabled when
>> executing the smp_rendezvous_action() when running inside a
>> virtual machine.
>>
>> In particular it is possible for the last cpu to increment
>> smp_rv_waiters[2] so that the exit rendezvous condition is
>> satisfied and then get interrupted by a hardware interrupt.
>>
>> When the execution of the interrupted vcpu continues it is
>> possible that one of the other vcpus that did *not* get
>> interrupted exited the old smp_rendezvous() and started a
>> new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is
>> again reset to 0. This would mean that the vcpu that got
>> interrupted will spin forever on the exit rendezvous.
>>
>> We protect this by spinning on the exit rendezvous only if
>> the generation of the smp_rendezvous() matches what we
>> started out with before incrementing 'smp_rv_waiters[2]'.
>>
>> Differences ...
>>
>> ==== //private/xxxx/sys/kern/subr_smp.c#3 (text) ====
>>
>> @@ -127,6 +127,7 @@
>> static void (*volatile smp_rv_teardown_func)(void *arg);
>> static void * volatile smp_rv_func_arg;
>> static volatile int smp_rv_waiters[3];
>> +static volatile int smp_rv_generation;
>>
>> /*
>> * Shared mutex to restrict busywaits between smp_rendezvous() and
>> @@ -418,6 +419,7 @@
>> void
>> smp_rendezvous_action(void)
>> {
>> + int generation;
>> void* local_func_arg = smp_rv_func_arg;
>> void (*local_setup_func)(void*) = smp_rv_setup_func;
>> void (*local_action_func)(void*) = smp_rv_action_func;
>> @@ -457,11 +459,14 @@
>> if (local_action_func != NULL)
>> local_action_func(local_func_arg);
>>
>> + generation = atomic_load_acq_int(&smp_rv_generation);
>> +
>> /* spin on exit rendezvous */
>> atomic_add_int(&smp_rv_waiters[2], 1);
>> if (local_teardown_func == smp_no_rendevous_barrier)
>> return;
>> - while (smp_rv_waiters[2] < smp_rv_ncpus)
>> + while (smp_rv_waiters[2] < smp_rv_ncpus &&
>> + generation == smp_rv_generation)
>> cpu_spinwait();
>>
>> /*
>> @@ -565,6 +570,8 @@
>> /* obtain rendezvous lock */
>> mtx_lock_spin(&smp_ipi_mtx);
>>
>> + atomic_add_acq_int(&smp_rv_generation, 1);
>> +
>> /* set static function pointers */
>> smp_rv_ncpus = ncpus;
>> smp_rv_setup_func = setup_func;
--
Andriy Gapon
More information about the freebsd-current
mailing list