proposed smp_rendezvous change

Sun May 15 08:13:53 UTC 2011

on 15/05/2011 07:33 Max Laier said the following:
> On Saturday 14 May 2011 11:25:36 John Baldwin wrote:
>> On 5/13/11 9:43 AM, Andriy Gapon wrote:
>>> This is a change in vein of what I've been doing in the xcpu branch and
>>> it's supposed to fix the issue by the recent commit that (probably
>>> unintentionally) stress-tests smp_rendezvous in TSC code.
>>>
>>> Non-essential changes:
>>> - ditch initial, and in my opinion useless, pre-setup rendezvous in
>>> smp_rendezvous_action()
>>
>> As long as IPIs ensure all data is up to date (I think this is certainly
>> true on x86) that is fine.  Presumably sending an IPI has an implicit
>> store barrier on all other platforms as well?
> 
> Note that if the barrier is required on any platform, then we have to move all 
> the initializations after the _acq, otherwise it is pointless indeed.

Yeah, I still would like to get an explanation why that synchronization is
needed.  I can understand "better safe than sorry" mentality, I often stick to
it myself.  But for the benefit of improving objective knowledge it would be
useful to understand why that spin-waiting could it useful, especially given
your observation about where the assignment are made.

>>> Essential changes (the fix):
>>> - re-use freed smp_rv_waiters[2] to indicate that a slave/target is
>>> really fully done with rendezvous (i.e. it's not going to access any
>>> members of smp_rv_* pseudo-structure)
>>> - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not
>>> re-use the smp_rv_* pseudo-structure too early
>>
>> Hmmm, so this is not actually sufficient.  NetApp ran into a very
>> similar race with virtual CPUs in BHyVe.  In their case because virtual
>> CPUs are threads that can be preempted, they have a chance at a longer
>> race.
>>
>> The problem that they see is that even though the values have been
>> updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2]
>> to zero before one of the other CPUs notices that it has finished.
>>
>> A more concrete example:
>>
>> Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator.
>>
>> All three CPUs run the rendezvous down to the point of increasing
>> smp_rv_waiters[2].  CPU 1 is the last one to rendezvous for some reason
>> so he notices smp_rv_waiters[2] being correct first and exits the
>> rendezvous.  It immediately does a new rendezvous.  Even with your
>> change, it will see that smp_rv_waiters[2] is correct and will proceed
>> to clear it before starting the next rendezvous.
>>
>> Meanwhile one of the other CPUs never sees the final update by CPU 1 to
>> smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU
>> 1's two writes were coalesced, or in the case of the hypervisor, CPU 2
>> or 3's backing thread is preempted and both writes have posted before
>> the thread backing CPU 2 or 3 gets to run again).
>>
>> At this point that CPU (call it CPU 2) will spin forever.  When CPU 1
>> sends a second rendezvous IPI it will be held in the local APIC of CPU 2
>> because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump
>> smp_rv_waiters[2] and the entire system will deadlock.
>>
>> NetApp's solution is to add a monotonically increasing generation count
>> to the rendezvous data set, which is cached in the rendezvous handler
>> and to exit the last synchronization point if either smp_rv_waiters[2]
>> is high enough, or the generation count has changed.
>>
>> I think this would also handle the case the TSC changes have provoked.
>> I'm not sure if this would be sufficient for the error case Max Laier
>> has encountered.
> 
> It seems to address the issue, albeit a bit difficult to understand why this 
> is correct.  I'd like to see some comments in the code instead of just the 
> commit log, but otherwise ... please go ahead with this.

I think it works approximately the same as what I suggested.
Only my patch forces the next master CPU to wait until all slave CPUs are fully
done with the previous rendezvous (and ware of that fact), while this patch
provides a "side-channel" to tell late slave CPUs that their rendezvous was
completed when there is a new rendezvous started by a new master CPU.

>> NetApp's patch:
>>
>>             Extra protection for consecutive smp_rendezvous() calls.
>>
>>             We need this because interrupts are not really disabled when
>>             executing the smp_rendezvous_action() when running inside a
>>             virtual machine.
>>
>>             In particular it is possible for the last cpu to increment
>>             smp_rv_waiters[2] so that the exit rendezvous condition is
>>      satisfied and then get interrupted by a hardware interrupt.
>>
>>             When the execution of the interrupted vcpu continues it is
>>             possible that one of the other vcpus that did *not* get
>>      interrupted exited the old smp_rendezvous() and started a
>>             new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is
>>             again reset to 0. This would mean that the vcpu that got
>>             interrupted will spin forever on the exit rendezvous.
>>
>>             We protect this by spinning on the exit rendezvous only if
>>             the generation of the smp_rendezvous() matches what we
>>             started out with before incrementing 'smp_rv_waiters[2]'.
>>
>> Differences ...
>>
>> ==== //private/xxxx/sys/kern/subr_smp.c#3 (text) ====
>>
>> @@ -127,6 +127,7 @@
>>   static void (*volatile smp_rv_teardown_func)(void *arg);
>>   static void * volatile smp_rv_func_arg;
>>   static volatile int smp_rv_waiters[3];
>> +static volatile int smp_rv_generation;
>>
>>   /*
>>    * Shared mutex to restrict busywaits between smp_rendezvous() and
>> @@ -418,6 +419,7 @@
>>   void
>>   smp_rendezvous_action(void)
>>   {
>> +    int generation;
>>       void* local_func_arg = smp_rv_func_arg;
>>       void (*local_setup_func)(void*)   = smp_rv_setup_func;
>>       void (*local_action_func)(void*)   = smp_rv_action_func;
>> @@ -457,11 +459,14 @@
>>       if (local_action_func != NULL)
>>           local_action_func(local_func_arg);
>>
>> +    generation = atomic_load_acq_int(&smp_rv_generation);
>> +
>>       /* spin on exit rendezvous */
>>       atomic_add_int(&smp_rv_waiters[2], 1);
>>       if (local_teardown_func == smp_no_rendevous_barrier)
>>                   return;
>> -    while (smp_rv_waiters[2] < smp_rv_ncpus)
>> +    while (smp_rv_waiters[2] < smp_rv_ncpus &&
>> +           generation == smp_rv_generation)
>>           cpu_spinwait();
>>
>>       /*
>> @@ -565,6 +570,8 @@
>>       /* obtain rendezvous lock */
>>       mtx_lock_spin(&smp_ipi_mtx);
>>
>> +    atomic_add_acq_int(&smp_rv_generation, 1);
>> +
>>       /* set static function pointers */
>>       smp_rv_ncpus = ncpus;
>>       smp_rv_setup_func = setup_func;

-- 
Andriy Gapon