proposed smp_rendezvous change
John Baldwin
jhb at FreeBSD.org
Tue May 17 12:16:33 UTC 2011
On 5/16/11 6:05 PM, Max Laier wrote:
> On Monday 16 May 2011 17:54:54 Attilio Rao wrote:
>> 2011/5/16 Max Laier<max at love2party.net>:
>>> On Monday 16 May 2011 16:46:03 John Baldwin wrote:
>>>> On Monday, May 16, 2011 4:30:44 pm Max Laier wrote:
>>>>> On Monday 16 May 2011 14:21:27 John Baldwin wrote:
>>>>>> Yes, we need to fix that. Humm, it doesn't preempt when you do a
>>>>>> critical_exit() though? Or do you use a hand-rolled critical exit
>>>>>> that doesn't do a deferred preemption?
>>>>>
>>>>> Right now I just did a manual td_critnest++/--, but I guess ...
>>>>
>>>> Ah, ok, so you would "lose" a preemption. That's not really ideal.
>>>>
>>>>>> Actually, I'm curious how the spin unlock inside the IPI could yield
>>>>>> the CPU. Oh, is rmlock doing a wakeup inside the IPI handler? I
>>>>>> guess that is ok as long as the critical_exit() just defers the
>>>>>> preemption to the end of the IPI handler.
>>>>>
>>>>> ... the earliest point where it is safe to preempt is after doing the
>>>>>
>>>>> atomic_add_int(&smp_rv_waiters[2], 1);
>>>>>
>>>>> so that we can start other IPIs again. However, since we don't accept
>>>>> new IPIs until we signal EOI in the MD code (on amd64), this might
>>>>> still not be a good place to do the yield?!?
>>>>
>>>> Hmm, yeah, you would want to do the EOI before you yield. However, we
>>>> could actually move the EOI up before calling the MI code so long as we
>>>> leave interrupts disabled for the duration of the handler (which we do).
>>>>
>>>>> The spin unlock boils down to a critical_exit() and unless we did a
>>>>> critical_enter() at some point during the redenvouz setup, we will
>>>>> yield() if we owepreempt. I'm not quite sure how that can happen, but
>>>>> it seems like there is a path that allows the scheduler to set it from
>>>>> a foreign CPU.
>>>>
>>>> No, it is only set on curthread by curthread. This is actually my main
>>>> question. I've no idea how this could happen unless the rmlock code is
>>>> actually triggering a wakeup or sched_add() in its rendezvous handler.
>>>>
>>>> I don't see anything in rm_cleanIPI() that would do that however.
>>>>
>>>> I wonder if your original issue was really fixed just by the first
>>>> patch you had which fixed the race in smp_rendezvous()?
>>>
>>> I found the stack that lead me to this patch in the first place:
>>>
>>> #0 sched_switch (td=0xffffff011a970000, newtd=0xffffff006e6784b0,
>>> flags=4) at src/sys/kern/sched_ule.c:1939
>>> #1 0xffffffff80285c7f in mi_switch (flags=6, newtd=0x0) at
>>> src/sys/kern/kern_synch.c:475
>>> #2 0xffffffff802a2cb3 in critical_exit () at
>>> src/sys/kern/kern_switch.c:185 #3 0xffffffff80465807 in spinlock_exit
>>> () at
>>> src/sys/amd64/amd64/machdep.c:1458
>>> #4 0xffffffff8027adea in rm_cleanIPI (arg=<value optimized out>) at
>>> src/sys/kern/kern_rmlock.c:180
>>> #5 0xffffffff802b9887 in smp_rendezvous_action () at
>>> src/sys/kern/subr_smp.c:402
>>> #6 0xffffffff8045e2a4 in Xrendezvous () at
>>> src/sys/amd64/amd64/apic_vector.S:235
>>> #7 0xffffffff802a2c6e in critical_exit () at
>>> src/sys/kern/kern_switch.c:179 #8 0xffffffff804365ba in uma_zfree_arg
>>> (zone=0xffffff009ff4b5a0, item=0xffffff000f34cd40,
>>> udata=0xffffff000f34ce08) at
>>> src/sys/vm/uma_core.c:2370
>>> .
>>> .
>>> .
>>>
>>> and now that I look at it again, it is clear that critical_exit() just
>>> isn't interrupt safe. I'm not sure how to fix that, yet ... but this:
>>>
>>>
>>> if (td->td_critnest == 1) {
>>> td->td_critnest = 0;
>>> if (td->td_owepreempt) {
>>> td->td_critnest = 1;
>>>
>>> clearly doesn't work.
>>
>> I'm sorry if I didn't reply to the whole rendezvous thread, but as
>> long as there is so many people taking care of it, I'll stay hidden in
>> my corner.
>>
>> I just wanted to tell that I think you are misunderstanding what
>> critical section is supposed to do.
>>
>> When an interrupt fires, it goes on the old "interrupt/kernel context"
>> which means it has not a context of his own. That is the reason why we
>> disable interrupts on spinlocks (or similar workaround for !x86
>> architectures) and this is why spinlocks are the only protection
>> usable in code that runs in interrupt context.
>>
>> Preempting just means another thread will be scheduler in the middle
>> of another thread execution path.
>>
>> This code is perfectly fine if you consider curthread won't be
>> descheduled while it is executing.
>
> Well, no - it is not. With this you can end up with a curthread that has
> td_critnest=0 and td_owepreempt=1 in interrupt context. If you use a spinlock
> on such a thread, it will do the preemption at the point where you drop the
> spinlock, this is bad in some circumstances. One example is the smp_rendevous
> case we are discussing here.
Yes, the 'owepreempt' flag can leak because we allow it to be set while
td_critnest is 0. That is the problem, and I think we added this as an
optimization in the last round of changes to this routine to avoid
excessive context switches. However, I think we will just have to
revert that optimization and prevent that state from occurring.
Hmm, the log message for the change said it was to avoid races.
http://svnweb.freebsd.org/base?view=revision&revision=144777
That is what introduced the bug of mixing those two states. Hmmm, the
reason for moving the td_critnest == 0 up according to my old e-mails
was to avoid a problem with losing preemptions. I think the best way to
fix this is to clear owepreempt earlier while td_critnest is still 1 and
cache it locally in critical_exit(). So something like this:
Index: kern/sched_ule.c
===================================================================
--- kern/sched_ule.c (revision 221536)
+++ kern/sched_ule.c (working copy)
@@ -1785,7 +1785,7 @@
td->td_oncpu = NOCPU;
if (!(flags & SW_PREEMPT))
td->td_flags &= ~TDF_NEEDRESCHED;
- td->td_owepreempt = 0;
+ MPASS(td->td_owepreempt == 0);
tdq->tdq_switchcnt++;
/*
* The lock pointer in an idle thread should never change. Reset it
Index: kern/kern_switch.c
===================================================================
--- kern/kern_switch.c (revision 221536)
+++ kern/kern_switch.c (working copy)
@@ -192,15 +192,22 @@
critical_exit(void)
{
struct thread *td;
- int flags;
+ int flags, owepreempt;
td = curthread;
KASSERT(td->td_critnest != 0,
("critical_exit: td_critnest == 0"));
if (td->td_critnest == 1) {
+ owepreempt = td->td_owepreempt;
+ td->td_owepreempt = 0;
+ /*
+ * XXX: Should move compiler_memory_barrier() from
+ * rmlock to a header.
+ */
+ __asm __volatile("":::"memory");
td->td_critnest = 0;
- if (td->td_owepreempt) {
+ if (owepreempt) {
td->td_critnest = 1;
thread_lock(td);
td->td_critnest--;
Index: kern/kern_synch.c
===================================================================
--- kern/kern_synch.c (revision 221536)
+++ kern/kern_synch.c (working copy)
@@ -400,9 +400,7 @@
if (!TD_ON_LOCK(td) && !TD_IS_RUNNING(td))
mtx_assert(&Giant, MA_NOTOWNED);
#endif
- KASSERT(td->td_critnest == 1 || (td->td_critnest == 2 &&
- (td->td_owepreempt) && (flags & SW_INVOL) != 0 &&
- newtd == NULL) || panicstr,
+ KASSERT(td->td_critnest == 1 || panicstr,
("mi_switch: switch in a critical section"));
KASSERT((flags & (SW_INVOL | SW_VOL)) != 0,
("mi_switch: switch must be voluntary or involuntary"));
Index: kern/sched_4bsd.c
===================================================================
--- kern/sched_4bsd.c (revision 221536)
+++ kern/sched_4bsd.c (working copy)
@@ -947,7 +947,7 @@
td->td_lastcpu = td->td_oncpu;
if (!(flags & SW_PREEMPT))
td->td_flags &= ~TDF_NEEDRESCHED;
- td->td_owepreempt = 0;
+ MPASS(td->td_owepreempt == 0);
td->td_oncpu = NOCPU;
/*
Hmm, I'm also not sure if I really need the compiler barrier since
td_owepreempt and td_critnest are marked volatile?
--
John Baldwin
More information about the freebsd-current
mailing list