ule+smp: small optimization for turnstile priority lending

Tue Nov 6 11:02:07 UTC 2012

On 10/29/12, Attilio Rao <attilio at freebsd.org> wrote:
> On Mon, Oct 29, 2012 at 4:56 PM, Attilio Rao <attilio at freebsd.org> wrote:
>> On Wed, Oct 3, 2012 at 3:05 PM, Andriy Gapon <avg at freebsd.org> wrote:
>>> on 20/09/2012 16:14 Attilio Rao said the following:
>>>> On 9/20/12, Andriy Gapon <avg at freebsd.org> wrote:
>>> [snip]
>>>>> The patch works well as far as I can tell.  Thank you!
>>>>> There is one warning with full witness enables but it appears to be
>>>>> harmless
>>>>> (so
>>>>> far):
>>>>
>>>> Andriy,
>>>> thanks a lot for your testing and reports you made so far.
>>>> Unfortunately I'm going off for 2 weeks now and I won't work on
>>>> FreeBSD for that timeframe. I will get back to those in 2 weeks then.
>>>> If you want  to play more with this idea feel free to extend/fix/etc.
>>>> this patch.
>>>
>>> Unfortunately I haven't found time to work on this further, but I have
>>> some
>>> additional thoughts.
>>>
>>> First, I think that the witness message was not benign and it actually
>>> warns about
>>> the same kind of deadlock that I originally had.
>>> The problem is that sched_lend_prio() is called with target thread's
>>> td_lock held,
>>> which is a lock of tdq on the thread's CPU.  Then, in your patch, we
>>> acquire
>>> current tdq's lock to modify its load.  But if two tdq locks are to be
>>> held at the
>>> same time, then they must be locked using tdq_lock_pair, so that lock
>>> order is
>>> maintained.  With the patch no tdq lock order can be maintained (can
>>> arbitrary)
>>> and thus it is possible to run into a deadlock.
>>
>> Indeed. I realized this after thinking about this problem while I was
>> on holiday.
>>
>>>
>>> I see two possibilities here, but don't rule out that there can be more.
>>>
>>> 1. Do something like my patch does.  That is, manipulate current thread's
>>> tdq in
>>> advance before any other thread/tdq locks are acquired (or td_lock is
>>> changed to
>>> point to a different lock and current tdq is unlocked).  The API can be
>>> made more
>>> generic in nature.  E.g. it can look like this:
>>> void
>>> sched_thread_block(struct thread *td, int inhibitor)
>>> {
>>>         struct tdq *tdq;
>>>
>>>         THREAD_LOCK_ASSERT(td, MA_OWNED);
>>>         KASSERT(td == curthread,
>>>             ("sched_thread_block: only curthread is supported"));
>>>         tdq = TDQ_SELF();
>>>         TDQ_LOCK_ASSERT(tdq, MA_OWNED);
>>>         MPASS(td->td_lock == TDQ_LOCKPTR(tdq));
>>>         TD_SET_INHIB(td, inhibitor);
>>>         tdq_load_rem(tdq, td);
>>>         tdq_setlowpri(tdq, td);
>>> }
>>>
>>>
>>> 2. Try to do things from sched_lend_prio based on curthread's state.
>>> This, as it
>>> seems, would require completely lock-less manipulations of current tdq.
>>> E.g. for
>>> tdq_load we could use atomic operations (it is already accessed
>>> locklessly, but
>>> not modified).  But for tdq_lowpri a more elaborate trick would be
>>> required like
>>> having a separate field for a temporary value.
>>
>> No, this is not going to work because tdq_lowpri and tdq_load are
>> updated and used in the same critical path (and possibly also other
>> tdq* members in tdq_runqueue_add(), for example, but I didn't bother
>> to also check that).
>
> Ok, so here I wasn't completely right because td_lowpri and tdq_load
> are not read in the same critical path (cpu_search() and
> sched_pickcpu() in the self case).
> However, I was over-estimating a factor in this patch: I don't think
> we need to fix real tdq_load for self in that case before cpu_search()
> because self is handled in a very special way, with its own
> comparison. I then think that a patch local to sched_pickcpu() on the
> self path should be enough.
>
> This brings us to another bigger issue, however. tdq_lowpri handling
> is not perfect in that patch. Think about the case where the thread to
> be switched out is, infact, the lowest priority one. In that case,
> what happens is that what we should do is recalculating tdq_lowpri
> which is usually a very expensive operation and requires TDQ self
> locking to be in place to be brought on correctly.

I think that the easy fix for this is to implement a second field
called tdq_sec_lowpri where we store the next thread to be scheduled,
after tdq_lowpri. This is going to not really change anything because
we will get it for free when we already calculate tdq_lowpri and it
will let us the implement a definitive logic that helps in the case
you describe.

However I still want to think about the base idea behind the current
algorithm in the self/likely cpu pickup.

Attilio

-- 
Peace can only be achieved by understanding - A. Einstein