cvs commit: src/sys/kern kern_mutex.c
John Baldwin
jhb at freebsd.org
Mon Jun 11 18:22:33 UTC 2007
On Friday 08 June 2007 10:20:19 pm Bruce Evans wrote:
> On Fri, 8 Jun 2007, John Baldwin wrote:
>
> > On Friday 08 June 2007 10:50:15 am Bruce Evans wrote:
> >> On Thu, 7 Jun 2007, John Baldwin wrote:
>
> >>> Index: vm_zeroidle.c
> >>> ===================================================================
> >>> RCS file: /usr/cvs/src/sys/vm/vm_zeroidle.c,v
> >>> retrieving revision 1.45
> >>> diff -u -r1.45 vm_zeroidle.c
> >>> --- vm_zeroidle.c 18 May 2007 07:10:50 -0000 1.45
> >>> +++ vm_zeroidle.c 7 Jun 2007 14:56:02 -0000
> >>> @@ -147,8 +147,10 @@
> >>> #ifndef PREEMPTION
> >>> if (sched_runnable()) {
> >>> mtx_lock_spin(&sched_lock);
> >>> + mtx_unlock(&vm_page_queue_free_mtx);
> >>> mi_switch(SW_VOL, NULL);
> >>> mtx_unlock_spin(&sched_lock);
> >>> + mtx_lock(&vm_page_queue_free_mtx);
> >>> }
> >>> #endif
> >>> } else {
> >>
> >> The sched_locks are now of course thread_locks. I put the vm unlock
> >> before the thread lock since the above order seems to risk a LOR. That
> >> may have been a mistake -- we would prefer not to be switched after
> >> deciding to do it ourself.
> >
> > Actually, the order is on purpose so that we don't waste time doing a
> > preemption when we're about to switch anyway.
>
> OK, changed back (not tested with it changed back).
>
> What about after returning from mi_switch()? The spinlock must be dropped
> before acquiring the sleep lock, but that wastes some time if preemption
> occurs. Here we expect to have more work to do and it seems best to
> ensure doing at least one page of work per context switch if possible.
Well, when we drop the spin lock we might enable a pending interrupt in which
case we really should go handle that right away. If I have to starve
something, I'd rather starve this process than ithreads.
> >>>> Index: vm_zeroidle.c
> >>> ===================================================================
> >>> RCS file: /usr/cvs/src/sys/vm/vm_zeroidle.c,v
> >>> retrieving revision 1.45
> >>> diff -u -r1.45 vm_zeroidle.c
> >>> --- vm_zeroidle.c 18 May 2007 07:10:50 -0000 1.45
> >>> +++ vm_zeroidle.c 7 Jun 2007 14:58:39 -0000
> >>> @@ -110,8 +110,10 @@
> >>> if (m != NULL && (m->flags & PG_ZERO) == 0) {
> >>> vm_pageq_remove_nowakeup(m);
> >>> mtx_unlock(&vm_page_queue_free_mtx);
> >>> + critical_exit();
> >>> pmap_zero_page_idle(m);
> >>> mtx_lock(&vm_page_queue_free_mtx);
> >>> + critical_enter();
> >>> m->flags |= PG_ZERO;
> >>> vm_pageq_enqueue(PQ_FREE + m->pc, m);
> >>> ++vm_page_zero_count;
> >>
> >> Next I will try this. I put the critical_exit() before the vm unlock.
> >> mtx_unlock() should be allowed to switch if it wants. However, we
> >> would prefer to keep context switches disabled in the above -- just drop
> >> the lock so that other CPUs can proceed.
> >
> > Again, the order here is on purpose to make sure we don't preempt while
> > holding the lock.
>
> critical_enter(9) says that critical regions should not be interlocked
> with other synchronization primitives like this. It makes an exception
> for spin locks, but vm_page_queue_free_mtx is a sleep lock.
Mostly that is to prevent foot shooting. It can be done if one is careful.
Critical sections can be thought of as basically deferring preemptions.
> I think this is as good a place to preempt as any. In fact, the code
> can be simplified by preempting here and not in the main loop. The
> vm mutex is already dropped here, so extra code isn't needed to drop
> in the main loop. In the !PREEMPTION case, check sched_runnable()
> here. In the PREEMPTION case, can you explain why involuntary preemption
> never worked right with SMP for SCHED_4BSD? I thought that it was
> because SCHED_4BSD doesn't do enough IPIs, so it doesn't preempt idle
> threads running on other CPUs, but shouldn't preemption occur when the
> vm mutex is unlocked in the above, if sched_runnable() is true and
> there is something better to run than pagezero?
Preemption happens when a thread is made runnable, i.e. usually when an
interrupt comes in, but also when releasing locks or doing
wakeup/cv_broadcast, etc. The only thing the idle thread does other than
interrupts is release the lock. One side effect of using a critical section
here is that we are potentially adding interrupt latency, so this may be a
rather bad thing. :-/ Probably you are seeing the number of context switches
go down because we are "batching" up switches. Say you get two interrupts
during an iteration of the main loop for this process, previously you'd get 4
context switches out of that: zeroidle -> first ithread -> zeroidle, and then
later zeroidle -> second ithread -> zeroidle. Now you block those switches
until the critical exit, at which point both interrupt threads are pending,
so you do: zeroidle -> first ithread -> second ithread -> zeroidle, i.e. 3
context switches. However, the cost is increased interrupt latency and that
may explain why the "real" time went up. Lots of context switches is fairly
lame, but if the box is idle, then I'm less worried about wasting time in a
context switch (well, I'm concerned because it means we can't put the CPU to
sleep since we are executing something still, but extra overhead in an idle
thread doing a background task is not near as worrisome as overhead
in "important" work like ithreads).
As to why preemption doesn't work for SMP, a thread only knows to preempt if
it makes a higher priority thread runnable. This happens in mtx_unlock when
we wakeup a thread waiting on the lock, in wakeup, or when an interrupt
thread is scheduled (the interrupted thread "sees" the ithread being
scheduled). If another thread on another CPU makes a thread runnable, the
thread on the first CPU has no idea unless the second CPU explicitly sends a
message (i.e. IPI) to the first CPU asking it to yield instead.
Specifically, suppose you have threads A, B, C and with priorities A < B < C.
Suppose A is running on CPU 0, and C is running on CPU 1. If C does a wakeup
that awakens B, C isn't going to preempt to B because C is more important
(assume > means more important, even though priority values are opposite
that, which is annoying). If A had awakened B it would have preempted
though, so in theory C should look at the other CPUs, notice that A < B, and
send an IPI to CPU 0 to ask it to preempt from A to B. One thing is that
IPI_PREEMPT should set td_owepreempt if the target A is in a critical
section, I haven't checked to see if we do that currently.
> >> I added the necessary critical_exit/enter() calls here.
> >
> > They aren't needed in the !PREEMPTION case since the context switch is
> > explicit. The critical sections are only needed for the PREEMPTION case
> > really.
>
> They are needed to pass the KASSERT(td_critnest == 1 || ...) in
mi_switch().
Ah, right.
> PREEMPTION ifdefs on the critical sections would be messy.
Agreed.
> Next I will try moving the PREEMPTION code to where the vm mutex is dropped
> naturally. I will try the following order:
I like this idea a lot.
>
> // Stay in critical section.
> drop vm mutex // though critical_enter(9) says this is wrong
> zero the page // intentionally in critical section
> #ifdef PREEMPTION(?) // even more needed now that the critical
> // prevents prevents a switch on the mutex
drop
> if (sched_runnable()) {
> thread_lock(...)
> critical_exit() // if using critical sections with !PRE*
> mi_switch(...)
> thread_unlock(...)
> } else
> #endif
> critical_exit()
Actually, I would keep the sched_runnable(), etc. as #ifndef PREEMPTION, the
critical_exit() already does that check (modulo concerns above). Also, I
originally wanted to not hold the critical sectino while zeroing the page to
not impeded interrupts during that operation. I was actually trying to just
avoid preempting while holding the lock. However, given my comments about
how this harms interrupt latency, maybe this is a bad idea and we should just
let priority propagation handle that for us. Moving the context switch is
probably a good idea though.
the reason I wanted to avoid preempting while holding the lock is you can get
this case:
zeroidle -> some ithread -> some top-half thread in kernel which needs the
vm page queue mtx -> zeroidle (with top-half thread's priority; until
mtx_unlock) -> top-half thread in kernel -> zeroidle
which can be many context switches. By not switching while holding the lock,
one can reduce this to:
zeroidle -> some ithread -> some top-half thread -> zeroidle
but at the cost of adding latency to "some ithread" and "some top-half thread"
--
John Baldwin
More information about the cvs-all
mailing list