sched_pin() versus PCPU_GET

Thu Aug 5 19:46:33 UTC 2010

On Thursday, August 05, 2010 11:59:37 am mdf at freebsd.org wrote:
> On Wed, Aug 4, 2010 at 11:55 AM, John Baldwin <jhb at freebsd.org> wrote:
> > On Wednesday, August 04, 2010 12:20:31 pm mdf at freebsd.org wrote:
> >> On Wed, Aug 4, 2010 at 2:26 PM, John Baldwin <jhb at freebsd.org> wrote:
> >> > On Tuesday, August 03, 2010 9:46:16 pm mdf at freebsd.org wrote:
> >> >> On Fri, Jul 30, 2010 at 2:31 PM, John Baldwin <jhb at freebsd.org> wrote:
> >> >> > On Friday, July 30, 2010 10:08:22 am John Baldwin wrote:
> >> >> >> On Thursday, July 29, 2010 7:39:02 pm mdf at freebsd.org wrote:
> >> >> >> > We've seen a few instances at work where witness_warn() in ast()
> >> >> >> > indicates the sched lock is still held, but the place it claims 
it was
> >> >> >> > held by is in fact sometimes not possible to keep the lock, like:
> >> >> >> >
> >> >> >> >     thread_lock(td);
> >> >> >> >     td->td_flags &= ~TDF_SELECT;
> >> >> >> >     thread_unlock(td);
> >> >> >> >
> >> >> >> > What I was wondering is, even though the assembly I see in 
objdump -S
> >> >> >> > for witness_warn has the increment of td_pinned before the 
PCPU_GET:
> >> >> >> >
> >> >> >> > ffffffff802db210:   65 48 8b 1c 25 00 00    mov    %gs:0x0,%rbx
> >> >> >> > ffffffff802db217:   00 00
> >> >> >> > ffffffff802db219:   ff 83 04 01 00 00       incl   0x104(%rbx)
> >> >> >> >      * Pin the thread in order to avoid problems with thread 
migration.
> >> >> >> >      * Once that all verifies are passed about spinlocks 
ownership,
> >> >> >> >      * the thread is in a safe path and it can be unpinned.
> >> >> >> >      */
> >> >> >> >     sched_pin();
> >> >> >> >     lock_list = PCPU_GET(spinlocks);
> >> >> >> > ffffffff802db21f:   65 48 8b 04 25 48 00    mov    %gs:0x48,%rax
> >> >> >> > ffffffff802db226:   00 00
> >> >> >> >     if (lock_list != NULL && lock_list->ll_count != 0) {
> >> >> >> > ffffffff802db228:   48 85 c0                test   %rax,%rax
> >> >> >> >      * Pin the thread in order to avoid problems with thread 
migration.
> >> >> >> >      * Once that all verifies are passed about spinlocks 
ownership,
> >> >> >> >      * the thread is in a safe path and it can be unpinned.
> >> >> >> >      */
> >> >> >> >     sched_pin();
> >> >> >> >     lock_list = PCPU_GET(spinlocks);
> >> >> >> > ffffffff802db22b:   48 89 85 f0 fe ff ff    mov   
 %rax,-0x110(%rbp)
> >> >> >> > ffffffff802db232:   48 89 85 f8 fe ff ff    mov   
 %rax,-0x108(%rbp)
> >> >> >> >     if (lock_list != NULL && lock_list->ll_count != 0) {
> >> >> >> > ffffffff802db239:   0f 84 ff 00 00 00       je     
ffffffff802db33e
> >> >> >> > <witness_warn+0x30e>
> >> >> >> > ffffffff802db23f:   44 8b 60 50             mov    0x50(%rax),
%r12d
> >> >> >> >
> >> >> >> > is it possible for the hardware to do any re-ordering here?
> >> >> >> >
> >> >> >> > The reason I'm suspicious is not just that the code doesn't have 
a
> >> >> >> > lock leak at the indicated point, but in one instance I can see 
in the
> >> >> >> > dump that the lock_list local from witness_warn is from the pcpu
> >> >> >> > structure for CPU 0 (and I was warned about sched lock 0), but 
the
> >> >> >> > thread id in panic_cpu is 2.  So clearly the thread was being 
migrated
> >> >> >> > right around panic time.
> >> >> >> >
> >> >> >> > This is the amd64 kernel on stable/7.  I'm not sure exactly what 
kind
> >> >> >> > of hardware; it's a 4-way Intel chip from about 3 or 4 years ago 
IIRC.
> >> >> >> >
> >> >> >> > So... do we need some kind of barrier in the code for sched_pin() 
for
> >> >> >> > it to really do what it claims?  Could the hardware have re-
ordered
> >> >> >> > the "mov    %gs:0x48,%rax" PCPU_GET to before the sched_pin()
> >> >> >> > increment?
> >> >> >>
> >> >> >> Hmmm, I think it might be able to because they refer to different 
locations.
> >> >> >>
> >> >> >> Note this rule in section 8.2.2 of Volume 3A:
> >> >> >>
> >> >> >>   • Reads may be reordered with older writes to different locations 
but not
> >> >> >>     with older writes to the same location.
> >> >> >>
> >> >> >> It is certainly true that sparc64 could reorder with RMO.  I 
believe ia64
> >> >> >> could reorder as well.  Since sched_pin/unpin are frequently used 
to provide
> >> >> >> this sort of synchronization, we could use memory barriers in 
pin/unpin
> >> >> >> like so:
> >> >> >>
> >> >> >> sched_pin()
> >> >> >> {
> >> >> >>       td->td_pinned = atomic_load_acq_int(&td->td_pinned) + 1;
> >> >> >> }
> >> >> >>
> >> >> >> sched_unpin()
> >> >> >> {
> >> >> >>       atomic_store_rel_int(&td->td_pinned, td->td_pinned - 1);
> >> >> >> }
> >> >> >>
> >> >> >> We could also just use atomic_add_acq_int() and 
atomic_sub_rel_int(), but they
> >> >> >> are slightly more heavyweight, though it would be more clear what 
is happening
> >> >> >> I think.
> >> >> >
> >> >> > However, to actually get a race you'd have to have an interrupt fire 
and
> >> >> > migrate you so that the speculative read was from the other CPU. 
 However, I
> >> >> > don't think the speculative read would be preserved in that case. 
 The CPU
> >> >> > has to return to a specific PC when it returns from the interrupt 
and it has
> >> >> > no way of storing the state for what speculative reordering it might 
be
> >> >> > doing, so presumably it is thrown away?  I suppose it is possible 
that it
> >> >> > actually retires both instructions (but reordered) and then returns 
to the PC
> >> >> > value after the read of listlocks after the interrupt.  However, in 
that case
> >> >> > the scheduler would not migrate as it would see td_pinned != 0.  To 
get the
> >> >> > race you have to have the interrupt take effect prior to modifying 
td_pinned,
> >> >> > so I think the processor would have to discard the reordered read of
> >> >> > listlocks so it could safely resume execution at the 'incl' 
instruction.
> >> >> >
> >> >> > The other nit there on x86 at least is that the incl instruction is 
doing
> >> >> > both a read and a write and another rule in the section 8.2.2 is 
this:
> >> >> >
> >> >> >  • Reads are not reordered with other reads.
> >> >> >
> >> >> > That would seem to prevent the read of listlocks from passing the 
read of
> >> >> > td_pinned in the incl instruction on x86.
> >> >>
> >> >> I wonder how that's interpreted in the microcode, though?  I.e. if the
> >> >> incr instruction decodes to load, add, store, does the h/w allow the
> >> >> later reads to pass the final store?
> >> >
> >> > Well, the architecture is defined in terms of the ISA, not the 
microcode, per
> >> > se, so I think it would have to treat the read for the incl as being an 
earlier
> >> > read than 'spinlocks'.
> >> >
> >> >> I added the following:
> >> >>
> >> >>       sched_pin();
> >> >>       lock_list = PCPU_GET(spinlocks);
> >> >>       if (lock_list != NULL && lock_list->ll_count != 0) {
> >> >> +             /* XXX debug for bug 67957 */
> >> >> +             mfence();
> >> >> +             lle = PCPU_GET(spinlocks);
> >> >> +             if (lle != lock_list) {
> >> >> +                     panic("Bug 67957: had lock list %p, now %p\n",
> >> >> +                         lock_list, lle);
> >> >> +             }
> >> >> +             /* XXX end debug */
> >> >>               sched_unpin();
> >> >>
> >> >>               /*
> >> >>
> >> >> ... and the panic triggered.  I think it's more likely that some
> >> >> barrier is needed in sched_pin() than that %gs is getting corrupted
> >> >> but can always be dereferenced.
> >> >
> >> > Actually, I would beg to differ in that case.  If PCPU_GET(spinlocks)
> >> > returns non-NULL, then it means that you hold a spin lock,
> >>
> >> ll_count is 0 for the "correct" pc_spinlocks and non-zero for the
> >> "wrong" one, though.  So I think it can be non-NULL but the current
> >> thread/CPU doesn't hold a spinlock.
> >
> > Hmm, does the 'lock_list' pointer value in the dump match 'lock_list'
> > from another CPU?
> 
> Yes:
> 
> (gdb) p panic_cpu
> $9 = 2
> (gdb) p dumptid
> $12 = 100751
> (gdb) p cpuhead.slh_first->pc_allcpu.sle_next->pc_curthread->td_tid
> $14 = 100751
> 
> (gdb) p *cpuhead.slh_first->pc_allcpu.sle_next
> $6 = {
>   pc_curthread = 0xffffff00716d6960,
>   pc_cpuid = 2,
>   pc_spinlocks = 0xffffffff80803198,
> 
> (gdb) p lock_list
> $2 = (struct lock_list_entry *) 0xffffffff80803fb0
> 
> (gdb) p *cpuhead.slh_first->pc_allcpu.sle_next->pc_allcpu.sle_next-
>pc_allcpu.sle_next
> $8 = {
>   pc_curthread = 0xffffff0005479960,
>   pc_cpuid = 0,
>   pc_spinlocks = 0xffffffff80803fb0,
> 
> I.e. we're dumping on CPU 2, but the lock_list pointer that was saved
> in the dump matches that of CPU 0.

Can you print out the tid's for the two curthreads?  It's not impossible that 
the thread migrated after calling panic.  In fact we force threads to CPU 0 
during shutdown.

-- 
John Baldwin