please review, patch for lost camisr

Wed May 29 07:40:43 UTC 2013

On Wed, May 29, 2013 at 12:27:46AM -0700, Alfred Perlstein wrote:
> On 5/29/13 12:16 AM, Konstantin Belousov wrote:
> > On Tue, May 28, 2013 at 10:31:40PM -0700, Alfred Perlstein wrote:
> >> On 5/28/13 10:08 PM, Konstantin Belousov wrote:
> >>> On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote:
> >>>> [[  moved to hackers@ from private mail. ]]
> >>>>
> >>>> On 5/28/13 1:13 PM, John Baldwin wrote:
> >>>>> On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
> >>>>>> On 5/28/13 9:04 AM, John Baldwin wrote:
> >>>>>>> On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
> >>>>>>>> Hey folks,
> >>>>>>>>
> >>>>>>>> I had a talk with Nathan Whitehorn about the camisr issue.  The issue we
> >>>>>>>> are seeing we mostly know, but to summarize, we are losing the camisr
> >>>>>>>> signal and the camisr is not being run.
> >>>>>>>>
> >>>>>>>> I gave him a summary of what we have been seeing and pointed him to the
> >>>>>>>> code I am concerned about here:
> >>>>>>>> http://pastebin.com/tLKr7mCV  (this is inside of kern_intr.c).
> >>>>>>>>
> >>>>>>>> What I think that is happening is that the setting of it_need to 0
> >>>>>>>> inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
> >>>>>>>> correctly and it is being delayed until AFTER the call to
> >>>>>>>> ithread_execute_handlers() right below the atomic_store_rel_int().
> >>>>>>> This seems highly unlikely, to the extent that if this were true all our
> >>>>>>> locking primitives would be broken.  The store_rel is actually a release
> >>>>>>> barrier which acts like more a read/write fence.  No memory accesses (read or
> >>>>>>> write) from before the release can be scheduled after the associated store,
> >>>>>>> either by the compiler or CPU.  That is what Konstantin is referring to in his
> >>>>>>> commit when he says "release" semantics".
> >>>>>> Yes, that makes sense, however does it specify that the writes *must*
> >>>>>> occur at that *point*?  If it only enforces ordering then we may have
> >>>>>> some issue, specifically because the setting of it to '1' inside of
> >>>>>> intr_event_schedule_thread has no barrier other than the acq semantics
> >>>>>> of the thread lock.  I am wondering what is forcing out the '1' there.
> >>>>> Nothing ever forces writes.  You would have to invalidate the cache to do that
> >>>>> and that is horribly expensive.  It is always only about ordering and knowing
> >>>>> that if you can complete another operation on the same "cookie" variable with
> >>>>> acquire semantics that earlier writes will be visible.
> >>>> By cookie, you mean a specific memory address, basically a lock? This is
> >>>> starting to reinforce my suspicions as the setting of it_need is done
> >>>> with release semantics, however the acq on the other CPU is done on the
> >>>> thread lock.  Maybe that is irrelevant.  We will find out shortly.
> >>>>
> >>>>>> See below as I think we have proof that this is somehow happening.
> >>>>> Having ih_need of 1 and it_need of 0 is certainly busted.  The simplest fix
> >>>>> is probably to stop using atomics on it_need and just grab the thread lock
> >>>>> in the main ithread loop and hold it while checking and clearing it_need.
> >>>>>
> >>>> OK, we have some code that will either prove this, or perturb the memory
> >>>> ordering enough to make the bug go away, or prove this assertion wrong.
> >>>>
> >>>> We will update on our findings hopefully in the next few days.
> >>> IMO the read of it_need in the 'while (ithd->it_need)' should
> >>> have acquire semantic, otherwise the future reads in the
> >>> ithread_execute_handlers(), in particular, of the ih_need, could pass
> >>> the read of it_need and cause the situation you reported.  I do not
> >>> see any acquire barrier between a condition in the while() statement
> >>> and the read of ih_need in the execute_handlers().
> >>>
> >>> It is probably true that the issue you see was caused by r236456, in the
> >>> sense that implicitely locked xchgl instruction on x86 has a full barrier
> >>> semantic.  As result, the store_rel() was actually an acquire too, making
> >>> this reordering impossible.  I argue that this is not a bug in r236456,
> >>> but the issue in the kern_intr.c.
> >> If I remember the code correctly that would probably explain why we see
> >> it only on 9.1 system.
> >>> On the other hand, the John' suggestion to move the manipulations of
> >>> it_need under the lock is probably the best anyway.
> >>>
> >> I was wondering if it would be lower latency to maintain it_need,
> >> however to keep another variable it_needlocked under the thread lock.
> >> This would result in potential superfluous interrupts, however under
> >> load you would allow the ithread to loop without taking the thread lock
> >> some number of times.
> >>
> >> I am not really sure if this is really worth the optimization
> >> (especially since it can result in superfluous interrupts) however it
> >> may reduce latency and that might be important.
> >>
> >> Is there some people that I can pass the patch onto for help with
> >> performance once we confirm that this is the actual bug?   We can do
> >> internal testing, but I am worried about regressing performance of any
> >> form of IO for the kernel.
> >>
> >> I'll show the patch soon.
> >>
> >> Thank you for the information.  This is promising.
> > Well, if you and I are right, the minimal patch should be
> >
> > diff --git a/sys/kern/kern_intr.c b/sys/kern/kern_intr.c
> > index 8d63c9b..7c21015 100644
> > --- a/sys/kern/kern_intr.c
> > +++ b/sys/kern/kern_intr.c
> > @@ -1349,7 +1349,7 @@ ithread_loop(void *arg)
> >   		 * we are running, it will set it_need to note that we
> >   		 * should make another pass.
> >   		 */
> > -		while (ithd->it_need) {
> > +		while (atomic_load_acq_int(&ithd->it_need)) {
> >   			/*
> >   			 * This might need a full read and write barrier
> >   			 * to make sure that this write posts before any
> >
> 
> OK we can try this.
> 
> I've been pretty good at locking when using mutexes, but when we get 
> into the atomic ops like this it gets a little tough for me to follow 
> without extensive research.  I know that the signalling thread 
> (swi_sched caller) does not use any atomic ops... is this OK?

The sequence of the actions there is
	atomic_store_rel_int(&ih->ih_need, 1);
later in intr_event_schedule_thread():
	it->it_need = 1;
	thread_lock(td);
	sched_add(td);
	thread_unlock(td);

There are two things to worry about, as I see:
1. The possibiblity of seeing it_need == 1 but ih_need == 0.  This could
   happen on some architectures which allow the write reordering, so
   it might make sense to move the store_rel from ih_need to it_need.
   Effectively, we would need to have rel in both places, since 
   the call to intr_event_schedule_thread() is conditional.

   But the reordering is impossible on x86.

2. A possibility of scheduling the interrupt thread without CPU running
   it noticing the it_need = 1 write.  Since the thread lock is locked
   and released, the release barrier is applied, so the write should
   become visible before thread is scheduled on processor.

I would start with the only addition of load_acq() for now and see if
it helps.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-hackers/attachments/20130529/46e901b3/attachment.sig>