please review, patch for lost camisr

Wed May 29 05:31:44 UTC 2013

On 5/28/13 10:08 PM, Konstantin Belousov wrote:
> On Tue, May 28, 2013 at 01:35:01PM -0700, Alfred Perlstein wrote:
>> [[  moved to hackers@ from private mail. ]]
>>
>> On 5/28/13 1:13 PM, John Baldwin wrote:
>>> On Tuesday, May 28, 2013 3:29:41 pm Alfred Perlstein wrote:
>>>> On 5/28/13 9:04 AM, John Baldwin wrote:
>>>>> On Tuesday, May 28, 2013 2:13:32 am Alfred Perlstein wrote:
>>>>>> Hey folks,
>>>>>>
>>>>>> I had a talk with Nathan Whitehorn about the camisr issue.  The issue we
>>>>>> are seeing we mostly know, but to summarize, we are losing the camisr
>>>>>> signal and the camisr is not being run.
>>>>>>
>>>>>> I gave him a summary of what we have been seeing and pointed him to the
>>>>>> code I am concerned about here:
>>>>>> http://pastebin.com/tLKr7mCV  (this is inside of kern_intr.c).
>>>>>>
>>>>>> What I think that is happening is that the setting of it_need to 0
>>>>>> inside of sys/kern/kern_intr.c:ithread_loop() is not being scheduled
>>>>>> correctly and it is being delayed until AFTER the call to
>>>>>> ithread_execute_handlers() right below the atomic_store_rel_int().
>>>>> This seems highly unlikely, to the extent that if this were true all our
>>>>> locking primitives would be broken.  The store_rel is actually a release
>>>>> barrier which acts like more a read/write fence.  No memory accesses (read or
>>>>> write) from before the release can be scheduled after the associated store,
>>>>> either by the compiler or CPU.  That is what Konstantin is referring to in his
>>>>> commit when he says "release" semantics".
>>>> Yes, that makes sense, however does it specify that the writes *must*
>>>> occur at that *point*?  If it only enforces ordering then we may have
>>>> some issue, specifically because the setting of it to '1' inside of
>>>> intr_event_schedule_thread has no barrier other than the acq semantics
>>>> of the thread lock.  I am wondering what is forcing out the '1' there.
>>> Nothing ever forces writes.  You would have to invalidate the cache to do that
>>> and that is horribly expensive.  It is always only about ordering and knowing
>>> that if you can complete another operation on the same "cookie" variable with
>>> acquire semantics that earlier writes will be visible.
>> By cookie, you mean a specific memory address, basically a lock? This is
>> starting to reinforce my suspicions as the setting of it_need is done
>> with release semantics, however the acq on the other CPU is done on the
>> thread lock.  Maybe that is irrelevant.  We will find out shortly.
>>
>>>> See below as I think we have proof that this is somehow happening.
>>> Having ih_need of 1 and it_need of 0 is certainly busted.  The simplest fix
>>> is probably to stop using atomics on it_need and just grab the thread lock
>>> in the main ithread loop and hold it while checking and clearing it_need.
>>>
>> OK, we have some code that will either prove this, or perturb the memory
>> ordering enough to make the bug go away, or prove this assertion wrong.
>>
>> We will update on our findings hopefully in the next few days.
> IMO the read of it_need in the 'while (ithd->it_need)' should
> have acquire semantic, otherwise the future reads in the
> ithread_execute_handlers(), in particular, of the ih_need, could pass
> the read of it_need and cause the situation you reported.  I do not
> see any acquire barrier between a condition in the while() statement
> and the read of ih_need in the execute_handlers().
>
> It is probably true that the issue you see was caused by r236456, in the
> sense that implicitely locked xchgl instruction on x86 has a full barrier
> semantic.  As result, the store_rel() was actually an acquire too, making
> this reordering impossible.  I argue that this is not a bug in r236456,
> but the issue in the kern_intr.c.
If I remember the code correctly that would probably explain why we see 
it only on 9.1 system.
>
> On the other hand, the John' suggestion to move the manipulations of
> it_need under the lock is probably the best anyway.
>
I was wondering if it would be lower latency to maintain it_need, 
however to keep another variable it_needlocked under the thread lock.  
This would result in potential superfluous interrupts, however under 
load you would allow the ithread to loop without taking the thread lock 
some number of times.

I am not really sure if this is really worth the optimization 
(especially since it can result in superfluous interrupts) however it 
may reduce latency and that might be important.

Is there some people that I can pass the patch onto for help with 
performance once we confirm that this is the actual bug?   We can do 
internal testing, but I am worried about regressing performance of any 
form of IO for the kernel.

I'll show the patch soon.

Thank you for the information.  This is promising.

-Alfred