mfi panic on recused on non-recusive mutex MFI I/O lock

Steven Hartland killing at multiplay.co.uk
Fri Nov 16 23:39:26 UTC 2012


----- Original Message ----- 
From: "Steven Hartland" <killing at multiplay.co.uk>

>> Sounds like you have made some good progress.  I looked at your prior locking
>> change and they good.  Haven't had time to go through the queue changes
>> yet.
> 
> Just to update people on this, as its taken quite some time to track down the
> random issues causing panics, but I believe I made a breakthrough last night.
> 
> It seems that the cleanup interation between mfi_cmd's and tbolt_cmd's is flawed
> meaning its possible that tbolt commands are processed after the caller has
> already recieved a response, cleaned and returned the mfi_cmd to the free queue.
> 
> This means that its anyones guess what the result of the tbolt cleanup is as it
> could well be operating on a mfi_cmd thats either now in the free queue or even
> worse has already been reused.
> 
> It also possible this was the underling issue you may well have seening which
> caused you to add the mfi_tbolt_complete_cmd calls to mfi_tbolt_send_frame
> in r242681.
> 
> If this is correct then I believe the correct fix is to ensure that
> mfi_tbolt_return_cmd is only ever called from mfi_release_command thus ensuring
> completion ordering is always correct. I'm testing fixes for this theory now
> but initial debug has had good results.
> 
> The patch of fixes is really growing, so definitely going to need someone to
> review in detail when I'm done.
> 
> What do you think of the above, does it make sence? Would you be willing to
> review the patch when I'm done, before I commit it Doug?

Ok I think I'm done.

The good news is I've managed to fix all panics and cases of commands being
processed incorrectly that we've seen here. The bad news is the patch is now
really quite large as there was a lot if issues found during debugging of the
core problems.

The main fixes are:-
1. Ensure that IO lock is not dropped during tbolt ISR processing, as this
can cause some very nasty issues when two threads end up processing the same
tbolt cmd.

2. Ensure that interaction between mfi_cmd's and tbolt_cmd's, specifically
in their cleanup, total number and range checks as if this isn't done then
again some very nasty issues can occur.

3. Ensure that tbolt init doesn't break MFI indexing by assuming it always
gets the first mfi command structure.

The reset of the fixes are for things like potential NULL pointer exceptions,
locks not being dropped during error cases etc. Full details of all the fixes
are in the patch which can be found here:-
http://blog.multiplay.co.uk/dropzone/freebsd/zz-mfi-queue.patch

It should be noted that while the changes now make the driver functionally
correct, the promotion of the IO lock to the upper layers isn't ideal and
could do with optimising.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.



More information about the freebsd-stable mailing list