mfi panic on recused on non-recusive mutex MFI I/O lock

Steven Hartland killing at multiplay.co.uk
Thu Nov 8 00:35:20 UTC 2012


----- Original Message ----- 
From: "Steven Hartland"
>> On Tue, Nov 06, 2012 at 12:09:42AM -0000, Steven Hartland wrote:
>> | Thanks Doug, actually just finished another test run with some more
>> | debugging in and I believe I've found the reason for the non-recusive
>> | lock and at least some of the queuing issues.
>> | 
>> | The non-recursive lock is due to the mfi_tbolt_reset calling
>> | mfi_process_fw_state_chg_isr with mfi_io_lock held which in turn calls
>> | mfi_tbolt_init_MFI_queue which tries to acquire mfi_io_lock hence
>> | the problem.
>> | 
>> | mfi-lock.txt attached I believe fixes this as well as what appears
>> | to be an invalid call to mtx_unlock(&sc->mfi_io_lock) in mfi_attach
>> | which never acquires the lock as far as can see, possibly a cut and
>> | paste error.
>> 
>> I don't seem to see the attachment.
> 
> Yer seems like some mail fail by me there, but I've had some more locking
> panics during todays tests anyway, requiring additional fixes. Will update
> and post when I'm happy with it.

OK two patches attached
== zz-mfi-lock.patch ==
Fixes mfi panic on recused on non-recusive mutex MFI I/O lock

Removes a mtx_unlock call for mfi_io_lock which is never aquired

== zz-mfi-queue.patch ==
Fixes queuing issues where mfi_release_command blindly sets the cm_flags = 0
without first removing the command from the relavent queue.

This was causing panics in the queue functions which check to ensure a command
is not on another queue.

Also fixed some cases where the error from mfi_mapcmd was lost and where the
command was never released / dequeued in error cases.

Ensure that all failures to mfi_mapcmd are logged

Fixed possible null pointer exception in mfi_aen_setup if mfi_get_log_state
failed.

Fixed mfi_parse_entries & mfi_aen_setup not returning possible errors

Corrected MFI_DUMP_CMDS calls with invalid vars SC vs sc

Commands which have timed out now set cm_error to ETIMEDOUT and call
mfi_complete which prevents them getting stuck in the busy queue forever.

Fixed possible use of NULL pointer in mfi_tbolt_get_cmd

Changed output formats to be more easily recognisable when debugging.

A few style (9) fixes e.g. braced single line conditions and double blank
lines
----------

I've just had another panic, trace below, but it doesn't seem to be related
to my changes so I'd appreciate your feedback on them as they are for now.

While the lock patch fixes the problems I've seen, its not clear to me
why mfi_tbolt_reset is acquiring the lock and hence requiring
mfi_process_fw_state_chg_isr to jump through hoops to ensure locking
around queue manipulation is done correctly. Given what its doing
(resetting the entire adapter) I wouldn't be surprised if it should
really be acquiring the config lock.

Other things I've noticed / questions
* Should mfi_abort sleep even if its call to mfi_mapcmd fails?
* Should mfi_get_controller_info really ignore the error from mfi_mapcmd?
* Do these controllers not support none 512 byte requests? Currently
all syspd requests are done assuming 512 byte sectors which the disk may
not be. This will both reduce performance or potentially break totally
if the firmware isn't translating it under the surface correctly.

Anyway the new panic manually transcribed is:-
panic: Bad linx elm 0xffffff0069b0fc0 next->prev != elm
...
mfi_tbolt_get_cmd()
mfi_build_mpt_pass_thru()
mfi_tbolt_build_mpt_cmd()
mfi_tbolt_send_frame()
bus_dmamap_load()
mfi_mapcmd()
mfi_startio()
mfi_syspd_strategy()
g_disk_start()
g_io_schedule_down()
g_down_proc_body()
fork_exit()
fork_trampoline()

Looks like mfi_cmd_tbolt_tqh has become corrupt some how, but as far as I
can tell all manip is done using the TAILQ macros and under mfi_io_lock
so its not obvious to me at this time why this is, any ideas?

There was an obvious error in mfi_tbolt_get_cmd which is now fixed in the
queue patch, where cmd can be used even if queue was empty and TAILQ_FIRST
returned NULL, but I can't see this causing this panic.

This is running with a debug kernel with:-
options         WITNESS
options         INVARIANTS
options         INVARIANT_SUPPORT
options         DDB
options         GDB
options         PRINTF_BUFR_SIZE=2048
options         MFI_DEBUG

Unfortunately I've only got this hardware till Friday unfortunately so any
ideas would be most appreciated so I can get testing done before then.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster at multiplay.co.uk.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zz-mfi-lock.patch
Type: application/octet-stream
Size: 1312 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20121108/f8e5745b/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: zz-mfi-queue.patch
Type: application/octet-stream
Size: 10580 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20121108/f8e5745b/attachment-0001.obj>


More information about the freebsd-stable mailing list