Re: iSCSI target: Handling in-flight requests during ctld shutdown

Reply: Alexander Motin : "Re: iSCSI target: Handling in-flight requests during ctld shutdown"
In reply to: Alexander Motin : "Re: iSCSI target: Handling in-flight requests during ctld shutdown"
Go to: [ bottom of page ] [ top of archives ] [ this month ]

From: John Baldwin <jhb_at_FreeBSD.org>
Date: Thu, 30 Dec 2021 19:29:20 UTC

On 12/29/21 1:57 PM, Alexander Motin wrote:
> On 29.12.2021 16:39, John Baldwin wrote:
>> One of the tests Chelsio QA has been running against our iSCSI stack
>> with cxgbei offload enabled is to run a bunch of iozone's on an
>> initiator while running a script on the target that keeps stopping
>> ctld (for a minute or so), then starting it again and letting it run
>> for about 5 minutes until stopping it again.
>>
>> One of the errors found last night is that the target reported the
>> following error to the initiator:
>>
>> (da7:iscsi10:0:0:0): CAM status: SCSI Status Error
>> (da7:iscsi10:0:0:0): SCSI status: Check Condition
>> (da7:iscsi10:0:0:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal
>> target failure)
>> (da7:iscsi10:0:0:0): Actual Retry Count: 44
>> (da7:iscsi10:0:0:0): Error 5, Unretryable error
>> g_vfs_done():da7[WRITE(offset=9797632, length=32768)]error = 6
>> UFS: forcibly unmounting /dev/da7 from /ISCSI8
> 
> 
>> So my question I think is what is the expected behavior?  Is the
>> internal error
>> really expected to make it on the wire to be sent to the other side?  Since
>> the connection is shutting down should we just discard the reply altogether
>> rather than reporting an internal error?  If we discarded the reply then
>> the
>> initiator in this particular test would have retried the original
>> request once
>> ctld was restarted and continued running without an error.
> 
> The HARDWARE ERROR is obviously not expected by the initiator.  It
> should better not be leaked after we decided to kill the connection.
> Initiator may retry it and still work happily after reconnect, but
> cleaner would be to not rely on that.  cfiscsi_session_terminate_tasks()
> aborts all running commands by CTL_TASK_I_T_NEXUS_RESET, that make them
> not return statuses to initiator, but I suppose this is the other side
> of the race now.

Hmm, I wonder if we should be setting CTL_FLAG_ABORT instead of setting the
port_status when aborting an I/O?  The comment in ctl_frontend_iscsi.c claims
the backends check the port_status, but I don't see any checks for port_status
at all in backends.  I do see checks for CTL_FLAG_ABORT, and the handler for
the CTL_TASK_I_T_NEXUS_RESET does set CTL_FLAG_ABORT on pending requests.

For the tasks in sciscsi_session_terminate_tasks(), those should already have
CTL_FLAG_ABORT set anyway, but it wouldn't hurt if it were set again by
cfiscsi_data_wait_abort().  For the the cfiscsi_task_management_done case I'm
less certain, but I suspect there too that returning an internal error status
back to the initiator is not expected and that it would be better to just set
CTL_FLAG_ABORT and drop any response?

-- 
John Baldwin