Re: iSCSI target: Handling in-flight requests during ctld shutdown
Date: Thu, 30 Dec 2021 23:06:13 UTC
On 30.12.2021 17:41, John Baldwin wrote: > On 12/30/21 12:10 PM, Alexander Motin wrote: >> The checks were in all backends, until in 2c7dc6bae9fd I've moved them >> to ctl_datamove_done_process(). Backends don't have any special >> knowledge now to process specific data move error codes. It was >> complete code duplication between backends and I've tried to organize >> that a bit. Unfortunately we do not have formalized data move error >> statuses in CTL to do more. It would be much more interesting (and >> complicated ;)) if iSCSI (for example) could report data transfer error >> without killing connection, but that area in general is not very >> specified in SCSI, since usually each transport has own means of >> recovery. > > Hmmm, but this commit seems to have actually broken this case. That is, > now the possibly uninitialized data is written to the backing store: > > /* > - * We set status at this point for read commands, and write > - * commands with errors. > + * We set status at this point for read and compare commands. > */ > - if (io->io_hdr.flags & CTL_FLAG_ABORT) { > - ; > - } else if ((io->io_hdr.port_status != 0) && > - ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE || > - (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) { > - ctl_set_internal_failure(&io->scsiio, /*sks_valid*/ 1, > - /*retry_count*/ io->io_hdr.port_status); > - } else if (io->scsiio.kern_data_resid != 0 && > - (io->io_hdr.flags & CTL_FLAG_DATA_MASK) == CTL_FLAG_DATA_OUT && > - ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE || > - (io->io_hdr.status & CTL_STATUS_MASK) == CTL_SUCCESS)) { > - ctl_set_invalid_field_ciu(&io->scsiio); > - } else if ((io->io_hdr.port_status == 0) && > - ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE)) { > + if ((io->io_hdr.flags & CTL_FLAG_ABORT) == 0 && > + (io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE) { > lbalen = ARGS(beio->io); > if (lbalen->flags & CTL_LLF_READ) { > ctl_set_success(&io->scsiio); > > As ctl_datamove_done_process() will call ctl_set_internal_failure(), but > ctl_datamove_done() will still call be_move_done which runs the code above. > The code above doesn't check port_status anywhere, so it will still happily > write the stale data to the backend unless CTL_FLAG_ABORT is set. Do the > backends need to be checking port_status == 0 again in addition to > CTL_FLAG_ABORT then? No. Backends check for ((io->io_hdr.status & CTL_STATUS_MASK) == CTL_STATUS_NONE) . Notice that you've received that HARDWARE ERROR status, not garbage. >>> I do see checks for CTL_FLAG_ABORT, and the handler >>> for >>> the CTL_TASK_I_T_NEXUS_RESET does set CTL_FLAG_ABORT on pending >>> requests. >>> >>> For the tasks in sciscsi_session_terminate_tasks(), those should already >>> have >>> CTL_FLAG_ABORT set anyway, but it wouldn't hurt if it were set again by >>> cfiscsi_data_wait_abort(). For the the cfiscsi_task_management_done >>> case I'm >>> less certain, but I suspect there too that returning an internal error >>> status >>> back to the initiator is not expected and that it would be better to >>> just set >>> CTL_FLAG_ABORT and drop any response? >> >> cfiscsi_data_wait_abort() does only one specific thing -- it aborts >> waiting for data and completes the data move operation, allowing backend >> to continue (or at least free resources, etc). It is only >> cfiscsi_session_terminate_tasks() knows that it should abort data moves >> while it aborts tasks, since it knows that the connection is dead. >> cfiscsi_task_management_done() is another special case, since iSCSI >> explicitly mentions that there will be no further exchanges related to >> that task, so the data move should explicitly return. But in general >> case I think it would be nice to not combine those two facts of aborting >> task and aborting data transfer, since the last may not necessary imply >> the first, as the first does not imply the last. >> >> It is tempting to just set the flag, but I suspect it may cause more >> problems (I am worrying about HA peer update). I'd prefer the race >> inside the transport to be handled inside the transport. May be we >> could just add another flag to be set under the session lock inside >> cfiscsi_session_terminate_tasks() after it aborted all the tasks, but >> before it start to abort transfers itself and use it inside >> cfiscsi_datamove_out() instead of cs_terminating. > > I think though that even that flag wouldn't quite help as you would still > have the problem that the internal_error response would still be sent on > the wire if we don't abort the entire task vs aborting just the move. No. cfiscsi_datamove_out() called before the new flag is set would still try to send R2T over the dying connection to be aborted by the cfiscsi_session_terminate_tasks() few milliseconds later. cfiscsi_data_wait_abort() would only be needed if cfiscsi_session_terminate_tasks() has already passed through the data waiters list and waiting for the last tasks completion. > It may be that all I really need to do is fix the one case in > cfiscsi_datamove_out() to set CTL_FLAG_ABORT rather than changing > cfiscsi_data_move_abort()? I still think it won't be needed with the new flag I proposed, that would be cleaner IMO. -- Alexander Motin