From nobody Wed Dec 29 21:39:36 2021 X-Original-To: scsi@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id F3801190F6A9 for ; Wed, 29 Dec 2021 21:39:38 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4JPPty4sDBz3HPk; Wed, 29 Dec 2021 21:39:38 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from [10.0.1.4] (ralph.baldwin.cx [66.234.199.215]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: jhb) by smtp.freebsd.org (Postfix) with ESMTPSA id 04807271F4; Wed, 29 Dec 2021 21:39:37 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Message-ID: Date: Wed, 29 Dec 2021 13:39:36 -0800 List-Id: SCSI subsystem List-Archive: https://lists.freebsd.org/archives/freebsd-scsi List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-scsi@freebsd.org X-BeenThere: freebsd-scsi@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:91.0) Gecko/20100101 Thunderbird/91.4.1 Content-Language: en-US To: scsi@FreeBSD.org Cc: Alexander Motin , =?UTF-8?Q?Edward_Tomasz_Napiera=c5=82a?= From: John Baldwin Subject: iSCSI target: Handling in-flight requests during ctld shutdown Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1640813978; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=H9EW185xhRTNB2EsyfJP0mjzy9h1k6xthferDRF22DQ=; b=gwqvCYHUZ5oQ+Sb5QPNZU1WsfayefEoWZK0vCmQQumb863G9ZK2R/KOQNJKmYiclSiHtkD eWY5ympQAwmrwP6s1s90M7GB4w2exXj6oewkovgGhB760wmDJeh05QLqOeZKz3+NdSL7DQ k9+NnpMhPOXZbZfVwQaGHCUtC+ZcPJlkm2M/qE3GmGrb5smK1UR/cnrMDx8/6/SVmc37Ct aTwGqiKkuH5WSkOZHQ+cpYSnWSb94sHxX8GKDmnAua40kRuT/8zv+WmYjwPGyxo84exHlj StBziYvG7TtyRij3fhyk4Rw/M74rxgctDAU69B1gYIsecPoJxzZQtV0c0uxXkA== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1640813978; a=rsa-sha256; cv=none; b=RLCwc8WAxTrKUIvtupC8PRy6Qoxg3etCi0nSxUvMP+3CurH2OLFme/vmQzEAsgoJJ9KB5P 8jINAXbUaM5qbgkGhLRQ8OCeOorW+fN+8aJfEcXk0z3sg3HDLi3KPcsUt7egEJLsZiI4ae TO8Rq1OY+K7yHSG1nmwSVjsSGCPPoO2Hqw646lwLVTTkxZMeUPtE2HMm/MJNRiy5Uz5Ewt iwBE6iNscazEfeHvFaicyudZ8I40x1OBOuCBvsCkQUVv8stsIoRhMIVAYBVuuQCHOeCah1 dfQgttLRdzCGP/SXjD0K6OqjFSPzvMsMeDaUNftZTcSPIhonb5BCpxyGfW0Kkg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N One of the tests Chelsio QA has been running against our iSCSI stack with cxgbei offload enabled is to run a bunch of iozone's on an initiator while running a script on the target that keeps stopping ctld (for a minute or so), then starting it again and letting it run for about 5 minutes until stopping it again. One of the errors found last night is that the target reported the following error to the initiator: (da7:iscsi10:0:0:0): CAM status: SCSI Status Error (da7:iscsi10:0:0:0): SCSI status: Check Condition (da7:iscsi10:0:0:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure) (da7:iscsi10:0:0:0): Actual Retry Count: 44 (da7:iscsi10:0:0:0): Error 5, Unretryable error g_vfs_done():da7[WRITE(offset=9797632, length=32768)]error = 6 UFS: forcibly unmounting /dev/da7 from /ISCSI8 The retry count of 44 is the breadcrumb to find the corresponding error in the ctl code. In this case it is here in ctl_frontend_iscsi.c: static void cfiscsi_datamove_out(union ctl_io *io) { ... CFISCSI_SESSION_LOCK(cs); if (cs->cs_terminating) { CFISCSI_SESSION_UNLOCK(cs); cfiscsi_data_wait_abort(cs, cdw, 44); return; } TAILQ_INSERT_TAIL(&cs->cs_waiting_for_data_out, cdw, cdw_next); CFISCSI_SESSION_UNLOCK(cs); ... } I added this check recently (September) to fix a deadlock I encountered during similar testing: commit 0cd6e85e242bb07a33df9a6314e90bcb0ba99576 Author: John Baldwin Date: Wed Sep 15 13:25:30 2021 -0700 iscsi: Abort data-out tasks queued on a terminating session. cfiscsi_datamove_out() can race with cfiscsi_session_terminate_tasks() and enqueue a new task after the latter function has aborted existing tasks. This could result in a deadlock as cfiscsi_session_terminate_tasks() waited forever for this task to complete. Reviewed by: mav Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D31892 Note that in the case that ctld is shut down just slightly later we would abort the request similarly in cfiscsi_session_terminate_tasks(), just with an error code of 42: CFISCSI_SESSION_LOCK(cs); while ((cdw = TAILQ_FIRST(&cs->cs_waiting_for_data_out)) != NULL) { TAILQ_REMOVE(&cs->cs_waiting_for_data_out, cdw, cdw_next); CFISCSI_SESSION_UNLOCK(cs); cfiscsi_data_wait_abort(cs, cdw, 42); CFISCSI_SESSION_LOCK(cs); } CFISCSI_SESSION_UNLOCK(cs); So my question I think is what is the expected behavior? Is the internal error really expected to make it on the wire to be sent to the other side? Since the connection is shutting down should we just discard the reply altogether rather than reporting an internal error? If we discarded the reply then the initiator in this particular test would have retried the original request once ctld was restarted and continued running without an error. -- John Baldwin