From nobody Wed Dec 29 21:39:36 2021
X-Original-To: scsi@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id F3801190F6A9
	for <scsi@mlmmj.nyi.freebsd.org>; Wed, 29 Dec 2021 21:39:38 +0000 (UTC)
	(envelope-from jhb@FreeBSD.org)
Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "smtp.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4JPPty4sDBz3HPk;
	Wed, 29 Dec 2021 21:39:38 +0000 (UTC)
	(envelope-from jhb@FreeBSD.org)
Received: from [10.0.1.4] (ralph.baldwin.cx [66.234.199.215])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(Client did not present a certificate)
	(Authenticated sender: jhb)
	by smtp.freebsd.org (Postfix) with ESMTPSA id 04807271F4;
	Wed, 29 Dec 2021 21:39:37 +0000 (UTC)
	(envelope-from jhb@FreeBSD.org)
Message-ID: <fd383f6f-5a19-e2bb-5383-e559271eb3cd@FreeBSD.org>
Date: Wed, 29 Dec 2021 13:39:36 -0800
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-scsi
List-Help: <mailto:scsi+help@freebsd.org>
List-Post: <mailto:scsi@freebsd.org>
List-Subscribe: <mailto:scsi+subscribe@freebsd.org>
List-Unsubscribe: <mailto:scsi+unsubscribe@freebsd.org>
Sender: owner-freebsd-scsi@freebsd.org
X-BeenThere: freebsd-scsi@freebsd.org
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:91.0)
 Gecko/20100101 Thunderbird/91.4.1
Content-Language: en-US
To: scsi@FreeBSD.org
Cc: Alexander Motin <mav@FreeBSD.org>,
 =?UTF-8?Q?Edward_Tomasz_Napiera=c5=82a?= <trasz@freebsd.org>
From: John Baldwin <jhb@FreeBSD.org>
Subject: iSCSI target: Handling in-flight requests during ctld shutdown
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1640813978;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=H9EW185xhRTNB2EsyfJP0mjzy9h1k6xthferDRF22DQ=;
	b=gwqvCYHUZ5oQ+Sb5QPNZU1WsfayefEoWZK0vCmQQumb863G9ZK2R/KOQNJKmYiclSiHtkD
	eWY5ympQAwmrwP6s1s90M7GB4w2exXj6oewkovgGhB760wmDJeh05QLqOeZKz3+NdSL7DQ
	k9+NnpMhPOXZbZfVwQaGHCUtC+ZcPJlkm2M/qE3GmGrb5smK1UR/cnrMDx8/6/SVmc37Ct
	aTwGqiKkuH5WSkOZHQ+cpYSnWSb94sHxX8GKDmnAua40kRuT/8zv+WmYjwPGyxo84exHlj
	StBziYvG7TtyRij3fhyk4Rw/M74rxgctDAU69B1gYIsecPoJxzZQtV0c0uxXkA==
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1640813978; a=rsa-sha256; cv=none;
	b=RLCwc8WAxTrKUIvtupC8PRy6Qoxg3etCi0nSxUvMP+3CurH2OLFme/vmQzEAsgoJJ9KB5P
	8jINAXbUaM5qbgkGhLRQ8OCeOorW+fN+8aJfEcXk0z3sg3HDLi3KPcsUt7egEJLsZiI4ae
	TO8Rq1OY+K7yHSG1nmwSVjsSGCPPoO2Hqw646lwLVTTkxZMeUPtE2HMm/MJNRiy5Uz5Ewt
	iwBE6iNscazEfeHvFaicyudZ8I40x1OBOuCBvsCkQUVv8stsIoRhMIVAYBVuuQCHOeCah1
	dfQgttLRdzCGP/SXjD0K6OqjFSPzvMsMeDaUNftZTcSPIhonb5BCpxyGfW0Kkg==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
X-ThisMailContainsUnwantedMimeParts: N

One of the tests Chelsio QA has been running against our iSCSI stack
with cxgbei offload enabled is to run a bunch of iozone's on an
initiator while running a script on the target that keeps stopping
ctld (for a minute or so), then starting it again and letting it run
for about 5 minutes until stopping it again.

One of the errors found last night is that the target reported the
following error to the initiator:

(da7:iscsi10:0:0:0): CAM status: SCSI Status Error
(da7:iscsi10:0:0:0): SCSI status: Check Condition
(da7:iscsi10:0:0:0): SCSI sense: HARDWARE FAILURE asc:44,0 (Internal target failure)
(da7:iscsi10:0:0:0): Actual Retry Count: 44
(da7:iscsi10:0:0:0): Error 5, Unretryable error
g_vfs_done():da7[WRITE(offset=9797632, length=32768)]error = 6
UFS: forcibly unmounting /dev/da7 from /ISCSI8

The retry count of 44 is the breadcrumb to find the corresponding error
in the ctl code.  In this case it is here in ctl_frontend_iscsi.c:

static void
cfiscsi_datamove_out(union ctl_io *io)
{
         ...
	CFISCSI_SESSION_LOCK(cs);
	if (cs->cs_terminating) {
		CFISCSI_SESSION_UNLOCK(cs);
		cfiscsi_data_wait_abort(cs, cdw, 44);
		return;
	}
	TAILQ_INSERT_TAIL(&cs->cs_waiting_for_data_out, cdw, cdw_next);
	CFISCSI_SESSION_UNLOCK(cs);
         ...
}

I added this check recently (September) to fix a deadlock I encountered
during similar testing:

commit 0cd6e85e242bb07a33df9a6314e90bcb0ba99576
Author: John Baldwin <jhb@FreeBSD.org>
Date:   Wed Sep 15 13:25:30 2021 -0700

     iscsi: Abort data-out tasks queued on a terminating session.
     
     cfiscsi_datamove_out() can race with cfiscsi_session_terminate_tasks()
     and enqueue a new task after the latter function has aborted existing
     tasks.  This could result in a deadlock as
     cfiscsi_session_terminate_tasks() waited forever for this task to
     complete.
     
     Reviewed by:    mav
     Sponsored by:   Chelsio Communications
     Differential Revision:  https://reviews.freebsd.org/D31892

Note that in the case that ctld is shut down just slightly later we would
abort the request similarly in cfiscsi_session_terminate_tasks(), just with
an error code of 42:

	CFISCSI_SESSION_LOCK(cs);
	while ((cdw = TAILQ_FIRST(&cs->cs_waiting_for_data_out)) != NULL) {
		TAILQ_REMOVE(&cs->cs_waiting_for_data_out, cdw, cdw_next);
		CFISCSI_SESSION_UNLOCK(cs);
		cfiscsi_data_wait_abort(cs, cdw, 42);
		CFISCSI_SESSION_LOCK(cs);
	}
	CFISCSI_SESSION_UNLOCK(cs);

So my question I think is what is the expected behavior?  Is the internal error
really expected to make it on the wire to be sent to the other side?  Since
the connection is shutting down should we just discard the reply altogether
rather than reporting an internal error?  If we discarded the reply then the
initiator in this particular test would have retried the original request once
ctld was restarted and continued running without an error.


-- 
John Baldwin