git: e6d3ba4be27d - main - nvme: Lock when processing an abort completion command.

Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Warner Losh <imp_at_FreeBSD.org>
Date: Tue, 23 Jul 2024 23:03:39 UTC
The branch main has been updated by imp:

URL: https://cgit.FreeBSD.org/src/commit/?id=e6d3ba4be27d86c0ade250b52cf9af380f7b4c34

commit e6d3ba4be27d86c0ade250b52cf9af380f7b4c34
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2024-07-23 23:01:57 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-07-23 23:04:02 +0000

    nvme: Lock when processing an abort completion command.
    
    When processing an abort completion command, we have to lock. But we
    have to lock the qpair of the original transaction (not the abort we're
    completing). We do this to avoid races with checking the completion id
    to tr mapping array, as well as to manually complete it.
    
    Note: we don't handle the completion status of 'Asked to abort too many
    transactions at once.' That will be fixed on subsequent commits. Add a
    note to that effect for now since it's a harder problem to solve.
    
    Sponsored by: Netflix
    Differential Revision:  https://reviews.freebsd.org/D46025
---
 sys/dev/nvme/nvme_qpair.c | 29 +++++++++++++++++++++--------
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/sys/dev/nvme/nvme_qpair.c b/sys/dev/nvme/nvme_qpair.c
index 8d9fb4d647c6..e4286b61a3fc 100644
--- a/sys/dev/nvme/nvme_qpair.c
+++ b/sys/dev/nvme/nvme_qpair.c
@@ -1007,22 +1007,35 @@ nvme_abort_complete(void *arg, const struct nvme_completion *status)
 	struct nvme_tracker     *tr = arg;
 
 	/*
-	 * If cdw0 == 1, the controller was not able to abort the command
-	 *  we requested.  We still need to check the active tracker array,
-	 *  to cover race where I/O timed out at same time controller was
-	 *  completing the I/O.
+	 * If cdw0 bit 0 == 1, the controller was not able to abort the command
+	 * we requested.  We still need to check the active tracker array, to
+	 * cover race where I/O timed out at same time controller was completing
+	 * the I/O. An abort command always is on the admin queue, but affects
+	 * either an admin or an I/O queue, so take the appropriate qpair lock
+	 * for the original command's queue, since we'll need it to avoid races
+	 * with the completion code and to complete the command manually.
 	 */
-	if (status->cdw0 == 1 && tr->qpair->act_tr[tr->cid] != NULL) {
+	mtx_lock(&tr->qpair->lock);
+	if ((status->cdw0 & 1) == 1 && tr->qpair->act_tr[tr->cid] != NULL) {
 		/*
-		 * An I/O has timed out, and the controller was unable to
-		 *  abort it for some reason.  Construct a fake completion
-		 *  status, and then complete the I/O's tracker manually.
+		 * An I/O has timed out, and the controller was unable to abort
+		 * it for some reason.  And we've not processed a completion for
+		 * it yet. Construct a fake completion status, and then complete
+		 * the I/O's tracker manually.
 		 */
 		nvme_printf(tr->qpair->ctrlr,
 		    "abort command failed, aborting command manually\n");
 		nvme_qpair_manual_complete_tracker(tr,
 		    NVME_SCT_GENERIC, NVME_SC_ABORTED_BY_REQUEST, 0, ERROR_PRINT_ALL);
 	}
+	/*
+	 * XXX We don't check status for the possible 'Could not abort because
+	 * excess aborts were submitted to the controller'. We don't prevent
+	 * that, either. Document for the future here, since the standard is
+	 * squishy and only says 'may generate' but implies anything is possible
+	 * including hangs if you exceed the ACL.
+	 */
+	mtx_unlock(&tr->qpair->lock);
 }
 
 static void