git: bb7f7d5b5201 - main - nvme: Warn if there's system interrupt issues.

From: Warner Losh <imp_at_FreeBSD.org>
Date: Tue, 23 Jul 2024 23:03:42 UTC
The branch main has been updated by imp:

URL: https://cgit.FreeBSD.org/src/commit/?id=bb7f7d5b5201cfe569fce79b0f325bec2cf38ad2

commit bb7f7d5b5201cfe569fce79b0f325bec2cf38ad2
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2024-07-23 23:02:33 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2024-07-23 23:04:03 +0000

    nvme: Warn if there's system interrupt issues.
    
    Issue a warning if we have system interrupt issues. If you get this
    warning, then we submitted a request, it timed out without an interrupt
    being posted, but when we polled the card's completion, we found
    completion events. This indicates that we're missing interrupts, and to
    date all the times I've helped people track issues like this down it has
    been a system issue, not an NVMe driver isseue.
    
    Sponsored by:           Netflix
    Reviewed by:            gallatin
    Differential Revision:  https://reviews.freebsd.org/D46031
---
 share/man/man4/nvme.4       | 9 +++++++++
 sys/dev/nvme/nvme_private.h | 1 +
 sys/dev/nvme/nvme_qpair.c   | 9 +++++++--
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/share/man/man4/nvme.4 b/share/man/man4/nvme.4
index 011ff483c839..dcd2ec86f5fa 100644
--- a/share/man/man4/nvme.4
+++ b/share/man/man4/nvme.4
@@ -239,6 +239,15 @@ detects that the AHCI device supports RST and when it is enabled.
 See
 .Xr ahci 4
 for more details.
+.Sh DIAGNOSTICS
+.Bl -diag
+.It "nvme%d: System interrupt issues?"
+The driver found a timed-out transaction had a pending completion record,
+indicating an interrupt had not been delivered.
+The system is either not configuring interrupts properly, or the system drops
+them under load.
+This message will appear at most once per boot per controller.
+.El
 .Sh SEE ALSO
 .Xr nda 4 ,
 .Xr nvd 4 ,
diff --git a/sys/dev/nvme/nvme_private.h b/sys/dev/nvme/nvme_private.h
index ff08f6581db5..05b5f3189eb2 100644
--- a/sys/dev/nvme/nvme_private.h
+++ b/sys/dev/nvme/nvme_private.h
@@ -303,6 +303,7 @@ struct nvme_controller {
 
 	bool				is_failed;
 	bool				is_dying;
+	bool				isr_warned;
 	STAILQ_HEAD(, nvme_request)	fail_req;
 
 	/* Host Memory Buffer */
diff --git a/sys/dev/nvme/nvme_qpair.c b/sys/dev/nvme/nvme_qpair.c
index c917b34dbe43..0c3a36d4d76f 100644
--- a/sys/dev/nvme/nvme_qpair.c
+++ b/sys/dev/nvme/nvme_qpair.c
@@ -1145,9 +1145,14 @@ do_reset:
 		/*
 		 * There's a stale transaction at the start of the queue whose
 		 * deadline has passed. Poll the competions as a last-ditch
-		 * effort in case an interrupt has been missed.
+		 * effort in case an interrupt has been missed. Warn the user if
+		 * transactions were found of possible interrupt issues, but
+		 * just once per controller.
 		 */
-		_nvme_qpair_process_completions(qpair);
+		if (_nvme_qpair_process_completions(qpair) && !ctrlr->isr_warned) {
+			nvme_printf(ctrlr, "System interrupt issues?\n");
+			ctrlr->isr_warned = true;
+		}
 
 		/*
 		 * Now that we've run the ISR, re-rheck to see if there's any