git: 25df691800f0 - stable/14 - nvme: Fix hotplug on one of the amazon platforms

From: Colin Percival <cperciva_at_FreeBSD.org>
Date: Sun, 30 Mar 2025 23:45:36 UTC
The branch stable/14 has been updated by cperciva:

URL: https://cgit.FreeBSD.org/src/commit/?id=25df691800f08756228d3ac52ac64d80e9fe998d

commit 25df691800f08756228d3ac52ac64d80e9fe998d
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2025-02-25 16:29:14 +0000
Commit:     Colin Percival <cperciva@FreeBSD.org>
CommitDate: 2025-03-30 23:44:57 +0000

    nvme: Fix hotplug on one of the amazon platforms
    
    Amazon EC2 m7i cloud instances use PCI hotplug rather than ACPI
    hotplug. The card is removed and detach is called to remove the drive
    from the system. The hardware is no longer present at this point, but
    the bridge doesn't translate the now-missing hardware reads to all ff's
    leading us to conclude the hardware is there and we need to do a proper
    shutdown of it. Fix this oversight by asking the bridge if the device is
    still present as well. We need both tests since some systems one cane
    remove the card w/o a hotplug event and we want to fail-safe in those
    cases.
    
    Convert gone to a bool while I'm here and update a comment about
    shutting down the controller and why that's important.
    
    Tested by: cperciva
    Sponsored by: Netflix
    
    (cherry picked from commit dc95228d98474aba940e3885164912b419c5579d)
---
 sys/dev/nvme/nvme_ctrlr.c | 31 +++++++++++++++++++------------
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/sys/dev/nvme/nvme_ctrlr.c b/sys/dev/nvme/nvme_ctrlr.c
index 5a825c10f584..6f5d6ae74add 100644
--- a/sys/dev/nvme/nvme_ctrlr.c
+++ b/sys/dev/nvme/nvme_ctrlr.c
@@ -1513,7 +1513,8 @@ nvme_ctrlr_construct(struct nvme_controller *ctrlr, device_t dev)
 void
 nvme_ctrlr_destruct(struct nvme_controller *ctrlr, device_t dev)
 {
-	int	gone, i;
+	int	i;
+	bool	gone;
 
 	ctrlr->is_dying = true;
 
@@ -1523,10 +1524,16 @@ nvme_ctrlr_destruct(struct nvme_controller *ctrlr, device_t dev)
 		goto noadminq;
 
 	/*
-	 * Check whether it is a hot unplug or a clean driver detach.
-	 * If device is not there any more, skip any shutdown commands.
+	 * Check whether it is a hot unplug or a clean driver detach.  If device
+	 * is not there any more, skip any shutdown commands.  Some hotplug
+	 * bridges will return zeros instead of ff's when the device is
+	 * departing, so ask the bridge if the device is gone. Some systems can
+	 * remove the drive w/o the bridge knowing its gone (they don't really
+	 * do hotplug), so failsafe with detecting all ff's (impossible with
+	 * this hardware) as the device being gone.
 	 */
-	gone = (nvme_mmio_read_4(ctrlr, csts) == NVME_GONE);
+	gone = bus_child_present(dev) == 0 ||
+	    (nvme_mmio_read_4(ctrlr, csts) == NVME_GONE);
 	if (gone)
 		nvme_ctrlr_fail(ctrlr);
 	else
@@ -1554,17 +1561,17 @@ nvme_ctrlr_destruct(struct nvme_controller *ctrlr, device_t dev)
 	nvme_admin_qpair_destroy(&ctrlr->adminq);
 
 	/*
-	 *  Notify the controller of a shutdown, even though this is due to
-	 *   a driver unload, not a system shutdown (this path is not invoked
-	 *   during shutdown).  This ensures the controller receives a
-	 *   shutdown notification in case the system is shutdown before
-	 *   reloading the driver.
+	 * Notify the controller of a shutdown, even though this is due to a
+	 * driver unload, not a system shutdown (this path is not invoked uring
+	 * shutdown).  This ensures the controller receives a shutdown
+	 * notification in case the system is shutdown before reloading the
+	 * driver. Some NVMe drives need this to flush their cache to stable
+	 * media and consider it a safe shutdown in SMART stats.
 	 */
-	if (!gone)
+	if (!gone) {
 		nvme_ctrlr_shutdown(ctrlr);
-
-	if (!gone)
 		nvme_ctrlr_disable(ctrlr);
+	}
 
 noadminq:
 	if (ctrlr->taskqueue)