[Bug 262969] NVMe - Resetting controller due to a timeout and possible hot unplug

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 01 Apr 2022 05:45:42 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262969

            Bug ID: 262969
           Summary: NVMe - Resetting controller due to a timeout and
                    possible hot unplug
           Product: Base System
           Version: 13.1-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: ibrennan@netgrade.com

I am seeing very unstable NVMe on TrueNAS 12 and 13.  I'm using Western Digital
Black NVMe m.2 SSD's.  The exact version I'm using is WD_BLACK SN770 250GB &
2TB, firmware version 731030WD.

This is on TrueNAS version: TrueNAS-12.0-U8 / FreeBSD: 12.2-RELEASE-p12 (I
later upgraded to 13 Beta, see below)

I set hw.nvme.per_cpu_io_queues=0, and it did not fix the problem, in fact it
seems to have made it much more frequent, although I'm not 100% sure about
that, need to test again.

I also tried using the nvd driver with hw.nvme.use_nvd=0, which doesn't seem to
make a difference, however it had slightly different results in the log when
the issue happened again.  See logs below, would be grateful if somebody can
help with this problem.


Mar 29 21:42:25 truenas nvme5: Resetting controller due to a timeout and
possible hot unplug.
Mar 29 21:42:25 truenas nvme5: resetting controller
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:12 cid:120 nsid:1 lba:1497544880
len:16
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:12 cid:120
cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:12 cid:123 nsid:1 lba:198272936 len:16
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:12 cid:123
cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:13 cid:121 nsid:1 lba:431014528 len:24
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:13 cid:121
cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:15 cid:127 nsid:1 lba:864636432 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:15 cid:127
cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:16 cid:126 nsid:1 lba:2445612184 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:16 cid:126
cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:16 cid:120 nsid:1 lba:430503600 len:8
Mar 29 21:42:25 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:16 cid:120
cdw0:0
Mar 29 21:42:25 truenas nvme5: failing outstanding i/o
Mar 29 21:42:25 truenas nvme5: READ sqid:18 cid:123 nsid:1 lba:1499051024 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:18 cid:123
cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: WRITE sqid:18 cid:124 nsid:1 lba:1990077368
len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:18 cid:124
cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:19 cid:122 nsid:1 lba:1237765696 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:19 cid:122
cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:19 cid:125 nsid:1 lba:180758264 len:16
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:19 cid:125
cdw0:0
Mar 29 21:42:26 truenas nvme5: failing outstanding i/o
Mar 29 21:42:26 truenas nvme5: READ sqid:20 cid:121 nsid:1 lba:2445612192 len:8
Mar 29 21:42:26 truenas nvme5: ABORTED - BY REQUEST (00/07) sqid:20 cid:121
cdw0:0
Mar 29 21:42:26 truenas nvd5: detached





nvme3: Resetting controller due to a timeout and possible hot unplug.
nvme3: resetting controller
nvme3: failing outstanding i/o
nvme3: READ sqid:7 cid:127 nsid:1 lba:419546528 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:7 cid:127 cdw0:0
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0
cdw=1901c5a0 0 7 0 0 0
failing outstanding i/o
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
nvme3: READ sqid:11 cid:127 nsid:1 lba:782841288 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:11 cid:127 cdw0:0
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0
cdw=2ea935c8 0 7 0 0 0
failing outstanding i/o
nvme3: READ sqid:11 cid:123 nsid:1 lba:704576056 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:11 cid:123 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:12 cid:127 nsid:1 lba:1016402352 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:12 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:12 cid:125 nsid:1 lba:1824854760 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:12 cid:125 cdw0:0
nvme3: failing outstanding i/o
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
WRITE sqid:13 cid:124 nsid:1 lba:1008638008 len:64
nvme3: ABORTED - BY REQUEST (00/07) sqid:13 cid:124 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:13 cid:125 nsid:1 lba:1008638152 len:56
nvme3: ABORTED - BY REQUEST (00/07) sqid:13 cid:125 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:15 cid:127 nsid:1 lba:783188688 len:8
nvme3: (nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0
cdw=29fefa38 0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c9511b0
0 7 0 0 0
ABORTED - BY REQUEST (00/07) sqid:15 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: WRITE sqid:15 cid:123 nsid:1 lba:1008553080 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:15 cid:123 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:16 cid:124 nsid:1 lba:147012776 len:8
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=6cc512e8 0
7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1e9838
0 3f 0 0 0
ABORTED - BY REQUEST (00/07) sqid:16 cid:124 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:16 cid:127 nsid:1 lba:2881895592 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:16 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:17 cid:127 nsid:1 lba:2574392744 len:16
nvme3: ABORTED - BY REQUEST (00/07) sqid:17 cid:127 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:18 cid:126 nsid:1 lba:155895056 len:8
nvme3: (nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1e98c8
0 37 0 0 0
ABORTED - BY REQUEST (00/07) sqid:18 cid:126 cdw0:0
nvme3: failing outstanding i/o
nvme3: READ sqid:19 cid:125 nsid:1 lba:151377120 len:8
nvme3: ABORTED - BY REQUEST (00/07) sqid:19 cid:125 cdw0:0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=2eae82d0 0
7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=3c1d4c78
0 7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=8c33ca8 0
7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=abc63ca8 0
7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=99721da8 0
f 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=94ac510 0
7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
(nda3:nvme3:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=905d4e0 0
7 0 0 0
(nda3:nvme3:0:0:1): CAM status: CCB request completed with an error
(nda3:nvme3:0:0:1): Error 5, Retries exhausted
nda3 at nvme3 bus 0 scbus13 target 0 lun 1
nda3: <WD_BLACK SN770 2TB 731030WD 21513C800057>
 s/n 21513C800057 detached
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file
xptioctl: pass driver is not in the kernel
xptioctl: put "device pass" in your kernel config file


After Upgrading to 13.0-STABLE I see this:

nvme4: RECOVERY_START 166765538822 vs 166329346273
nvme4: Controller in fatal status, resetting
nvme4: Resetting controller due to a timeout and possible hot unplug.
nvme4: RECOVERY_WAITING
nvme4: resetting controller
nvme4: failing outstanding i/o
nvme4: READ sqid:18 cid:127 nsid:1 lba:32 len:224
nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:127 cdw0:0
nvme4: failing outstanding i/o
nvme4: READ sqid:18 cid:126 nsid:1 lba:544 len:224
nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:126 cdw0:0
nvme4: failing outstanding i/o
nvme4: READ sqid:18 cid:125 nsid:1 lba:3907028000 len:224
nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:125 cdw0:0
nvme4: failing outstanding i/o
nvme4: READ sqid:18 cid:124 nsid:1 lba:3907028512 len:224
nvme4: ABORTED - BY REQUEST (00/07) sqid:18 cid:124 cdw0:0
nvd4: detached

-- 
You are receiving this mail because:
You are the assignee for the bug.