Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)
Date: Thu, 08 Jun 2023 06:24:55 UTC
On Wed, Jun 7, 2023 at 11:12 PM Rebecca Cran <rebecca@bsdio.com> wrote: > I got a seemingly random nvme data transfer error on my new arm64 Ampere > Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive. > > Since it's a new drive and smartctl doesn't show any errors I thought it > might be worth mentioning here. > > I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585f. > > > dmesg contains: > > nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8 > nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0 > (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 > cdw=98085b90 0 7 0 0 0 > (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error > (nda0:nvme0:0:0:1): Error 5, Retries exhausted > > > nvmecontrol identify nvme0 shows: > > Vendor ID: 144d > Subsystem Vendor ID: 144d > Model Number: SAMSUNG MZPLJ6T4HALA-00007 > Firmware Version: EPK9CB5Q > Recommended Arb Burst: 8 > IEEE OUI Identifier: 00 25 38 > Multi-Path I/O Capabilities: Multiple controllers, Multiple ports > Max Data Transfer Size: 131072 bytes > Sanitize Crypto Erase: Supported > Sanitize Block Erase: Supported > Sanitize Overwrite: Not Supported > Sanitize NDI: Not Supported > Sanitize NODMMAS: Undefined > Controller ID: 0x0041 > Version: 1.3.0 > PCIe 3 or PCIe 4? So the only documented reason for this error is if we setup the memory wrong such that the drive couldn't start a transfer from the specified address. This seems weird to me... But in the prior paragraph it talks about other types of aborts that need software intervention. If this is a transient error, then maybe we should retry it as part of the data recovery. Unless this do not retry bit is set. which it isn't. I wonder this is retried 5 times or not before generating the error... Warner