Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)

From: Warner Losh <imp_at_bsdimp.com>
Date: Thu, 08 Jun 2023 06:24:55 UTC
On Wed, Jun 7, 2023 at 11:12 PM Rebecca Cran <rebecca@bsdio.com> wrote:

> I got a seemingly random nvme data transfer error on my new arm64 Ampere
> Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive.
>
> Since it's a new drive and smartctl doesn't show any errors I thought it
> might be worth mentioning here.
>
> I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585f.
>
>
> dmesg contains:
>
> nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8
> nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0
> (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0
> cdw=98085b90 0 7 0 0 0
> (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error
> (nda0:nvme0:0:0:1): Error 5, Retries exhausted
>
>
> nvmecontrol identify nvme0 shows:
>
> Vendor ID:                   144d
> Subsystem Vendor ID:         144d
> Model Number:                SAMSUNG MZPLJ6T4HALA-00007
> Firmware Version:            EPK9CB5Q
> Recommended Arb Burst:       8
> IEEE OUI Identifier:         00 25 38
> Multi-Path I/O Capabilities: Multiple controllers, Multiple ports
> Max Data Transfer Size:      131072 bytes
> Sanitize Crypto Erase:       Supported
> Sanitize Block Erase:        Supported
> Sanitize Overwrite:          Not Supported
> Sanitize NDI:                Not Supported
> Sanitize NODMMAS:            Undefined
> Controller ID:               0x0041
> Version:                     1.3.0
>

PCIe 3 or PCIe 4?

So the only documented reason for this error is if we setup the memory wrong
such that the drive couldn't start a transfer from the specified address.
This seems
weird to me... But in the prior paragraph it talks about other types of
aborts that
need software intervention. If this is a transient error, then  maybe we
should retry
it as part of the data recovery. Unless this do not retry bit is set. which
it isn't. I wonder
this is retried 5 times or not before generating the error...

Warner