From nobody Thu Jun 08 06:24:55 2023 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4QcDj31SWlz4blWh for ; Thu, 8 Jun 2023 06:25:11 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ej1-x629.google.com (mail-ej1-x629.google.com [IPv6:2a00:1450:4864:20::629]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4QcDj268Q7z4Pwg for ; Thu, 8 Jun 2023 06:25:10 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-ej1-x629.google.com with SMTP id a640c23a62f3a-973bf581759so51504966b.0 for ; Wed, 07 Jun 2023 23:25:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20221208.gappssmtp.com; s=20221208; t=1686205507; x=1688797507; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=AK4dKdh+r1YH5KhdtFWfl0jV1UMeKyPqkc3jttwW3RE=; b=J4iX2z6ghzfzJfKoOtMKF6wTtYa7k0AnRT0c8w/M8r8kdZarkedXU3Fcgokl1m08hz Sef31dZT3PsmlH4AvNbErELA1sMrLvQotMXVgDZ135cpEeOce4NziBWRVBUT54w8NPD9 ZlJKHrjuOhiu+CRwGWaXqrYf1lTKNVqt/JQsDIzdEt5VncpyxRNcd4hN/S+ctYLmMrJO 5wwKb8jpoCzeM57PDU7/T2KP1Ff01TZoLVZ8SdPHM3bY/+BC55M4q9aZdK7uuiIzAPB2 b15M8o0t5Wd9Prfx5v8syphu3HY+2S8nX33AXItr8756hnHqt0YAOE+EXG8+2RB3P052 9Vaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686205507; x=1688797507; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=AK4dKdh+r1YH5KhdtFWfl0jV1UMeKyPqkc3jttwW3RE=; b=Xv7K5Mry3tn8lHCGVmdSRoHRTmOlqKcrVNUpkvvHesEIuHrtlcyiXO3wmzu1NIuUWH Ioe+8hTjzZk2vvprGWfO6wRM75TZElnKuENbsgXjJ3PCdQn6TNfk0NwPUTJwMXucQbn+ 19PLmutraA6ZC6m2OgEi8deyus9Cfp1MHj3t5ONGg8ianSsUekISB04Gqkxs/ZpqBd0T VHKfTGyxyIdWMD6wlfkVP05/eyMbc1/RcfqgC9nXZdE4F9/8PGwKQDW1MQUcrN9bqUwv /sSAt90ts3LevUgJ95SNeLzs1XQxoJoWsmK1MzaXRJCIrvMRN2tRvXDsPqtb7UwBJzfL OBJA== X-Gm-Message-State: AC+VfDxZ2YHguocDx3+exv05tE8r93T2PmgNV0HYqrKNzkRLQ9rh7ayI GGes4BwO+F5IMeI2MEmLmbVHgm8B3MMJCKNznvlna7eLNUI6gz+F X-Google-Smtp-Source: ACHHUZ5xTsZPkhjfpIrV+7dO39mG/fOiDF4kZNTbOkHHBY0lxIIF6uAGjil3ijLJikLrNX0bmzZZaoDFPvvY5PwD8PQ= X-Received: by 2002:a17:907:9718:b0:978:8685:71cd with SMTP id jg24-20020a170907971800b00978868571cdmr4593922ejc.71.1686205506891; Wed, 07 Jun 2023 23:25:06 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: <5b52fc08-fb5a-900e-b98c-817a4ab79846@bsdio.com> In-Reply-To: <5b52fc08-fb5a-900e-b98c-817a4ab79846@bsdio.com> From: Warner Losh Date: Thu, 8 Jun 2023 00:24:55 -0600 Message-ID: Subject: Re: Seemingly random nvme (nda) write error on new drive (retries exhausted) To: Rebecca Cran Cc: FreeBSD CURRENT Content-Type: multipart/alternative; boundary="00000000000041c28005fd9850ff" X-Rspamd-Queue-Id: 4QcDj268Q7z4Pwg X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N --00000000000041c28005fd9850ff Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Jun 7, 2023 at 11:12=E2=80=AFPM Rebecca Cran wr= ote: > I got a seemingly random nvme data transfer error on my new arm64 Ampere > Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive. > > Since it's a new drive and smartctl doesn't show any errors I thought it > might be worth mentioning here. > > I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585= f. > > > dmesg contains: > > nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8 > nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0 > (nda0:nvme0:0:0:1): WRITE. NCB: opc=3D1 fuse=3D0 nsid=3D1 prp1=3D0 prp2= =3D0 > cdw=3D98085b90 0 7 0 0 0 > (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error > (nda0:nvme0:0:0:1): Error 5, Retries exhausted > > > nvmecontrol identify nvme0 shows: > > Vendor ID: 144d > Subsystem Vendor ID: 144d > Model Number: SAMSUNG MZPLJ6T4HALA-00007 > Firmware Version: EPK9CB5Q > Recommended Arb Burst: 8 > IEEE OUI Identifier: 00 25 38 > Multi-Path I/O Capabilities: Multiple controllers, Multiple ports > Max Data Transfer Size: 131072 bytes > Sanitize Crypto Erase: Supported > Sanitize Block Erase: Supported > Sanitize Overwrite: Not Supported > Sanitize NDI: Not Supported > Sanitize NODMMAS: Undefined > Controller ID: 0x0041 > Version: 1.3.0 > PCIe 3 or PCIe 4? So the only documented reason for this error is if we setup the memory wron= g such that the drive couldn't start a transfer from the specified address. This seems weird to me... But in the prior paragraph it talks about other types of aborts that need software intervention. If this is a transient error, then maybe we should retry it as part of the data recovery. Unless this do not retry bit is set. which it isn't. I wonder this is retried 5 times or not before generating the error... Warner --00000000000041c28005fd9850ff Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Wed, Jun 7, 2023 at 11:12=E2=80=AF= PM Rebecca Cran <rebecca@bsdio.com<= /a>> wrote:
I= got a seemingly random nvme data transfer error on my new arm64 Ampere Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive.

Since it's a new drive and smartctl doesn't show any errors I thoug= ht it
might be worth mentioning here.

I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b5= 85f.


dmesg contains:

nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8
nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0 (nda0:nvme0:0:0:1): WRITE. NCB: opc=3D1 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0=
cdw=3D98085b90 0 7 0 0 0
(nda0:nvme0:0:0:1): CAM status: CCB request completed with an error
(nda0:nvme0:0:0:1): Error 5, Retries exhausted


nvmecontrol identify nvme0 shows:

Vendor ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 144d
Subsystem Vendor ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 144d Model Number:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 SAMSUNG MZPLJ6T4HALA-00007
Firmware Version:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 EPK9CB5Q
Recommended Arb Burst:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8
IEEE OUI Identifier:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 00 25 = 38
Multi-Path I/O Capabilities: Multiple controllers, Multiple ports
Max Data Transfer Size:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 131072 bytes
Sanitize Crypto Erase:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Supported
Sanitize Block Erase:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Supported Sanitize Overwrite:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N= ot Supported
Sanitize NDI:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Not Supported
Sanitize NODMMAS:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 Undefined
Controller ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 0x0041
Version:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1.3.0

PCIe 3 or PCIe 4?

So the onl= y documented reason for this error is if we setup the memory wrong
such that the drive couldn't start a transfer from the specified addr= ess. This seems

--00000000000041c28005fd9850ff--