Re: nvme device errors & zfs

From: Warner Losh <imp_at_bsdimp.com>
Date: Tue, 05 Nov 2024 01:14:43 UTC
On Mon, Nov 4, 2024 at 10:31 AM Dave Cottlehuber <dch@freebsd.org> wrote:

> What's the best way to see error counters or states on an nvme
> device?
>

Sadly, I think dmesg | grep nvme and/or trolling through /var/log/messages.
Nvme drives don't generally keep good counters of errors...


> I have a typical mirrored nvme zpool, that reported enough errors
> in a burst last week, that 1 drive dropped off the bus [1].
>
> After a reboot, it resilvered, I cleared the errors, and it seems
> fine according to repeated scrubs and a few days of use.
>
> I was unable to see any errors from the nvme drive itself, but
> as its (just) in warranty for 2 more weeks I'd like to know
> if I should return it.
>
> I installed ports `sysutils/nvme-cli` and didn't see anything
> of note there either:
>
> $ doas nvme smart-log /dev/nvme1
> 0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0
>           : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw13: 0
>           : cdw14: 0 cdw15: 0 len: 0x200 is_read: 0
> <--- 0 cid: 0 status 0
> Smart Log for NVME device:nvme1 namespace-id:ffffffff
> critical_warning                    : 0
> temperature                         : 39 C
> available_spare                     : 100%
> available_spare_threshold           : 10%
> percentage_used                     : 3%
> data_units_read                     : 121681067
> data_units_written                  : 86619659
> host_read_commands                  : 695211450
> host_write_commands                 : 2187823697
> controller_busy_time                : 2554
> power_cycles                        : 48
> power_on_hours                      : 6342
> unsafe_shutdowns                    : 38
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 0
> Critical Composite Temperature Time : 0
>

This suggests that the only 'badness' is 38 unsafe shutdowns (likely power
failures), since
either there were a bunch all at once (maybe when installing) or you've had
power off events
every week...

There's been no reported media errors (or the drive hasn't done a good job
of remembering
them, though most NVME is better than most for that).


> Temperature Sensor 1                : 39 C
> Temperature Sensor 2                : 43 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0
>

There's been no time where the drive overheated either. That's good.


> [1]: zpool status
> status: One or more devices are faulted in response to persistent errors.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Replace the faulted device, or use 'zpool clear' to mark the device
>         repaired.
>   scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:24:36
> 2024
> config:
>
>         NAME          STATE     READ WRITE CKSUM
>         zroot         DEGRADED     0     0     0
>           mirror-0    DEGRADED     0     0     0
>             gpt/zfs0  ONLINE       0     0     0
>             gpt/zfs1  FAULTED      0     0     0  too many errors
>

I'm not sure how to reconcile this in the face of the above. I'd have to
see the
dmesg / messages logs for any non-boot messages for nvme / nda.  For bad
drives
at work, I typically see something like:

/var/log/messages.0.bz2:Nov  3 02:48:54 c001 kernel: nvme2: Resetting
controller due to a timeout.
/var/log/messages.0.bz2:Nov  3 02:48:54 c001 kernel: nvme2: Waiting for
reset to complete
/var/log/messages.0.bz2:Nov  3 02:49:05 c001 kernel: nvme2: controller
ready did not become 0 within 10500 ms

for drives that just 'hang' which would cause ZFS to drop them out. I'd see
if there's new firmware or return
the drive.

I also see:
nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024
nvme8: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a2400
nvme8: cdw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0
nvme8: UNRECOVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3 cid:117
cdw0:0
(nda8:nvme8:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0
cdw=6fce3a70 0 3ff 0 0 0
(nda8:nvme8:0:0:1): CAM status: NVME Status Error
(nda8:nvme8:0:0:1): Error 5, Retries exhausted
g_vfs_done():nda8p8[READ(offset=960402063360, length=1048576)]error = 5

when there's a media error. But the brand of NVMe drives we by report this
as an error:

c029.for002.ix# nvmecontrol logpage -p 2 nvme8
SMART/Health Information Log
============================
Critical Warning State:         0x04
 Available spare:               0
 Temperature:                   0
 Device reliability:            1
 Read only:                     0
 Volatile memory backup:        0
[[... but this says the drive has lost data ]]
Power cycles:                   106
Power on hours:                 30250
Unsafe shutdowns:               19
Media errors:                   3
No. error info log entries:     3
Warning Temp Composite Time:    0
Error Temp Composite Time:      0
Temperature 1 Transition Count: 0
Temperature 2 Transition Count: 0
Total Time For Temperature 1:   0
Total Time For Temperature 2:   0

so there's 3 media errors. I can read the log page to find the LBA too (I'm
working on
enhancing the errors we report for NVMe to include LBA of first error too,
but that's not
there yet).

But since you don't have any media errors, I'd check history to see if the
nvme drives
are resetting (either successfully or not). But I don't know how to get
that data from just
the drive logs.

Warner