Re: nvme device errors & zfs

From: Dave Cottlehuber <dch_at_freebsd.org>
Date: Tue, 05 Nov 2024 09:14:59 UTC
On Tue, 5 Nov 2024, at 01:14, Warner Losh wrote:
> On Mon, Nov 4, 2024 at 10:31 AM Dave Cottlehuber <dch@freebsd.org> wrote:
>> What's the best way to see error counters or states on an nvme
>> device?
>
> Sadly, I think dmesg | grep nvme and/or trolling through /var/log/messages.
> Nvme drives don't generally keep good counters of errors...

Thanks Warner,

good to know. brb, writing all my critical data onto stone tablets
and vellum... mirrored vdevs from different batches & vendors..

these are samsung 990, mainly chosen for low price at the time: 

nda0: <Samsung SSD 990 PRO 2TB 0B2QJXG7 S7DNNJ0WC12665P>
nda1: <Samsung SSD 990 PRO 2TB 0B2QJXG7 S7DNNJ0WC12664X>

https://gist.github.com/dch/6523082557066f7d95af82000be74a8c is
more readable version of below.

I forgot to mention dmesg prior:

Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Resetting controller due to a timeout.
Oct 31 16:11:05 wintermute kernel[9406]: nvme1: event="start"
Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Waiting for reset to complete
Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Waiting for reset to complete
... repeated x400

then the vdev is yanked by zfs:
Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30294]: vdev I/O failure, zpool=zroot path=/dev/gpt/zfs1 offset=270336 size=8192 error=6
Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30305]: vdev state changed, pool_guid=16468628842577781486 vdev_guid=1873129772219000696
Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30309]: vdev is removed, pool_guid=16468628842577781486 vdev_guid=1873129772219000696
Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30313]: vdev state changed, pool_guid=16468628842577781486 vdev_guid=1873129772219000696
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to complete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to complete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to complete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to complete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to complete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: controller ready did not become 0 within 20500 ms
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: event="timed_out"
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: failing outstanding i/o
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: READ sqid:3 cid:126 nsid:1 lba:667089368 len:2048
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:126 cdw0:0
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: WRITE sqid:3 cid:121 nsid:1 lba:2533579032 len:256
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:121 cdw0:0
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: WRITE sqid:3 cid:127 nsid:1 lba:2533579456 len:256
Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=27c2f9d8 0 7ff 0 0 0
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:127 cdw0:0
Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): CAM status: NVME Status Error
Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): Error 5, Retries exhausted
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: failing outstanding i/o

this also repeated a few times.

>> I have a typical mirrored nvme zpool, that reported enough errors
>> in a burst last week, that 1 drive dropped off the bus [1].
>> 
>> After a reboot, it resilvered, I cleared the errors, and it seems
>> fine according to repeated scrubs and a few days of use.
>> 
>> I was unable to see any errors from the nvme drive itself, but
>> as its (just) in warranty for 2 more weeks I'd like to know
>> if I should return it.
>> 
>> I installed ports `sysutils/nvme-cli` and didn't see anything 
>> of note there either:
>> 
>> $ doas nvme smart-log /dev/nvme1

[snip]

`nvmecontrol logpage -p 2 nvme1` returns the same info as the ports
nvme-cli tool, but its in base already.

I will add this example to the manpage, as I tried logpage 0 & 1
without getting any info back.

>> 0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0
>>           : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw13: 0
>>           : cdw14: 0 cdw15: 0 len: 0x200 is_read: 0
>> <--- 0 cid: 0 status 0
>> Smart Log for NVME device:nvme1 namespace-id:ffffffff
>> critical_warning                    : 0
>> temperature                         : 39 C
>> available_spare                     : 100%
>> available_spare_threshold           : 10%
>> percentage_used                     : 3%
>> data_units_read                     : 121681067
>> data_units_written                  : 86619659
>> host_read_commands                  : 695211450
>> host_write_commands                 : 2187823697
>> controller_busy_time                : 2554
>> power_cycles                        : 48
>> power_on_hours                      : 6342
>> unsafe_shutdowns                    : 38
>> media_errors                        : 0
>> num_err_log_entries                 : 0
>> Warning Temperature Time            : 0
>> Critical Composite Temperature Time : 0
>
> This suggests that the only 'badness' is 38 unsafe shutdowns (likely 
> power failures), since
> either there were a bunch all at once (maybe when installing) or you've 
> had power off events
> every week...

Could that be panics? This box typically runs for a fortnight and then
gets CURRENT bumped. It also feels like a lot more panics than I would
have, I don't know if these counters were zero on purchase tho.

> There's been no reported media errors (or the drive hasn't done a good 
> job of remembering
> them, though most NVME is better than most for that).
> 
>> Temperature Sensor 1                : 39 C
>> Temperature Sensor 2                : 43 C
>> Thermal Management T1 Trans Count   : 0
>> Thermal Management T2 Trans Count   : 0
>> Thermal Management T1 Total Time    : 0
>> Thermal Management T2 Total Time    : 0
>
> There's been no time where the drive overheated either. That's good.
> 
>> [1]: zpool status
>> status: One or more devices are faulted in response to persistent errors.
>>         Sufficient replicas exist for the pool to continue functioning in a
>>         degraded state.
>> action: Replace the faulted device, or use 'zpool clear' to mark the device
>>         repaired.
>>   scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:24:36 2024
>> config:
>> 
>>         NAME          STATE     READ WRITE CKSUM
>>         zroot         DEGRADED     0     0     0
>>           mirror-0    DEGRADED     0     0     0
>>             gpt/zfs0  ONLINE       0     0     0
>>             gpt/zfs1  FAULTED      0     0     0  too many errors
>
> I'm not sure how to reconcile this in the face of the above. I'd have to see the
> dmesg / messages logs for any non-boot messages for nvme / nda.  For bad drives
> at work, I typically see something like:
>
> /var/log/messages.0.bz2:Nov  3 02:48:54 c001 kernel: nvme2: Resetting 
> controller due to a timeout.
> /var/log/messages.0.bz2:Nov  3 02:48:54 c001 kernel: nvme2: Waiting for 
> reset to complete
> /var/log/messages.0.bz2:Nov  3 02:49:05 c001 kernel: nvme2: controller 
> ready did not become 0 within 10500 ms
>
> for drives that just 'hang' which would cause ZFS to drop them out. I'd 
> see if there's new firmware or return
> the drive.

I will give this a go if this issue returns, its the only one I had
in a year so far.

> I also see:
> nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024
> nvme8: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a2400
> nvme8: cdw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0
> nvme8: UNRECOVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3 
> cid:117 cdw0:0
> (nda8:nvme8:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 
> cdw=6fce3a70 0 3ff 0 0 0
> (nda8:nvme8:0:0:1): CAM status: NVME Status Error
> (nda8:nvme8:0:0:1): Error 5, Retries exhausted
> g_vfs_done():nda8p8[READ(offset=960402063360, length=1048576)]error = 5
>
> when there's a media error. But the brand of NVMe drives we by report 
> this as an error:
>
> c029.for002.ix# nvmecontrol logpage -p 2 nvme8
> SMART/Health Information Log
> ============================
> Critical Warning State:         0x04
>  Available spare:               0
>  Temperature:                   0
>  Device reliability:            1
>  Read only:                     0
>  Volatile memory backup:        0
> [[... but this says the drive has lost data ]]
> Power cycles:                   106
> Power on hours:                 30250
> Unsafe shutdowns:               19
> Media errors:                   3
> No. error info log entries:     3
> Warning Temp Composite Time:    0
> Error Temp Composite Time:      0
> Temperature 1 Transition Count: 0
> Temperature 2 Transition Count: 0
> Total Time For Temperature 1:   0
> Total Time For Temperature 2:   0
>
> so there's 3 media errors. I can read the log page to find the LBA too 
> (I'm working on
> enhancing the errors we report for NVMe to include LBA of first error 
> too, but that's not
> there yet).
>
> But since you don't have any media errors, I'd check history to see if 
> the nvme drives
> are resetting (either successfully or not). But I don't know how to get 
> that data from just
> the drive logs.
>
> Warner

There's nothing at all prior to last week, and I assume the unsafe
shutdowns are unrelated panics while running CURRENT for a year.

I'll cross my fingers and hope this is a transient bus problem.

thanks!
Dave
———
O for a muse of fire, that would ascend the brightest heaven of invention!