Re: nvme device errors & zfs
- Reply: Dave Cottlehuber: "Re: nvme device errors & zfs"
- In reply to: Dave Cottlehuber: "nvme device errors & zfs"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 05 Nov 2024 01:14:43 UTC
On Mon, Nov 4, 2024 at 10:31 AM Dave Cottlehuber <dch@freebsd.org> wrote: > What's the best way to see error counters or states on an nvme > device? > Sadly, I think dmesg | grep nvme and/or trolling through /var/log/messages. Nvme drives don't generally keep good counters of errors... > I have a typical mirrored nvme zpool, that reported enough errors > in a burst last week, that 1 drive dropped off the bus [1]. > > After a reboot, it resilvered, I cleared the errors, and it seems > fine according to repeated scrubs and a few days of use. > > I was unable to see any errors from the nvme drive itself, but > as its (just) in warranty for 2 more weeks I'd like to know > if I should return it. > > I installed ports `sysutils/nvme-cli` and didn't see anything > of note there either: > > $ doas nvme smart-log /dev/nvme1 > 0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0 > : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw13: 0 > : cdw14: 0 cdw15: 0 len: 0x200 is_read: 0 > <--- 0 cid: 0 status 0 > Smart Log for NVME device:nvme1 namespace-id:ffffffff > critical_warning : 0 > temperature : 39 C > available_spare : 100% > available_spare_threshold : 10% > percentage_used : 3% > data_units_read : 121681067 > data_units_written : 86619659 > host_read_commands : 695211450 > host_write_commands : 2187823697 > controller_busy_time : 2554 > power_cycles : 48 > power_on_hours : 6342 > unsafe_shutdowns : 38 > media_errors : 0 > num_err_log_entries : 0 > Warning Temperature Time : 0 > Critical Composite Temperature Time : 0 > This suggests that the only 'badness' is 38 unsafe shutdowns (likely power failures), since either there were a bunch all at once (maybe when installing) or you've had power off events every week... There's been no reported media errors (or the drive hasn't done a good job of remembering them, though most NVME is better than most for that). > Temperature Sensor 1 : 39 C > Temperature Sensor 2 : 43 C > Thermal Management T1 Trans Count : 0 > Thermal Management T2 Trans Count : 0 > Thermal Management T1 Total Time : 0 > Thermal Management T2 Total Time : 0 > There's been no time where the drive overheated either. That's good. > [1]: zpool status > status: One or more devices are faulted in response to persistent errors. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Replace the faulted device, or use 'zpool clear' to mark the device > repaired. > scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:24:36 > 2024 > config: > > NAME STATE READ WRITE CKSUM > zroot DEGRADED 0 0 0 > mirror-0 DEGRADED 0 0 0 > gpt/zfs0 ONLINE 0 0 0 > gpt/zfs1 FAULTED 0 0 0 too many errors > I'm not sure how to reconcile this in the face of the above. I'd have to see the dmesg / messages logs for any non-boot messages for nvme / nda. For bad drives at work, I typically see something like: /var/log/messages.0.bz2:Nov 3 02:48:54 c001 kernel: nvme2: Resetting controller due to a timeout. /var/log/messages.0.bz2:Nov 3 02:48:54 c001 kernel: nvme2: Waiting for reset to complete /var/log/messages.0.bz2:Nov 3 02:49:05 c001 kernel: nvme2: controller ready did not become 0 within 10500 ms for drives that just 'hang' which would cause ZFS to drop them out. I'd see if there's new firmware or return the drive. I also see: nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024 nvme8: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a2400 nvme8: cdw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0 nvme8: UNRECOVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3 cid:117 cdw0:0 (nda8:nvme8:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=6fce3a70 0 3ff 0 0 0 (nda8:nvme8:0:0:1): CAM status: NVME Status Error (nda8:nvme8:0:0:1): Error 5, Retries exhausted g_vfs_done():nda8p8[READ(offset=960402063360, length=1048576)]error = 5 when there's a media error. But the brand of NVMe drives we by report this as an error: c029.for002.ix# nvmecontrol logpage -p 2 nvme8 SMART/Health Information Log ============================ Critical Warning State: 0x04 Available spare: 0 Temperature: 0 Device reliability: 1 Read only: 0 Volatile memory backup: 0 [[... but this says the drive has lost data ]] Power cycles: 106 Power on hours: 30250 Unsafe shutdowns: 19 Media errors: 3 No. error info log entries: 3 Warning Temp Composite Time: 0 Error Temp Composite Time: 0 Temperature 1 Transition Count: 0 Temperature 2 Transition Count: 0 Total Time For Temperature 1: 0 Total Time For Temperature 2: 0 so there's 3 media errors. I can read the log page to find the LBA too (I'm working on enhancing the errors we report for NVMe to include LBA of first error too, but that's not there yet). But since you don't have any media errors, I'd check history to see if the nvme drives are resetting (either successfully or not). But I don't know how to get that data from just the drive logs. Warner