Re: nvme controller reset failures on recent -CURRENT
- Reply: Don Lewis : "Re: nvme controller reset failures on recent -CURRENT"
- In reply to: Don Lewis : "nvme controller reset failures on recent -CURRENT"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 13 Feb 2024 03:03:29 UTC
On Mon, Feb 12, 2024 at 04:28:10PM -0800, Don Lewis wrote: > I just upgraded my package build machine to: > FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e > from: > FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 > and I've had two nvme-triggered panics in the last day. > > nvme is being used for swap and L2ARC. I'm not able to get a crash > dump, probably because the nvme device has gone away and I get an error > about not having a dump device. It looks like a low-memory panic > because free memory is low and zfs is calling malloc(). > > This shows up in the log leading up to the panic: > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: resetting controller > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete > Feb 12 10:07:41 zipper syslogd: last message repeated 2 times > Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o > Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog ti > meout. Are you by chance using the drive mentioned here? https://github.com/openzfs/zfs/discussions/14793 I was bitten by that and ended up replacing the drive with a different model. The crash manifested exactly as you describe, though I didn't have L2ARC or swap enabled on it. > The device looks healthy to me: > SMART/Health Information Log > ============================ > Critical Warning State: 0x00 > Available spare: 0 > Temperature: 0 > Device reliability: 0 > Read only: 0 > Volatile memory backup: 0 > Temperature: 312 K, 38.85 C, 101.93 F > Available spare: 100 > Available spare threshold: 10 > Percentage used: 3 > Data units (512,000 byte) read: 5761183 > Data units written: 29911502 > Host read commands: 471921188 > Host write commands: 605394753 > Controller busy time (minutes): 32359 > Power cycles: 110 > Power on hours: 19297 > Unsafe shutdowns: 14 > Media errors: 0 > No. error info log entries: 0 > Warning Temp Composite Time: 0 > Error Temp Composite Time: 0 > Temperature 1 Transition Count: 5231 > Temperature 2 Transition Count: 0 > Total Time For Temperature 1: 41213 > Total Time For Temperature 2: 0 > >