Re: nvme controller reset failures on recent -CURRENT
- In reply to: Mark Johnston : "Re: nvme controller reset failures on recent -CURRENT"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 13 Feb 2024 04:06:21 UTC
On 12 Feb, Mark Johnston wrote: > On Mon, Feb 12, 2024 at 04:28:10PM -0800, Don Lewis wrote: >> I just upgraded my package build machine to: >> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e >> from: >> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 >> and I've had two nvme-triggered panics in the last day. >> >> nvme is being used for swap and L2ARC. I'm not able to get a crash >> dump, probably because the nvme device has gone away and I get an error >> about not having a dump device. It looks like a low-memory panic >> because free memory is low and zfs is calling malloc(). >> >> This shows up in the log leading up to the panic: >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a >> nd possible hot unplug. >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a >> nd possible hot unplug. >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog ti >> meout. > > Are you by chance using the drive mentioned here? https://github.com/openzfs/zfs/discussions/14793 > > I was bitten by that and ended up replacing the drive with a different > model. The crash manifested exactly as you describe, though I didn't > have L2ARC or swap enabled on it. Nope: nda0 at nvme0 bus 0 scbus9 target 0 lun 1 nda0: <INTEL SSDPEKNW512G8 002C BTNH940617WE512A> nda0: Serial Number BTNH940617WE512A nda0: nvme version 1.3 nda0: 488386MB (1000215216 512 byte sectors) I'm not seeing super high I/O rates> I happened to have iostat running when the machine paniced: 0 584 88.4 31 2.68 65.8 112 7.18 68.2 107 7.13 80 0 20 0 0 0 565 99.1 32 3.06 27.9 74 2.01 30.5 70 2.08 80 0 20 0 0 0 612 92.8 31 2.77 18.9 148 2.74 18.9 148 2.73 86 0 14 0 0 0 618 88.6 13 1.17 25.0 59 1.44 24.2 61 1.44 89 0 11 0 0 0 586 45.4 5 0.22 31.4 55 1.70 30.8 57 1.70 84 0 16 0 0 0 598 12.7 3 0.03 38.1 64 2.40 37.1 66 2.40 84 0 16 0 0 0 675 36.1 6 0.21 23.7 156 3.62 22.7 164 3.63 88 0 12 0 0 0 641 6.9 6 0.04 25.7 243 6.10 25.3 246 6.08 71 0 29 0 0 0 737 20.1 9 0.18 36.4 148 5.24 37.2 144 5.24 78 0 22 0 0 0 578 44.7 23 1.03 25.1 164 4.01 25.5 161 3.99 86 0 14 0 0 0 608 70.3 15 1.06 51.1 64 3.19 51.3 64 3.19 89 0 11 0 0 0 624 38.6 9 0.35 32.3 121 3.80 32.2 121 3.79 90 0 10 0 0 0 577 80.6 16 1.28 37.8 66 2.44 36.5 69 2.46 90 0 10 0 0 tty nda0 ada0 ada1 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 566 87.7 16 1.39 27.2 60 1.60 25.3 66 1.62 87 0 13 0 0 0 599 77.2 11 0.83 17.4 391 6.66 17.3 395 6.66 74 0 26 0 0 0 660 45.0 7 0.31 18.7 575 10.51 18.6 578 10.49 76 0 24 0 0 0 615 37.7 8 0.31 24.0 303 7.11 24.0 303 7.11 58 0 42 0 0 Fssh_packet_write_wait: ... port 22: Broken pipe ada* are old and slow spinning rust. That report does mention something else that could also be a cause. I upgraded the motherboard BIOS around the same time. When I get a chance, I'll drop back to the older FreeBSD version and see if the problem goes away.