Re: nvme controller reset failures on recent -CURRENT
Date: Tue, 13 Feb 2024 05:31:46 UTC
On Mon, Feb 12, 2024 at 9:15 PM Don Lewis <truckman@freebsd.org> wrote: > On 12 Feb, Maxim Sobolev wrote: > > Might be an overheating. Today's nvme drives are notoriously flaky if you > > run them without proper heat sink attached to it. > > I don't think it is a thermal problem. According to the drive health > page, the device temperature has never reached Temperature 2, whatever > that is. The room temperature is around 65F. The system was stable > last summer when the room temperature spent a lot of time in the 80-85F > range. The device temperature depends a lot on the I/O rate, and the > last panic happened when the I/O rate had been below 40tps for quite a > while. > It did reach temperature 1, though. That's the 'Warning this drive is too hot' temperature. It has spent 41213 minutes of your 19297 hours of up time, or an average of 2 minutes per hour. That's too much. Temperature 2 is critical error: we are about to shut down completely due to it being too hot. It's only a couple degrees below hardware power off due to temperature in many drives. Some really cheap ones don't really implement it at all. On my card with the bad heat sink, Warning temp is 70C while critical is 75C while IIRC thermal shutdown is 78C or 80C. I don't think we report these values in nvmecontrol identify. But you can do a raw dump with -x look at bytes 266:267 for warning and 268:269 for critical. In contrast, the few dozen drives that I have, all of which have been abused in various ways, And only one of them has any heat issues, and that one is an engineering special / sample with what I think is a damaged heat sink. If your card has no heat sink, this could well be what's going on. This panic means "the nvme card lost its mind and stopped talking to the host". Its status registers read 0xff's, which means that the card isn't decoding bus signals. Usually this means that the firmware on the card has faulted and rebooted. If the card is overheating, then this could well be what's happening. There's a tiny chance that this could be something more exotic, but my money is on hardware gone bad after 2 years of service. I don't think this is 'wear out' of the NAND (it's only 15TB written, but it could be if this drive is really really crappy nand: first generation QLC maybe, but it seems too new). It might also be a connector problem that's developed over time. There might be a few other things too, but I don't think this is a U.2 drive with funky cables. Warner > > On Mon, Feb 12, 2024, 4:28 PM Don Lewis <truckman@freebsd.org> wrote: > > > >> I just upgraded my package build machine to: > >> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e > >> from: > >> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 > >> and I've had two nvme-triggered panics in the last day. > >> > >> nvme is being used for swap and L2ARC. I'm not able to get a crash > >> dump, probably because the nvme device has gone away and I get an error > >> about not having a dump device. It looks like a low-memory panic > >> because free memory is low and zfs is calling malloc(). > >> > >> This shows up in the log leading up to the panic: > >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > >> timeout a > >> nd possible hot unplug. > >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller > >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > >> timeout a > >> nd possible hot unplug. > >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete > >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times > >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o > >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping > watchdog > >> ti > >> meout. > >> > >> The device looks healthy to me: > >> SMART/Health Information Log > >> ============================ > >> Critical Warning State: 0x00 > >> Available spare: 0 > >> Temperature: 0 > >> Device reliability: 0 > >> Read only: 0 > >> Volatile memory backup: 0 > >> Temperature: 312 K, 38.85 C, 101.93 F > >> Available spare: 100 > >> Available spare threshold: 10 > >> Percentage used: 3 > >> Data units (512,000 byte) read: 5761183 > >> Data units written: 29911502 > >> Host read commands: 471921188 > >> Host write commands: 605394753 > >> Controller busy time (minutes): 32359 > >> Power cycles: 110 > >> Power on hours: 19297 > >> Unsafe shutdowns: 14 > >> Media errors: 0 > >> No. error info log entries: 0 > >> Warning Temp Composite Time: 0 > >> Error Temp Composite Time: 0 > >> Temperature 1 Transition Count: 5231 > >> Temperature 2 Transition Count: 0 > >> Total Time For Temperature 1: 41213 > >> Total Time For Temperature 2: 0 > >> > >> > >> > > >