Re: nvme controller reset failures on recent -CURRENT

Reply: Don Lewis : "Re: nvme controller reset failures on recent -CURRENT"
In reply to: Don Lewis : "Re: nvme controller reset failures on recent -CURRENT"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Warner Losh <imp_at_bsdimp.com>
Date: Tue, 13 Feb 2024 05:31:46 UTC
On Mon, Feb 12, 2024 at 9:15 PM Don Lewis <truckman@freebsd.org> wrote:

> On 12 Feb, Maxim Sobolev wrote:
> > Might be an overheating. Today's nvme drives are notoriously flaky if you
> > run them without proper heat sink attached to it.
>
> I don't think it is a thermal problem.  According to the drive health
> page, the device temperature has never reached Temperature 2, whatever
> that is.  The room temperature is around 65F.  The system was stable
> last summer when the room temperature spent a lot of time in the 80-85F
> range.  The device temperature depends a lot on the I/O rate, and the
> last panic happened when the I/O rate had been below 40tps for quite a
> while.
>

It did reach temperature 1, though. That's the 'Warning this drive is too
hot' temperature. It has spent 41213 minutes of your 19297 hours of up
time, or an average of 2 minutes per hour. That's too much. Temperature
2 is critical error: we are about to shut down completely due to it
being too hot. It's only a couple degrees below hardware power off
due to temperature in many drives. Some really cheap ones don't really
implement it at all. On my card with the bad heat sink, Warning temp is
70C while critical is 75C while IIRC thermal shutdown is 78C or 80C.

I don't think we report these values in nvmecontrol identify. But you can
do a raw dump with -x look at bytes 266:267 for warning and 268:269
for critical.

In contrast, the few dozen drives that I have, all of which have been
abused in various ways, And only one of them has any heat issues,
and that one is an engineering special / sample with what I think is
a damaged heat sink. If your card has no heat sink, this could well
be what's going on.

This panic means "the nvme card lost its mind and stopped talking
to the host". Its status registers read 0xff's, which means that the card
isn't decoding bus signals. Usually this means that the firmware on the
card has faulted and rebooted. If the card is overheating, then this could
well be what's happening.

There's a tiny chance that this could be something more exotic,
but my money is on hardware gone bad after 2 years of service. I don't think
this is 'wear out' of the NAND (it's only 15TB written, but it could be if
this
drive is really really crappy nand: first generation QLC maybe, but it seems
too new). It might also be a connector problem that's developed over time.
There might be a few other things too, but I don't think this is a U.2 drive
with funky cables.

Warner


> > On Mon, Feb 12, 2024, 4:28 PM Don Lewis <truckman@freebsd.org> wrote:
> >
> >> I just upgraded my package build machine to:
> >>   FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
> >> from:
> >>   FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
> >> and I've had two nvme-triggered panics in the last day.
> >>
> >> nvme is being used for swap and L2ARC.  I'm not able to get a crash
> >> dump, probably because the nvme device has gone away and I get an error
> >> about not having a dump device.  It looks like a low-memory panic
> >> because free memory is low and zfs is calling malloc().
> >>
> >> This shows up in the log leading up to the panic:
> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
> >> timeout a
> >> nd possible hot unplug.
> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
> >> timeout a
> >> nd possible hot unplug.
> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
> >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
> >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
> >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping
> watchdog
> >> ti
> >> meout.
> >>
> >> The device looks healthy to me:
> >> SMART/Health Information Log
> >> ============================
> >> Critical Warning State:         0x00
> >>  Available spare:               0
> >>  Temperature:                   0
> >>  Device reliability:            0
> >>  Read only:                     0
> >>  Volatile memory backup:        0
> >> Temperature:                    312 K, 38.85 C, 101.93 F
> >> Available spare:                100
> >> Available spare threshold:      10
> >> Percentage used:                3
> >> Data units (512,000 byte) read: 5761183
> >> Data units written:             29911502
> >> Host read commands:             471921188
> >> Host write commands:            605394753
> >> Controller busy time (minutes): 32359
> >> Power cycles:                   110
> >> Power on hours:                 19297
> >> Unsafe shutdowns:               14
> >> Media errors:                   0
> >> No. error info log entries:     0
> >> Warning Temp Composite Time:    0
> >> Error Temp Composite Time:      0
> >> Temperature 1 Transition Count: 5231
> >> Temperature 2 Transition Count: 0
> >> Total Time For Temperature 1:   41213
> >> Total Time For Temperature 2:   0
> >>
> >>
> >>
>
>
>