[Bug 262969] NVMe - Resetting controller due to a timeout and possible hot unplug

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 26 Jun 2023 17:25:42 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262969

--- Comment #14 from Timothy Guo <firemeteor@users.sourceforge.net> ---
(In reply to crb from comment #11)

I would like to share my follow up experience of this issue.

In short, the problem magically goes away after I wipe the disk and recreated
the pool from backup. The same system (hardware and SW) has been working
without issue for about half a year now. Unfortunately, I couldn't locate a
conclusive offender during the entire procedure.

One thing I would like to note also is the 3.3V rail of the PSU. When I was
still suffering from the issue, I also discovered 3.3V rail under-voltage,
probably thanks to the hint from @crb's bug. I first read the out of range
Voltage value from BIOS, and then confirmed the issue through direct
measurement with a Voltage meter directly from the PSU pin-out. So it's true
that the issue could really be power related. But it's unfortunate that I can't
tell who is the offender, is the NVME drawing too much power due to firmware
bug? Or is a failing PSU leading to NVME failure?

I contacted my PSU vendor and got the feedback that the wire connector may be
aged and increased the resistance. Maybe my Voltage measuring attempt fixed the
wiring connection, maybe the wipe-out and rebuild worked-around a potential
firmware bug. The issue just suddenly goes away, as it suddenly comes (Note: I
couldn't remember any re-assembling of the hardware build when it suddenly
comes, though.)

The only part that I'm sure is the power failure is real and highly related. A
stronger PSU might have simply avoided the problem altogether?

-- 
You are receiving this mail because:
You are the assignee for the bug.