[Bug 264141] nvme(4): Heavy load to SSD wedges 13.1 system: Controller in fatal status, resetting ... Resetting controller due to a timeout and possible hot unplug.

From: <bugzilla-noreply_at_freebsd.org>
Date: Sun, 22 May 2022 05:50:25 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264141

--- Comment #8 from crb <crb@ChrisBowman.com> ---
Replacing nvme with nda results in similar looking messages from both nvme0 and
nda0 (theses didn't show up in a remote ssh session so that I could cut and
paste them).

I don't think the cards get to hot.  The machine has 3 fans that spin up with
cpu temperature and as I mentioned earlier the card has a heat sync.  When I
link while building world with 32 jobs I do hear the fans ramp ever so slightly
but mostly they're quiet.

I doubt it's cabling as these SSDs were directly inserted in to an M2 slot and
I seated the last one securely a few days ago.

It could be power, this is a bit of a hacked system (I gutted a Sun Ultra 40
and replaced the contents with this reusing the power supply) but I don't have
a way to eliminate power as a possibility right now.  Theoretically this system
should be able to deliver 1000W and I only have the motherboard, processor, 64
G memory, the SSD, 2 ethernet cards (one a Mellanox CX3 using fiber) and 6
spinning drives which are basically quiet.  Power seems unlikely as the system
seems otherwise rock solid with load except when hitting the SSD hard.

This (unfortunately) seems to be completely repeatable now simple by copying a
couple of repo over 10G ether from a remote nfs machine to the local SSD while
the machine is otherwise completely idle.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.