Re: nvme controller reset failures on recent -CURRENT

From: Pete Wright <pete_at_nomadlogic.org>
Date: Tue, 13 Feb 2024 19:56:52 UTC
>> There's a tiny chance that this could be something more exotic,
>> but my money is on hardware gone bad after 2 years of service. I don't think
>> this is 'wear out' of the NAND (it's only 15TB written, but it could be if
>> this
>> drive is really really crappy nand: first generation QLC maybe, but it seems
>> too new). It might also be a connector problem that's developed over time.
>> There might be a few other things too, but I don't think this is a U.2 drive
>> with funky cables.
> The system was probably idle the majority of those two years of power on
> time.
>
> It's one of these:
> https://www.techpowerup.com/ssd-specs/intel-660p-512-gb.d437
> I've seen comments that these generally don't need cooling.
>
> I just ordered a heatsink with some nice big fins, but it will take a
> week or more to arrive.


just wanted to add another data-point to this discussion.  i had a 
crucial NVME drive on my workstation that recently was showing similar 
problems.  after much debugging i came to the same conclusion that it 
was getting too hot.  i went ahead an purchased a Sabrent NVME drive 
that came with a heat sink.  i've also starting making much more use of 
my workstation (and the disk subsystem) and have had zero issues.

so lessons learnt:

1. M.2 nvme really does need proper cooling, much more so than 
traditional SATA/SAS/SCSI drives.

2. not all vendors do a great job reporting the health of devices

-pete

-- 
Pete Wright
pete@nomadlogic.org