Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS

From: Warner Losh <imp_at_bsdimp.com>
Date: Sat, 17 Jul 2021 15:46:06 UTC
On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin <grahamperrin@gmail.com>
wrote:

> When the file system is stress-tested, it seems that the device (an
> internal drive) is lost.
>

This is most likely a drive problem. Netflix pushes half a dozen different
lower-end
models of NVMe drives to their physical limits w/o seeing issues like this.

That said, our screening process screens out several low-quality drives
that just
lose their minds from time to time.


> A recent photograph:
>
> <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7>
>
> Transcribed manually:
>
> nvme0: Resetting controller due to a timeout.
> nvme0: resetting controller
> nvme0: controller ready did not become 0 within 5500 ms
>

Here the controller failed hard. We were unable to reset it within 5
seconds. One might
be able to tweak the timeouts to cope with the drive better. Do you have to
power cycle
to get it to respond again?


> nvme0: failing outstanding i/o
> nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64
> nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0
> g_vfs_done():nvd0p2[WRITE(offset=151370924032, length=32768)]error = 6
> UFS: forcibly unmounting /dev/nvd0p2 from /
> nvme0: failing outstanding i/o
>
> … et cetera.
>
> Is this a sure sign of a hardware problem? Or must I do something
> special to gain reliability under stress?
>

It's most likely a hardware problem. that said, I've been working on
patches to
make the recovery when errors like this happen better.


> I don't how to interpret parts of the manual page for nvme(4). There's
> direction to include this line in loader.conf(5):
>
> nvme_load="YES"
>
> – however when I used kldload(8), it seemed that the module was already
> loaded, or in kernel.
>

Yes. If you are using it at all, you have the driver.


> Using StressDisk:
>
> <https://github.com/ncw/stressdisk>
>
> – failures typically occur after around six minutes of testing.
>

Do you have a number of these drives, or is it just this one bad apple?


> The drive is very new, less than 2 TB written:
>
> <https://bsd-hardware.info/?probe=7138e2a9e7&log=smartctl>
>
> I do suspect a hardware problem, because two prior installations of
> Windows 10 became non-bootable.
>

That's likely a huge red flag.


> Also: I find peculiarities with use of fsck_ffs(8), which I can describe
> later. Maybe to be expected, if there's a problem with the drive.
>

You can ask Kirk, but if data isn't written to the drive when the firmware
crashes, then there may be data loss.

Warner