nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS: clean-then-dirty
- In reply to: Warner Losh : "Re: nvme(4) losing control, and subsequent use of fsck_ffs(8) with UFS"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 17 Jul 2021 19:12:33 UTC
On 17/07/2021 16:46, Warner Losh wrote: > On Sat, Jul 17, 2021 at 6:33 AM Graham Perrin <grahamperrin@gmail.com > <mailto:grahamperrin@gmail.com>> wrote: > > When the file system is stress-tested, it seems that the device (an > internal drive) is lost. > > > This is most likely a drive problem. Netflix pushes half a dozen > different lower-end > models of NVMe drives to their physical limits w/o seeing issues like > this. > > That said, our screening process screens out several low-quality > drives that just > lose their minds from time to time. > > A recent photograph: > > <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7 > <https://photos.app.goo.gl/wB7gZKLF5PQzusrz7>> > > Transcribed manually: > > nvme0: Resetting controller due to a timeout. > nvme0: resetting controller > nvme0: controller ready did not become 0 within 5500 ms > > > Here the controller failed hard. We were unable to reset it within 5 > seconds. One might > be able to tweak the timeouts to cope with the drive better. Do you > have to power cycle > to get it to respond again? More recently testing with FreeBSD 14.0-CURRENT installed to a mobile hard disk drive, with the one partition of the NVMe drive used entirely for test data: * the NVMe drive is not found following a restart of FreeBSD * the NVMe drive is found when (for example) I key F9 for HP's startup manager, and then I can boot (from the mobile HDD) and FreeBSD does find the drive again. > > nvme0: failing outstanding i/o > nvme0: WRITE sqid:2 cid:115 nsid:1 lba:296178856 len:64 > nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:115 cdw0:0 > g_vfs_done():nvd0p2[WRITE(offset=151370924032, length=32768)]error = 6 > UFS: forcibly unmounting /dev/nvd0p2 from / > nvme0: failing outstanding i/o > > … et cetera. > > Is this a sure sign of a hardware problem? Or must I do something > special to gain reliability under stress? > > > It's most likely a hardware problem. that said, I've been working on > patches to > make the recovery when errors like this happen better. Smart. Thanks. > I don't how to interpret parts of the manual page for nvme(4). > There's > direction to include this line in loader.conf(5): > > nvme_load="YES" > > – however when I used kldload(8), it seemed that the module was > already > loaded, or in kernel. > > > Yes. If you are using it at all, you have the driver. > > Using StressDisk: > > <https://github.com/ncw/stressdisk > <https://github.com/ncw/stressdisk>> > > – failures typically occur after around six minutes of testing. > > > Do you have a number of these drives, or is it just this one bad apple? > > The drive is very new, less than 2 TB written: > > <https://bsd-hardware.info/?probe=7138e2a9e7&log=smartctl > <https://bsd-hardware.info/?probe=7138e2a9e7&log=smartctl>> > > I do suspect a hardware problem, because two prior installations of > Windows 10 became non-bootable. > > > That's likely a huge red flag. The computer (not mine) will be in my hands for the next thirty-six hours or so. Then it will be seen by the assigned hardware specialist, who will decide how to proceed. Whether it will be taken away for a bench test diagnosis, I don't know. In due course I'll follow up, to the list, with a final outcome. > Also: I find peculiarities with use of fsck_ffs(8), which I can > describe > later. Maybe to be expected, if there's a problem with the drive. > > > You can ask Kirk, but if data isn't written to the drive when the firmware > crashes, then there may be data loss. > > Warner Blind cc Kirk on this occasion. Re: the attached typescript file, a first run of fsck performed repairs and marked the file system clean. A subsequent run performed repairs and marked the file system dirty. I understand that with a probable hardware problem, all bets are off :-) but still: * clean-then-dirty raises an eyebrow. The version.txt file (Thursday 2021-07-15 16:12:28 BST) relates to a disk image that was provided to me, from which I performed the installation of FreeBSD that I'm currently using to test. NB the patch at the time. Thanks all Graham