Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)
- Reply: bob prohaska : "Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)"
- In reply to: Mark Millard : "Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 20 Feb 2023 11:47:30 UTC
> On Feb 20, 2023, at 01:00, Mark Millard <marklmi@yahoo.com> wrote: > > On Feb 19, 2023, at 21:50, Mark Millard <marklmi@yahoo.com> wrote: > >> On Feb 19, 2023, at 20:45, bob prohaska <fbsd@www.zefox.net> wrote: >> >>> On Sun, Feb 19, 2023 at 02:35:15PM -0800, Mark Millard wrote: >>>> >>>> Kirk likely monitors the freebsd-fs list. >>> >>> I didn't notice there was such a list 8-\ >>> >>>> Kirk likely does not monitor the freebsd-arm list. >>>> None of us thought to switch to freebsd-fs at the >>>> time. The only part of your context that ended up >>>> to be arm specific was original buildworld crash. >>>> You definitely started in an appropriate place >>>> (freebsd-arm). After the crash, the rest was more >>>> general relative to platforms and more specific >>>> relative to file system handling (UFS support). >>>> >>>> I do not see any reason for any of this exchange >>>> to go to any lists, given the current status. >>> >>> Alas, the story's not over yet 8-( >>> >>> After getting the disk fsck'd and booting once more, >>> an attempt to buildworld using a fresh /usr/src >>> and empty /usr/obj crashed again, >> >> I'm confused. The original crash was reported to be >> on a RPi2B using a armv7 kernel, or so I thought. >> (The RPi3B was for later fsck_ffs activity for the >> media's UFS.) >> >> This new material indicates a RPi3B arm64 (aarch64) >> context for this buildworld failure. Is it the same >> media as for the prior buildworld failure? >> >>> in I think the >>> same way. This time some notes have been collected >>> at >>> http://www.zefox.net/~fbsd/rpi3/scsi_status_error/readme >>> >>> To a casual glance, it looks like a hardware error. >>> But, the machine seems to work fine until it's running >>> buildworld, and then crashes during a relatively easy >>> part of buildworld. The initial error message is: >>> >>> bob@pelorus:/usr/src % (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 43 29 d6 40 00 00 40 00 >>> (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error >>> (da0:umass-sim0:0:0:0): SCSI status: Check Condition >>> (da0:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error) >>> (da0:umass-sim0:0:0:0): Error 5, Unretryable error >> >> A description of "Media Error" from seagate is: >> >> Medium Error - Indicates the command terminated with a nonrecovered error condition, probably caused by a flaw in the medium or an error in the recorded data. >> >> To compare/contrast with other alternatives, see: >> >> https://www.seagate.com/support/kb/scsi-sense-key-chart-196259en/ >> >> A more extensive list with asc/ascq involved as well is at: >> >> https://en.wikipedia.org/wiki/Key_Code_Qualifier/ >> >> Allowing more comparison/contrast with other classifications. >> >> It indicates: >> >> 3 11 00 Medium Error - unrecovered read error >> >> (matching the reported text). >> >>> SCSI errors are not unknown, but they usually succeed on retry. >>> It's not obvious why this is treated as un-retryable. >> >> Because that is what the "3 11 00" combination involved >> means. The drive is reporting that. It is not a FreeBSD >> driver choice of handling. >> >> (I'm not expert at drive internals, so I take it at face >> value.) >> >>> Are there any simple tests that might help decide what's wrong? >>> It's likely that re-running buildworld will reproduce the crash. >> >> See the https://en.wikipedia.org/wiki/Key_Code_Qualifier/ >> description material for some background information? >> >>> I've placed the results of smartctl -a at the end of the notes. >>> The interpretation isn't self evident, hopefully someone else >>> can lend an eye. I'll try smartctl -t after a good night's sleep. >> >> man smartctl reports: >> >> UNC: UNCorrectable Error in Data >> >> The 3 examples of: >> >> After command completion occurred, registers were: >> ER ST SC SN CL CH DH >> -- -- -- -- -- -- -- >> 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 >> >> indicate UNC. All 3 list the same LBA value. > > Turns out that the LBA value is likely garbage, given the > size of your drive (> 128 GiBytes): But we have an address from the SCSI command: READ(10). CDB: 28 00 43 29 d6 40 00 00 40 00 Decoded that says read, starting block 0x4329d640, length 0x40 blocks. If block size is 512 bytes that is about half a terabyte into the disk. This shell command should replicate the read: # dd if=/dev/da0 of=/dev/null bs=32768 count=1 skip=17606489 The device name (if=) comes from the error message "da0:umass-sim0:0:0:0". The block size (bs=) matches the read request in the failed SCSI command. The skip count is 0x4329d640 (disk block) / 64 (number of disk blocks per dd block). If you reproduce the error with dd you can try a binary search over the 64 block range until you find the block that failed.