Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Mon, 20 Feb 2023 06:00:35 UTC
On Feb 19, 2023, at 21:50, Mark Millard <marklmi@yahoo.com> wrote:

> On Feb 19, 2023, at 20:45, bob prohaska <fbsd@www.zefox.net> wrote:
> 
>> On Sun, Feb 19, 2023 at 02:35:15PM -0800, Mark Millard wrote:
>>> 
>>> Kirk likely monitors the freebsd-fs list.
>> 
>> I didn't notice there was such a list 8-\
>> 
>>> Kirk likely does not monitor the freebsd-arm list.
>>> None of us thought to switch to freebsd-fs at the
>>> time. The only part of your context that ended up
>>> to be arm specific was original buildworld crash.
>>> You definitely started in an appropriate place
>>> (freebsd-arm). After the crash, the rest was more
>>> general relative to platforms and more specific
>>> relative to file system handling (UFS support).
>>> 
>>> I do not see any reason for any of this exchange
>>> to go to any lists, given the current status.
>> 
>> Alas, the story's not over yet 8-(  
>> 
>> After getting the disk fsck'd and booting once more,
>> an attempt to buildworld using a fresh /usr/src
>> and empty /usr/obj crashed again,
> 
> I'm confused. The original crash was reported to be
> on a RPi2B using a armv7 kernel, or so I thought.
> (The RPi3B was for later fsck_ffs activity for the
> media's UFS.)
> 
> This new material indicates a RPi3B arm64 (aarch64)
> context for this buildworld failure. Is it the same
> media as for the prior buildworld failure?
> 
>> in I think the
>> same way. This time some notes have been collected
>> at
>> http://www.zefox.net/~fbsd/rpi3/scsi_status_error/readme
>> 
>> To a casual glance, it looks like a hardware error.
>> But, the machine seems to work fine until it's running
>> buildworld, and then crashes during a relatively easy
>> part of buildworld. The initial error message is:
>> 
>> bob@pelorus:/usr/src % (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 43 29 d6 40 00 00 40 00 
>> (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
>> (da0:umass-sim0:0:0:0): SCSI status: Check Condition
>> (da0:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
>> (da0:umass-sim0:0:0:0): Error 5, Unretryable error
> 
> A description of "Media Error" from seagate is:
> 
> Medium Error - Indicates the command terminated with a nonrecovered error condition, probably caused by a flaw in the medium or an error in the recorded data.
> 
> To compare/contrast with other alternatives, see:
> 
> https://www.seagate.com/support/kb/scsi-sense-key-chart-196259en/
> 
> A more extensive list with asc/ascq involved as well is at:
> 
> https://en.wikipedia.org/wiki/Key_Code_Qualifier/
> 
> Allowing more comparison/contrast with other classifications.
> 
> It indicates:
> 
> 3 11 00 Medium Error - unrecovered read error
> 
> (matching the reported text).
> 
>> SCSI errors are not unknown, but they usually succeed on retry.
>> It's not obvious why this is treated as un-retryable. 
> 
> Because that is what the "3 11 00" combination involved
> means. The drive is reporting that. It is not a FreeBSD
> driver choice of handling.
> 
> (I'm not expert at drive internals, so I take it at face
> value.)
> 
>> Are there any simple tests that might help decide what's wrong?
>> It's likely that re-running buildworld will reproduce the crash.
> 
> See the https://en.wikipedia.org/wiki/Key_Code_Qualifier/
> description material for some background information?
> 
>> I've placed the results of smartctl -a at the end of the notes. 
>> The interpretation isn't self evident, hopefully someone else
>> can lend an eye. I'll try smartctl -t after a good night's sleep. 
> 
> man smartctl reports:
> 
>                 UNC:   UNCorrectable Error in Data
> 
> The 3 examples of:
> 
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
> 
> indicate UNC. All 3 list the same LBA value.

Turns out that the LBA value is likely garbage, given the
size of your drive (> 128 GiBytes):

Quoting the smartctl man page:

(Because of the limitations of the SMART
              error log, if the LBA is greater than 0xfffffff, then either no
              error log entry will be made, or the error log entry will have
              an incorrect LBA.  This may happen for drives with a capacity
              greater than 128 GiB or 137 GB.)

Also, the more expanded material about UNC is:

              UNC (UNCorrectable): data is uncorrectable.  This refers to data
              which has been read from the disk, but for which the Error
              Checking and Correction (ECC) codes are inconsistent.  In
              effect, this means that the data can not be read.

> 
> Error 4 occurred at disk power-on lifetime: 11121 hours (463 days + 9 hours)
> Error 3 occurred at disk power-on lifetime: 11098 hours (462 days + 10 hours)
> Error 2 occurred at disk power-on lifetime: 11096 hours (462 days + 8 hours)
> 
> So spread over a little over a day overall, with 2 and 3
> spread over a couple of hours.
> 
> It suggests to me that the drive is no longer usable.
> But I'm no expert.



===
Mark Millard
marklmi at yahoo.com