Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)

From: bob prohaska <fbsd_at_www.zefox.net>
Date: Sun, 26 Feb 2023 02:56:54 UTC
On Sun, Feb 19, 2023 at 09:50:45PM -0800, Mark Millard wrote:
> On Feb 19, 2023, at 20:45, bob prohaska <fbsd@www.zefox.net> wrote:
> 
> > 
> > To a casual glance, it looks like a hardware error.
> > But, the machine seems to work fine until it's running
> > buildworld, and then crashes during a relatively easy
> > part of buildworld. The initial error message is:
> > 
> > bob@pelorus:/usr/src % (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 43 29 d6 40 00 00 40 00 
> > (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
> > (da0:umass-sim0:0:0:0): SCSI status: Check Condition
> > (da0:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> > (da0:umass-sim0:0:0:0): Error 5, Unretryable error
> 
> A description of "Media Error" from seagate is:
> 
> Medium Error - Indicates the command terminated with a nonrecovered error condition, probably caused by a flaw in the medium or an error in the recorded data.
> 
> To compare/contrast with other alternatives, see:
> 
> https://www.seagate.com/support/kb/scsi-sense-key-chart-196259en/
> 
> A more extensive list with asc/ascq involved as well is at:
> 
> https://en.wikipedia.org/wiki/Key_Code_Qualifier/
> 
> Allowing more comparison/contrast with other classifications.
> 
> It indicates:
> 
> 3 11 00 Medium Error - unrecovered read error
> 
> (matching the reported text).
> 
> > SCSI errors are not unknown, but they usually succeed on retry.
> > It's not obvious why this is treated as un-retryable. 
> 
> Because that is what the "3 11 00" combination involved
> means. The drive is reporting that. It is not a FreeBSD
> driver choice of handling.
> 
> (I'm not expert at drive internals, so I take it at face
> value.)
> 
> > Are there any simple tests that might help decide what's wrong?
> > It's likely that re-running buildworld will reproduce the crash.
> 
> See the https://en.wikipedia.org/wiki/Key_Code_Qualifier/
> description material for some background information?
> 
> > I've placed the results of smartctl -a at the end of the notes. 
> > The interpretation isn't self evident, hopefully someone else
> > can lend an eye. I'll try smartctl -t after a good night's sleep. 
> 
> man smartctl reports:
> 
>                  UNC:   UNCorrectable Error in Data
> 
> The 3 examples of:
> 
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
> 
> indicate UNC. All 3 list the same LBA value.
> 
> Error 4 occurred at disk power-on lifetime: 11121 hours (463 days + 9 hours)
> Error 3 occurred at disk power-on lifetime: 11098 hours (462 days + 10 hours)
> Error 2 occurred at disk power-on lifetime: 11096 hours (462 days + 8 hours)
> 
> So spread over a little over a day overall, with 2 and 3
> spread over a couple of hours.
> 
> It suggests to me that the drive is no longer usable.
> But I'm no expert.

You were correct. After a few re-installations the
disk failed in an obvious way, reporting 395-odd errors. All the
while, SMART seemed to claim the disk "passed" its self-tests.

I was baffled, since the experiments with dd failed to replicate
the error. Evidently there was more to the failure than met the eye.

Thanks for writing!

bob prohaska