Re: Error detection for microSD-based swap, buildworld failures on pi3

From: Mark Millard <marklmi_at_yahoo.com>
Date: Wed, 02 Feb 2022 01:25:33 UTC
On 2022-Feb-1, at 16:47, MJ <mafsys1234@gmail.com> wrote:

> On 2/02/2022 3:18 am, bob prohaska wrote:
>> [new subject, different emphasis, old problem]
>> On Mon, Jan 31, 2022 at 03:06:01PM -0800, Mark Millard wrote:
>>> 
>>> One thing that could fit the behavior is if small part(s)
>>> of the system c++ compiler (or libraires it uses) were
>>> corrupted on that specific media. In that case, nothing
>>> elsewhere would replicate the failures but a lot might
>>> work without using the corrupted part(s), making the
>>> failures not random.
>> [spaced for emphasis]
>>> Checking on that is part of why
>>> I'd hoped to get a lldb report for a .sh/.cpp pair
>>> leading to failure on your RPi3* in question.
>>> 
>> If/when the stable/13 Pi3 finishes its -j1 single-user
>> build/install cycle I'll make a point of trying the
>> .sh/.cpp test under lldb.
>> For most of their operational history both troublesome Pi3
>> systems have had some of their swap on microSD. If there
>> is no error detection at all for microSD-based storage
> 
> Is this true? I would have thought it used some form of error detection in the firmware or in
> the controller.

The type of error and stage at which the error occurs matters.
The firmware can not cover all issues that lead to corrupted
content on media.

>> then undetected corruption of data from swap is a real
>> possibility. I expected that storage errors would be
>> reported but maybe not, especially outside file systems.
> 
> If indeed your suppositions are correct, would a file for swap be more prudent as it has to
> go through the file system (UFS/VFS) to read/write to swap?

No. See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206048 and
its comments #7 and #8.

>> Mechanical disks have some internal error detection and
>> report explictly when data can't be retrieved. As I think
>> back on it at least one flash device (a USB thumb drive)
>> failed silently, no reported errors but also no-write.
>> That was on a filesystem, so the OS noticed and so did I.
> 
> But this could "simply" be because one of the NAND blocks has failed, not that it could not
> detect an error. Is there a lack of error detection in the driver handling USB thumb drives and reported back to the kernel? I do not know.

Bob's context is reproducible at the same places in
compiling the same files across buildworld with
varing -jN figures, prior history since boot, use
of the .sh/.cpp files that the compiler saves, and
across reboots.

Such is unlikely for hitting the same problem page(s)
in the swap space each way things are run.

>> Is there any error detection/correction employed by the
>> virtual memory system as it reads and writes mass storage?
> 
> You would think there should be.

Any corruption on media would more likely be in the compiler's
file(s) for Bob's context. (Source code that fails to compile
in Bob's specific RPi3* context compile fine when copied to other
machines.)

If there is such a corruption (unknown), the memory content
to be written might have already been corrupt before it was
queued to be written out. At this point we do not know if
there is any corruption involved.

===
Mark Millard
marklmi at yahoo.com