Re: Error detection for microSD-based swap, buildworld failures on pi3

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 01 Feb 2022 17:58:38 UTC

On 2022-Feb-1, at 08:18, bob prohaska <fbsd@www.zefox.net> wrote:

> [new subject, different emphasis, old problem]
> 
> On Mon, Jan 31, 2022 at 03:06:01PM -0800, Mark Millard wrote:
>> 
>> One thing that could fit the behavior is if small part(s)
>> of the system c++ compiler (or libraires it uses) were
>> corrupted on that specific media. In that case, nothing
>> elsewhere would replicate the failures but a lot might
>> work without using the corrupted part(s), making the
>> failures not random. 
> 
> [spaced for emphasis]
> 
>> Checking on that is part of why
>> I'd hoped to get a lldb report for a .sh/.cpp pair
>> leading to failure on your RPi3* in question.
>> 
> 
> If/when the stable/13 Pi3 finishes its -j1 single-user
> build/install cycle I'll make a point of trying the 
> .sh/.cpp test under lldb.  
> 
> For most of their operational history both troublesome Pi3
> systems have had some of their swap on microSD. If there
> is no error detection at all for microSD-based storage
> then undetected corruption of data from swap is a real
> possibility.

Getting a systematic error (SEGV) at a specific point in a
compile across many attempts with various prior histories,
reboots, etc. involved is not likely to be from somehow
hitting the same bad page in the swap space each time.
This variety has varying -jN figures, which can lead to
variations in which compile get an error first. But
when it is a specific file that gets the failure, the
detail seems repeatable.

This is true of the .sh/.cpp pairs that fail reliably
for you as well --especially given that they work for
me, even without swap enabled: the 1 GiBytes of RAM is
enough. (Swap required for running under lldb.)

If the problem is a corruption, it would most likely
be in some file in use by the compiler (possibly its
own file): a file in the UFS file system.

> I expected that storage errors would be
> reported but maybe not, especially outside file systems.  

Not likely to be a swap space issue.

> Mechanical disks have some internal error detection and
> report explictly when data can't be retrieved. As I think
> back on it at least one flash device (a USB thumb drive)
> failed silently, no reported errors but also no-write.
> That was on a filesystem, so the OS noticed and so did I.

Storage media can not generally detect if the data being
written is already corrupt before it is written.

> Is there any error detection/correction employed by the
> virtual memory system as it reads and writes mass storage? 
> 

No separate one that I know of.  But getting a systematic
error at a systematic point across a wide variety of
histories is not likely to be a swap I/O problem.

If I understand correctly, the normal recommendation is
to avoid using microsd card for a heavily used swap
space, reliability over time being an issue. (But I'm
no expert.)


For reference, as I understand the following is a
repeatable part of the failure notice for compiling
contrib/googletest/googletest/src/gtest-all.cc :

1.      /usr/obj/usr/src/arm64.aarch64/tmp/usr/include/private/gtest/internal/gtest-type-util.h:806:37: current parser token '{'
2.      /usr/obj/usr/src/arm64.aarch64/tmp/usr/include/private/gtest/internal/gtest-type-util.h:58:1: parsing namespace 'testing'

But we know that when I make a copy of the .cpp/.sh
pair and execute the .sh the compile works fine.
This is evidence against the source code being compiled
being corrupt.

===
Mark Millard
marklmi at yahoo.com