Re: Error detection for microSD-based swap, buildworld failures on pi3
Date: Wed, 02 Feb 2022 02:52:10 UTC
On 2/02/2022 12:25 pm, Mark Millard wrote: > On 2022-Feb-1, at 16:47, MJ <mafsys1234@gmail.com> wrote: > >> On 2/02/2022 3:18 am, bob prohaska wrote: >>> [new subject, different emphasis, old problem] >>> On Mon, Jan 31, 2022 at 03:06:01PM -0800, Mark Millard wrote: >>>> >>>> One thing that could fit the behavior is if small part(s) >>>> of the system c++ compiler (or libraires it uses) were >>>> corrupted on that specific media. In that case, nothing >>>> elsewhere would replicate the failures but a lot might >>>> work without using the corrupted part(s), making the >>>> failures not random. >>> [spaced for emphasis] >>>> Checking on that is part of why >>>> I'd hoped to get a lldb report for a .sh/.cpp pair >>>> leading to failure on your RPi3* in question. >>>> >>> If/when the stable/13 Pi3 finishes its -j1 single-user >>> build/install cycle I'll make a point of trying the >>> .sh/.cpp test under lldb. >>> For most of their operational history both troublesome Pi3 >>> systems have had some of their swap on microSD. If there >>> is no error detection at all for microSD-based storage >> >> Is this true? I would have thought it used some form of error detection in the firmware or in >> the controller. > > The type of error and stage at which the error occurs matters. > The firmware can not cover all issues that lead to corrupted > content on media. I did not state it covers all corruption. However, I would be totally surprised if the controller in ALL SD cards does not do error checking, whether ECC or even BCH. That remains my point. > >>> then undetected corruption of data from swap is a real >>> possibility. I expected that storage errors would be >>> reported but maybe not, especially outside file systems. >> >> If indeed your suppositions are correct, would a file for swap be more prudent as it has to >> go through the file system (UFS/VFS) to read/write to swap? > > No. See https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=206048 and > its comments #7 and #8. > This seems to address potential memory over-use because of a swapfile, not the safety of it over a swap partition. I still contend the UFS file system has better protection against corruption than a raw partition labelled swap. If Bob's requirement is a "safer" swap, then a file would be the answer. Whether there are other issues to contend with are likely out of context in this particular discussion. >>> Mechanical disks have some internal error detection and >>> report explictly when data can't be retrieved. As I think >>> back on it at least one flash device (a USB thumb drive) >>> failed silently, no reported errors but also no-write. >>> That was on a filesystem, so the OS noticed and so did I. >> >> But this could "simply" be because one of the NAND blocks has failed, not that it could not >> detect an error. Is there a lack of error detection in the driver handling USB thumb drives and reported back to the kernel? I do not know. > > Bob's context is reproducible at the same places in No, he was talking about a "failed silently" event and this is what I was replying to. I am not up-to-date with the previous discussion on the failure of llvm/clang. > > Such is unlikely for hitting the same problem page(s) > in the swap space each way things are run. I couldn't agree more. The chances would seem remote, unless that partition is on a part of the SD card/USB drive that is failing and the USB driver is not detecting these as reported by the controller. MJ