Re: aarch64(?) poudiere-devel based builds seem to get fairly-rare corrupted files after recent system update(s?)

From: Mark Millard via freebsd-current <freebsd-current_at_freebsd.org>
Date: Thu, 18 Nov 2021 20:15:24 UTC
On 2021-Nov-17, at 11:17, Mark Millard <marklmi@yahoo.com> wrote:

> On 2021-Nov-15, at 15:43, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On 2021-Nov-15, at 13:13, Mark Millard <marklmi@yahoo.com> wrote:
>> 
>>> On 2021-Nov-15, at 12:51, Mark Millard <marklmi@yahoo.com> wrote:
>>> 
>>>> On 2021-Nov-15, at 11:31, Mark Millard <marklmi@yahoo.com> wrote:
>>>> 
>>>>> I updated from (shown a system that I've not updated yet):
>>>>> 
>>>>> # uname -apKU
>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #18 main-n250455-890cae197737-dirty: Thu Nov  4 13:43:17 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 
>>>>> 1400040 1400040
>>>>> 
>>>>> to:
>>>>> 
>>>>> # uname -apKU
>>>>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #19 main-n250667-20aa359773be-dirty: Sun Nov 14 02:57:32 PST 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1400042 1400042
>>>>> 
>>>>> and then updated /usr/ports/ and started poudriere-devel based builds of
>>>>> the ports I's set up to use. However my last round of port builds from
>>>>> a general update of /usr/ports/ were on 2021-10-23 before either of the
>>>>> above.
>>>>> 
>>>>> I've had at least two files that seem to be corrupted, where a later part
>>>>> of the build hits problematical file(s) from earlier build activity. For
>>>>> example:
>>>>> 
>>>>> /usr/local/include/X11/extensions/XvMC.h:1:1: warning: null character ignored [-Wnull-character]
>>>>> <U+0000> 
>>>>> ^
>>>>> /usr/local/include/X11/extensions/XvMC.h:1:2: warning: null character ignored [-Wnull-character]
>>>>> <U+0000><U+0000>
>>>>>    ^
>>>>> /usr/local/include/X11/extensions/XvMC.h:1:3: warning: null character ignored [-Wnull-character]
>>>>> <U+0000><U+0000><U+0000> 
>>>>>            ^   
>>>>> /usr/local/include/X11/extensions/XvMC.h:1:4: warning: null character ignored [-Wnull-character]
>>>>> <U+0000><U+0000><U+0000><U+0000>
>>>>>                    ^
>>>>> . . .
>>>>> 
>>>>> Removing the xorgproto-2021.4 package and rebuilding via
>>>>> poudiere-devel did not get a failure of any ports dependent
>>>>> on it.
>>>>> 
>>>>> This was from a use of:
>>>>> 
>>>>> # poudriere jail -j13_0R-CA7 -i
>>>>> Jail name:         13_0R-CA7
>>>>> Jail version:      13.0-RELEASE-p5
>>>>> Jail arch:         arm.armv7
>>>>> Jail method:       null
>>>>> Jail mount:        /usr/obj/DESTDIRs/13_0R-CA7-poud
>>>>> Jail fs:           
>>>>> Jail updated:      2021-11-04 01:48:49
>>>>> Jail pkgbase:      disabled
>>>>> 
>>>>> but another not-investigated example was from:
>>>>> 
>>>>> # poudriere jail -j13_0R-CA72 -i
>>>>> Jail name:         13_0R-CA72
>>>>> Jail version:      13.0-RELEASE-p5
>>>>> Jail arch:         arm64.aarch64
>>>>> Jail method:       null
>>>>> Jail mount:        /usr/obj/DESTDIRs/13_0R-CA72-poud
>>>>> Jail fs:           
>>>>> Jail updated:      2021-11-04 01:48:01
>>>>> Jail pkgbase:      disabled
>>>>> 
>>>>> (so no 32-bit COMPAT involved). The apparent corruption
>>>>> was in a different port (autoconfig, noticed by the
>>>>> build of automake failing via config reporting
>>>>> /usr/local/share/autoconf-2.69/autoconf/autoconf.m4f
>>>>> being rejected).
>>>>> 
>>>>> /usr/obj/DESTDIRs/13_0R-CA7-poud/ and
>>>>> /usr/obj/DESTDIRs/13_0R-CA72-poud/ and the like track the
>>>>> system versions.
>>>>> 
>>>>> The media is an Optane 960 in the PCIe slot of a HoneyComb
>>>>> (16 Cortex-A72's). The context is a root on ZFS one, ZFS
>>>>> used in order to have bectl, not redundancy.
>>>>> 
>>>>> The ThreadRipper 1950X (so amd64) port builds did not give
>>>>> evidence of such problems based on the updated system. (Also
>>>>> Optane media in a PCIe slot, also root on ZFS.) But the
>>>>> errors seem rare enough to not be able to conclude much.
>>>> 
>>>> For aarch64 targeting aarch64 there was also this
>>>> explicit corruption notice during the poudriere(-devel)
>>>> bulk build:
>>>> 
>>>> . . .
>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3: .........
>>>> pkg-static: Fail to extract /usr/local/libexec/gcc/arm-none-eabi/8.4.0/lto1 from package: Lzma library error: Corrupted input data
>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3... done
>>>> 
>>>> Failed to install the following 1 package(s): /packages/All/arm-none-eabi-gcc-8.4.0_3.pkg
>>>> *** Error code 1
>>>> Stop.
>>>> make: stopped in /usr/ports/sysutils/u-boot-orangepi-plus-2e
>>>> 
>>>> I'm not yet to the point of retrying after removing
>>>> arm-none-eabi-gcc-8.4.0_3 : other things are being built.
>>> 
>>> 
>>> Another context with my prior general update of /usr/ports/
>>> and the matching port builds: Back then I used USE_TMPFS=all
>>> but the failure is based on USE_TMPFS-"data" instead. So:
>>> lots more I/O.
>>> 
>> 
>> None of the 3 corruptions repeated during bulk builds that
>> retried the builds that generated the files. All of the
>> ports that failed by hitting the corruptions in what they
>> depended on, built fine in teh retries.
>> 
>> For reference:
>> 
>> I'll note that, back when I was using USE_TMPFS=all , I also
>> did some separate bulk -a test runs, both aarch64 (Cortex-A72)
>> native and Cortext-A72 targeting Cortex-A7 (armv7). None of
>> those showed evidence of file corruptions. In general I've
>> not had previous file corruptions with this system. (There
>> was a little more than 245 GiBytes swap, which covered the
>> tmpfs needs when they were large.)
> 
> 
> I set up a contrasting test context and got no evidence of
> corruptions in that context. (Note: the 3 bulk builds
> total to around 24 hrs of activity for the 3 examples
> of 460+ ports building.) So, for the Cortex-A72 system,

I set up a UFS on Optane (U.2 via M.2 adapter) context and
also got no evidence of corruptions in that context (same
hardware and a copy of the USB3 SSD based system). The
sequence of 3 bulks took somewhat over 18 hrs using the
Optane.

> root on UFS on portable USB3 SSD:   no evidence of corruptions
Also:
root on UFS on Optane U.2 via M.2:  no evidence of corruptions
> vs.:
> root on ZFS on optane in PCIe slot: solid evidence of 3 known corruptions
> 
> Both had USE_TMPFS="data" in use. The same system build
> had been installed and booted for both tests.
> 
> The evidence of corruptions is rare enough for this not to
> be determinative, but it is suggestive.
> 
> Unfortunately, ZFS vs. UFS and Optane-in-PCIe vs. USB3 are
> not differentiated by this test result.
> 
> There is also the result that I've not seen evidence of
> corruptions on the ThreadRipper 1950 X (amd64) system.
> Again, not determinative, but suggestive, given how rare
> the corruptions seem to be.

So far the only things unique to the observed corruptions are:

root on ZFS context (vs. root on UFS)
and:
Optane in a PCIe slot (but no contrasting ZFS case tested)

The PCIe slot does not seem to me to be likely to be contributing.
So this seem to be suggestive of a ZFS problem.

A contributing point might be that the main [so: 14] system was
built via -mcpu=cortex-a72 for execution on a Cortext-A72 system.

[I previously ran into a USB subsystem mishandling of keeping
things coherent for the week memory ordering in this sort of
context. That issue was fixed. But back then I was lucky enough
to be able to demonstrate fails vs. works by adding an
appropriate instruction to FreeBSD in a few specific places
(more than necessary as it turned out). Someone else determined
where the actual mishandling was that covered all required
places. My generating that much information in this context
seems unlikely.]


===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)