Re: aarch64(?) poudiere-devel based builds seem to get fairly-rare corrupted files after recent system update(s?)

From: Mark Millard via freebsd-current <freebsd-current_at_freebsd.org>
Date: Sat, 20 Nov 2021 06:20:40 UTC
On 2021-Nov-18, at 12:15, Mark Millard <marklmi@yahoo.com> wrote:

> On 2021-Nov-17, at 11:17, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On 2021-Nov-15, at 15:43, Mark Millard <marklmi@yahoo.com> wrote:
>> 
>>> On 2021-Nov-15, at 13:13, Mark Millard <marklmi@yahoo.com> wrote:
>>> 
>>>> On 2021-Nov-15, at 12:51, Mark Millard <marklmi@yahoo.com> wrote:
>>>> 
>>>>> On 2021-Nov-15, at 11:31, Mark Millard <marklmi@yahoo.com> wrote:
>>>>> 
>>>>>> I updated from (shown a system that I've not updated yet):
>>>>>> 
>>>>>> # uname -apKU
>>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #18 main-n250455-890cae197737-dirty: Thu Nov  4 13:43:17 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 
>>>>>> 1400040 1400040
>>>>>> 
>>>>>> to:
>>>>>> 
>>>>>> # uname -apKU
>>>>>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #19 main-n250667-20aa359773be-dirty: Sun Nov 14 02:57:32 PST 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1400042 1400042
>>>>>> 
>>>>>> and then updated /usr/ports/ and started poudriere-devel based builds of
>>>>>> the ports I's set up to use. However my last round of port builds from
>>>>>> a general update of /usr/ports/ were on 2021-10-23 before either of the
>>>>>> above.
>>>>>> 
>>>>>> I've had at least two files that seem to be corrupted, where a later part
>>>>>> of the build hits problematical file(s) from earlier build activity. For
>>>>>> example:
>>>>>> 
>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:1: warning: null character ignored [-Wnull-character]
>>>>>> <U+0000> 
>>>>>> ^
>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:2: warning: null character ignored [-Wnull-character]
>>>>>> <U+0000><U+0000>
>>>>>>   ^
>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:3: warning: null character ignored [-Wnull-character]
>>>>>> <U+0000><U+0000><U+0000> 
>>>>>>           ^   
>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:4: warning: null character ignored [-Wnull-character]
>>>>>> <U+0000><U+0000><U+0000><U+0000>
>>>>>>                   ^
>>>>>> . . .
>>>>>> 
>>>>>> Removing the xorgproto-2021.4 package and rebuilding via
>>>>>> poudiere-devel did not get a failure of any ports dependent
>>>>>> on it.
>>>>>> 
>>>>>> This was from a use of:
>>>>>> 
>>>>>> # poudriere jail -j13_0R-CA7 -i
>>>>>> Jail name:         13_0R-CA7
>>>>>> Jail version:      13.0-RELEASE-p5
>>>>>> Jail arch:         arm.armv7
>>>>>> Jail method:       null
>>>>>> Jail mount:        /usr/obj/DESTDIRs/13_0R-CA7-poud
>>>>>> Jail fs:           
>>>>>> Jail updated:      2021-11-04 01:48:49
>>>>>> Jail pkgbase:      disabled
>>>>>> 
>>>>>> but another not-investigated example was from:
>>>>>> 
>>>>>> # poudriere jail -j13_0R-CA72 -i
>>>>>> Jail name:         13_0R-CA72
>>>>>> Jail version:      13.0-RELEASE-p5
>>>>>> Jail arch:         arm64.aarch64
>>>>>> Jail method:       null
>>>>>> Jail mount:        /usr/obj/DESTDIRs/13_0R-CA72-poud
>>>>>> Jail fs:           
>>>>>> Jail updated:      2021-11-04 01:48:01
>>>>>> Jail pkgbase:      disabled
>>>>>> 
>>>>>> (so no 32-bit COMPAT involved). The apparent corruption
>>>>>> was in a different port (autoconfig, noticed by the
>>>>>> build of automake failing via config reporting
>>>>>> /usr/local/share/autoconf-2.69/autoconf/autoconf.m4f
>>>>>> being rejected).
>>>>>> 
>>>>>> /usr/obj/DESTDIRs/13_0R-CA7-poud/ and
>>>>>> /usr/obj/DESTDIRs/13_0R-CA72-poud/ and the like track the
>>>>>> system versions.
>>>>>> 
>>>>>> The media is an Optane 960 in the PCIe slot of a HoneyComb
>>>>>> (16 Cortex-A72's). The context is a root on ZFS one, ZFS
>>>>>> used in order to have bectl, not redundancy.
>>>>>> 
>>>>>> The ThreadRipper 1950X (so amd64) port builds did not give
>>>>>> evidence of such problems based on the updated system. (Also
>>>>>> Optane media in a PCIe slot, also root on ZFS.) But the
>>>>>> errors seem rare enough to not be able to conclude much.
>>>>> 
>>>>> For aarch64 targeting aarch64 there was also this
>>>>> explicit corruption notice during the poudriere(-devel)
>>>>> bulk build:
>>>>> 
>>>>> . . .
>>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3: .........
>>>>> pkg-static: Fail to extract /usr/local/libexec/gcc/arm-none-eabi/8.4.0/lto1 from package: Lzma library error: Corrupted input data
>>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3... done
>>>>> 
>>>>> Failed to install the following 1 package(s): /packages/All/arm-none-eabi-gcc-8.4.0_3.pkg
>>>>> *** Error code 1
>>>>> Stop.
>>>>> make: stopped in /usr/ports/sysutils/u-boot-orangepi-plus-2e
>>>>> 
>>>>> I'm not yet to the point of retrying after removing
>>>>> arm-none-eabi-gcc-8.4.0_3 : other things are being built.
>>>> 
>>>> 
>>>> Another context with my prior general update of /usr/ports/
>>>> and the matching port builds: Back then I used USE_TMPFS=all
>>>> but the failure is based on USE_TMPFS-"data" instead. So:
>>>> lots more I/O.
>>>> 
>>> 
>>> None of the 3 corruptions repeated during bulk builds that
>>> retried the builds that generated the files. All of the
>>> ports that failed by hitting the corruptions in what they
>>> depended on, built fine in teh retries.
>>> 
>>> For reference:
>>> 
>>> I'll note that, back when I was using USE_TMPFS=all , I also
>>> did some separate bulk -a test runs, both aarch64 (Cortex-A72)
>>> native and Cortext-A72 targeting Cortex-A7 (armv7). None of
>>> those showed evidence of file corruptions. In general I've
>>> not had previous file corruptions with this system. (There
>>> was a little more than 245 GiBytes swap, which covered the
>>> tmpfs needs when they were large.)
>> 
>> 
>> I set up a contrasting test context and got no evidence of
>> corruptions in that context. (Note: the 3 bulk builds
>> total to around 24 hrs of activity for the 3 examples
>> of 460+ ports building.) So, for the Cortex-A72 system,
> 
> I set up a UFS on Optane (U.2 via M.2 adapter) context and
> also got no evidence of corruptions in that context (same
> hardware and a copy of the USB3 SSD based system). The
> sequence of 3 bulks took somewhat over 18 hrs using the
> Optane.
> 
>> root on UFS on portable USB3 SSD:   no evidence of corruptions
> Also:
> root on UFS on Optane U.2 via M.2:  no evidence of corruptions
>> vs.:
>> root on ZFS on optane in PCIe slot: solid evidence of 3 known corruptions
>> 
>> Both had USE_TMPFS="data" in use. The same system build
>> had been installed and booted for both tests.
>> 
>> The evidence of corruptions is rare enough for this not to
>> be determinative, but it is suggestive.
>> 
>> Unfortunately, ZFS vs. UFS and Optane-in-PCIe vs. USB3 are
>> not differentiated by this test result.
>> 
>> There is also the result that I've not seen evidence of
>> corruptions on the ThreadRipper 1950 X (amd64) system.
>> Again, not determinative, but suggestive, given how rare
>> the corruptions seem to be.
> 
> So far the only things unique to the observed corruptions are:
> 
> root on ZFS context (vs. root on UFS)
> and:
> Optane in a PCIe slot (but no contrasting ZFS case tested)
> 
> The PCIe slot does not seem to me to be likely to be contributing.
> So this seem to be suggestive of a ZFS problem.
> 
> A contributing point might be that the main [so: 14] system was
> built via -mcpu=cortex-a72 for execution on a Cortext-A72 system.
> 
> [I previously ran into a USB subsystem mishandling of keeping
> things coherent for the week memory ordering in this sort of
> context. That issue was fixed. But back then I was lucky enough
> to be able to demonstrate fails vs. works by adding an
> appropriate instruction to FreeBSD in a few specific places
> (more than necessary as it turned out). Someone else determined
> where the actual mishandling was that covered all required
> places. My generating that much information in this context
> seems unlikely.]


I started a retry of root-on-ZFS with the Optane-in-PCIe-slot media
and it got its first corruption (in a different place, 2nd bulk
build this time). The use of the corrupted file reports:

configure:13269: cc -o conftest -Wall -Wextra -fsigned-char -Wdeclaration-after-statement -O2 -pipe -mcpu=cortex-a53  -g -fstack-protector-strong -fno-strict-aliasing  -DUSE_MEMORY_H -I/usr/local/incl
ude -mcpu=cortex-a53  -fstack-protector-strong  conftest.c  -L/usr/local/lib -logg >&5
In file included from conftest.c:27:
In file included from /usr/local/include/ogg/ogg.h:24:
In file included from /usr/local/include/ogg/os_types.h:154:
/usr/local/include/ogg/config_types.h:1:1: warning: null character ignored [-Wnull-character]
<U+0000>
^
/usr/local/include/ogg/config_types.h:1:2: warning: null character ignored [-Wnull-character]
<U+0000><U+0000>
        ^
/usr/local/include/ogg/config_types.h:1:3: warning: null character ignored [-Wnull-character]
<U+0000><U+0000><U+0000>
                ^
. . .
/usr/local/include/ogg/config_types.h:1:538: warning: null character ignored [-Wnull-character]
. . . (nulls) . . .

So: 538 such null bytes.

Thus, another example of something like a page of nulls being
written out when ZFS is in use.

audio/gstreamer1-plugins-ogg also failed via referencing the file
during its build.

(The bulk run is still going and there is one more bulk run to go.)


===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)