Re: aarch64(?) poudiere-devel based builds seem to get fairly-rare corrupted files after recent system update(s?)

From: Mark Millard via freebsd-current <freebsd-current_at_freebsd.org>
Date: Tue, 23 Nov 2021 08:43:11 UTC
On 2021-Nov-21, at 07:50, Mark Millard <marklmi@yahoo.com> wrote:

> On 2021-Nov-20, at 11:54, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On 2021-Nov-19, at 22:20, Mark Millard <marklmi@yahoo.com> wrote:
>> 
>>> On 2021-Nov-18, at 12:15, Mark Millard <marklmi@yahoo.com> wrote:
>>> 
>>>> On 2021-Nov-17, at 11:17, Mark Millard <marklmi@yahoo.com> wrote:
>>>> 
>>>>> On 2021-Nov-15, at 15:43, Mark Millard <marklmi@yahoo.com> wrote:
>>>>> 
>>>>>> On 2021-Nov-15, at 13:13, Mark Millard <marklmi@yahoo.com> wrote:
>>>>>> 
>>>>>>> On 2021-Nov-15, at 12:51, Mark Millard <marklmi@yahoo.com> wrote:
>>>>>>> 
>>>>>>>> On 2021-Nov-15, at 11:31, Mark Millard <marklmi@yahoo.com> wrote:
>>>>>>>> 
>>>>>>>>> I updated from (shown a system that I've not updated yet):
>>>>>>>>> 
>>>>>>>>> # uname -apKU
>>>>>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #18 main-n250455-890cae197737-dirty: Thu Nov  4 13:43:17 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 
>>>>>>>>> 1400040 1400040
>>>>>>>>> 
>>>>>>>>> to:
>>>>>>>>> 
>>>>>>>>> # uname -apKU
>>>>>>>>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #19 main-n250667-20aa359773be-dirty: Sun Nov 14 02:57:32 PST 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1400042 1400042
>>>>>>>>> 
>>>>>>>>> and then updated /usr/ports/ and started poudriere-devel based builds of
>>>>>>>>> the ports I's set up to use. However my last round of port builds from
>>>>>>>>> a general update of /usr/ports/ were on 2021-10-23 before either of the
>>>>>>>>> above.
>>>>>>>>> 
>>>>>>>>> I've had at least two files that seem to be corrupted, where a later part
>>>>>>>>> of the build hits problematical file(s) from earlier build activity. For
>>>>>>>>> example:
>>>>>>>>> 
>>>>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:1: warning: null character ignored [-Wnull-character]
>>>>>>>>> <U+0000> 
>>>>>>>>> ^
>>>>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:2: warning: null character ignored [-Wnull-character]
>>>>>>>>> <U+0000><U+0000>
>>>>>>>>> ^
>>>>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:3: warning: null character ignored [-Wnull-character]
>>>>>>>>> <U+0000><U+0000><U+0000> 
>>>>>>>>>        ^   
>>>>>>>>> /usr/local/include/X11/extensions/XvMC.h:1:4: warning: null character ignored [-Wnull-character]
>>>>>>>>> <U+0000><U+0000><U+0000><U+0000>
>>>>>>>>>                ^
>>>>>>>>> . . .
>>>>>>>>> 
>>>>>>>>> Removing the xorgproto-2021.4 package and rebuilding via
>>>>>>>>> poudiere-devel did not get a failure of any ports dependent
>>>>>>>>> on it.
>>>>>>>>> 
>>>>>>>>> This was from a use of:
>>>>>>>>> 
>>>>>>>>> # poudriere jail -j13_0R-CA7 -i
>>>>>>>>> Jail name:         13_0R-CA7
>>>>>>>>> Jail version:      13.0-RELEASE-p5
>>>>>>>>> Jail arch:         arm.armv7
>>>>>>>>> Jail method:       null
>>>>>>>>> Jail mount:        /usr/obj/DESTDIRs/13_0R-CA7-poud
>>>>>>>>> Jail fs:           
>>>>>>>>> Jail updated:      2021-11-04 01:48:49
>>>>>>>>> Jail pkgbase:      disabled
>>>>>>>>> 
>>>>>>>>> but another not-investigated example was from:
>>>>>>>>> 
>>>>>>>>> # poudriere jail -j13_0R-CA72 -i
>>>>>>>>> Jail name:         13_0R-CA72
>>>>>>>>> Jail version:      13.0-RELEASE-p5
>>>>>>>>> Jail arch:         arm64.aarch64
>>>>>>>>> Jail method:       null
>>>>>>>>> Jail mount:        /usr/obj/DESTDIRs/13_0R-CA72-poud
>>>>>>>>> Jail fs:           
>>>>>>>>> Jail updated:      2021-11-04 01:48:01
>>>>>>>>> Jail pkgbase:      disabled
>>>>>>>>> 
>>>>>>>>> (so no 32-bit COMPAT involved). The apparent corruption
>>>>>>>>> was in a different port (autoconfig, noticed by the
>>>>>>>>> build of automake failing via config reporting
>>>>>>>>> /usr/local/share/autoconf-2.69/autoconf/autoconf.m4f
>>>>>>>>> being rejected).
>>>>>>>>> 
>>>>>>>>> /usr/obj/DESTDIRs/13_0R-CA7-poud/ and
>>>>>>>>> /usr/obj/DESTDIRs/13_0R-CA72-poud/ and the like track the
>>>>>>>>> system versions.
>>>>>>>>> 
>>>>>>>>> The media is an Optane 960 in the PCIe slot of a HoneyComb
>>>>>>>>> (16 Cortex-A72's). The context is a root on ZFS one, ZFS
>>>>>>>>> used in order to have bectl, not redundancy.
>>>>>>>>> 
>>>>>>>>> The ThreadRipper 1950X (so amd64) port builds did not give
>>>>>>>>> evidence of such problems based on the updated system. (Also
>>>>>>>>> Optane media in a PCIe slot, also root on ZFS.) But the
>>>>>>>>> errors seem rare enough to not be able to conclude much.
>>>>>>>> 
>>>>>>>> For aarch64 targeting aarch64 there was also this
>>>>>>>> explicit corruption notice during the poudriere(-devel)
>>>>>>>> bulk build:
>>>>>>>> 
>>>>>>>> . . .
>>>>>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3: .........
>>>>>>>> pkg-static: Fail to extract /usr/local/libexec/gcc/arm-none-eabi/8.4.0/lto1 from package: Lzma library error: Corrupted input data
>>>>>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3... done
>>>>>>>> 
>>>>>>>> Failed to install the following 1 package(s): /packages/All/arm-none-eabi-gcc-8.4.0_3.pkg
>>>>>>>> *** Error code 1
>>>>>>>> Stop.
>>>>>>>> make: stopped in /usr/ports/sysutils/u-boot-orangepi-plus-2e
>>>>>>>> 
>>>>>>>> I'm not yet to the point of retrying after removing
>>>>>>>> arm-none-eabi-gcc-8.4.0_3 : other things are being built.
>>>>>>> 
>>>>>>> 
>>>>>>> Another context with my prior general update of /usr/ports/
>>>>>>> and the matching port builds: Back then I used USE_TMPFS=all
>>>>>>> but the failure is based on USE_TMPFS-"data" instead. So:
>>>>>>> lots more I/O.
>>>>>>> 
>>>>>> 
>>>>>> None of the 3 corruptions repeated during bulk builds that
>>>>>> retried the builds that generated the files. All of the
>>>>>> ports that failed by hitting the corruptions in what they
>>>>>> depended on, built fine in teh retries.
>>>>>> 
>>>>>> For reference:
>>>>>> 
>>>>>> I'll note that, back when I was using USE_TMPFS=all , I also
>>>>>> did some separate bulk -a test runs, both aarch64 (Cortex-A72)
>>>>>> native and Cortext-A72 targeting Cortex-A7 (armv7). None of
>>>>>> those showed evidence of file corruptions. In general I've
>>>>>> not had previous file corruptions with this system. (There
>>>>>> was a little more than 245 GiBytes swap, which covered the
>>>>>> tmpfs needs when they were large.)
>>>>> 
>>>>> 
>>>>> I set up a contrasting test context and got no evidence of
>>>>> corruptions in that context. (Note: the 3 bulk builds
>>>>> total to around 24 hrs of activity for the 3 examples
>>>>> of 460+ ports building.) So, for the Cortex-A72 system,
>>>> 
>>>> I set up a UFS on Optane (U.2 via M.2 adapter) context and
>>>> also got no evidence of corruptions in that context (same
>>>> hardware and a copy of the USB3 SSD based system). The
>>>> sequence of 3 bulks took somewhat over 18 hrs using the
>>>> Optane.
>>>> 
>>>>> root on UFS on portable USB3 SSD:   no evidence of corruptions
>>>> Also:
>>>> root on UFS on Optane U.2 via M.2:  no evidence of corruptions
>>>>> vs.:
>>>>> root on ZFS on optane in PCIe slot: solid evidence of 3 known corruptions
>>>>> 
>>>>> Both had USE_TMPFS="data" in use. The same system build
>>>>> had been installed and booted for both tests.
>>>>> 
>>>>> The evidence of corruptions is rare enough for this not to
>>>>> be determinative, but it is suggestive.
>>>>> 
>>>>> Unfortunately, ZFS vs. UFS and Optane-in-PCIe vs. USB3 are
>>>>> not differentiated by this test result.
>>>>> 
>>>>> There is also the result that I've not seen evidence of
>>>>> corruptions on the ThreadRipper 1950 X (amd64) system.
>>>>> Again, not determinative, but suggestive, given how rare
>>>>> the corruptions seem to be.
>>>> 
>>>> So far the only things unique to the observed corruptions are:
>>>> 
>>>> root on ZFS context (vs. root on UFS)
>>>> and:
>>>> Optane in a PCIe slot (but no contrasting ZFS case tested)
>>>> 
>>>> The PCIe slot does not seem to me to be likely to be contributing.
>>>> So this seem to be suggestive of a ZFS problem.
>>>> 
>>>> A contributing point might be that the main [so: 14] system was
>>>> built via -mcpu=cortex-a72 for execution on a Cortext-A72 system.
>>>> 
>>>> [I previously ran into a USB subsystem mishandling of keeping
>>>> things coherent for the week memory ordering in this sort of
>>>> context. That issue was fixed. But back then I was lucky enough
>>>> to be able to demonstrate fails vs. works by adding an
>>>> appropriate instruction to FreeBSD in a few specific places
>>>> (more than necessary as it turned out). Someone else determined
>>>> where the actual mishandling was that covered all required
>>>> places. My generating that much information in this context
>>>> seems unlikely.]
>>> 
>>> 
>>> I started a retry of root-on-ZFS with the Optane-in-PCIe-slot media
>>> and it got its first corruption (in a different place, 2nd bulk
>>> build this time). The use of the corrupted file reports:
>>> 
>>> configure:13269: cc -o conftest -Wall -Wextra -fsigned-char -Wdeclaration-after-statement -O2 -pipe -mcpu=cortex-a53  -g -fstack-protector-strong -fno-strict-aliasing  -DUSE_MEMORY_H -I/usr/local/incl
>>> ude -mcpu=cortex-a53  -fstack-protector-strong  conftest.c  -L/usr/local/lib -logg >&5
>>> In file included from conftest.c:27:
>>> In file included from /usr/local/include/ogg/ogg.h:24:
>>> In file included from /usr/local/include/ogg/os_types.h:154:
>>> /usr/local/include/ogg/config_types.h:1:1: warning: null character ignored [-Wnull-character]
>>> <U+0000>
>>> ^
>>> /usr/local/include/ogg/config_types.h:1:2: warning: null character ignored [-Wnull-character]
>>> <U+0000><U+0000>
>>>      ^
>>> /usr/local/include/ogg/config_types.h:1:3: warning: null character ignored [-Wnull-character]
>>> <U+0000><U+0000><U+0000>
>>>              ^
>>> . . .
>>> /usr/local/include/ogg/config_types.h:1:538: warning: null character ignored [-Wnull-character]
>>> . . . (nulls) . . .
>>> 
>>> So: 538 such null bytes.
>>> 
>>> Thus, another example of something like a page of nulls being
>>> written out when ZFS is in use.
>>> 
>>> audio/gstreamer1-plugins-ogg also failed via referencing the file
>>> during its build.
>>> 
>>> (The bulk run is still going and there is one more bulk run to go.)
>>> 
>> 
>> Well, 528 happened to be the size of config_types.h --and of
>> config_types.h from a build that did not get the corruption there.
>> 
>> So looking at the other (later) corruption, which was a bigger file
>> (looking via bulk -i and installing what contained the file but
>> looking from outside the jail):
>> 
>> # find /usr/local/ -name "libtextstyle.so*" -exec ls -Tld {} \;
>> -rwxr-xr-x  1 root  wheel  2339104 Nov 20 01:05:05 2021 /usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextstyle.so.0.1.1
>> lrwxr-xr-x  1 root  wheel  21 Nov 20 01:05:05 2021 /usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextstyle.so.0 -> libtextstyle.so.0.1.1
>> lrwxr-xr-x  1 root  wheel  21 Nov 20 01:05:05 2021 /usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextstyle.so -> libtextstyle.so.0.1.1
>> 
>> hd /usr/local/poudriere/data/.m/13_0R-CA7-default/ref/usr/local/lib/libtextstyle.so.0.1.1 | more
>> 00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
>> *
>> 0023b120
>> 
>> So the whole, over 2 MiByte, the whole file ended up with just null Bytes.
>> 
>> To cross check on live system caching vs. on disk, I rebooted and redid the
>> bulk -i based install of libtextstyle and looked at libtextstyle.so.0.1.1 :
>> still all zeros.
>> 
>> For reference, zpool scrub afterward resulted in:
>> 
>> # zpool status
>> pool: zopt0
>> state: ONLINE
>> scan: scrub repaired 0B in 00:01:49 with 0 errors on Sat Nov 20 11:47:31 2021
>> config:
>> 
>>       NAME        STATE     READ WRITE CKSUM
>>       zopt0       ONLINE       0     0     0
>>         nda1p3    ONLINE       0     0     0
>> 
>> But it is not a ZFS redundancy context: ZFS used just to use bectl .
> 
> Using bectl (on the root-on-ZFS Optane in PCIe slot),
> I booted stable/13 :
> 
> # uname -apKU
> FreeBSD CA72_16Gp_ZFS 13.0-STABLE FreeBSD 13.0-STABLE #13 stable/13-n248062-109330155000-dirty: Sat Nov 13 23:55:14 PST 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/13S-CA72-nodbg-clang/usr/13S-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1300520 1300520
> 
> and tried the sequence of 3 bulk runs:
> 
> There was no evidence of corruptions, suggesting that
> the Optane in the PCIe slot is not the source of the
> problem of having some file(s) end up with all bytes
> being null bytes.
> 
> So, overall, ending up with evidence of corruptions
> generated during bulk builds seem to be tied to main's
> [so: 14's] ZFS implementation in:
> 
> # uname -apKU
> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #18 main-n250455-890cae197737-dirty: Thu Nov  4 13:43:17 PDT 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 
> 1400040 1400040
> 
> because that is all that is unique to having the
> evidence of corruptions.
> 
> Since there have been ZFS updates in main since then, it
> seems that the next experiment would be to update main
> and try again under main.


Given that the issue seems to be a ZFS issue, I updated to:

# uname -apKU
FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #21 main-n250903-06bd74e1e39c-dirty: Mon Nov 22 04:15:08 PST 2021     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72  arm64 aarch64 1400042 1400042

(which involved updating some ZFS material).

I ran the sequence of 3 bulk's again: no evidence of
corruptions.

For reference:

The bulks targeting Cortex-A72 and Cortex-A53 each took
somewhat under 10 minutes more than the earlier stable/13
and main [so: 14] builds that otherwise matched (including
the Optane used), for bulks that each took somewhat over
6 hr either way.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)