Re: aarch64(?) poudiere-devel based builds seem to get fairly-rare corrupted files after recent system update(s?)
- Reply: Mark Millard via freebsd-current : "Re: aarch64(?) poudiere-devel based builds seem to get fairly-rare corrupted files after recent system update(s?)"
- In reply to: Mark Millard via freebsd-current : "Re: aarch64(?) poudiere-devel based builds seem to get fairly-rare corrupted files after recent system update(s?)"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 20 Nov 2021 06:20:40 UTC
On 2021-Nov-18, at 12:15, Mark Millard <marklmi@yahoo.com> wrote: > On 2021-Nov-17, at 11:17, Mark Millard <marklmi@yahoo.com> wrote: > >> On 2021-Nov-15, at 15:43, Mark Millard <marklmi@yahoo.com> wrote: >> >>> On 2021-Nov-15, at 13:13, Mark Millard <marklmi@yahoo.com> wrote: >>> >>>> On 2021-Nov-15, at 12:51, Mark Millard <marklmi@yahoo.com> wrote: >>>> >>>>> On 2021-Nov-15, at 11:31, Mark Millard <marklmi@yahoo.com> wrote: >>>>> >>>>>> I updated from (shown a system that I've not updated yet): >>>>>> >>>>>> # uname -apKU >>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #18 main-n250455-890cae197737-dirty: Thu Nov 4 13:43:17 PDT 2021 root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 >>>>>> 1400040 1400040 >>>>>> >>>>>> to: >>>>>> >>>>>> # uname -apKU >>>>>> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #19 main-n250667-20aa359773be-dirty: Sun Nov 14 02:57:32 PST 2021 root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400042 1400042 >>>>>> >>>>>> and then updated /usr/ports/ and started poudriere-devel based builds of >>>>>> the ports I's set up to use. However my last round of port builds from >>>>>> a general update of /usr/ports/ were on 2021-10-23 before either of the >>>>>> above. >>>>>> >>>>>> I've had at least two files that seem to be corrupted, where a later part >>>>>> of the build hits problematical file(s) from earlier build activity. For >>>>>> example: >>>>>> >>>>>> /usr/local/include/X11/extensions/XvMC.h:1:1: warning: null character ignored [-Wnull-character] >>>>>> <U+0000> >>>>>> ^ >>>>>> /usr/local/include/X11/extensions/XvMC.h:1:2: warning: null character ignored [-Wnull-character] >>>>>> <U+0000><U+0000> >>>>>> ^ >>>>>> /usr/local/include/X11/extensions/XvMC.h:1:3: warning: null character ignored [-Wnull-character] >>>>>> <U+0000><U+0000><U+0000> >>>>>> ^ >>>>>> /usr/local/include/X11/extensions/XvMC.h:1:4: warning: null character ignored [-Wnull-character] >>>>>> <U+0000><U+0000><U+0000><U+0000> >>>>>> ^ >>>>>> . . . >>>>>> >>>>>> Removing the xorgproto-2021.4 package and rebuilding via >>>>>> poudiere-devel did not get a failure of any ports dependent >>>>>> on it. >>>>>> >>>>>> This was from a use of: >>>>>> >>>>>> # poudriere jail -j13_0R-CA7 -i >>>>>> Jail name: 13_0R-CA7 >>>>>> Jail version: 13.0-RELEASE-p5 >>>>>> Jail arch: arm.armv7 >>>>>> Jail method: null >>>>>> Jail mount: /usr/obj/DESTDIRs/13_0R-CA7-poud >>>>>> Jail fs: >>>>>> Jail updated: 2021-11-04 01:48:49 >>>>>> Jail pkgbase: disabled >>>>>> >>>>>> but another not-investigated example was from: >>>>>> >>>>>> # poudriere jail -j13_0R-CA72 -i >>>>>> Jail name: 13_0R-CA72 >>>>>> Jail version: 13.0-RELEASE-p5 >>>>>> Jail arch: arm64.aarch64 >>>>>> Jail method: null >>>>>> Jail mount: /usr/obj/DESTDIRs/13_0R-CA72-poud >>>>>> Jail fs: >>>>>> Jail updated: 2021-11-04 01:48:01 >>>>>> Jail pkgbase: disabled >>>>>> >>>>>> (so no 32-bit COMPAT involved). The apparent corruption >>>>>> was in a different port (autoconfig, noticed by the >>>>>> build of automake failing via config reporting >>>>>> /usr/local/share/autoconf-2.69/autoconf/autoconf.m4f >>>>>> being rejected). >>>>>> >>>>>> /usr/obj/DESTDIRs/13_0R-CA7-poud/ and >>>>>> /usr/obj/DESTDIRs/13_0R-CA72-poud/ and the like track the >>>>>> system versions. >>>>>> >>>>>> The media is an Optane 960 in the PCIe slot of a HoneyComb >>>>>> (16 Cortex-A72's). The context is a root on ZFS one, ZFS >>>>>> used in order to have bectl, not redundancy. >>>>>> >>>>>> The ThreadRipper 1950X (so amd64) port builds did not give >>>>>> evidence of such problems based on the updated system. (Also >>>>>> Optane media in a PCIe slot, also root on ZFS.) But the >>>>>> errors seem rare enough to not be able to conclude much. >>>>> >>>>> For aarch64 targeting aarch64 there was also this >>>>> explicit corruption notice during the poudriere(-devel) >>>>> bulk build: >>>>> >>>>> . . . >>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3: ......... >>>>> pkg-static: Fail to extract /usr/local/libexec/gcc/arm-none-eabi/8.4.0/lto1 from package: Lzma library error: Corrupted input data >>>>> [CA72_ZFS] Extracting arm-none-eabi-gcc-8.4.0_3... done >>>>> >>>>> Failed to install the following 1 package(s): /packages/All/arm-none-eabi-gcc-8.4.0_3.pkg >>>>> *** Error code 1 >>>>> Stop. >>>>> make: stopped in /usr/ports/sysutils/u-boot-orangepi-plus-2e >>>>> >>>>> I'm not yet to the point of retrying after removing >>>>> arm-none-eabi-gcc-8.4.0_3 : other things are being built. >>>> >>>> >>>> Another context with my prior general update of /usr/ports/ >>>> and the matching port builds: Back then I used USE_TMPFS=all >>>> but the failure is based on USE_TMPFS-"data" instead. So: >>>> lots more I/O. >>>> >>> >>> None of the 3 corruptions repeated during bulk builds that >>> retried the builds that generated the files. All of the >>> ports that failed by hitting the corruptions in what they >>> depended on, built fine in teh retries. >>> >>> For reference: >>> >>> I'll note that, back when I was using USE_TMPFS=all , I also >>> did some separate bulk -a test runs, both aarch64 (Cortex-A72) >>> native and Cortext-A72 targeting Cortex-A7 (armv7). None of >>> those showed evidence of file corruptions. In general I've >>> not had previous file corruptions with this system. (There >>> was a little more than 245 GiBytes swap, which covered the >>> tmpfs needs when they were large.) >> >> >> I set up a contrasting test context and got no evidence of >> corruptions in that context. (Note: the 3 bulk builds >> total to around 24 hrs of activity for the 3 examples >> of 460+ ports building.) So, for the Cortex-A72 system, > > I set up a UFS on Optane (U.2 via M.2 adapter) context and > also got no evidence of corruptions in that context (same > hardware and a copy of the USB3 SSD based system). The > sequence of 3 bulks took somewhat over 18 hrs using the > Optane. > >> root on UFS on portable USB3 SSD: no evidence of corruptions > Also: > root on UFS on Optane U.2 via M.2: no evidence of corruptions >> vs.: >> root on ZFS on optane in PCIe slot: solid evidence of 3 known corruptions >> >> Both had USE_TMPFS="data" in use. The same system build >> had been installed and booted for both tests. >> >> The evidence of corruptions is rare enough for this not to >> be determinative, but it is suggestive. >> >> Unfortunately, ZFS vs. UFS and Optane-in-PCIe vs. USB3 are >> not differentiated by this test result. >> >> There is also the result that I've not seen evidence of >> corruptions on the ThreadRipper 1950 X (amd64) system. >> Again, not determinative, but suggestive, given how rare >> the corruptions seem to be. > > So far the only things unique to the observed corruptions are: > > root on ZFS context (vs. root on UFS) > and: > Optane in a PCIe slot (but no contrasting ZFS case tested) > > The PCIe slot does not seem to me to be likely to be contributing. > So this seem to be suggestive of a ZFS problem. > > A contributing point might be that the main [so: 14] system was > built via -mcpu=cortex-a72 for execution on a Cortext-A72 system. > > [I previously ran into a USB subsystem mishandling of keeping > things coherent for the week memory ordering in this sort of > context. That issue was fixed. But back then I was lucky enough > to be able to demonstrate fails vs. works by adding an > appropriate instruction to FreeBSD in a few specific places > (more than necessary as it turned out). Someone else determined > where the actual mishandling was that covered all required > places. My generating that much information in this context > seems unlikely.] I started a retry of root-on-ZFS with the Optane-in-PCIe-slot media and it got its first corruption (in a different place, 2nd bulk build this time). The use of the corrupted file reports: configure:13269: cc -o conftest -Wall -Wextra -fsigned-char -Wdeclaration-after-statement -O2 -pipe -mcpu=cortex-a53 -g -fstack-protector-strong -fno-strict-aliasing -DUSE_MEMORY_H -I/usr/local/incl ude -mcpu=cortex-a53 -fstack-protector-strong conftest.c -L/usr/local/lib -logg >&5 In file included from conftest.c:27: In file included from /usr/local/include/ogg/ogg.h:24: In file included from /usr/local/include/ogg/os_types.h:154: /usr/local/include/ogg/config_types.h:1:1: warning: null character ignored [-Wnull-character] <U+0000> ^ /usr/local/include/ogg/config_types.h:1:2: warning: null character ignored [-Wnull-character] <U+0000><U+0000> ^ /usr/local/include/ogg/config_types.h:1:3: warning: null character ignored [-Wnull-character] <U+0000><U+0000><U+0000> ^ . . . /usr/local/include/ogg/config_types.h:1:538: warning: null character ignored [-Wnull-character] . . . (nulls) . . . So: 538 such null bytes. Thus, another example of something like a page of nulls being written out when ZFS is in use. audio/gstreamer1-plugins-ogg also failed via referencing the file during its build. (The bulk run is still going and there is one more bulk run to go.) === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)