Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75

From: Cy Schubert <Cy.Schubert_at_cschubert.com>
Date: Thu, 13 Apr 2023 05:47:33 UTC
On Wed, 12 Apr 2023 22:28:13 -0700
Mark Millard <marklmi@yahoo.com> wrote:

> From: Charlie Li <vishwin_at_freebsd.org> wrote on
> Date: Wed, 12 Apr 2023 20:11:16 UTC :
> 
> > Charlie Li wrote:  
> > > Mateusz Guzik wrote:  
> > >> can you please test poudriere with
> > >> https://github.com/openzfs/zfs/pull/14739/files
> > >>  
> > > After applying, on the md(4)-backed pool regardless of block_cloning, 
> > > the cy@ `cp -R` test reports no differing (ie corrupted) files. Will 
> > > report back on poudriere results (no block_cloning).
> > >   
> > As for poudriere, build failures are still rolling in. These are (and 
> > have been) entirely random on every run. Some examples from this run:
> > 
> > lang/php81:
> > - post-install: @${INSTALL_DATA} ${WRKSRC}/php.ini-development 
> > ${WRKSRC}/php.ini-production ${WRKDIR}/php.conf ${STAGEDIR}/${PREFIX}/etc
> > - consumers fail to build due to corrupted php.conf packaged
> > 
> > devel/ninja:
> > - phase: stage
> > - install -s -m 555 
> > /wrkdirs/usr/ports/devel/ninja/work/ninja-1.11.1/ninja 
> > /wrkdirs/usr/ports/devel/ninja/work/stage/usr/local/bin
> > - consumers fail to build due to corrupted bin/ninja packaged
> > 
> > devel/netsurf-buildsystem:
> > - phase: stage
> > - mkdir -p 
> > /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/netsurf-buildsystem/makefiles 
> > /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/netsurf-buildsystem/testtools
> > for M in Makefile.top Makefile.tools Makefile.subdir Makefile.pkgconfig 
> > Makefile.clang Makefile.gcc Makefile.norcroft Makefile.open64; do \
> > cp makefiles/$M 
> > /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/netsurf-buildsystem/makefiles/; 
> > \
> > done
> > - graphics/libnsgif fails to build due to NUL characters in 
> > Makefile.{clang,subdir}, causing nothing to link  
> 
> Summary: I have problems building ports into packages
> via poudriere-devel use despite being fully updated/patched
> (as of when I started the experiment), never having enabled
> block_cloning ( still using openzfs-2.1-freebsd ).
> 
> In other words, I can confirm other reports that have
> been made.
> 
> The details follow.
> 
> 
> [Written as I was working on setting up for the experiments
> and then executing those experiments, adjusting as I went
> along.]
> 
> I've run my own tests in a context that has never had the
> zpool upgrade and that jump from before the openzfs import to
> after the existing commits for trying to fix openzfs on
> FreeBSD. I report on the sequence of activities getting to
> the point of testing as well.
> 
> By personal policy I keep my (non-temporary) pool's compatible
> with what the most recent ??.?-RELEASE supports, using
> openzfs-2.1-freebsd for now. The pools involved below have
> never had a zpool upgrade from where they started. (I've no
> pools that have ever had a zpool upgrade.)
> 
> (Temporary pools are rare for me, such as this investigation.
> But I'm not testing block_cloning or anything new this time.)
> 
> I'll note that I use zfs for bectl, not for redundancy. So
> my evidence is more limited in that respect.
> 
> The activities were done on a HoneyComb (16 Cortex-A72 cores).
> The system has and supports ECC RAM, 64 GiBytes of RAM are
> present.
> 
> I started by duplicating my normal zfs environment to an
> external USB3 NVMe drive and adjusting the host name and such
> to produce the below. (Non-debug, although I do not strip
> symbols.) :
> 
> # uname -apKU
> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023     root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082
> 
> I then did: git fetch, stash push ., merge --ff-only, stash apply . :
> my normal procedure. I then also applied the patch from:
> 
> https://github.com/openzfs/zfs/pull/14739/files
> 
> Then I did: buildworld buildkernel, install them, and rebooted.
> 
> The result was:
> 
> # uname -apKU
> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #91 main-n262122-2ef2c26f3f13-dirty: Wed Apr 12 19:23:35 PDT 2023     root@CA72_4c8G_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400086 1400086
> 
> The later poudriere-devel based build of packages from ports is
> based on:
> 
> # ~/fbsd-based-on-what-commit.sh -C /usr/ports
> 4e94ac9eb97f (HEAD -> main, freebsd/main, freebsd/HEAD) devel/freebsd-gcc12: Bump to 12.2.0.
> Author:     John Baldwin <jhb@FreeBSD.org>
> Commit:     John Baldwin <jhb@FreeBSD.org>
> CommitDate: 2023-03-25 00:06:40 +0000
> branch: main
> merge-base: 4e94ac9eb97fab16510b74ebcaa9316613182a72
> merge-base: CommitDate: 2023-03-25 00:06:40 +0000
> n613214 (--first-parent --count for merge-base)
> 
> poudriere attempted to build 476 packages, starting
> with pkg (in order to build the 56 that I explicitly
> indicate that I want). It is my normal set of ports.
> The form of building is biased to allowing a high
> load average compared to the number of hardware
> threads (same as cores here): each builder is allowed
> to use the full count of hardware threads. The build
> used USE_TMPFS="data" instead of the USE_TMPFS=all I
> normally use on the build machine involved.
> 
> And it produced some random errors during the attempted
> builds. A type of example that is easy to interpret
> without further exploration is:
> 
> pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse error at "'\x00\x00\x00\x00\x00\x00\x00\x00'": Expected W:(0-9A-Za-z)
> 
> A fair number of errors are of the form: the build
> installing a previously built package for use in the
> builder but later the builder can not find some file
> from the package's installation.
> 
> Another error reported was:
> 
> ld: error: /usr/local/lib/libblkid.a: unknown file type
> 
> For reference:
> 
> [main-CA72-bulk_a-default] [2023-04-12_20h45m32s] [committing:] Queued: 476 Built: 252 Failed: 11  Skipped: 213 Ignored: 0   Fetched: 0   Tobuild: 0    Time: 00:37:52
> 
> I started another build that tried to build 224 packeges:
> the 11 failed and 213 skipped.
> 
> Just 1 package built that failed before:
> 
> [00:04:58] [09] [00:04:15] Finished databases/sqlite3@default | sqlite3-3.41.0_1,1: Success
> 
> It seems to be the only one where the original failure was not
> an example of complaining about the missing/corrupted content
> of a package install used for building. So it is an example
> of randomly varying behavior.
> 
> That, in turn, allowed:
> 
> [00:04:58] [01] [00:00:00] Building security/nss | nss-3.89
> 
> to build but everything else failed or was skipped.
> 
> The sqlite3 vs. other failure difference suggests that writes
> have random problems but later reads reliably see the problem
> that resulted (before the content is deleted).
> 
> 
> After the above:
> 
> # zpool status
>   pool: zroot
>  state: ONLINE
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         zroot       ONLINE       0     0     0
>           da0p8     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> # zpool scrub zroot
> # zpool status
>   pool: zroot
>  state: ONLINE
>   scan: scrub repaired 0B in 00:16:25 with 0 errors on Wed Apr 12 22:15:39 2023
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         zroot       ONLINE       0     0     0
>           da0p8     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> 
> ===
> Mark Millard
> marklmi at yahoo.com

Did your pools suffer the EXDEV problem? The EXDEV also corrupted files.

I think, without sufficient investigation we risk jumping to
conclusions. I've taken an extremely cautious approach, rolling back
snapshots (as much as possible, i.e. poudriere datasets) when EXDEV
corruption was encountered.

I did not rollback any snapshots in my MH mail directory. Rolling back
snapshots of my MH maildir would result in loss of email. I have to
live with that corruption. Corrupted files in my outgoing sent email
directory remain:

slippy$ ugrep -cPa '\x00' ~/.Mail/note | grep -c :1 
53
slippy$ 

There are 53 corrupted files in my note log of 9913 emails. Those files
will never be fixed. They were corrupted by the EXDEV bug. Any new ZFS
or ZFS patches cannot retroactively remove the corruption from those
files.

But my poudriere files, because the snapshots were rolled back, were
"repaired" by the rolled back snapshots.

I'm not convinced that there is presently active corruption since
the problem has been fixed. I am convinced that whatever corruption
that was written at the time will remain forever or until those files
are deleted or replaced -- just like my email files written to disk at
the time.

-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0