Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75

From: Cy Schubert <Cy.Schubert_at_cschubert.com>
Date: Thu, 13 Apr 2023 13:33:21 UTC
On Thu, 13 Apr 2023 19:54:42 +0900
Paweł Jakub Dawidek <pawel@dawidek.net> wrote:

> On Apr 13, 2023, at 16:10, Cy Schubert <Cy.Schubert@cschubert.com> wrote:
> > 
> > In message <20230413070426.8A54F25A@slippy.cwsent.com>, Cy Schubert writes:
> > In message <20230413064252.1E5C1318@slippy.cwsent.com>, Cy Schubert writes:
> >> In message <A291C24C-9D7C-4E79-AD03-68ED910FC2DE@yahoo.com>, Mark Millard
> >>> write
> >>> s:
> >>> [This just puts my prior reply's material into Cy's
> >>>> adjusted resend of the original. The To/Cc should
> >>>> be coomplete this time.]
> >>>> 
> >>>> On Apr 12, 2023, at 22:52, Cy Schubert <Cy.Schubert@cschubert.com> =
> >>>> wrote:
> >>>> 
> >>>> In message <C8E4A43B-9FC8-456E-ADB3-13E7F40B2B04@yahoo.com>, Mark =
> >>>>> Millard=20
> >>>> write
> >>>>> s:
> >>>>> From: Charlie Li <vishwin_at_freebsd.org> wrote on
> >>>>>> Date: Wed, 12 Apr 2023 20:11:16 UTC :
> >>>>>> =20
> >>>>>> Charlie Li wrote:
> >>>>>>> Mateusz Guzik wrote:
> >>>>>>>> can you please test poudriere with
> >>>>>>>>> https://github.com/openzfs/zfs/pull/14739/files
> >>>>>>>>> =20
> >>>>>>>>> After applying, on the md(4)-backed pool regardless of =3D
> >>>>>>>> block_cloning,=3D20
> >>>>>> the cy@ `cp -R` test reports no differing (ie corrupted) files. =
> >>>>>>>> Will=3D20=3D
> >>>> =20
> >>>>>> report back on poudriere results (no block_cloning).
> >>>>>>>> =3D20
> >>>>>>>> As for poudriere, build failures are still rolling in. These are =
> >>>>>>> (and=3D20=3D
> >>>> =20
> >>>>>> have been) entirely random on every run. Some examples from this =
> >>>>>>> run:
> >>>> =3D20
> >>>>>>> lang/php81:
> >>>>>>> - post-install: @${INSTALL_DATA} ${WRKSRC}/php.ini-development=3D20
> >>>>>>> ${WRKSRC}/php.ini-production ${WRKDIR}/php.conf =3D
> >>>>>>> ${STAGEDIR}/${PREFIX}/etc
> >>>>>> - consumers fail to build due to corrupted php.conf packaged
> >>>>>>> =3D20
> >>>>>>> devel/ninja:
> >>>>>>> - phase: stage
> >>>>>>> - install -s -m 555=3D20
> >>>>>>> /wrkdirs/usr/ports/devel/ninja/work/ninja-1.11.1/ninja=3D20
> >>>>>>> /wrkdirs/usr/ports/devel/ninja/work/stage/usr/local/bin
> >>>>>>> - consumers fail to build due to corrupted bin/ninja packaged
> >>>>>>> =3D20
> >>>>>>> devel/netsurf-buildsystem:
> >>>>>>> - phase: stage
> >>>>>>> - mkdir -p=3D20
> >>>>>>> =3D
> >>>>>>> =
> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/n
> >>>> e=
> >> =3D
> >>>> tsurf-buildsystem/makefiles=3D20
> >>>>>> =3D
> >>>>>>> =
> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/n
> >>>> e=
> >> =3D
> >>>> tsurf-buildsystem/testtools
> >>>>>> for M in Makefile.top Makefile.tools Makefile.subdir =3D
> >>>>>>> Makefile.pkgconfig=3D20
> >>>>>> Makefile.clang Makefile.gcc Makefile.norcroft Makefile.open64; do \
> >>>>>>> cp makefiles/$M=3D20
> >>>>>>> =3D
> >>>>>>> =
> >>>>>> /wrkdirs/usr/ports/devel/netsurf-buildsystem/work/stage/usr/local/share/n
> >>>> e=
> >> =3D
> >>>> tsurf-buildsystem/makefiles/;=3D20
> >>>>>> \
> >>>>>>> done
> >>>>>>> - graphics/libnsgif fails to build due to NUL characters in=3D20
> >>>>>>> Makefile.{clang,subdir}, causing nothing to link
> >>>>>>> =20
> >>>>>> Summary: I have problems building ports into packages
> >>>>>> via poudriere-devel use despite being fully updated/patched
> >>>>>> (as of when I started the experiment), never having enabled
> >>>>>> block_cloning ( still using openzfs-2.1-freebsd ).
> >>>>>> =20
> >>>>>> In other words, I can confirm other reports that have
> >>>>>> been made.
> >>>>>> =20
> >>>>>> The details follow.
> >>>>>> =20
> >>>>>> =20
> >>>>>> [Written as I was working on setting up for the experiments
> >>>>>> and then executing those experiments, adjusting as I went
> >>>>>> along.]
> >>>>>> =20
> >>>>>> I've run my own tests in a context that has never had the
> >>>>>> zpool upgrade and that jump from before the openzfs import to
> >>>>>> after the existing commits for trying to fix openzfs on
> >>>>>> FreeBSD. I report on the sequence of activities getting to
> >>>>>> the point of testing as well.
> >>>>>> =20
> >>>>>> By personal policy I keep my (non-temporary) pool's compatible
> >>>>>> with what the most recent ??.?-RELEASE supports, using
> >>>>>> openzfs-2.1-freebsd for now. The pools involved below have
> >>>>>> never had a zpool upgrade from where they started. (I've no
> >>>>>> pools that have ever had a zpool upgrade.)
> >>>>>> =20
> >>>>>> (Temporary pools are rare for me, such as this investigation.
> >>>>>> But I'm not testing block_cloning or anything new this time.)
> >>>>>> =20
> >>>>>> I'll note that I use zfs for bectl, not for redundancy. So
> >>>>>> my evidence is more limited in that respect.
> >>>>>> =20
> >>>>>> The activities were done on a HoneyComb (16 Cortex-A72 cores).
> >>>>>> The system has and supports ECC RAM, 64 GiBytes of RAM are
> >>>>>> present.
> >>>>>> =20
> >>>>>> I started by duplicating my normal zfs environment to an
> >>>>>> external USB3 NVMe drive and adjusting the host name and such
> >>>>>> to produce the below. (Non-debug, although I do not strip
> >>>>>> symbols.) :
> >>>>>> =20
> >>>>>> # uname -apKU
> >>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90 =3D
> >>>>>> main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023     =3D
> >>>>>> =
> >>>>>> root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm
> >>>> 6=
> >> =3D
> >>>> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400082 1400082
> >>>>>> =20
> >>>>>> I then did: git fetch, stash push ., merge --ff-only, stash apply . :
> >>>>>> my normal procedure. I then also applied the patch from:
> >>>>>> =20
> >>>>>> https://github.com/openzfs/zfs/pull/14739/files
> >>>>>> =20
> >>>>>> Then I did: buildworld buildkernel, install them, and rebooted.
> >>>>>> =20
> >>>>>> The result was:
> >>>>>> =20
> >>>>>> # uname -apKU
> >>>>>> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #91 =3D
> >>>>>> main-n262122-2ef2c26f3f13-dirty: Wed Apr 12 19:23:35 PDT 2023     =3D
> >>>>>> =
> >>>>>> root@CA72_4c8G_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm
> >>>> 6=
> >> =3D
> >>>> 4.aarch64/sys/GENERIC-NODBG-CA72 arm64 aarch64 1400086 1400086
> >>>>>> =20
> >>>>>> The later poudriere-devel based build of packages from ports is
> >>>>>> based on:
> >>>>>> =20
> >>>>>> # ~/fbsd-based-on-what-commit.sh -C /usr/ports
> >>>>>> 4e94ac9eb97f (HEAD -> main, freebsd/main, freebsd/HEAD) =3D
> >>>>>> devel/freebsd-gcc12: Bump to 12.2.0.
> >>>>>> Author:     John Baldwin <jhb@FreeBSD.org>
> >>>>>> Commit:     John Baldwin <jhb@FreeBSD.org>
> >>>>>> CommitDate: 2023-03-25 00:06:40 +0000
> >>>>>> branch: main
> >>>>>> merge-base: 4e94ac9eb97fab16510b74ebcaa9316613182a72
> >>>>>> merge-base: CommitDate: 2023-03-25 00:06:40 +0000
> >>>>>> n613214 (--first-parent --count for merge-base)
> >>>>>> =20
> >>>>>> poudriere attempted to build 476 packages, starting
> >>>>>> with pkg (in order to build the 56 that I explicitly
> >>>>>> indicate that I want). It is my normal set of ports.
> >>>>>> The form of building is biased to allowing a high
> >>>>>> load average compared to the number of hardware
> >>>>>> threads (same as cores here): each builder is allowed
> >>>>>> to use the full count of hardware threads. The build
> >>>>>> €ÏL€€€€‹ > > >> used USE_TMPFS=3D3D"data" instead of the USE_TMPFS=3D3Dall I
> >> normally use on the build machine involved.
> >>>>>> =20
> >>>>>> And it produced some random errors during the attempted
> >>>>>> builds. A type of example that is easy to interpret
> >>>>>> without further exploration is:
> >>>>>> =20
> >>>>>> pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse
> >>>>>> =
> >> =3D
> >>>> error at "'\x00\x00\x00\x00\x00\x00\x00\x00'": Expected W:(0-9A-Za-z)
> >>>>>>     0
> >>         da0p8     ONLINE       0     0     0
> >>>>>> =20
> >>>>>> errors: No known data errors
> >>>>>> =20
> >>>>>> =20
> >>>>>> =3D3D=3D3D=3D3D
> >>>>>> Mark Millard
> >>>>>> marklmi at yahoo.com
> >>>>>> =20
> >>>>> =20
> >>>>> Let's try this again. Claws-mail didn't include the list address in =
> >>>>> the=20
> >>>> header. Trying to reply, again, using exmh instead.
> >>>>> =20
> >>>>> =20
> >>>>> Did your pools suffer the EXDEV problem? The EXDEV also corrupted =
> >>>>> files.
> >>>> 
> >>>> As I reported, this was a jump from before the import
> >>>> to as things are tonight (here). So: NO, unless the
> >>>> existing code as of tonight still has the EXDEV problem!
> >>>> 
> >>>> Prior to this experiment I'd not progressed any media
> >>>> beyond: main-n261544-cee09bda03c8-dirty Wed Mar 15 20:25:49.
> >>>> 
> >>>> I think, without sufficient investigation we risk jumping to
> >>>>> conclusions. I've taken an extremely cautious approach, rolling back
> >>>>> snapshots (as much as possible, i.e. poudriere datasets) when EXDEV
> >>>>> corruption was encountered.
> >>>>> 
> >>>> Again: nothing between main-n261544-cee09bda03c8-dirty and
> >>>> main-n262122-2ef2c26f3f13-dirty was involved at any stage.
> >>>> 
> >>>> =20
> >>>>> I did not rollback any snapshots in my MH mail directory. Rolling back
> >>>>> snapshots of my MH maildir would result in loss of email. I have to
> >>>>> live with that corruption. Corrupted files in my outgoing sent email
> >>>>> directory remain:
> >>>>> =20
> >>>>> slippy$ ugrep -cPa '\x00' ~/.Mail/note | grep -c :1=20
> >>>>> 53
> >>>>> slippy$=20
> >>>>> =20
> >>>>> There are 53 corrupted files in my note log of 9913 emails. Those =
> >>>>> files
> >>>> will never be fixed. They were corrupted by the EXDEV bug. Any new ZFS
> >>>>> or ZFS patches cannot retroactively remove the corruption from those
> >>>>> files.
> >>>>> =20
> >>>>> But my poudriere files, because the snapshots were rolled back, were
> >>>>> "repaired" by the rolled back snapshots.
> >>>>> =20
> >>>>> I'm not convinced that there is presently active corruption since
> >>>>> the problem has been fixed. I am convinced that whatever corruption
> >>>>> that was written at the time will remain forever or until those files
> >>>>> are deleted or replaced -- just like my email files written to disk at
> >>>>> the time.
> >>>>> 
> >>>> My test results and procedure just do not fit your conclusion
> >>>> that things are okay now if block_clonging is completely avoided.
> >>>> 
> >>> Admitting I'm wrong: sending copies of my last reply to you back to myself,
> >>> 
> >> again and again, three times, I've managed to reproduce the corruption you
> >>> are talking about.
> >>> 
> >> This email itself was also corrupted. Below is what was sent. Good thing
> >> multiple copies are saved by exmh.
> >> 
> >> Admitting I'm wrong: sending copies of my last reply to you back to myself,
> >> again and again, three times, I've managed to reproduce the corruption you
> >> are talking about.
> >> 
> > This email itself was also corrupted. Below is what was sent. Good thing
> > multiple copies are saved by exmh.
> > 
> > Admitting I'm wrong: sending copies of my last reply to you back to myself,
> > again and again, three times, I've managed to reproduce the corruption you
> > are talking about.
> > 
> > From my previous email to you.
> > 
> > header. Trying to reply:::::::::, again, using exmh instead.
> >                       ^^^^^^^^^
> > Here it is, nine additional bytes of garbage. I've replaced the garbage
> > with colons because nulls mess up a lot of things, including cut&paste.
> > 
> > In another instance about 500 bytes were removed. I can reproduce the
> > corruption at will now.
> > 
> > The EXDEV patch is applied. Block_cloning is disabled.
> > 
> > Somehow nulls and other garbage are inserted in the middle of emails after
> > the ZFS upgrade.
> > 
> Can you please try this patch:
> 
> github.com

The patch was applied yesterday at noon (PDT).

> 
> 
> 
> Unfortunately I don’t see how this can happen with block cloning disabled.

It does and it's reproducible.

> 
> -- 
> Paweł Jakub Dawidek
> 



-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0