Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 15 Apr 2023 17:44:27 UTC
On Apr 15, 2023, at 07:36, Cy Schubert <Cy.Schubert@cschubert.com> wrote:

> In message <20230415115452.08911bb7@thor.intern.walstatt.dynvpn.de>, 
> FreeBSD Us
> er writes:
>> Am Thu, 13 Apr 2023 22:18:04 -0700
>> Mark Millard <marklmi@yahoo.com> schrieb:
>> 
>>> On Apr 13, 2023, at 21:44, Charlie Li <vishwin@freebsd.org> wrote:
>>> 
>>>> Mark Millard wrote:  
>>>>> FYI: in my original report for a context that has never had
>>>>> block_cloning enabled, I reported BOTH missing files and
>>>>> file content corruption in the poudriere-devel bulk build
>>>>> testing. This predates:
>>>>> https://people.freebsd.org/~pjd/patches/brt_revert.patch
>>>>> but had the changes from:
>>>>> https://github.com/openzfs/zfs/pull/14739/files
>>>>> The files were missing from packages installed to be used
>>>>> during a port's build. No other types of examples of missing
>>>>> files happened. (But only 11 ports failed.)  
>>>> I also don't have block_cloning enabled. "Missing files" prior to brt_rev
>> ert may actually
>>>> be present, but as the corruption also messes with the file(1) signature,
>> some tools like
>>>> ldconfig report them as missing.  
>>> 
>>> For reference, the specific messages that were not explicit
>>> null-byte complaints were (some shown with a little context):
>>> 
>>> 
>>> ===>   py39-lxml-4.9.2 depends on shared library: libxml2.so - not found
>>> ===>   Installing existing package /packages/All/libxml2-2.10.3_1.pkg  
>>> [CA72_ZFS] Installing libxml2-2.10.3_1...
>>> [CA72_ZFS] Extracting libxml2-2.10.3_1: .......... done
>>> ===>   py39-lxml-4.9.2 depends on shared library: libxml2.so - found
>>> (/usr/local/lib/libxml2.so) . . .
>>> [CA72_ZFS] Extracting libxslt-1.1.37: .......... done
>>> ===>   py39-lxml-4.9.2 depends on shared library: libxslt.so - found
>>> (/usr/local/lib/libxslt.so) ===>   Returning to build of py39-lxml-4.9.2  
>>> . . .
>>> ===>  Configuring for py39-lxml-4.9.2  
>>> Building lxml version 4.9.2.
>>> Building with Cython 0.29.33.
>>> Error: Please make sure the libxml2 and libxslt development packages are in
>> stalled.
>>> 
>>> 
>>> [CA72_ZFS] Extracting libunistring-1.1: .......... done
>>> ===>   libidn2-2.3.4 depends on shared library: libunistring.so - not found
>> 
>>> 
>>> 
>>> [CA72_ZFS] Extracting gmp-6.2.1: .......... done
>>> ===>   mpfr-4.2.0,1 depends on shared library: libgmp.so - not found  
>>> 
>>> 
>>> ===>   nettle-3.8.1 depends on shared library: libgmp.so - not found
>>> ===>   Installing existing package /packages/All/gmp-6.2.1.pkg  
>>> [CA72_ZFS] Installing gmp-6.2.1...
>>> the most recent version of gmp-6.2.1 is already installed
>>> ===>   nettle-3.8.1 depends on shared library: libgmp.so - not found  
>>> *** Error code 1
>>> 
>>> 
>>> autom4te: error: need GNU m4 1.4 or later: /usr/local/bin/gm4
>>> 
>>> 
>>> checking for GNU 
>>> M4 that supports accurate traces... configure: error: no acceptable m4 coul
>> d be found in
>>> $PATH. GNU M4 1.4.6 or later is required; 1.4.16 or newer is recommended.
>>> GNU M4 1.4.15 uses a buggy replacement strstr on some systems.
>>> Glibc 2.9 - 2.12 and GNU M4 1.4.11 - 1.4.15 have another strstr bug.
>>> 
>>> 
>>> ld: error: /usr/local/lib/libblkid.a: unknown file type
>>> 
>>> 
>>> ===
>>> Mark Millard
>>> marklmi at yahoo.com
>>> 
>>> 
>> 
>> Hello 
>> 
>> whar is the recent status of fixing/mitigate this desatrous bug? Especially f
>> or those with the
>> new option enabled on ZFS pools. Any advice?
>> 
>> In an act of precausion (or call it panic) I shutdown several servers to prev
>> ent irreversible
>> damages to databases and data storages. We face on one host with /usr/ports r
>> esiding on ZFS
>> always errors on the same files created while staging (using portmaster, leav
>> es the system
>> with noninstalled software, i.e. www/apache24 in our case). Deleting the work
>> folder doesn't
>> seem to change anything, even when starting a scrubbing of the entire pool (R
>> AIDZ1 pool) -
>> cause unknown, why it affects always the same files to be corrupted. Same wit
>> h deve/ruby-gems.
>> 
>> Poudriere has been shutdown for the time being to avoid further issues. 
>> 
>> Are there any advies to proceed apart from conserving the boxes via shutdown?
>> 
>> Thank you ;-)
>> oh
>> 
>> 
>> 
>> -- 
>> O. Hartmann
> 
> With an up-to-date tree + pjd@'s "Fix data corruption when cloning embedded 
> blocks. #14739" patch I didn't have any issues, except for email messages 
> with corruption in my sent directory, nowhere else. I'm still investigating 
> the email messages issue. IMO one is generally safe to run poudriere on the 
> latest ZFS with the additional patch.

My poudriere testing failed when I tested such (14739 included),
per what I reported, block_cloning never have been enabled.
Others have also reported poudriere bulk build failures absent
block_cloning being involved and 14739 being in place. My tests
do predate:

https://people.freebsd.org/~pjd/patches/brt_revert.patch

and I'm not sure of if Cy's activity had brt_revert.patch in
place or not.

Other's notes include Mateusz Guzik's:

https://lists.freebsd.org/archives/dev-commits-src-main/2023-April/014534.html

which said:

QUOTE
There is corruption with the recent import, with the
https://github.com/openzfs/zfs/pull/14739/files patch applied and
block cloning disabled on the pool.

There is no corruption with top of main with zfs merge reverted altogether.

Which commit results in said corruption remains to be seen, a variant
of the tree with just block cloning support reverted just for testing
purposes is about to be evaluated.
END QUOTE

Charlie Li's later related notes that helps interpret that were in:

https://lists.freebsd.org/archives/dev-commits-src-main/2023-April/014545.html

QUOTE
Testing with mjg@ earlier today revealed that block_cloning was not the 
cause of poudriere bulk build (and similar cp(1)/install(1)-based) 
corruption, although may have exacerbated it.
END QUOTE

Mateusz later indicated had a hope to have is sorted out sometime
Friday for what the cause(s) were:

https://lists.freebsd.org/archives/dev-commits-src-main/2023-April/014551.html

QUOTE
I'm going to narrow down the non-blockcopy corruption after my testjig
gets off the ground.

Basically I expect to have it sorted out on Friday.
END QUOTE

But the lack of later related messages suggests that did not happen.

> My tests of the additional patch

(I'm guessing that is a reference to 14739, not to brt_revert.patch .)

> concluded that it resolved my last 
> problems, except for the sent email problem I'm still investigating. I'm 
> sure there's a simple explanation for it, i.e. the email thread was 
> corrupted by the EXDEV regression which cannot be fixed by anything, even 
> reverting to the previous ZFS -- the data in those files will remain 
> damaged regardless.

Again: my test jump from prior to the import to after the EXDEV
changes, including having 14739. I still had poudriere bulk
produce file corruptions.

> I cannot speak to the others who have had poudriere and other issues. I 
> never had any problems with poudriere on top of the new ZFS.

Part of the mess is the variability. As I remember, I had 252
ports build fine in my test before the 11th failure meant that
the rest (213) had all been classified as skipped.

It is not like most of the port builds failed: relatively uncommon.

Also, one port built on a retry, indicating random/racy behavior
is involved. (The original failure was not from a file from
installing build dependencies but something that the builder
generated during the build. The 2nd try did not fail there or
anywhere.)

> WRT reverting block_cloning pools to without, your only option is to backup 
> your pool and recreate it without block_cloning. Then restore your data.
> 

Given what has been reported by multiple people and
Cy's own example of unexplained corruptions in email
handling, I'd be cautious risking important data
until reports from testing environment activity
consistently report not having corruptions.

Another thing my activity does not include any testing
of the suggestion in:

https://lists.freebsd.org/archives/dev-commits-src-main/2023-April/014607.html

to use "-o sync=disabled" in a clone, reporting:

QUOTE
With this workaround I was able to build thousands of packages without 
panics or failures due to data corruption.
END QUOTE

If reliable, that consequence to the change might help
folks that are trying to isolate the problem(s) figure
out what is involved.

===
Mark Millard
marklmi at yahoo.com