Re: main [and, likely, stable/14]: do not set vfs.zfs.bclone_enabled=1 with that zpool feature enabled because it still leads to panics

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 09 Sep 2023 16:32:33 UTC
On Sep 8, 2023, at 21:54, Mark Millard <marklmi@yahoo.com> wrote:

> On Sep 8, 2023, at 18:19, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On Sep 8, 2023, at 17:03, Mark Millard <marklmi@yahoo.com> wrote:
>> 
>>> On Sep 8, 2023, at 15:30, Martin Matuska <mm@FreeBSD.org> wrote:
>>> 
>>>> I can confirm that the patch fixes the panic caused by the provided script on my test systems.
>>>> Mark, would it be possible to try poudriere on your system with a patched kernel?
>>> 
>>> . . .
>>> 
>>> On 9. 9. 2023 0:09, Alexander Motin wrote:
>>>> On 08.09.2023 09:52, Martin Matuska wrote:
>>>>> . . .
>>>> 
>>>> Thank you, Martin.  I was able to reproduce the issue with your script and found the cause.
>>>> 
>>>> I first though the issue is triggered by the `cp`, but it appeared to be triggered by `cat`.  It also got copy_file_range() support, but later than `cp`.  That is probably why it slipped through testing.  This patch fixes it for me: https://github.com/openzfs/zfs/pull/15251 .
>>>> 
>>>> Mark, could you please try the patch?
>>> 
>>> If all goes well, this will end up reporting that the
>>> poudriere bulk -a is still running but has gotten past,
>>> say, 320+ port->package builds finished (so: more than
>>> double observed so far for the panic context). Later
>>> would be a report with a larger figure. A normal run
>>> I might let go for 6000+ ports and 10 hr or so.
>>> 
>>> Notes as I go . . .
>>> 
>>> Patch applied, built, and installed to the test media.
>>> Also, booted:
>>> 
>>> # uname -apKU
>>> FreeBSD amd64-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT amd64 1500000 #75 main-n265228-c9315099f69e-dirty: Thu Sep  7 13:28:47 PDT 2023     root@amd64-ZFS:/usr/obj/BUILDs/main-amd64-dbg-clang/usr/main-src/amd64.amd64/sys/GENERIC-DBG amd64 amd64 1500000 1500000
>>> 
>>> Note that this is with a debug kernel (-dbg- in path and -DBG in
>>> the GENERIC* name). Also, the vintage of what it is based on has:
>>> 
>>> git: 969071be938c - main - vfs: copy_file_range() between multiple mountpoints of the same fs type
>>> 
>>> The usual sort of sequencing previously reported to get to this
>>> point. Media update starts with the rewind to the checkpoint in
>>> hopes of avoiding oddities from the later failure.
>>> 
>>> . . . :
>>> 
>>> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: 34588 Built: 414   Failed: 0     Skipped: 39    Ignored: 335   Fetched: 0     Tobuild: 33800  Time: 00:30:41
>>> 
>>> 
>>> So 414 and and still building.
>>> 
>>> More later. (It may be a while.)
>>> 
>> 
>> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [parallel_build:] Queued: 34588 Built: 2013  Failed: 2     Skipped: 179   Ignored: 335   Fetched: 0     Tobuild: 32059  Time: 01:42:47
>> 
>> and still going. (FYI: The failures are expected.)
>> 
>> After a while I might stop it and start over with a non-debug
>> kernel installed instead.
> 
> I did ^C after 2.5 hr (with 2447 built):
> 
> ^C[02:30:05] Error: Signal SIGINT caught, cleaning up and exiting
> [main-amd64-bulk_a-default] [2023-09-08_16h31m51s] [sigint:] Queued: 34588 Built: 2447  Failed: 5     Skipped: 226   Ignored: 335   Fetched: 0     Tobuild: 31575  Time: 02:29:59
> [02:30:05] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08_16h31m51s
> [02:30:05] Cleaning up
> [02:38:04] Unmounting file systems
> Exiting with status 1
> 
> I'll switch it over to a non-debug kernel and, probably, world
> and setup/run another test.
> 
> . . . (time goes by) . . .
> 
> Hmm. This did not get sent when I wrote the above. FYI, non-debug
> test status:
> 
> [main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [parallel_build:] Queued: 34588 Built: 2547  Failed: 5     Skipped: 239   Ignored: 335   Fetched: 0     Tobuild: 31462  Time: 01:59:58
> 
> I may let it run overnight.

I finally stopped it at 7473 built (a little over 13 hrs elapsed):

^C[13:08:30] Error: Signal SIGINT caught, cleaning up and exiting
[main-amd64-bulk_a-default] [2023-09-08_19h51m52s] [sigint:] Queued: 34588 Built: 7473  Failed: 23    Skipped: 798   Ignored: 335   Fetched: 0     Tobuild: 25959  Time: 13:08:26
[13:08:30] Logs: /usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-08_19h51m52s
[13:08:31] Cleaning up
[13:17:10] Unmounting file systems
Exiting with status 1

In part that was more evidence for deadlocks at least being fairly
rare as well.

None of the failed ones looked odd. (A fair portion are because the
bulk -a was mostly doing WITH_DEBUG= builds. Many upstreams change
library names, some other file names, or paths used for debug
builds and ports generally do not cover well building the debug
builds for such. I've used these runs to extend my list of
exceptions that avoid using WITH_DEBUG .) So no evidence of
corruptions.

(I do not normally do bulk -a builds. The rare bulk -a runs are
normally to check that my configuration of a builder machine still
works reasonably --beyond building just the few hundred ports that
I normally build. So I should be able to build most any combination
that I decide to try.)

===
Mark Millard
marklmi at yahoo.com