Re: another crash and going forward with zfs

From: Mateusz Guzik <mjguzik_at_gmail.com>
Date: Mon, 24 Apr 2023 17:15:10 UTC
On 4/18/23, Pawel Jakub Dawidek <pjd@freebsd.org> wrote:
> On 4/18/23 05:14, Mateusz Guzik wrote:
>> On 4/17/23, Pawel Jakub Dawidek <pjd@freebsd.org> wrote:
>>> Correct me if I'm wrong, but from my understanding there were zero
>>> problems with block cloning when it wasn't in use or now disabled.
>>>
>>> The reason I've introduced vfs.zfs.bclone_enabled sysctl, was to exactly
>>> avoid mess like this and give us more time to sort all the problems out
>>> while making it easy for people to try it.
>>>
>>> If there is no plan to revert the whole import, I don't see what value
>>> removing just block cloning will bring if it is now disabled by default
>>> and didn't cause any problems when disabled.
>>>
>>
>> The feature definitely was not properly stress tested and what not and
>> trying to do it keeps running into panics. Given the complexity of the
>> feature I would expect there are many bug lurking, some of which
>> possibly related to the on disk format. Not having to deal with any of
>> this is can be arranged as described above and is imo the most
>> sensible route given the timeline for 14.0
>
> Block cloning doesn't create, remove or modify any on-disk data until it
> is in use.
>
> Again, if we are not going to revert the whole merge, I see no point in
> reverting block cloning as until it is enabled, its code is not
> executed. This allow people who upgraded the pools to do nothing special
> and it will allow people to test it easily.
>

Some people will zpool upgrade out of habit or whatever after moving
to 14.0, which will then make them unable to go back to 13.x if woes
show up.

Woes don't even have to be zfs-related. This is a major release, one
has to suspect there will be some breakage and it maybe the best way
forward for some of the users will be to downgrade (e.g., with boot
envinronments). As is they wont be able to do it if they zpool
upgrade.

If someone *does* zpool upgrade and there is further data corruption
due to block cloning (which you really can't rule out given that the
feature so far did not survive under load), telephone game is going to
turn this into "14.0 corrupts data" and no amount of clarifying about
an optional feature is going to help the press.

If anything the real question is how come the feature got merged upstream, when:
1. FreeBSD CI for the project is offline
2. There is no Linux support
3. ... basic usage showed numerous bugs

Should the feature get whipped into shape, it can be a 14.1 candidate.

-- 
Mateusz Guzik <mjguzik gmail.com>