Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75

From: Cy Schubert <Cy.Schubert_at_cschubert.com>
Date: Mon, 10 Apr 2023 23:54:13 UTC
On Mon, 10 Apr 2023 01:58:00 -0400
Charlie Li <vishwin@freebsd.org> wrote:

> Cy Schubert wrote:
> > Hmm, interesting. I'm experiencing no such panics nor corruption since
> > the commit.
> > 
> > Reading a previous email of yours from today block_cloning is not
> > enabled. Is it possible that before the regression was fixed, while it
> > was wreaking havoc in your zpool, that your zpool became irreversibly
> > corrupted resulting in panics, even with the fixed code?
> >   
> This is probably now the case.
> > One way, probably the best way, to test would be to revert back to the
> > commit prior to the import. If you still experience panics and
> > corruption, your zpool is damaged.
> >   
> Fails to mount with error 45 on a boot environment only a few commits 
> before the import.
> > At the moment we don't know if the code is still broken or if it has
> > been fixed but residual damage is still causing creeping rot and panics.
> > 
> > I don't know if zpool scrub can fix this -- reading one comment on
> > FreeBSD-current, zpool scrub fails to complete.
> >   
> It doesn't. All scrubs on my end complete fully with nothing to repair.
> > I'm not convinced, yet, that the problem code has not been fixed. We
> > don't know if the panics are a result of corruption as a result of the
> > regression.
> > 
> > Would it be best if we reverted the seven commits to main? I don't
> > know. I could argue it either way. My problems, on four machines, have
> > been fixed by the subsequent commits. We don't know if there are other
> > regressions or if the current problems are due to corruption caused
> > writes prior to patches addressing the regression. Maybe reverting the
> > seven commits and taking a watch for further fallout approach, whether
> > the panics and problems persist post revert. If the problems persist
> > post revert we know for sure the regression has caused some permanent
> > corruption. This is a radical option. IMO, I'm torn whether a revert
> > would be the best approach or not. It has its merits but
> > significant limitations too.
> >   
> Going to try recreating the pool on current tip, making sure that 
> block_cloning is disabled.
> 

You'll need to do this at pool creation time.

I have a "sandhbox" pool, called t, used for /usr/obj and ports wrkdirs, and other writes I can easily recreate on my laptop. Here are the results of my tests.

Method:

Initially I copied my /usr/obj from my two build machines (one amd64.amd64 and an i386.i386) to my "sandbox" zpool.

Next, with block_cloning disabled I did cp -R of the /usr/obj test files. Then a diff -qr. They source and target directories were the same.

Next, I cleaned up (rm -rf) the target directory to prepare for the 
block_clone enabled test.

Next, I did zpool checkpoint t. After this, zpool upgrade t. Pool t now has block_cloning enabled.

I repeated the cp -R test from above followed by a diff -qr. Almost 
every file was different. The pool was corrupted.

I restored the pool by the following removing the corruption:


slippy# zpool export t
slippy# zpool import --rewind-to-checkpoint t
slippy#

It is recommended that people avoid upgrading their zpools until the 
problem is fixed.


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0