[Bug 275308] EN tracking issue for potential ZFS data corruption

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 29 Nov 2023 10:12:02 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275308

--- Comment #16 from Rob Norris <robn@despairlabs.com> ---
(In reply to Dave Cottlehuber from comment #14)

> Generally:

> - testing of a new feature uncovered a rare and long-standing bug in OpenZFS

Correct. A reproduction for what was first thought to be a block cloning bug
also showed the same fault in 2.1, which does not have the block cloning
feature.

> - the block cloning feature increased the likelihood of encountering this bug

Unclear. So far no one has been able to show that cloning definitely changes
the probability of the bug showing up, and it is not at all clear why it should
be in play at all. Neither have I been able to show it in my own testing.

[tech detail and speculation follows]

The actual bug is that the object dnode is briefly, incorrectly, considered
"clean". There's no particular reason that the change being a clone rather than
a write would affect this window; at the time the dnode appears to be in the
wrong state, we're in syncing context, so the actual change must be two txgs
ago.

The only reason I've been able to think that it might make a difference is
simply about the amount of work in the system. Reading and writing real bytes
is slow; copying a block pointer that's already in memory is negligible by
comparison. It could simply be that cloning enables more operations to be
completed in the same time, obviously increasing the chance.

Its also worth noting that cloning was an initial guess, based on:

- a copy-related bug was presented on Linux against 2.2
- with a version of cp from GNU coreutils known to use copy_file_range(), thus
permitting cloning
- a initial analysis found a locking bug in part of block cloning involving
dirty buffers

However, this happened just after 2.2.1 was released, which disabled block
cloning at the POSIX entry points (a direct port of the FreeBSD 14 sysctl to
Linux), due to another bug that had shown up in 2.2.0 (unencrypted blocks could
be cloned into encrypted datasets). 2.2.1 disabling a headline feature had
generated some interest, as had FreeBSD 14 before it, and there was already
some uncertainty around the block cloning feature before 2.2 was released.

Once the problem was confirmed on 2.1, we quickly understood that
lseek(SEEK_DATA/SEEK_HOLE) was the real culprit, coincidentally also being
available for the first time in coreutils 9.x.

All this is to say: there was already an idea in the air that block cloning was
dangerous, and that's something I'm still arguing against days later. That may
be part of the reason that there is so strong a feeling that block cloning
raises the probability.

> - but it is still very very rare, and relies on a combination of high i/o, cpu, and unusual write/read patterns

At a high level, yes, all true.

I believe the specific sequence is:

- program [A] executes an operation that converts a hole to data, or appends
data to a file (which can be thought of as creating a hole at the end of the
file, then putting data underneath it). `cp` obviously fulfils this
- the change makes it to syncing context, and the dnode starts its sync process
- the dnode appears to be "clean"
- program [B] calls lseek(SEEK_DATA) at the start of the hole (usually offset 0
in a new file). The dnode appears clean, so its block pointers are examined.
The data isn't written yet, but the size in the dnode is past the offset, so
ENXIO ("data not found") is returned. This is an in-memory operation.
- the window passes, the dnode is "dirty"
- program [B], believing that there's no data there, does whatever it does (in
the case of `cp`, writes all-zeroes or punches a hole in the destination)

(still working on it, so I'm not totally sure this is all of it. Once I'm sure,
I'll be sending a patch to OpenZFS)

So as you see, the timing is enormously difficult: a particular kind of write
operation has to happen roughly in parallel with a hole-detection operation,
and hit at just the right moment. Its not really clear that high I/O, CPU, etc
make much difference, apart from perhaps changing the timing. The main reason
we do it a lot in the reproduction is just to better the odds of seeing it.

- this error will not be picked up by scrub or other zfs consistency checks

Correct. Its entirely in-memory; the problem isn't that anything read or write
bad data, but rather that it wrote _nothing_ because it thought there was
nothing to write!

- setting vfs.zfs.dmu_offset_next_sync=0 should mitigate the likelihood of
encountering this error going forwards until a patch is made available

Yes (though, patches are already available upstream and I believe at least on
FreeBSD head).

The reason this mitigation is effective is that =0 causes the hole check to
return "data" if the dnode is dirty, without checking for holes, while the
default =1 causes it to wait for sync if the dnode is dirty, then check dirty
_again_, before either doing the real hole block scan or returning "data" if
its still dirty. The retry occurs just after the txg sync has finished,
perilously close to the "clean" window as the dataset starts its next sync.

But of course, there's still a miniscule chance that the initial dirty check
could hit the gap, which is why its not a 100% solution.

> 14.0-RELEASE new zpools:

> - block cloning feature is disabled by default via vfs.zfs.bclone_enabled=0

Correct. (fwiw, upstream 2.2.1 has it disabled by default on Linux as well)

> - a newly created zpool in will have the zpool feature@block_cloning enabled

Correct.

> - but will not actively use that because of the sysctl

Correct. OpenZFS' copy_file_range() handler will check this flag, and
immediately fallback to the "generic" handler. There are no other block cloning
entry points on FreeBSD.

> - thus actual impact is unlikely

Possibly not true, for another reason. Even if we take block cloning out of the
picture, FreeBSD's generic copy_file_range() looks for holes and tries to
replicate them on the target. That's through
VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE). I haven't read that code closely, but I see
no reason why it would be less risky that on Linux with coreutils 9.x (which
isn't much, but its there).

> 13.x existing zpools:

> - you may have encountered this bug but there is currently no definitive way
>   to scan a zpool to check for the occurrence

Correct.

Note that copy_file_range() appeared in 13.0 (bbbbeca3e9a3) and had hole checks
from the beginning.

> 12.x existing zpools:

> - should not be impacted

Unclear. The bug has been affirmatively reproduced back to ZoL 0.6.5, which at
that time was still pulling code from Illumos as FreeBSD was. There's some
evidence from code reading that this might have been introduced in OpenSolaris
in 2006 (https://github.com/illumos/illumos-gate/commit/c543ec060d, which first
changed the dirty check logic to its incomplete form).

However, as best I can tell, copy_file_range() did not exist then, and `bin/cp`
does not now and did not then appear to have any facility to search for holes
on the source and replicate them (no lseek(SEEK_DATA/SEEK_HOLE), nor
ioctl(FIO_SEEKDATA/FIOSEEKHOLE)).

So, if the bug can still be hit there, its highly likely that it never was
through the inbuilt tools.

> Open Questions:

> - I don't understand whether FreeBSD uses a similar sparse-by-default functionality (like linux coreutils 9.2/9.3) that alters the user risk here

Answered above.

> I'm happy to help craft/review an EN if needed.

I hope I haven't made it harder!

The point being: its vanishingly unlikely that anyone has hit this during
normal operation, and even less likely that they've hit it _unknowningly_: how
often are people writing files, checking for holes and then taking action on
that within a fraction of a moment, at a large enough scale for a long enough
time that they could actually hit it, and then never actually looked at the
target file? The only real-world cases we've seen hit this are parallel build
systems, and while its a problem, its not _silent_ corruption, because the
output plain doesn't work! So of course its not _never_, but its really hard to
hit by accident.

My advice would to users would be (and has been):

- leave vfs.zfs.bclone_enabled=0
- if you can get the patch, take it
- if you can't, set vfs.zfs.dmu_offset_next_sync=0. You lose hole reporting for
dirty files, and vastly reduce the already small odds of hitting this
- play the probabilities: if you're not sure, and you're not creating new files
and copying them away in the same instant hundreds of times per second, you're
fine

-- 
You are receiving this mail because:
You are the assignee for the bug.