From nobody Wed Nov 29 10:12:02 2023 X-Original-To: fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SgFVd1srXz52TZj for ; Wed, 29 Nov 2023 10:12:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4SgFVd0ZkWz3ZMr for ; Wed, 29 Nov 2023 10:12:09 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1701252729; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6CpkkJMIFbkPYAUyZkgEbDufkhun70vMcwIN37Xto70=; b=g7U61OY8+BKfJHiPTRhf3Pv6TUTOCCuwswYp986XtT8SKcq0amGveSrJ47mLZYnrGemHvF 6SPBi1r56IdeZBMEGy+U6oHis5JOaEzPnWio4Ca+ymqhZw9O1t9PEQ6fq8EowftI05mBeR JiwgkFMadIzl7bAWyRZh6X0XFNoagcc7Smk4WDs9/0VHO20YdMWbOL6FTzn1uYe5YXRuMN y210n3caRc7iCW5/qSsOw7+RNW5tVL/pR3zXfVj3ydiuVSF3L3hE85UTfAOHYdwvWvt8f3 Lxfd5ndCIfWPSk5LCCFO5EOOLf2+6Paq11jJOZK637pjcmZ3LdduP2fOx0Ghiw== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1701252729; a=rsa-sha256; cv=none; b=r9WZeEI58FVA2sndrWwpEvtvydQWPKiHbggBHWVtpjnXD1duSwBGfx+R/7lpFgd3GOYJB6 27nRlAYIGQXZpfdfoheCDY6sEBnNQX9p/zaTYqz1otc+SyvqmJMajgffVEWFsRSRytqaI7 KVsOlc3vjOycwVmW3gkxSyhg814LtgXBqTSJUi6m2GEJSRWx/TSuI1zmWGqF3WV9g8A3Dq 5OSshcN6Ctd9ycg2YDdPagw9wjNK8++6oEHUBFzIo0Yb+fc3OmJBrrZ0YBytcjt3FiggH+ lAq7NHTXOcYEGpydi1iRYUW6DLzDcLXmkmOYQQ64xDUY0edxdKSrheXzJYk+6A== Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4SgFVc6mJlz11gm for ; Wed, 29 Nov 2023 10:12:08 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 3ATAC81l078475 for ; Wed, 29 Nov 2023 10:12:08 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 3ATAC8P8078474 for fs@FreeBSD.org; Wed, 29 Nov 2023 10:12:08 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: fs@FreeBSD.org Subject: [Bug 275308] EN tracking issue for potential ZFS data corruption Date: Wed, 29 Nov 2023 10:12:02 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 14.0-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: robn@despairlabs.com X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275308 --- Comment #16 from Rob Norris --- (In reply to Dave Cottlehuber from comment #14) > Generally: > - testing of a new feature uncovered a rare and long-standing bug in Open= ZFS Correct. A reproduction for what was first thought to be a block cloning bug also showed the same fault in 2.1, which does not have the block cloning feature. > - the block cloning feature increased the likelihood of encountering this= bug Unclear. So far no one has been able to show that cloning definitely changes the probability of the bug showing up, and it is not at all clear why it sh= ould be in play at all. Neither have I been able to show it in my own testing. [tech detail and speculation follows] The actual bug is that the object dnode is briefly, incorrectly, considered "clean". There's no particular reason that the change being a clone rather = than a write would affect this window; at the time the dnode appears to be in the wrong state, we're in syncing context, so the actual change must be two txgs ago. The only reason I've been able to think that it might make a difference is simply about the amount of work in the system. Reading and writing real byt= es is slow; copying a block pointer that's already in memory is negligible by comparison. It could simply be that cloning enables more operations to be completed in the same time, obviously increasing the chance. Its also worth noting that cloning was an initial guess, based on: - a copy-related bug was presented on Linux against 2.2 - with a version of cp from GNU coreutils known to use copy_file_range(), t= hus permitting cloning - a initial analysis found a locking bug in part of block cloning involving dirty buffers However, this happened just after 2.2.1 was released, which disabled block cloning at the POSIX entry points (a direct port of the FreeBSD 14 sysctl to Linux), due to another bug that had shown up in 2.2.0 (unencrypted blocks c= ould be cloned into encrypted datasets). 2.2.1 disabling a headline feature had generated some interest, as had FreeBSD 14 before it, and there was already some uncertainty around the block cloning feature before 2.2 was released. Once the problem was confirmed on 2.1, we quickly understood that lseek(SEEK_DATA/SEEK_HOLE) was the real culprit, coincidentally also being available for the first time in coreutils 9.x. All this is to say: there was already an idea in the air that block cloning= was dangerous, and that's something I'm still arguing against days later. That = may be part of the reason that there is so strong a feeling that block cloning raises the probability. > - but it is still very very rare, and relies on a combination of high i/o= , cpu, and unusual write/read patterns At a high level, yes, all true. I believe the specific sequence is: - program [A] executes an operation that converts a hole to data, or appends data to a file (which can be thought of as creating a hole at the end of the file, then putting data underneath it). `cp` obviously fulfils this - the change makes it to syncing context, and the dnode starts its sync pro= cess - the dnode appears to be "clean" - program [B] calls lseek(SEEK_DATA) at the start of the hole (usually offs= et 0 in a new file). The dnode appears clean, so its block pointers are examined. The data isn't written yet, but the size in the dnode is past the offset, so ENXIO ("data not found") is returned. This is an in-memory operation. - the window passes, the dnode is "dirty" - program [B], believing that there's no data there, does whatever it does = (in the case of `cp`, writes all-zeroes or punches a hole in the destination) (still working on it, so I'm not totally sure this is all of it. Once I'm s= ure, I'll be sending a patch to OpenZFS) So as you see, the timing is enormously difficult: a particular kind of wri= te operation has to happen roughly in parallel with a hole-detection operation, and hit at just the right moment. Its not really clear that high I/O, CPU, = etc make much difference, apart from perhaps changing the timing. The main reas= on we do it a lot in the reproduction is just to better the odds of seeing it. - this error will not be picked up by scrub or other zfs consistency checks Correct. Its entirely in-memory; the problem isn't that anything read or wr= ite bad data, but rather that it wrote _nothing_ because it thought there was nothing to write! - setting vfs.zfs.dmu_offset_next_sync=3D0 should mitigate the likelihood of encountering this error going forwards until a patch is made available Yes (though, patches are already available upstream and I believe at least = on FreeBSD head). The reason this mitigation is effective is that =3D0 causes the hole check = to return "data" if the dnode is dirty, without checking for holes, while the default =3D1 causes it to wait for sync if the dnode is dirty, then check d= irty _again_, before either doing the real hole block scan or returning "data" if its still dirty. The retry occurs just after the txg sync has finished, perilously close to the "clean" window as the dataset starts its next sync. But of course, there's still a miniscule chance that the initial dirty check could hit the gap, which is why its not a 100% solution. > 14.0-RELEASE new zpools: > - block cloning feature is disabled by default via vfs.zfs.bclone_enabled= =3D0 Correct. (fwiw, upstream 2.2.1 has it disabled by default on Linux as well) > - a newly created zpool in will have the zpool feature@block_cloning enab= led Correct. > - but will not actively use that because of the sysctl Correct. OpenZFS' copy_file_range() handler will check this flag, and immediately fallback to the "generic" handler. There are no other block clo= ning entry points on FreeBSD. > - thus actual impact is unlikely Possibly not true, for another reason. Even if we take block cloning out of= the picture, FreeBSD's generic copy_file_range() looks for holes and tries to replicate them on the target. That's through VOP_IOCTL(FIOSEEKDATA/FIOSEEKHOLE). I haven't read that code closely, but I= see no reason why it would be less risky that on Linux with coreutils 9.x (which isn't much, but its there). > 13.x existing zpools: > - you may have encountered this bug but there is currently no definitive = way > to scan a zpool to check for the occurrence Correct. Note that copy_file_range() appeared in 13.0 (bbbbeca3e9a3) and had hole ch= ecks from the beginning. > 12.x existing zpools: > - should not be impacted Unclear. The bug has been affirmatively reproduced back to ZoL 0.6.5, which= at that time was still pulling code from Illumos as FreeBSD was. There's some evidence from code reading that this might have been introduced in OpenSola= ris in 2006 (https://github.com/illumos/illumos-gate/commit/c543ec060d, which f= irst changed the dirty check logic to its incomplete form). However, as best I can tell, copy_file_range() did not exist then, and `bin= /cp` does not now and did not then appear to have any facility to search for hol= es on the source and replicate them (no lseek(SEEK_DATA/SEEK_HOLE), nor ioctl(FIO_SEEKDATA/FIOSEEKHOLE)). So, if the bug can still be hit there, its highly likely that it never was through the inbuilt tools. > Open Questions: > - I don't understand whether FreeBSD uses a similar sparse-by-default fun= ctionality (like linux coreutils 9.2/9.3) that alters the user risk here Answered above. > I'm happy to help craft/review an EN if needed. I hope I haven't made it harder! The point being: its vanishingly unlikely that anyone has hit this during normal operation, and even less likely that they've hit it _unknowningly_: = how often are people writing files, checking for holes and then taking action on that within a fraction of a moment, at a large enough scale for a long enou= gh time that they could actually hit it, and then never actually looked at the target file? The only real-world cases we've seen hit this are parallel bui= ld systems, and while its a problem, its not _silent_ corruption, because the output plain doesn't work! So of course its not _never_, but its really har= d to hit by accident. My advice would to users would be (and has been): - leave vfs.zfs.bclone_enabled=3D0 - if you can get the patch, take it - if you can't, set vfs.zfs.dmu_offset_next_sync=3D0. You lose hole reporti= ng for dirty files, and vastly reduce the already small odds of hitting this - play the probabilities: if you're not sure, and you're not creating new f= iles and copying them away in the same instant hundreds of times per second, you= 're fine --=20 You are receiving this mail because: You are the assignee for the bug.=