From nobody Thu Nov 23 16:26:00 2023 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Sbk525s9pz51rtk for ; Thu, 23 Nov 2023 16:26:14 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Sbk5164C3z4Xyy; Thu, 23 Nov 2023 16:26:13 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=CHd+SzxA; spf=pass (mx1.freebsd.org: domain of rick.macklem@gmail.com designates 2607:f8b0:4864:20::1029 as permitted sender) smtp.mailfrom=rick.macklem@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pj1-x1029.google.com with SMTP id 98e67ed59e1d1-285196aaecaso913003a91.0; Thu, 23 Nov 2023 08:26:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700756772; x=1701361572; darn=freebsd.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VJZF9+tmEGXTEGaoMG+9OuTsS8sX3q0QjFioNUQqV1g=; b=CHd+SzxAczBYrVJEIRXk3FO4dJbDelrvG2lcGXgvUP35CUF7RUfFRuBcfm1dIfk22x 1egoATBDv1L13qElN5kvUGtdfsBQ9JE0mGAkbWJvPBUGYGRugla0AI68eua3AKWO1EUx kqFkaHxW52n51gQpTNpO6GQ4WpCY3Oq85UPZKgl8tfD3U7+SdJDaIbzGQFidKLV+1yqo Q82f3pCdUezYnPXvZjmmn454mcmeF/M4NE+VbpiwLNJHu1CFIr0n2F5O2yjU4r1HLpCE gwqazz431LBbE4F4nZvEBbXF+t+tr9WpBjp8+Zne8Qwvxa30M8ImKhXOY1FXmlqMCwsU BwUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700756772; x=1701361572; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VJZF9+tmEGXTEGaoMG+9OuTsS8sX3q0QjFioNUQqV1g=; b=whkBh0bNeTspHxIHQHw35cty0Bh5sIVyZLYclrSbtxyDsqoq34ulHEs/v5NyFUt1OT LSUGQUnsMCoY/jFrh1yhuK0PWTZdQrkBlL4+yO8YfnMizDOvLAvchoOZ38M9CzaoMP1t IidDnNxScOJqf616WeRWo4LDbgtkwj+ueCnnNw1YW2KQsFv/BD6FANL7Gj5CFpRlDvoY NW37Ec5SFiuwQvVYmjBMYso2XZy/rT+fgNkhV9faH+0bmTUsjoJwVkiuBWoJSs3SmhXg M91bd2pwAfWyqsmJ3uHVYK/wOrOGSH9jBEGFpFZ8ERMX0R/+YpuR7WU7tLqutpbO4A5f xqag== X-Gm-Message-State: AOJu0YzF3ATIJPSZZvk+SGC8urPine94MyAHsdjiwi5CKq78aqtlDewp CdQ929hbwR/RGrYI1T9ke9sIJYXfx7G0A8wQFg== X-Google-Smtp-Source: AGHT+IGgVVzyZSxSHIs2MllnomA95uQYGCmOPHP/eIFi6LUEO0iqSUk3KLsLaC43wlgN0XfI44UsrZYsdax++dRbgLk= X-Received: by 2002:a17:90b:4a03:b0:283:21d3:11eb with SMTP id kk3-20020a17090b4a0300b0028321d311ebmr23642pjb.3.1700756772066; Thu, 23 Nov 2023 08:26:12 -0800 (PST) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: <25943.60056.880614.452966@hergotha.csail.mit.edu> <91988E23-ED50-4379-AA5F-4B069E08D80F@karels.net> <7C7920E0-B0FC-4255-AD5C-ECBDDC18CF43@karels.net> In-Reply-To: From: Rick Macklem Date: Thu, 23 Nov 2023 08:26:00 -0800 Message-ID: Subject: Re: NFS exports of ZFS snapshots broken To: Mike Karels Cc: FreeBSD CURRENT , Alexander Motin , Martin Matuska , Garrett Wollman Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Result: default: False [-4.00 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.998]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; MIME_GOOD(-0.10)[text/plain]; ARC_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-current@freebsd.org]; RCVD_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; FROM_EQ_ENVFROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::1029:from]; DKIM_TRACE(0.00)[gmail.com:+]; TAGGED_FROM(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; FROM_HAS_DN(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; RCVD_TLS_LAST(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; TO_DN_ALL(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-Rspamd-Queue-Id: 4Sbk5164C3z4Xyy X-Spamd-Bar: --- Oh, and thanks to everyone who put up with my lame ZFS snapshot questions. rick On Thu, Nov 23, 2023 at 8:13=E2=80=AFAM Rick Macklem wrote: > > On Sat, Nov 18, 2023 at 8:10=E2=80=AFPM Mike Karels wro= te: > > > > On 18 Nov 2023, at 21:19, Rick Macklem wrote: > > > > > On Sat, Nov 18, 2023 at 4:43=E2=80=AFPM Mike Karels = wrote: > > >> > > >> On 18 Nov 2023, at 17:23, Rick Macklem wrote: > > >> > > >>> On Sat, Nov 18, 2023 at 2:27=E2=80=AFPM Mike Karels wrote: > > >>>> > > >>>> On 18 Nov 2023, at 15:58, Rick Macklem wrote: > > >>>> > > >>>>> On Sat, Nov 18, 2023 at 8:09=E2=80=AFAM Rick Macklem wrote: > > >>>>>> > > >>>>>> On Fri, Nov 17, 2023 at 8:19=E2=80=AFPM Mike Karels wrote: > > >>>>>>> > > >>>>>>> On 17 Nov 2023, at 22:14, Mike Karels wrote: > > >>>>>>> > > >>>>>>>> On 17 Nov 2023, at 21:24, Rick Macklem wrote: > > >>>>>>>> > > >>>>>>>>> Most of the changes in stable/13 that are not in releng/13.2 > > >>>>>>>>> are the "make it work in a jail" stuff. Unfortunately, they a= re > > >>>>>>>>> a large # of changes (mostly trivial edits adding vnet macros= ), > > >>>>>>>>> but it also includes export check changes. > > >>>>>>>>> > > >>>>>>>>> I have attached a trivial patch that I think disables the exp= ort > > >>>>>>>>> checks for jails. If either of you can try it and see if it f= ixes > > >>>>>>>>> the problem, that would be great. > > >>>>>>>>> (Note that this is only for testing, although it probably doe= s not > > >>>>>>>>> matter unless you are running nfsd(8) in vnet jails.) > > >>>>>>>> > > >>>>>>>> Yes, I can see snapshots with the patch. This system is just = a test > > >>>>>>>> system that doesn't normally run ZFS or NFS, so no problem mes= sing > > >>>>>>>> with permissions. It's a bhyve VM, so I just added a small di= sk and > > >>>>>>>> enabled ZFS for testing. > > >>>>>>> > > >>>>>>> btw, you might try to get mm@ or maybe mav@ to help out from th= e ZFS > > >>>>>>> side. It must be doing something differently inside a snapshot= than > > >>>>>>> outside, maybe with file handles or something like that. > > >>>>>> Yes. I've added freebsd-current@ (although Garrett is not on it,= he is > > >>>>>> cc'd) and these guys specifically... > > >>>>>> > > >>>>>> So, here's what appears to be the problem... > > >>>>>> Commit 88175af (in main and stable/13, but not 13.2) added check= s for > > >>>>>> nfsd(8) running in jails by filling in mnt_exjail with a referen= ce to the cred > > >>>>>> used when the file system is exported. > > >>>>>> When mnt_exjail is found NULL, the current nfsd code assumes tha= t there > > >>>>>> is no access allowed for the mount. > > >>>>>> > > >>>>>> My vague understanding is that when a ZFS snapshot is accessed, = it is > > >>>>>> "pseudo-mounted" by zfsctl_snapdir_lookup() and I am guessing th= at > > >>>>>> mnt_exjail is NULL as a result. > > >>>>>> Since I do not know the ZFS code and don't even have an easy way= to > > >>>>>> test this (thankfully Mike can test easily), I do not know what = to do from > > >>>>>> here? > > >>>>>> > > >>>>>> Is there a "struct mount" constructed for this pseudo mount > > >>>>>> (or it actually appears to be the lookup of ".." that fails, so = it > > >>>>>> might be the parent of the snapshot subdir?)? > > >>>>>> > > >>>>>> One thought is that I can check to see if the mount pointer is i= n the > > >>>>>> mountlist (I don't think the snapshot's mount is in the mountlis= t) and > > >>>>>> avoid the jail test for this case. This would assume that snaps= hots are > > >>>>>> always within the file system(s) exported via that jail (which i= ncludes > > >>>>>> the case of prison0, of course), so that they do not need a sepa= rate > > >>>>>> jail check. > > >>>>>> > > >>>>>> If this doesn't work, there will need to be some sort of messing= about > > >>>>>> in ZFS to set mnt_exjail for these. > > >>>>> Ok, so now onto the hard part... > > >>>>> Thanks to Mike and others, I did create a snapshot under .zfs and= I can > > >>>>> see the problem. It is that mnt_exjail =3D=3D NULL. > > >>>>> Now, is there a way that this "struct mount" can be recognized as= "special" > > >>>>> for snapshots, so I can avoid the mnt_exjail =3D=3D NULL test? > > >>>>> (I had hoped that "mp->mnt_list.tqe_prev" would be NULL, but that= is not > > >>>>> the case.) > > >>>> > > >>>> Dumb question, is the mount point (mp presumably) different betwee= n the > > >>>> snapshot and the main file system? > > >>> Not a dump question and the answer is rather interesting... > > >>> It is "sometimes" or "usually" according to my printf(). > > >>> It seems that when you first "cd > >>> where mnt_exjail =3D=3D NULL.. Then when you look at directories wi= thin the > > >>> snapshot, you get the mp of the file system that .zfs exists in, wh= ich does > > >>> have mnt_exjail set non-NULL. > > >>> > > >>> There is this snippet of code in zfsctl_snapdir_lookup(): > > >>> /* > > >>> * Fix up the root vnode mounted on .zfs/snapshot/. > > >>> * > > >>> * This is where we lie about our v_vfsp in order to > > >>> * make .zfs/snapshot/ accessible over NFS > > >>> * without requiring manual mounts of . > > >>> */ > > >>> ASSERT3P(VTOZ(*vpp)->z_zfsvfs, !=3D, zfsvfs); > > >>> VTOZ(*vpp)->z_zfsvfs->z_parent =3D zfsvfs; > > >>> > > >>> /* Clear the root flag (set via VFS_ROOT) as well. */ > > >>> (*vpp)->v_vflag &=3D ~VV_ROOT; > > >>> which seems to set the mp to that of the parent, but it > > >>> seems this does not happen for the initial lookup of > > >>> the ? > > >>> > > >>> I'll note that there is code before this in > > >>> zfsctl_snapdir_lookup() for handling cases > > >>> like "." and ".." that return without doing this. > > >>> > > >>> Now, why does this work without the mnt_exjail > > >>> check (as in 13.2)? > > >>> I am not quite sure, but there is this "cheat" in the > > >>> NFS server (it has been there for years, maybe decades): > > >>> /* > > >>> * Allow a Lookup, Getattr, GetFH, Secinfo on an > > >>> * non-exported directory if > > >>> * nfs_rootfhset. Do I need to allow any other Ops? > > >>> * (You can only have a non-exported vpnes if > > >>> * nfs_rootfhset is true. See nfsd_fhtovp()) > > >>> * Allow AUTH_SYS to be used for file systems > > >>> * exported GSS only for certain Ops, to allow > > >>> * clients to do mounts more easily. > > >>> */ > > >>> if (nfsv4_opflag[op].needscfh && vp) { > > >>> if (!NFSVNO_EXPORTED(&vpnes) && > > >>> op !=3D NFSV4OP_LOOKUP && > > >>> op !=3D NFSV4OP_GETATTR && > > >>> op !=3D NFSV4OP_GETFH && > > >>> op !=3D NFSV4OP_ACCESS && > > >>> op !=3D NFSV4OP_READLINK && > > >>> op !=3D NFSV4OP_SECINFO && > > >>> op !=3D NFSV4OP_SECINFONONAME) > > >>> nd->nd_repstat =3D NFSERR_NOFILEHANDLE; > > >>> This allows certain operations to be done on > > >>> non-exported file systems and I think that is enough > > >>> to allow this to work when mnt_exjail is not checked. > > >>> (Note that NFSV4OP_LOOKUPP is not in the list, > > >>> which might explain why it is the one that fails for > > >>> Garrett. I don't think it can be added to this list > > >>> safely, since that would allow a client to move above > > >>> the exported file system into "uncharted territory".) > > >>> > > >>>> Just curious. Also, what is mnt_exjail > > >>>> normally set to for file systems not in a jail? > > >>> mnt_exjail is set to the credentials of the thread/process > > >>> that exported the file system (usually mountd(8)). > > >>> When not in a jail, cr_prison for these credentials > > >>> points to prison0. > > >>> > > >>> Btw, I checked and the "other mp that has mnt_exjail =3D=3D NULL > > >>> is in the mountlist, so the idea of checking "not in mountlist" > > >>> is a dead end. > > >>> > > >>> I am looking for something "unique" about this other mp, > > >>> but haven't found anything yet. > > >>> Alternately, it might be necessary to add code to > > >>> zfsctl_snapdir_lookup() to "cheat and change the mp" > > >>> in more cases, such as "." and ".." lookups? > > >> > > >> It seems to me that if ZFS is creating an additional mount structure= , > > >> it should be responsible for setting it up correctly. That could > > >> involve a vfs-level routine to do some of the cloning. In any case, > > >> it seems to me that mnt_exjail should be set properly, e.g. by dupin= g > > >> the one in the original mount structure. Probably ZFS is the only > > >> file system type that would need this added. > > > I've created a patch that I think does this. It seemed to test ok for= my case. > > > It's in D42672 on reviews.freebsd.org. > > > I have also attached it here (this diff doesn't use -U999999, so it m= ight be > > > easier to apply). > > > > It works for me in 13-stable, and looks plausible. > > The patch has now been fixed so that there is no race between vfs_exjail_= clone() > and a jail dying. I believe the current patch fixes the problem. > The patch can be found as an attachment to PR#275200 (that is a FreeBSD b= ugzilla > problem report, not an OpenZFS pull request). I will not close the PR unt= il > the above has all happened. > > This patch will be needed for systems running stable/13, stable/14 and 14= .0. > > The vfs_exjail_clone() part of the patch is now committed to main and wil= l be > MFC'd in 3 days. The ZFS part of the patch is a pull request on OpenZFS. > I cannot say how long the ZFS part takes or if/when an errata to 14.0 wil= l > happen. > > Sorry about the breakage and thanks for reporting it (and persevering whe= n > I said I knew nothing about ZFS, which is true). Thanks goes to Mike Kare= ls, > who was able to determine that the problem was not in 13.2, but was in > stable/13, > which allowed me to conclude it was a "nfsd in jail" related breakage. > > rick > > > > > Mike > > > > > Hopefully this fixes the problem. Sorry for the breakage. > > > > > > rick > > > > > >> > > >> Mike > > >> > > >>> rick > > >>> ps: I added all the cc's back in because I want the > > >>> ZFS folk to hopefully chime in. > > >>> > > >>>> > > >>>> Mike > > >>>> > > >>>>> Do I need to search mountlist for it? > > >>>>> > > >>>>> rick > > >>>>> ps: The hack patch attached should fix the problem, but can only = be > > >>>>> safely used if mountd/nfsd are not run in any jails. > > >>>>> > > >>>>>> > > >>>>>> I will try and get a test setup going here, which leads me to.. > > >>>>>> how do I create a ZFS snapshot? (I do have a simple ZFS pool run= ning > > >>>>>> on a test machine, but I've never done a snapshot.) > > >>>>>> > > >>>>>> Although this problem is not in 13.2, it will have shipped in 14= .0. > > >>>>>> > > >>>>>> Any help with be appreciated, rick > > >>>>>> > > >>>>>>> > > >>>>>>> Mike > > >>>>>>>> > > >>>>>>>>> rick > > >>>>>>>>> > > >>>>>>>>> On Fri, Nov 17, 2023 at 6:14=E2=80=AFPM Mike Karels wrote: > > >>>>>>>>>> > > >>>>>>>>>> CAUTION: This email originated from outside of the Universit= y of Guelph. Do not click links or open attachments unless you recognize th= e sender and know the content is safe. If in doubt, forward suspicious emai= ls to IThelp@uoguelph.ca. > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> Rick, have you been following this thread on freebsd-stable?= I have been able > > >>>>>>>>>> to reproduce this using a 13-stable server from Oct 7 and a = 15-current system > > >>>>>>>>>> that is up to date using NFSv3. I did not reproduce with a = 13.2 server. The > > >>>>>>>>>> client was running 13.2. Any ideas? A full bisect seems fa= irly painful, but > > >>>>>>>>>> maybe you have an idea of points to try. Fortunately, these= are all test > > >>>>>>>>>> systems that I can reboot at will. > > >>>>>>>>>> > > >>>>>>>>>> Mike > > >>>>>>>>>> > > >>>>>>>>>> Forwarded message: > > >>>>>>>>>> > > >>>>>>>>>>> From: Garrett Wollman > > >>>>>>>>>>> To: Mike Karels > > >>>>>>>>>>> Cc: freebsd-stable@freebsd.org > > >>>>>>>>>>> Subject: Re: NFS exports of ZFS snapshots broken > > >>>>>>>>>>> Date: Fri, 17 Nov 2023 17:35:04 -0500 > > >>>>>>>>>>> > > >>>>>>>>>>> < said: > > >>>>>>>>>>> > > >>>>>>>>>>>> I have not run into this, so I tried it just now. I had n= o problem. > > >>>>>>>>>>>> The server is 13.2, fully patched, the client is up-to-dat= e -current, > > >>>>>>>>>>>> and the mount is v4. > > >>>>>>>>>>> > > >>>>>>>>>>> On my 13.2 client and 13-stable server, I see: > > >>>>>>>>>>> > > >>>>>>>>>>> 25034 ls CALL open(0x237d32f9a000,0x120004) > > >>>>>>>>>>> 25034 ls NAMI "/mnt/tools/.zfs/snapshot/weekly-2023= -45" > > >>>>>>>>>>> 25034 ls RET open 4 > > >>>>>>>>>>> 25034 ls CALL fcntl(0x4,F_ISUNIONSTACK,0x0) > > >>>>>>>>>>> 25034 ls RET fcntl 0 > > >>>>>>>>>>> 25034 ls CALL getdirentries(0x4,0x237d32faa000,0x10= 00,0x237d32fa7028) > > >>>>>>>>>>> 25034 ls RET getdirentries -1 errno 5 Input/output= error > > >>>>>>>>>>> 25034 ls CALL close(0x4) > > >>>>>>>>>>> 25034 ls RET close 0 > > >>>>>>>>>>> 25034 ls CALL exit(0) > > >>>>>>>>>>> > > >>>>>>>>>>> Certainly a libc bug here that getdirentries(2) returning [= EIO] > > >>>>>>>>>>> results in ls(1) returning EXIT_SUCCESS, but the [EIO] erro= r is > > >>>>>>>>>>> consistent across both FreeBSD and Linux clients. > > >>>>>>>>>>> > > >>>>>>>>>>> Looking at this from the RPC side: > > >>>>>>>>>>> > > >>>>>>>>>>> (PUTFH, GETATTR, LOOKUP(snapshotname), GETFH, GETATTR= ) > > >>>>>>>>>>> [NFS4_OK for all ops] > > >>>>>>>>>>> (PUTFH, GETATTR) > > >>>>>>>>>>> [NFS4_OK, NFS4_OK] > > >>>>>>>>>>> (PUTFH, ACCESS(0x3f), GETATTR) > > >>>>>>>>>>> [NFS4_OK, NFS4_OK, rights =3D 0x03, NFS4_OK] > > >>>>>>>>>>> (PUTFH, GETATTR, LOOKUPP, GETFH, GETATTR) > > >>>>>>>>>>> [NFS4_OK, NFS4_OK, NFS4ERR_NOFILEHANDLE] > > >>>>>>>>>>> > > >>>>>>>>>>> and at this point the [EIO] is returned. > > >>>>>>>>>>> > > >>>>>>>>>>> It seems that clients always do a LOOKUPP before calling RE= ADDIR, and > > >>>>>>>>>>> this is failing when the subject file handle is the snapsho= t. The > > >>>>>>>>>>> client is perfectly able to *traverse into* the snapshot: i= f I try to > > >>>>>>>>>>> list a subdirectory I know exists in the snapshot, the clie= nt is able to > > >>>>>>>>>>> LOOKUP(dirname) just fine, but LOOKUPP still fails with > > >>>>>>>>>>> NFS4ERR_NOFILEHANDLE *on the subndirectory*. > > >>>>>>>>>>> > > >>>>>>>>>>> -GAWollman > > >>>>>>>>>>