From nobody Sun Nov 19 00:42:57 2023 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SXsLv23PTz50vjf for ; Sun, 19 Nov 2023 00:43:19 +0000 (UTC) (envelope-from mike@karels.net) Received: from mail2.karels.net (mail2.karels.net [3.19.118.201]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "freebsd", Issuer "freebsd" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 4SXsLt2K7Kz3NF5; Sun, 19 Nov 2023 00:43:18 +0000 (UTC) (envelope-from mike@karels.net) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=karels.net header.s=mail2 header.b=MKgphzQw; spf=pass (mx1.freebsd.org: domain of mike@karels.net designates 3.19.118.201 as permitted sender) smtp.mailfrom=mike@karels.net; dmarc=none Received: from mail2.karels.net (localhost [IPv6:0:0:0:0:0:0:0:1]) by mail2.karels.net (8.17.1/8.17.1) with ESMTP id 3AJ0gvpm061541; Sat, 18 Nov 2023 18:42:58 -0600 (CST) (envelope-from mike@karels.net) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=karels.net; s=mail2; t=1700354578; bh=axqWBZR23vPddu93kuypa8NtJH9uTiHJlQdxOqlepqk=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=MKgphzQwzdllWx71qvkrBQ8IwIgUrhJ6TDO7VYMGjAzxApolQNR7eC9eQD/bVAHco RDAs2O23/wl9fHgwXx2Ibn6RRNvGTeOCCMAxJjgzEQuItuO31G2TJWoiaXnwvbzlIu +qJZPB3TD4mO7NfpwapNj+NFbKcTVD9sh2JjoqbGk/NyiCWBLB0EdLSuCPE9Q50K/9 jCLAEfcFtaWI9Dsz/fikg++9R/k1ggjWrzXanH8MuWq3xFaSD3k9NOy6HE95TVJp4w ZQ9sl8uL8eeLYWCM3NrJtARNCyG+xxkmLtB+izvpcYCR0E2pPrYJZPhSiKWGmWGu0d 6K+y4CpO38NzA== Received: from [10.0.2.130] ([73.62.165.147]) by mail2.karels.net with ESMTPSA id COcbNBFaWWVj8AAAs/W3XQ (envelope-from ); Sat, 18 Nov 2023 18:42:57 -0600 From: Mike Karels To: Rick Macklem Cc: FreeBSD CURRENT , Alexander Motin , Martin Matuska , Garrett Wollman Subject: Re: NFS exports of ZFS snapshots broken Date: Sat, 18 Nov 2023 18:42:57 -0600 X-Mailer: MailMate (1.14r5964) Message-ID: In-Reply-To: References: <25943.60056.880614.452966@hergotha.csail.mit.edu> <91988E23-ED50-4379-AA5F-4B069E08D80F@karels.net> List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spamd-Result: default: False [-2.00 / 15.00]; SUSPICIOUS_RECIPS(1.50)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[karels.net:s=mail2]; R_SPF_ALLOW(-0.20)[+ip4:3.19.118.201]; MIME_GOOD(-0.10)[text/plain]; MLMMJ_DEST(0.00)[freebsd-current@freebsd.org]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_TO(0.00)[gmail.com]; ASN(0.00)[asn:16509, ipnet:3.16.0.0/14, country:US]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[karels.net:+]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[mike]; ARC_NA(0.00)[]; TO_DN_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; DMARC_NA(0.00)[karels.net]; TAGGED_RCPT(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; RCVD_VIA_SMTP_AUTH(0.00)[] X-Rspamd-Queue-Id: 4SXsLt2K7Kz3NF5 X-Spamd-Bar: - On 18 Nov 2023, at 17:23, Rick Macklem wrote: > On Sat, Nov 18, 2023 at 2:27=E2=80=AFPM Mike Karels w= rote: >> >> On 18 Nov 2023, at 15:58, Rick Macklem wrote: >> >>> On Sat, Nov 18, 2023 at 8:09=E2=80=AFAM Rick Macklem wrote: >>>> >>>> On Fri, Nov 17, 2023 at 8:19=E2=80=AFPM Mike Karels wrote: >>>>> >>>>> On 17 Nov 2023, at 22:14, Mike Karels wrote: >>>>> >>>>>> On 17 Nov 2023, at 21:24, Rick Macklem wrote: >>>>>> >>>>>>> Most of the changes in stable/13 that are not in releng/13.2 >>>>>>> are the "make it work in a jail" stuff. Unfortunately, they are >>>>>>> a large # of changes (mostly trivial edits adding vnet macros), >>>>>>> but it also includes export check changes. >>>>>>> >>>>>>> I have attached a trivial patch that I think disables the export >>>>>>> checks for jails. If either of you can try it and see if it fixes= >>>>>>> the problem, that would be great. >>>>>>> (Note that this is only for testing, although it probably does no= t >>>>>>> matter unless you are running nfsd(8) in vnet jails.) >>>>>> >>>>>> Yes, I can see snapshots with the patch. This system is just a te= st >>>>>> system that doesn't normally run ZFS or NFS, so no problem messing= >>>>>> with permissions. It's a bhyve VM, so I just added a small disk a= nd >>>>>> enabled ZFS for testing. >>>>> >>>>> btw, you might try to get mm@ or maybe mav@ to help out from the ZF= S >>>>> side. It must be doing something differently inside a snapshot tha= n >>>>> outside, maybe with file handles or something like that. >>>> Yes. I've added freebsd-current@ (although Garrett is not on it, he = is >>>> cc'd) and these guys specifically... >>>> >>>> So, here's what appears to be the problem... >>>> Commit 88175af (in main and stable/13, but not 13.2) added checks fo= r >>>> nfsd(8) running in jails by filling in mnt_exjail with a reference t= o the cred >>>> used when the file system is exported. >>>> When mnt_exjail is found NULL, the current nfsd code assumes that th= ere >>>> is no access allowed for the mount. >>>> >>>> My vague understanding is that when a ZFS snapshot is accessed, it i= s >>>> "pseudo-mounted" by zfsctl_snapdir_lookup() and I am guessing that >>>> mnt_exjail is NULL as a result. >>>> Since I do not know the ZFS code and don't even have an easy way to >>>> test this (thankfully Mike can test easily), I do not know what to d= o from >>>> here? >>>> >>>> Is there a "struct mount" constructed for this pseudo mount >>>> (or it actually appears to be the lookup of ".." that fails, so it >>>> might be the parent of the snapshot subdir?)? >>>> >>>> One thought is that I can check to see if the mount pointer is in th= e >>>> mountlist (I don't think the snapshot's mount is in the mountlist) a= nd >>>> avoid the jail test for this case. This would assume that snapshots= are >>>> always within the file system(s) exported via that jail (which inclu= des >>>> the case of prison0, of course), so that they do not need a separate= >>>> jail check. >>>> >>>> If this doesn't work, there will need to be some sort of messing abo= ut >>>> in ZFS to set mnt_exjail for these. >>> Ok, so now onto the hard part... >>> Thanks to Mike and others, I did create a snapshot under .zfs and I c= an >>> see the problem. It is that mnt_exjail =3D=3D NULL. >>> Now, is there a way that this "struct mount" can be recognized as "sp= ecial" >>> for snapshots, so I can avoid the mnt_exjail =3D=3D NULL test? >>> (I had hoped that "mp->mnt_list.tqe_prev" would be NULL, but that is = not >>> the case.) >> >> Dumb question, is the mount point (mp presumably) different between th= e >> snapshot and the main file system? > Not a dump question and the answer is rather interesting... > It is "sometimes" or "usually" according to my printf(). > It seems that when you first "cd where mnt_exjail =3D=3D NULL.. Then when you look at directories within= the > snapshot, you get the mp of the file system that .zfs exists in, which = does > have mnt_exjail set non-NULL. > > There is this snippet of code in zfsctl_snapdir_lookup(): > /* > * Fix up the root vnode mounted on .zfs/snapshot/. > * > * This is where we lie about our v_vfsp in order to > * make .zfs/snapshot/ accessible over NFS > * without requiring manual mounts of . > */ > ASSERT3P(VTOZ(*vpp)->z_zfsvfs, !=3D, zfsvfs); > VTOZ(*vpp)->z_zfsvfs->z_parent =3D zfsvfs; > > /* Clear the root flag (set via VFS_ROOT) as well. */ > (*vpp)->v_vflag &=3D ~VV_ROOT; > which seems to set the mp to that of the parent, but it > seems this does not happen for the initial lookup of > the ? > > I'll note that there is code before this in > zfsctl_snapdir_lookup() for handling cases > like "." and ".." that return without doing this. > > Now, why does this work without the mnt_exjail > check (as in 13.2)? > I am not quite sure, but there is this "cheat" in the > NFS server (it has been there for years, maybe decades): > /* > * Allow a Lookup, Getattr, GetFH, Secinfo on an > * non-exported directory if > * nfs_rootfhset. Do I need to allow any other Ops? > * (You can only have a non-exported vpnes if > * nfs_rootfhset is true. See nfsd_fhtovp()) > * Allow AUTH_SYS to be used for file systems > * exported GSS only for certain Ops, to allow > * clients to do mounts more easily. > */ > if (nfsv4_opflag[op].needscfh && vp) { > if (!NFSVNO_EXPORTED(&vpnes) && > op !=3D NFSV4OP_LOOKUP && > op !=3D NFSV4OP_GETATTR && > op !=3D NFSV4OP_GETFH && > op !=3D NFSV4OP_ACCESS && > op !=3D NFSV4OP_READLINK && > op !=3D NFSV4OP_SECINFO && > op !=3D NFSV4OP_SECINFONONAME) > nd->nd_repstat =3D NFSERR_NOFILEHANDLE; > This allows certain operations to be done on > non-exported file systems and I think that is enough > to allow this to work when mnt_exjail is not checked. > (Note that NFSV4OP_LOOKUPP is not in the list, > which might explain why it is the one that fails for > Garrett. I don't think it can be added to this list > safely, since that would allow a client to move above > the exported file system into "uncharted territory".) > >> Just curious. Also, what is mnt_exjail >> normally set to for file systems not in a jail? > mnt_exjail is set to the credentials of the thread/process > that exported the file system (usually mountd(8)). > When not in a jail, cr_prison for these credentials > points to prison0. > > Btw, I checked and the "other mp that has mnt_exjail =3D=3D NULL > is in the mountlist, so the idea of checking "not in mountlist" > is a dead end. > > I am looking for something "unique" about this other mp, > but haven't found anything yet. > Alternately, it might be necessary to add code to > zfsctl_snapdir_lookup() to "cheat and change the mp" > in more cases, such as "." and ".." lookups? It seems to me that if ZFS is creating an additional mount structure, it should be responsible for setting it up correctly. That could involve a vfs-level routine to do some of the cloning. In any case, it seems to me that mnt_exjail should be set properly, e.g. by duping the one in the original mount structure. Probably ZFS is the only file system type that would need this added. Mike > rick > ps: I added all the cc's back in because I want the > ZFS folk to hopefully chime in. > >> >> Mike >> >>> Do I need to search mountlist for it? >>> >>> rick >>> ps: The hack patch attached should fix the problem, but can only be >>> safely used if mountd/nfsd are not run in any jails. >>> >>>> >>>> I will try and get a test setup going here, which leads me to.. >>>> how do I create a ZFS snapshot? (I do have a simple ZFS pool running= >>>> on a test machine, but I've never done a snapshot.) >>>> >>>> Although this problem is not in 13.2, it will have shipped in 14.0. >>>> >>>> Any help with be appreciated, rick >>>> >>>>> >>>>> Mike >>>>>> >>>>>>> rick >>>>>>> >>>>>>> On Fri, Nov 17, 2023 at 6:14=E2=80=AFPM Mike Karels wrote: >>>>>>>> >>>>>>>> CAUTION: This email originated from outside of the University of= Guelph. Do not click links or open attachments unless you recognize the = sender and know the content is safe. If in doubt, forward suspicious emai= ls to IThelp@uoguelph.ca. >>>>>>>> >>>>>>>> >>>>>>>> Rick, have you been following this thread on freebsd-stable? I = have been able >>>>>>>> to reproduce this using a 13-stable server from Oct 7 and a 15-c= urrent system >>>>>>>> that is up to date using NFSv3. I did not reproduce with a 13.2= server. The >>>>>>>> client was running 13.2. Any ideas? A full bisect seems fairly= painful, but >>>>>>>> maybe you have an idea of points to try. Fortunately, these are= all test >>>>>>>> systems that I can reboot at will. >>>>>>>> >>>>>>>> Mike >>>>>>>> >>>>>>>> Forwarded message: >>>>>>>> >>>>>>>>> From: Garrett Wollman >>>>>>>>> To: Mike Karels >>>>>>>>> Cc: freebsd-stable@freebsd.org >>>>>>>>> Subject: Re: NFS exports of ZFS snapshots broken >>>>>>>>> Date: Fri, 17 Nov 2023 17:35:04 -0500 >>>>>>>>> >>>>>>>>> < said: >>>>>>>>> >>>>>>>>>> I have not run into this, so I tried it just now. I had no pr= oblem. >>>>>>>>>> The server is 13.2, fully patched, the client is up-to-date -c= urrent, >>>>>>>>>> and the mount is v4. >>>>>>>>> >>>>>>>>> On my 13.2 client and 13-stable server, I see: >>>>>>>>> >>>>>>>>> 25034 ls CALL open(0x237d32f9a000,0x120004) >>>>>>>>> 25034 ls NAMI "/mnt/tools/.zfs/snapshot/weekly-2023-45"= >>>>>>>>> 25034 ls RET open 4 >>>>>>>>> 25034 ls CALL fcntl(0x4,F_ISUNIONSTACK,0x0) >>>>>>>>> 25034 ls RET fcntl 0 >>>>>>>>> 25034 ls CALL getdirentries(0x4,0x237d32faa000,0x1000,0= x237d32fa7028) >>>>>>>>> 25034 ls RET getdirentries -1 errno 5 Input/output err= or >>>>>>>>> 25034 ls CALL close(0x4) >>>>>>>>> 25034 ls RET close 0 >>>>>>>>> 25034 ls CALL exit(0) >>>>>>>>> >>>>>>>>> Certainly a libc bug here that getdirentries(2) returning [EIO]= >>>>>>>>> results in ls(1) returning EXIT_SUCCESS, but the [EIO] error is= >>>>>>>>> consistent across both FreeBSD and Linux clients. >>>>>>>>> >>>>>>>>> Looking at this from the RPC side: >>>>>>>>> >>>>>>>>> (PUTFH, GETATTR, LOOKUP(snapshotname), GETFH, GETATTR) >>>>>>>>> [NFS4_OK for all ops] >>>>>>>>> (PUTFH, GETATTR) >>>>>>>>> [NFS4_OK, NFS4_OK] >>>>>>>>> (PUTFH, ACCESS(0x3f), GETATTR) >>>>>>>>> [NFS4_OK, NFS4_OK, rights =3D 0x03, NFS4_OK] >>>>>>>>> (PUTFH, GETATTR, LOOKUPP, GETFH, GETATTR) >>>>>>>>> [NFS4_OK, NFS4_OK, NFS4ERR_NOFILEHANDLE] >>>>>>>>> >>>>>>>>> and at this point the [EIO] is returned. >>>>>>>>> >>>>>>>>> It seems that clients always do a LOOKUPP before calling READDI= R, and >>>>>>>>> this is failing when the subject file handle is the snapshot. = The >>>>>>>>> client is perfectly able to *traverse into* the snapshot: if I = try to >>>>>>>>> list a subdirectory I know exists in the snapshot, the client i= s able to >>>>>>>>> LOOKUP(dirname) just fine, but LOOKUPP still fails with >>>>>>>>> NFS4ERR_NOFILEHANDLE *on the subndirectory*. >>>>>>>>> >>>>>>>>> -GAWollman >>>>>>>>