From nobody Tue Apr 19 15:58:55 2022 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 6118911E15E1 for ; Tue, 19 Apr 2022 15:58:58 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) Received: from mail2.ambrisko.com (mail2.ambrisko.com [70.91.206.91]) by mx1.freebsd.org (Postfix) with ESMTP id 4KjT4d3Jtkz3LPj for ; Tue, 19 Apr 2022 15:58:57 +0000 (UTC) (envelope-from ambrisko@ambrisko.com) IronPort-SDR: rvQfQcey05edcGJmoCb1FSGrFSRxPLsLOmQgq3B2KREXes4rq0csK3G8xUP6FWjAr9WqcwxIeF IwaaRW2uSkHlxH1bzCKQNTKsqAjgYxTow= X-Ambrisko-Me: Yes IronPort-Data: A9a23:CBezz6iaoIBSidOlK8i877ydX161HRAKZh0ujC45NGQN5FlHY01je htvXzzQOP2DYzCnLtkgPIng908FvcPRmt41Tgo/+Hg2F3lH+JHPbTi7wuccHM8zwunrFh8PA /3z5rAsFehsJpPmjk7F3oXJ9hGQ64nZH9IQN8aUYkiddSc8IMsQoUoLd9wR2+aEsvDla++5g u4eluWEULOTN56YBUpPg06LgEsHUP0fI1r0tHRmDRxAlAe2e3X4kPvzjExsRkYUTLW4HsbiL wrC5LC/4m7D+R4pTNqgmKz6aU4NBLXVOGBiiFIPCvLk20YS4HV0iM7XN9JEAatTozyMlcpw0 9ZKnZW1Qx0oJa7L3u8aVnG0FgkjZPcepeavzX+X9Jb7I1f9W37uzOh8DUIeMogR++IxCmZLn dQWMj0AZAuPwumr2qi2TPVEiN4uIcPwMMUYoH4I8N1zJZ7KWrjYTr/U6MUCmj41jNpPBvXZI cEebFJSgN37S0UnEj8q5FgWxY9EX1HzLG9Vrky7v60y7zSBxQB9yuK0YtPQcMaLXsZStk+dr HjH5Gf+RBodMYXHmzaC93utgM7JnD/6CN9KTezkrqYyjQ3B3HEXBT0XSUC//auzhHmhVo8NM EcT4Ccv8/Q/rRT5UtnnUhSki3eYpRpACcFIGug35VjVmKrZ6gqUHEYeSTtFZIB0vcM6X2Zzh FaMlcnoHj9omLSQQ2ic7bST6zi1PHFNf2MFYCYFSyoD4sXi8Nxr10OTFo47Hffs3NPvGDz2z zSblwQEhu0e3ZwRyqG23VHbmDbw9JLHeRE4u1fMVWW/4wInOIP8P9606ULW5OprJZqCSgXTp 2ANnsWT4bxcDZyJkyDREuwBEKvzvqSENiHRm1hmG98o8j63+mWgesZb5zQnfBVlNcMNeDnIZ k7PuFMMvMYCYCPyNaInMZisD8kKzLT7EYW3X//ZWdNCf5xteVLV5yppf0ORgzjgnRR+i605I pvHI8+gAWxAUfZ8wSCoSv1Hl7YuzDo/3mDUA5v8yk3/g7aZYXeUT5YDMUePPr1htfLY+F2N/ oYNLdaOxjVeTPb6M3ve/oMkJFwXKWQ2WMLtoMtNe+/fegdrFQnN0RMKLW/Nr2C9o5loqw== IronPort-HdrOrdr: A9a23:+rxhbKnRsqaVOjjalBWVhgicCjPpDfIs3DAbv31ZSRFFG/FwWf rOoB0+726StN9xYgBFpTnuAsW9qB/nmqKdpLNhW4tKPzOW3VdATrsSjrcKqgeIc0aSygce79 YDT0EUMr3N5DZB4/oT0GODeeod/A== Received: from server2.ambrisko.com (HELO internal.ambrisko.com) ([192.168.1.2]) by ironport2.ambrisko.com with ESMTP; 19 Apr 2022 07:55:34 -0700 Received: from ambrisko.com (localhost [127.0.0.1]) by internal.ambrisko.com (8.17.1/8.17.1) with ESMTPS id 23JFwt6T080302 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Tue, 19 Apr 2022 08:58:55 -0700 (PDT) (envelope-from ambrisko@ambrisko.com) X-Authentication-Warning: internal.ambrisko.com: Host localhost [127.0.0.1] claimed to be ambrisko.com Received: (from ambrisko@localhost) by ambrisko.com (8.17.1/8.17.1/Submit) id 23JFwtAo080301; Tue, 19 Apr 2022 08:58:55 -0700 (PDT) (envelope-from ambrisko) Date: Tue, 19 Apr 2022 08:58:55 -0700 From: Doug Ambrisko To: Mateusz Guzik Cc: freebsd-current@freebsd.org Subject: Re: nullfs and ZFS issues Message-ID: References: List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Queue-Id: 4KjT4d3Jtkz3LPj X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of ambrisko@ambrisko.com has no SPF policy when checking 70.91.206.91) smtp.mailfrom=ambrisko@ambrisko.com X-Spamd-Result: default: False [-2.00 / 15.00]; R_SPF_NA(0.00)[no SPF record]; ARC_NA(0.00)[]; FREEFALL_USER(0.00)[ambrisko]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; HAS_XAW(0.00)[]; DMARC_NA(0.00)[ambrisko.com]; AUTH_NA(1.00)[]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-0.996]; RCPT_COUNT_TWO(0.00)[2]; MLMMJ_DEST(0.00)[freebsd-current]; FREEMAIL_TO(0.00)[gmail.com]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:7922, ipnet:70.88.0.0/14, country:US]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote: | Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff | | this is not committable but should validate whether it works fine As a POC it's working. I see the vnode count for the nullfs and ZFS go up. The ARC cache also goes up until it exceeds the ARC max. size tten the vnodes for nullfs and ZFS goes down. The ARC cache goes down as well. This all repeats over and over. The systems seems healthy. No excessive running of arc_prune or arc_evict. My only comment is that the vnode freeing seems a bit agressive. Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS. The ARC drops from 70M to 7M (max is set at 64M) for this unit test. Thanks, Doug A. | On 4/19/22, Mateusz Guzik wrote: | > On 4/19/22, Mateusz Guzik wrote: | >> On 4/19/22, Doug Ambrisko wrote: | >>> I've switched my laptop to use nullfs and ZFS. Previously, I used | >>> localhost NFS mounts instead of nullfs when nullfs would complain | >>> that it couldn't mount. Since that check has been removed, I've | >>> switched to nullfs only. However, every so often my laptop would | >>> get slow and the the ARC evict and prune thread would consume two | >>> cores 100% until I rebooted. I had a 1G max. ARC and have increased | >>> it to 2G now. Looking into this has uncovered some issues: | >>> - nullfs would prevent vnlru_free_vfsops from doing anything | >>> when called from ZFS arc_prune_task | >>> - nullfs would hang onto a bunch of vnodes unless mounted with | >>> nocache | >>> - nullfs and nocache would break untar. This has been fixed now. | >>> | >>> With nullfs, nocache and settings max vnodes to a low number I can | >>> keep the ARC around the max. without evict and prune consuming | >>> 100% of 2 cores. This doesn't seem like the best solution but it | >>> better then when the ARC starts spinning. | >>> | >>> Looking into this issue with bhyve and a md drive for testing I create | >>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt. | >>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf | >>> it | >>> and repeat. I set the ARC to the smallest value I can. Untarring the | >>> Linux kernel was enough to get the ARC evict and prune to spin since | >>> they couldn't evict/prune anything. | >>> | >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it | >>> static int | >>> vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) | >>> { | >>> ... | >>> | >>> for (;;) { | >>> ... | >>> vp = TAILQ_NEXT(vp, v_vnodelist); | >>> ... | >>> | >>> /* | >>> * Don't recycle if our vnode is from different type | >>> * of mount point. Note that mp is type-safe, the | >>> * check does not reach unmapped address even if | >>> * vnode is reclaimed. | >>> */ | >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL && | >>> mp->mnt_op != mnt_op) { | >>> continue; | >>> } | >>> ... | >>> | >>> The vp ends up being the nulfs mount and then hits the continue | >>> even though the passed in mvp is on ZFS. If I do a hack to | >>> comment out the continue then I see the ARC, nullfs vnodes and | >>> ZFS vnodes grow. When the ARC calls arc_prune_task that calls | >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS. | >>> The ARC cache usage also goes down. Then they increase again until | >>> the ARC gets full and then they go down again. So with this hack | >>> I don't need nocache passed to nullfs and I don't need to limit | >>> the max vnodes. Doing multiple untars in parallel over and over | >>> doesn't seem to cause any issues for this test. I'm not saying | >>> commenting out continue is the fix but a simple POC test. | >>> | >> | >> I don't see an easy way to say "this is a nullfs vnode holding onto a | >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs | >> callback, if the module is loaded. | >> | >> In the meantime I think a good enough(tm) fix would be to check that | >> nothing was freed and fallback to good old regular clean up without | >> filtering by vfsops. This would be very similar to what you are doing | >> with your hack. | >> | > | > Now that I wrote this perhaps an acceptable hack would be to extend | > struct mount with a pointer to "lower layer" mount (if any) and patch | > the vfsops check to also look there. | > | >> | >>> It appears that when ZFS is asking for cached vnodes to be | >>> free'd nullfs also needs to free some up as well so that | >>> they are free'd on the VFS level. It seems that vnlru_free_impl | >>> should allow some of the related nullfs vnodes to be free'd so | >>> the ZFS ones can be free'd and reduce the size of the ARC. | >>> | >>> BTW, I also hacked the kernel and mount to show the vnodes used | >>> per mount ie. mount -v: | >>> test on /test (zfs, NFS exported, local, nfsv4acls, fsid | >>> 2b23b2a1de21ed66, | >>> vnodes: count 13846 lazy 0) | >>> /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid | >>> 11ff002929000000, vnodes: count 13846 lazy 0) | >>> | >>> Now I can easily see how the vnodes are used without going into ddb. | >>> On my laptop I have various vnet jails and nullfs mount my homedir into | >>> them so pretty much everything goes through nullfs to ZFS. I'm limping | >>> along with the nullfs nocache and small number of vnodes but it would be | >>> nice to not need that. | >>> | >>> Thanks, | >>> | >>> Doug A. | >>> | >>> | >> | >> | >> -- | >> Mateusz Guzik | >> | > | > | > -- | > Mateusz Guzik | > | | | -- | Mateusz Guzik