Did something change with ZFS and vnode caching?

From: Garrett Wollman <wollman_at_bimajority.org>
Date: Mon, 21 Aug 2023 15:31:12 UTC
Hi, all,

As I've mentioned before, we have been upgrading our servers from 12.4
to 13.2.  Over the past week I've noticed on a number of our NFS
servers that our backups are running very slowly, taking much longer
than normal, with the `vnlru` process taking a whole CPU and load
average balloons to 40 or more.  At the same time, NFS service becomes
extremely slow.  A look at the vnode cache shows that it's at the
limit, and increasing `kern.maxvnodes` helps only for a few seconds,
until the vnode population reaches the new limit.  This never happened
under 12.4.  Things return to normal when the backup clients are
killed.  (Usually as many as four run in parallel with multiple
threads scanning the filesystems.)

These machines have hundreds of terabytes of filesystem data, and
billions of files, and typically between 128 and 256 GiB of RAM.
In the normal case of an incremental backup, the backup client will
scan the filesystem and stat(2) every file in sequence but won't
actually open the files that haven't been modified.  Perhaps these
vnodes aren't getting discarded soon enough?

-GAWollman