[Bug 275594] High CPU usage by arc_prune; analysis and fix
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 11 Dec 2023 08:45:26 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594 --- Comment #10 from Seigo Tanimura <seigo.tanimura@gmail.com> --- (In reply to Mark Johnston from comment #9) > vnodes live on a global list, chained by v_vnodelist, and this list appears to be used purely for reclamation. The free vnodes are indeed chained to vnode_list in sys/kern/vfs_subr.c, but this "free" means "not opened by any user processes," ie vp->v_usecount > 0. Besides the user processes, the kernel may use a "free" vnode on its own purpose. In such the case, the kernel "holds" the vnode by vhold(9), making vp->v_holdcnt > 0. A vnode held by the kernel in this way cannot be recycled even if it is not opened by the user process. vnlru_free_impl() checks if the vnode in question is held, and skips recycling if so. I have seen, out of the tests so far, that vnlru_free_impl() tends to skip many vnodes, especially during the late phase of "poudriere bulk". The results and findings are shown at the end of this comment. ----- > If arc_prune() is spending most of its time reclaiming tmpfs vnodes, then it does nothing to address its targets; it may as well do nothing. Again, the mixed use of tmpfs and ZFS has actually turned out as rather a minor problem. Please refer to my findings. Also, there are some easier workarounds that can be tried first, if this is really the issue: - Perform the test of vp->v_mount->mnt_op before vp->v_holdcnt. This should work somehow for now because ZFS is the only filesystem that calls vnlru_free_vfsops() with the valid mnt_op. - After a preconfigured number of consecutive skips, move the marker vnode to the restart point, release vnode_list_mtx and yield the CPU. This actually happens when a vnode is recycled, which may block. > Suppose that arc_prune is disabled outright. How does your test fare? Difficult to tell. I am sure the ARC size should keep increasing first, but cannot tell if it eventually comes to an equilibrium point because of the builder cleanup or keeps rising. ----- In order to investigate the detail of the held vnodes found in vnlru_free_impl(), I have conducted another test with some additional counters. Source on GitHub: - Repo: https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters - Branch: release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters Test setup: The same as "Ongoing test" in bug #275594, comment #6. - vfs.vnode.vnlru.max_free_per_call: 4000000 (== vfs.vnode.vnlru.max_free_per_call) - vfs.zfs.arc.prune_interval: 1000 (my fix enabled) Build time: 06:32:57 (325 pkgs / hr) Counters after completing the build, with some remarks: # The iteration attempts in vnlru_free_impl(). # This includes the retry from the head of vnode_list. vfs.vnode.free.free_attempt: 29695926809 # The number of the vnodes recycled successfully, including vtryrecycle(). vfs.vnode.free.free_success: 30841748 # The number of the iteration skips due to a held vnode. ("phase 2" hereafter) vfs.vnode.free.free_phase2_retry: 11909948307 # The number of the phase 2 skips upon the VREG (regular file) vnodes. vfs.vnode.free.free_phase2_retry_reg: 7877197761 # The number of the phase 2 skips upon the VBAD (being recycled) vnodes. vfs.vnode.free.free_phase2_retry_bad: 3101137010 # The number of the phase 2 skips upon the VDIR (directory) vnodes. vfs.vnode.free.free_phase2_retry_dir: 899106296 # The number of the phase 2 skips upon the VNON (being created) vnodes. vfs.vnode.free.free_phase2_retry_non: 2046379 # The number of the phase 2 skips upon the doomed (being destroyed) vnodes. vfs.vnode.free.free_phase2_retry_doomed: 3101137196 # The number of the iteration skips due to the filesystem mismatch. ("phase 3" hereafter) vfs.vnode.free.free_phase3_retry: 17755077891 Analysis and Findings: Out of ~30G iteration attempts in vnlru_free_impl(), ~12G failed in phase 2. (Phase 3 failure is ~18G, but there are some workaround ideas shown above) Among the phase 2 failures, the most dominant vnode type is VREG. On this type, I suspect the residential VM pages alive in the kernel; a VM object holds the backend vnode if the object has at least one page allocated out of it. Please refer to vm_page_insert_after() and vm_page_insert_radixdone() for the implementation. Technically, such the vnodes can be recycled as long as the prerequisites checked in vtryrecycle() are met with the sufficient locks, which does not include the residential VM pages. vnode_destroy_vobject(), called in vgonel(), takes care of those pages. I suppose we have to do this if the more work is required on vnlru_free_impl(), maybe during the retry after reaching the end of vnode_list. The further fix above assumes that ZFS takes the appropriate work to reduce the ARC size upon reclaiming a ZFS vnode. The rest of the cases are either difficult or impossible for any further work. A VDIR vnode is held by the name cache to improve the path resolution performance, both forward and backward. While the vnodes of this kind can be reclaimed somehow, a significant performance penalty is expected upon the path resolution. VBAD and VNON are actually the states rather than the types of the vnodes. Both of the states are not eligible for recycling by design. -- You are receiving this mail because: You are the assignee for the bug.