[Bug 275594] High CPU usage by arc_prune; analysis and fix
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 25 Dec 2023 10:21:27 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275594 --- Comment #14 from Seigo Tanimura <seigo.tanimura@gmail.com> --- (In reply to Seigo Tanimura from comment #12) Thanks for your patience, I have spent quite some days to set up the more precise and easy measurement by Fluent Bit, Elasticsearch and Kibana. It is now working somehow and ready to share the test results and the analysis on them. ----- Sources on GitHub: - FreeBSD - Repo - https://github.com/altimeter-130ft/freebsd-freebsd-src - Branches - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix - No changes since the last test. - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters - Add the vnode tag counting. (disabled by default) - Refactor the ZFS ARC stats for the precise investigation. - The changes may be squashed or removed in the future. - Fluent Bit - Repo - https://github.com/altimeter-130ft/fluent-fluent-bit - Branch - topic-plugin-in-sysctl - The input plugin for sysctl(3). - The other fixes included. - No documents yet. - Note - "-DFLB_CORO_STACK_SIZE=1048576" recommended on cmake(1). - The default coroutine thread stack size is only a few kilobytes on FreeBSD. ----- Test Summary: - Date: 21 Dec 2023 17:50Z - 22 Dec 2023 00:50Z - Build time: 07:00:59 (304 pkgs / hr) - Failed ports: 3 - Setup - vfs.vnode.vnlru.max_free_per_call: 4000000 (== vfs.vnode.vnlru.max_free_per_call) - vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled) - vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled) - vfs.zfs.arc.dnode_limit=2684354560 (2.5G, larger than the max actual value observed so far) ----- Findings Summary: - Vnode Behavior - The vnodes circulate between the creation and pruning. - Almost all of the unrecyclable vnodes have one use count, which makes one extra hold count as well. - ZFS Behaviour - The direct trigger of the ARC pruning is the low ARC meta value. - There is a massaive amount of the ghost data hits. ----- Finding Detail: The discussion below refer to the charts in the archive attached as poudriere-bulk-2023-12-22_02h55m25s. The times on the chart horizontal axes are in the UTC. - Vnode Behavior - The vnodes circulate between the creation and pruning. * Charts: number-of-vnodes.png, zfs-vnode-free-calls.png, zfs-freed-vnodes.png. Although zfs-freed-vnodes.png shows the steady vnode recycling in some time sections, the number of the vnodes is almost flat. (Typical case: 23:20Z - 00:40Z) This means that the vnodes are both recycled and created at almost the same rate. - Almost all of the unrecyclable vnodes have one use count, which makes one extra hold count as well. * Charts: zfs-vnode-recycle-phase2-reg-retries.png, zfs-vnode-recycle-phase2-dir-retries.png. The non-zero traces on the charts are the overlaps of: - free_phase2_retry_*: the skipped vnodes, - free_phase2_retry_*_inuse: the vnodes with the non-zero v_usecount, and - free_phase2_retry_*_unknown1: the vnodes with one unknown extra v_holdcnt. The non-zero v_usecount is sufficient to explain the "unknown" extra one v_holdcnt. The current vnode recycling code does not work on the vnodes with the non-zero v_usecount, and one v_usecount makes one v_holdcnt on FreeBSD. - ZFS Behaviour - The direct trigger of the ARC pruning is the low ARC meta value. * Charts: zfs-arc-meta.png, zfs-arc-dnode-size.png. arc_evict() triggers the async ARC pruning when either of the following conditions stand: A) More than 3/4 of the wanted metadata size target in the ARC is not evictable, or B) The size allocated for the dnode exceeds vfs.zfs.arc.dnode_limit. B) is not the case in this test; the max ZFS ARC dnode size was ~1.6GB, which was less than vfs.zfs.arc.dnode_limit. (2.5GB) The ZFS ARC meta is the ratio of the wanted metadata size with respect to the whole ARC size, initialized to 25% scaled up to 1G as the 32bit fixed point decimal. The higher (lower) value means the more (less) size for the ARC metadata. The reason for the sharp drop of this value is discussed next. - There is a massaive amount of the ghost data hits. * Charts: zfs-arc-mru-ghost-data-hits.png, zfs-arc-mfu-ghost-data-hits.png, zfs-arc-mru-ghost-metadata-hits.png, zfs-arc-mfu-ghost-metadata-hits.png. Please watch out for the vertical axes of the charts: - The "hits" are the sizes, not the counts in ZFS. - The scales of the data hits are about 10 times larger than the metadata hits. The ZFS ARC meta value is automatically tuned by the sizes of the ghost hits passed to arc_evict_adj(), called in arc_evict(). The idea is to favour either the data or metadata experiencing the more ghost hits. The ghost hits on the MFU and MRU data is about 10 times larger than their counterparts on the metadata. This explains the sharp drop of the ZFS ARC meta starting at about 18:20Z. Also, the large size of the ghost data hits happened occasionally, while the ghost metadata hits seldom exceeded the data. This shows that the ZFS ARC meta stayed low until the end of the build test. ----- Next Work: - Check if the vnodes with the non-zero v_usecount are actually open(2)ed. - Compare the count to kern.openfiles. - Estimation out of the current results: - Max free_phase2_retry_reg_inuse per 10 minutes: ~270K @ 00:20Z. - The ZFS vnode free call period at the time above: ~1.57 [s]. (<= 1 / (382.5 / (10 * 60))) - The hits into the vnodes in use per ZFS vnode free call: ~707. (<= 270K / (10 * 60) * 1.57) - If this figure is reasonably close to kern.openfiles, the phase2 retries are indeed caused by the open(2)ed vnodes and such the vnodes cannot be evicted. -- You are receiving this mail because: You are the assignee for the bug.