From nobody Fri Jan 05 17:27:20 2024 X-Original-To: fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4T69Pj1Qvsz55NcR for ; Fri, 5 Jan 2024 17:27:21 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4T69Pj0KTKz55s1 for ; Fri, 5 Jan 2024 17:27:21 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1704475641; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zew4NvKPVvygoUf0wkzjXi5MUbH44ZQNJUX0dKS+lF4=; b=rM+uNlq7DoHMQZITX44gqwSYrSuuqunBQ5JMm0CYzkrQKNlEqR7T8paqwbXuUMV2ADtCws 9OXr5AI3XFh4YhK3TcdxgmvUi6+3ymbw4nYO+GNoYupKW/c4xkBHzgA1jqNVHvD0yDM5+j mh4KmBtPFvGgaEG6I7SunoKjNm/L0m4bHK8d8iXTDenH76kUNLYhXzo0neBGUZ4dgMAN/n LvrdKh6775riWhZmD5H7NTtPQWVqJ5o9KA/bv36QYs6e/TM9SsDRC04zkr8W9ZaZKgUaSC PUInnVM5glfWMr4Atd/kDrr9NVDjRkU4EfnYpLY7gWjq2g1KJSFnyuHNT16w1Q== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1704475641; a=rsa-sha256; cv=none; b=LZEh4yBi8mPhV5964L0hMXYOIy0i8MSf70ZnTdRYOlFQBZgMBPdeDA0PpecAU+A5f5XH+P LiPKPJoyZwBPpiLDXl2E17TXXS3GmdzOULu2+x3r1ToMuMkxSj9NN8xSZrkcDLSa1kJ4+r OOksbFPDkQAqc8oIkeOhHIRlykMRLIolHev7Z7Qy6L7W7akCKgxahGCbxEUB6Kx2qg6iQw dQGB+B6XJCkBIXpK3P0UUn6B6R96wEB2PTps2das4cneAQbHOEIHsm9NM7hmb5Mfn6i7fK BCCwT3tFZhT9GkAwcl0+IILdMDzGjXQHEgaFxGoleiaYZ/HZDdFm04VTzLlH6w== Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4T69Ph6PKhz167S for ; Fri, 5 Jan 2024 17:27:20 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 405HRKtF076253 for ; Fri, 5 Jan 2024 17:27:20 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 405HRKwc076252 for fs@FreeBSD.org; Fri, 5 Jan 2024 17:27:20 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: fs@FreeBSD.org Subject: [Bug 275594] High CPU usage by arc_prune; analysis and fix Date: Fri, 05 Jan 2024 17:27:20 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 14.0-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: seigo.tanimura@gmail.com X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275594 --- Comment #22 from Seigo Tanimura --- (In reply to Seigo Tanimura from comment #18) > It may be good to account the number of the ZFS vnodes not in use. Befor= e starting an ARC pruning, we can check that count and defer pruning if tha= t is too low. > (snip) > Figure out the requirement and design of the accounting above. Done. * Sources on GitHub: - Repo - https://github.com/altimeter-130ft/freebsd-freebsd-src - Branches - Fix only - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix - Fix and counters - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters * New ZFS vnode / znode counters for ARC pruning - Total ZFS vnodes - sysctl(3): vfs.zfs.znode.count - Counter variable: zfs_znode_count - ZFS vnodes in use (v_usecount > 0) - sysctl(3): vfs.zfs.znode.inuse - Counter variable: zfs_znode_inuse_count - ARC-prunable ZFS vnodes - sysctl(3): vfs.zfs.znode.prunable - Formula: zfs_znode_count - zfs_znode_inuse_count - ARC pruning requests - sysctl(3): vfs.zfs.znode.pruning_requested - Counter variable: zfs_znode_pruning_requested - Skipped ARC pruning requests - sysctl(3): vfs.zfs.znode.pruning_skipped - Counter variable: zfs_znode_pruning_skipped * Design of counter operations - Total ZFS vnodes (zfs_znode_count) - Increment upon creating a new ZFS vnode. - zfs_znode_alloc() - Decrement upon reclaiming a ZFS vnode. - zfs_freebsd_reclaim() - ZFS vnodes in use ("v_usecount > 0") - Export to the VFS via mnt_fsvninusep of struct mount. - Both the VFS and ZFS have to operate on the counter. - struct vnode cannot be expanded anymore. - Increment upon inserting a new ZFS vnode into a ZFS mountpoint. - zfs_mknode() - zfs_zget() - Increment upon vget() and alike. - vget_finish_ref() - Decrement upon vput() and alike. - vput_final() * Design of ARC pruning regulation - Required condition - zfs_prune_task(uint64_t nr_to_scan, void *arg __unused) - Condition: (zfs_znode_count - zfs_znode_inuse_count) * dn / zfs_znode_inuse_count >=3D nr_to_scan - dn: The number of the dnodes. - Scale the prunable znodes to the dnodes linearly because a znode may span across multiple dnodes. - Call vnlru_free_vfsops() only when the condition above satisfies. - Other changes on ARC pruning - Refactor the extra vnode recycling into 2 togglable features. - sysctl(3): vfs.vnode.vnlru.recycle_bufs_pages - Recycle the vnodes with the clean buffers and clean/dirty VM pages. - sysctl(3): vfs.vnode.vnlru.recycle_nc_src - Recycle the vnodes working as the namecache sources. - Both enabled by default. - Retire the interval between the ARC pruning, the initial fix. - The ARC pruning regulation above is more precise. * Test results Test Summary: - Date: 03 Jan 2024 01:30Z - 03 Dec 2023 08:13Z - Build time: 06:43:25 (317 pkgs / hr) - Failed port(s): 1 - Setup - sysctl(3) - vfs.vnode.vnlru.max_free_per_call: 4000000 - =3D=3D vfs.vnode.param.limit. - vfs.zfs.arc_max: 4294967296 - 4GB. - vfs.zfs.arc.dnode_limit=3D8080000000 - 2.5 * (vfs.vnode.param.limit) * sizeof(dnode_t) - 2.5: experimental average dnodes per znode (2.0) + margin (0.5) - poudriere-bulk(8) - USE_TMPFS=3D"wrkdir data localbase" Result Chart Archive (1 / 2): (poudriere-bulk-2024-01-03_10h30m00s.7z) - zfs-znodes-and-dnodes.png - The counts of the ZFS znodes and dnodes. - zfs-dnodes-and-freeing-activity.png - The freeing activity of the ZFS znodes and dnodes. - vnode-free-calls.png - The calls to the ZFS vnode freeing functions. Result Chart Archive (2 / 2): (poudriere-bulk-2024-01-03_10h30m00s-zfs-arc.= 7z) - zfs-arc/zfs-arc-meta.png - The balancing of the ZFS ARC metadata and data. - zfs-arc/zfs-arc-(A)-(B)-(C).png - The ZFS ARC stats. (A): MRU (mru) or MFU. (mfu) (B): Metadata (metadata) or data (data); the "ghost-" prefix denotes the evicted cache. (C): Size (size) or hits (hits); the hits count the hit sizes, not the = hit counts. Finding Summary: - The ZFS ARC meta was lowered strongly, contradicting the high metadata de= mand in the ZFS ARC. - They are both the designed behaviours. - The low ZFS ARC meta value triggered the aggressive ARC pruning. - Again, this is as designed. - The ARC pruning regulation throttled the load as expected. - Virtually no load happened when only one or two builders were running. - The fruitless pruning was eliminated. Analysis in Detail: - ZFS znode and dnode counts (zfs-znodes-and-dnodes.png) The green and blue traces show the counts of the total and in-use ZFS znode= s, respectively. The gap between these lines denote the prunable ZFS znodes, = also shown as the red trace. Those traces show that there are almost no prunable znodes, so it is useless to perform the ARC pruning too often. - ZFS znode and dnode freeing activity (zfs-dnodes-and-freeing-activity.png) The red trace is the count of the znodes freed by the ARC pruning. It work= ed in the first hour because the build happened upon many small ports, where t= he builder cleaning released many znodes. After that, the build moved to the = big long ones (lang/rust, lang/gcc12, ...) and the znode release ceased. A cou= ple of the occational bumps happened upon the builder cleanups after finishing = the build of such the ports. - Vnode free calls (vnode-free-calls.png) The non-zero traces are vfs.zfs.znode.pruning_requested and vfs.zfs.znode.pruning_skipped, almost completely overlapped. After 02:45Z, there were no counts on vfs.vnode.free.* shown by the median. This means t= he ARC pruning was either not performed at all or merely exceptionally. The magnitude of vfs.zfs.znode.pruning_requested shows the high pressure of= the ARC pruning from ZFS. The top peak at 02:20Z is ~1.8M / 5 mins =3D=3D 6K /= sec.=20 The ARC pruning request definitely needs a solid throttling because a typic= al ARC pruning work takes up to ~0.2 seconds when there are actually no prunab= le vnodes. [1] Even under a steady light load in 06:25Z - 08;05Z (working on emulators/mame, where ccache does not work somehow), vfs.zfs.znode.pruning_requested recorded ~50K / 5 mins =3D~ 167 / sec. [1] Observed under my first fix where the interval of 1 second was enforced between each ARC pruning. The max ARC pruning rate was ~0.8 / sec. - The ZFS ARC stats (zfs-arc/zfs-arc-*.png) The ZFS ARC stats show how the high pressure upon the ARC pruning happened. The ZFS ARC stats of the sizes (zfs-arc/zfs-arc-*-size.png) show the follow= ing properties: - Except for the first hour, there were almost no evictable sizes. - The metadata stayed solidly while the data was driven away. - The trace of the ZFS ARC MRU metadata size (zfs-arc-mru-metadata-size.p= ng) is similar to that of the znode and dnode counts. Out of these properties, I suspect that the znodes and dnodes in use domina= ted the ARC. Although not confirmed by the code walk, it makes a sense to secu= re such the metadata in the memory because they are likely to be updated often. Another parameter affecting the ZFS ARC is the balancing of the metadata and data. The ZFS ARC meta (zfs-arc-meta.png) is the auto-tuned target ratio of the metadata size in the 32 bit fixed point decimal. Since vfs.zfs.arc_max= is 4GB in my setup, this value can be straightly read as the metadata size tar= get in bytes. The ZFS ARC meta is tuned by the ghost-hit sizes. (zfs-arc-*-ghost-*-hits.p= ng)=20 It is designed to favour either the metadata or data with the more ghost-hit sizes, so that the further caching lessens that. As the data was dominant = so much in the ghost-hit sizes, the ZFS ARC meta was pushed so low; the minimum was ~197M at 05:20Z, and mostly less than 1G, the default (1/4 of vfs.zfs.arc_max), except for the first hour. The low target of the metadata size then caused the aggressive ARC pruning as implemented in arc_evict(), = in conjunction with the high demand of the unevictable metadata. --=20 You are receiving this mail because: You are the assignee for the bug.=