From nobody Mon Dec 25 10:21:27 2023 X-Original-To: fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SzDTM5fm7z54SGd for ; Mon, 25 Dec 2023 10:21:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4SzDTM2SpDz4QBK for ; Mon, 25 Dec 2023 10:21:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1703499687; a=rsa-sha256; cv=none; b=xUWr5FV3K1Zj6yv29meWVBM3zCBDIOmqiW71SJb+fbLEtsKfSQDIrSWScgs3WXy8s8ATh+ aeXs8fv3q/6j5+xBZie7xBy8G0hA6fk3ixQ1bKN3texj/eqKUchQU2kOZss8yQ8iWrgAb4 A8/xNwQzVFTnO2nNHA132K4SI69rik985u/j0pRrfezEty/scZfH4oHpffP/z/qi27CK2h NXpezIVLIMKd7rmAPVc2j/UywaCxntegLOfex3FAzWEh6p7bkHrhLIHPi5rBGUkDPa5Ywu RuZAinznfEsSdB8QcIlCpQRhNGvcFxC8sruLvtsI4Uxjnj/ga89VG5IQwy+OIg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1703499687; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=w8Z10sCuzYU1DLQd7hDJFlsYDsaoRkfRfaQwL4/8vH8=; b=eedlDxCNUO5hGREi5Mivt1EUjifiTNQmQiIMlENKzr/23vu5NLo+eqVBk1f0hvgoc7eYo2 LbTK8khp8RKyBYJTSXEO5lHKfe48ygvy5bsdWIsoKfmYEcFvQcg84H7UnSrD6PM1/JA73W qrQFk3EHS1nryH1cTZJQp5smkFR5m56a5g2msM2K8LDpQbcKJlJV+fBlX0Ulrafhm5uQGf smDC+k4rdfziEX1lvSmwh8qhkskXnY/xu/HmNr2tPmw5CJnNpPQZB6aNobM9FKWqdFqIqr 2KCNcFwhfF7SN68TkxjGx/o+VKM94TrJhFEO+CkFtBrkrttMOtPTzCFa3QxLhA== Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4SzDTM1Xcqz16fn for ; Mon, 25 Dec 2023 10:21:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 3BPALRBd010156 for ; Mon, 25 Dec 2023 10:21:27 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 3BPALRFu010155 for fs@FreeBSD.org; Mon, 25 Dec 2023 10:21:27 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: fs@FreeBSD.org Subject: [Bug 275594] High CPU usage by arc_prune; analysis and fix Date: Mon, 25 Dec 2023 10:21:27 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 14.0-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: seigo.tanimura@gmail.com X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275594 --- Comment #14 from Seigo Tanimura --- (In reply to Seigo Tanimura from comment #12) Thanks for your patience, I have spent quite some days to set up the more precise and easy measurement by Fluent Bit, Elasticsearch and Kibana. It is now working somehow and ready to share the test results and the analysis on them. ----- Sources on GitHub: - FreeBSD - Repo - https://github.com/altimeter-130ft/freebsd-freebsd-src - Branches - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-fix - No changes since the last test. - release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters - Add the vnode tag counting. (disabled by default) - Refactor the ZFS ARC stats for the precise investigation. - The changes may be squashed or removed in the future. - Fluent Bit - Repo - https://github.com/altimeter-130ft/fluent-fluent-bit - Branch - topic-plugin-in-sysctl - The input plugin for sysctl(3). - The other fixes included. - No documents yet. - Note - "-DFLB_CORO_STACK_SIZE=3D1048576" recommended on cmake(1). - The default coroutine thread stack size is only a few kilobytes= on FreeBSD. ----- Test Summary: - Date: 21 Dec 2023 17:50Z - 22 Dec 2023 00:50Z - Build time: 07:00:59 (304 pkgs / hr) - Failed ports: 3 - Setup - vfs.vnode.vnlru.max_free_per_call: 4000000 (=3D=3D vfs.vnode.vnlru.max_free_per_call) - vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled) - vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled) - vfs.zfs.arc.dnode_limit=3D2684354560 (2.5G, larger than the max actual = value observed so far) ----- Findings Summary: - Vnode Behavior - The vnodes circulate between the creation and pruning. - Almost all of the unrecyclable vnodes have one use count, which makes o= ne extra hold count as well. - ZFS Behaviour - The direct trigger of the ARC pruning is the low ARC meta value. - There is a massaive amount of the ghost data hits. ----- Finding Detail: The discussion below refer to the charts in the archive attached as poudriere-bulk-2023-12-22_02h55m25s. The times on the chart horizontal axes are in the UTC. - Vnode Behavior - The vnodes circulate between the creation and pruning. * Charts: number-of-vnodes.png, zfs-vnode-free-calls.png, zfs-freed-vnodes.= png. Although zfs-freed-vnodes.png shows the steady vnode recycling in some time sections, the number of the vnodes is almost flat. (Typical case: 23:20Z - 00:40Z) This means that the vnodes are both recycled and created at almost= the same rate. - Almost all of the unrecyclable vnodes have one use count, which makes o= ne extra hold count as well. * Charts: zfs-vnode-recycle-phase2-reg-retries.png, zfs-vnode-recycle-phase2-dir-retries.png. The non-zero traces on the charts are the overlaps of: - free_phase2_retry_*: the skipped vnodes, - free_phase2_retry_*_inuse: the vnodes with the non-zero v_usecount, and - free_phase2_retry_*_unknown1: the vnodes with one unknown extra v_holdc= nt. The non-zero v_usecount is sufficient to explain the "unknown" extra one v_holdcnt. The current vnode recycling code does not work on the vnodes wi= th the non-zero v_usecount, and one v_usecount makes one v_holdcnt on FreeBSD. - ZFS Behaviour - The direct trigger of the ARC pruning is the low ARC meta value. * Charts: zfs-arc-meta.png, zfs-arc-dnode-size.png. arc_evict() triggers the async ARC pruning when either of the following conditions stand: A) More than 3/4 of the wanted metadata size target in the ARC is not evictable, or B) The size allocated for the dnode exceeds vfs.zfs.arc.dnode_limit. B) is not the case in this test; the max ZFS ARC dnode size was ~1.6GB, whi= ch was less than vfs.zfs.arc.dnode_limit. (2.5GB) The ZFS ARC meta is the ratio of the wanted metadata size with respect to t= he whole ARC size, initialized to 25% scaled up to 1G as the 32bit fixed point decimal. The higher (lower) value means the more (less) size for the ARC metadata. The reason for the sharp drop of this value is discussed next. - There is a massaive amount of the ghost data hits. * Charts: zfs-arc-mru-ghost-data-hits.png, zfs-arc-mfu-ghost-data-hits.png, zfs-arc-mru-ghost-metadata-hits.png, zfs-arc-mfu-ghost-metadata-hits.png. Please watch out for the vertical axes of the charts: - The "hits" are the sizes, not the counts in ZFS. - The scales of the data hits are about 10 times larger than the metadata h= its. The ZFS ARC meta value is automatically tuned by the sizes of the ghost hits passed to arc_evict_adj(), called in arc_evict(). The idea is to favour ei= ther the data or metadata experiencing the more ghost hits. The ghost hits on the MFU and MRU data is about 10 times larger than their counterparts on the metadata. This explains the sharp drop of the ZFS ARC = meta starting at about 18:20Z. Also, the large size of the ghost data hits happ= ened occasionally, while the ghost metadata hits seldom exceeded the data. This shows that the ZFS ARC meta stayed low until the end of the build test. ----- Next Work: - Check if the vnodes with the non-zero v_usecount are actually open(2)ed. - Compare the count to kern.openfiles. - Estimation out of the current results: - Max free_phase2_retry_reg_inuse per 10 minutes: ~270K @ 00:20Z. - The ZFS vnode free call period at the time above: ~1.57 [s]. (<=3D 1 / (382.5 / (10 * 60))) - The hits into the vnodes in use per ZFS vnode free call: ~707. (<=3D = 270K / (10 * 60) * 1.57) - If this figure is reasonably close to kern.openfiles, the phase2 retr= ies are indeed caused by the open(2)ed vnodes and such the vnodes cannot be evicted. --=20 You are receiving this mail because: You are the assignee for the bug.=