From nobody Tue Oct 03 13:41:50 2023 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4S0Jrt697Qz4wGRk; Tue, 3 Oct 2023 13:41:50 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4S0Jrt5kHdz4f40; Tue, 3 Oct 2023 13:41:50 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1696340510; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=8MhJpizShRMhggA2n4MKi1hzlq1eKw49KiPc3UzNids=; b=O7UaSRfDtDj/TVLXTFHC+Wofa3uq2zf1np873EywTx89RCy2tIbptftbpBg8pxl2ZyVrMO DIkx3tjsW5U/DcrotZGRLXYqtCby3kXMgMaNCK4nRcNjnEon2H/hzNQK/xKXHb6TJO7sW4 5BjiHOhd+u7sIBs0oIofNkO8QL6tDVRxbYJb499OLzChIIlOCXm7F3BZKMkO1qHA6rw7to wqRUw8cU4r5poPTwkV2bgGLYOFfZHHZG2c+FhnQ7ASWeMdAlzV7BZRFUVnXTt9sSG/JHbG Fv/lLUGWRHhQOFdMpABfW4jjA4bE11V4IPzL83Wl7tKS1u0qM3kyxYuHWHaqLg== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1696340510; a=rsa-sha256; cv=none; b=IqmI5MyWHUDEEV5G5k1j0E7vjXEbnmVjbzAC1F5bkwim6IAdz0ajaPJT6xYWJoa2F46O/u 6cR9TV9b3USKFbpXoBxpVphEUQEkQfPMIe5CxZiAkW2cGxIMX2OY/Qo9VouimboQvxjW7y 79rZRx72Lv16NPQ0lcVsmT5KDLLcXxNfVvmWQY7rxjczh4h0J7dH2ruAZXBWaG6V+1zF1N C/rTUwqE5qLPNDcaMVScz8NNGZX+VT4P//Me6F8bS5JB+TuGT7ENLv9JyVXzZqH05YTBb8 VyWQFoO8RQpJhYd3iDBllUnugcAFC93JTm4B5duKZF0lbc3jp7tehjnI91XUgA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1696340510; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=8MhJpizShRMhggA2n4MKi1hzlq1eKw49KiPc3UzNids=; b=AkmP7xpe/T3IxLaVtHfxxwayo/gAqFrUfl78nXAkTXmflo778nfpj8pzabmGyzBsEPZ5n7 O/RlwLU18qDyavUSZAJiaaPlboEriTHwhJNFqUYB3ktdsexUEa5PzeJMYMz2Vqp2RY5EVr kR+sQAzi7Phj3QlqMDpyzc5pLmd4fqOvT3aVlzbwikRVqNzDv3r3d1bqYd25ucXcT1m4RB hVwUQjGgmThgA1GnITj8BEuA4nGGxLaAxjkHyCgl2qSpCUCYwF9v/PCulqaMa/f1zq+z1d /Vvn2M5Z0/c4V+UxiBQJ57Oww9JxL862CQpATk7NzozydjcoZGtpUtfVjNGcgA== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4S0Jrt4cQyzxKD; Tue, 3 Oct 2023 13:41:50 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 393DfoNu016000; Tue, 3 Oct 2023 13:41:50 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 393DfoNJ015997; Tue, 3 Oct 2023 13:41:50 GMT (envelope-from git) Date: Tue, 3 Oct 2023 13:41:50 GMT Message-Id: <202310031341.393DfoNJ015997@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Mateusz Guzik Subject: git: 4862e8ac0223 - main - vfs cache: describe various optimization ideas List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: mjg X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: 4862e8ac0223d7b19c8b3e070af1e2b38b18f333 Auto-Submitted: auto-generated The branch main has been updated by mjg: URL: https://cgit.FreeBSD.org/src/commit/?id=4862e8ac0223d7b19c8b3e070af1e2b38b18f333 commit 4862e8ac0223d7b19c8b3e070af1e2b38b18f333 Author: Mateusz Guzik AuthorDate: 2023-10-03 13:36:50 +0000 Commit: Mateusz Guzik CommitDate: 2023-10-03 13:36:50 +0000 vfs cache: describe various optimization ideas While here report a sample result from running on Sapphire Rapids: An access(2) loop slapped into will-it-scale, like so: while (1) { int error = access(tmpfile, R_OK); assert(error == 0); (*iterations)++; } .. operating on /usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c In operations per second: lockless: 3462164 locked: 1362376 While the over 3.4 mln may seem like a big number, a critical look shows it should be significantly higher. A poor man's profiler, counting how many times given routine was sampled: dtrace -w -n 'profile:::profile-4999 /execname == "a.out"/ { @[sym(arg0)] = count(); } tick-5s { system("clear"); trunc(@, 40); printa("%40a %@16d\n", @); clear(@); }' [snip] kernel`kern_accessat 231 kernel`cpu_fetch_syscall_args 324 kernel`cache_fplookup_cross_mount 340 kernel`namei 346 kernel`amd64_syscall 352 kernel`tmpfs_fplookup_vexec 388 kernel`vput 467 kernel`vget_finish 499 kernel`lockmgr_unlock 529 kernel`lockmgr_slock 558 kernel`vget_prep_smr 571 kernel`vput_final 578 kernel`vdropl 1070 kernel`memcmp 1174 kernel`0xffffffff80 2080 0x0 2231 kernel`copyinstr_smap 2492 kernel`cache_fplookup 9246 --- sys/kern/vfs_cache.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 77 insertions(+), 2 deletions(-) diff --git a/sys/kern/vfs_cache.c b/sys/kern/vfs_cache.c index 7e059d374c31..6ae4239cc11d 100644 --- a/sys/kern/vfs_cache.c +++ b/sys/kern/vfs_cache.c @@ -197,10 +197,85 @@ * - vnodes are subject to being recycled even if target inode is left in memory, * which loses the name cache entries when it perhaps should not. in case of tmpfs * names get duplicated -- kept by filesystem itself and namecache separately - * - struct namecache has a fixed size and comes in 2 variants, often wasting space. - * now hard to replace with malloc due to dependence on SMR. + * - struct namecache has a fixed size and comes in 2 variants, often wasting + * space. now hard to replace with malloc due to dependence on SMR, which + * requires UMA zones to opt in * - lack of better integration with the kernel also turns nullfs into a layered * filesystem instead of something which can take advantage of caching + * + * Appendix A: where is the time lost, expanding on paragraph III + * + * While some care went into optimizing lookups, there is still plenty of + * performance left on the table, most notably from single-threaded standpoint. + * Below is a woefully incomplete list of changes which can help. Ideas are + * mostly sketched out, no claim is made all kinks or prerequisites are laid + * out. + * + * Note there is performance lost all over VFS. + * + * === SMR-only lookup + * + * For commonly used ops like stat(2), when the terminal vnode *is* cached, + * lockless lookup could refrain from refing/locking the found vnode and + * instead return while within the SMR section. Then a call to, say, + * vop_stat_smr could do the work (or fail with EAGAIN), finally the result + * would be validated with seqc not changing. This would be faster + * single-threaded as it dodges atomics and would provide full scalability for + * multicore uses. This would *not* work for open(2) or other calls which need + * the vnode to hang around for the long haul, but would work for aforementioned + * stat(2) but also access(2), readlink(2), realpathat(2) and probably more. + * + * === hotpatching for sdt probes + * + * They result in *tons* of branches all over with rather regrettable codegen + * at times. Removing sdt probes altogether gives over 2% boost in lookup rate. + * Reworking the code to patch itself at runtime with asm goto would solve it. + * asm goto is fully supported by gcc and clang. + * + * === copyinstr + * + * On all architectures it operates one byte at a time, while it could be + * word-sized instead thanks to the Mycroft trick. + * + * API itself is rather pessimal for path lookup, accepting arbitrary sizes and + * *optionally* filling in the length parameter. + * + * Instead a new routine (copyinpath?) could be introduced, demanding a buffer + * size which is a multiply of the word (and never zero), with the length + * always returned. On top of it the routine could be allowed to transform the + * buffer in arbitrary ways, most notably writing past the found length (not to + * be confused with writing past buffer size) -- this would allow word-sized + * movs while checking for '\0' later. + * + * === detour through namei + * + * Currently one suffers being called from namei, which then has to check if + * things worked out locklessly. Instead the lockless lookup could be the + * actual entry point which calls what is currently namei as a fallback. + * + * === avoidable branches in cache_can_fplookup + * + * The cache_fast_lookup_enabled flag check could be hotpatchable (in fact if + * this is off, none of fplookup code should execute). + * + * Both audit and capsicum branches can be combined into one, but it requires + * paying off a lot of tech debt first. + * + * ni_startdir could be indicated with a flag in cn_flags, eliminating the + * branch. + * + * === mount stacks + * + * Crossing a mount requires checking if perhaps something is mounted on top. + * Instead, an additional entry could be added to struct mount with a pointer + * to the final mount on the stack. This would be recalculated on each + * mount/unmount. + * + * === root vnodes + * + * It could become part of the API contract to *always* have a rootvnode set in + * mnt_rootvnode. Such vnodes are annotated with VV_ROOT and vnlru would have + * to be modified to always skip them. */ static SYSCTL_NODE(_vfs, OID_AUTO, cache, CTLFLAG_RW | CTLFLAG_MPSAFE, 0,