From nobody Fri Sep 01 21:22:23 2023 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RcrbJ4H4Pz4rh0K for ; Fri, 1 Sep 2023 21:22:36 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4RcrbH63LGz4kZ8 for ; Fri, 1 Sep 2023 21:22:35 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20221208 header.b="d7G/m9vT"; spf=pass (mx1.freebsd.org: domain of rick.macklem@gmail.com designates 2607:f8b0:4864:20::42c as permitted sender) smtp.mailfrom=rick.macklem@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pf1-x42c.google.com with SMTP id d2e1a72fcca58-68bec3a1c0fso2092117b3a.1 for ; Fri, 01 Sep 2023 14:22:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1693603354; x=1694208154; darn=freebsd.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NIDFk0GuWcnuwLN5juZM763g7ejPOWiXDh1wPkKD1iQ=; b=d7G/m9vT4T5xt32g8ir6Gu0Fjn2fVHHctlHZs3K7AtZcV5SGBahF9XpO8lK3KM9C/E RP66qRzuuhExoAohDjaXe4acmROUq969JnWDQeu6dHBOK81+oJ5Qwj4ZJMitsw+77skb hwQNWkWDBLi/fGhWWzRewvX3vLrdayubEUbHoxxW0M9y7d9xkMul+/EwfRAQV3qIiHHC duJ3ajkkkKnQW3lWpjOfXUoB1qYglpjCitQ0BM33oDYDsfnbwAs1poFxYdi4Kea6hrIT cDV08mGMhos5qD4jGGfQ+bipZHGUavRgpQA1vHstBIcRpsSlwDupxfLAForhE2/BfV2U TVdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693603354; x=1694208154; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NIDFk0GuWcnuwLN5juZM763g7ejPOWiXDh1wPkKD1iQ=; b=hkfUKgBSKYUNOTGTiDVzYXPGUreA6/9mCPKrNrDVuoOlsO6HbDoJPHj+6NKxwFfCY/ qmdBMUqdojUUJy7IdKWp/Lo6AUIO2t4C12gqlZzNTi98RSgUgxW9SNrpddmDIBMeGQzP 8AEDX3sCU7TdTGfFg4c9GfQwvW8Da+FiGtUhfbdY3TWbIiOQObfnFud1c/wj6uWF8Nu1 Gspj4/pSMBOiZz6cMmyTJJXLY/x0VInGSWct0X3vX4OycWGaN5UVyILaKhvLUdxQEviy uPAKTL26n5yXnzf8UECpGJaeWaSWwzApR4G/WAFQOxMTkkPaUOqbksC+Lzjn1dyOqRJP 8VpQ== X-Gm-Message-State: AOJu0YxbDXSD/u1QLAM/Dwukgndmm0kzongLRof2VWZHUEv10mFRWzpq 7i25jxQzTX846OW9Zy+aRfXupGQyE7BEf8qFenV8QYZGjA== X-Google-Smtp-Source: AGHT+IGDByQCBUaFVCkfKE4tyH9nYgb4SzN4ECBgg5s4tYzeaAUz1PCyiLqrnwQnWeuIEBWsMWfSISftTa97w0TMiAM= X-Received: by 2002:aa7:8881:0:b0:68b:c423:fb20 with SMTP id z1-20020aa78881000000b0068bc423fb20mr4279580pfe.30.1693603354080; Fri, 01 Sep 2023 14:22:34 -0700 (PDT) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org MIME-Version: 1.0 References: <25827.33600.611577.665054@hergotha.csail.mit.edu> <25831.30103.446606.733311@hergotha.csail.mit.edu> <25840.58487.468791.344785@hergotha.csail.mit.edu> In-Reply-To: <25840.58487.468791.344785@hergotha.csail.mit.edu> From: Rick Macklem Date: Fri, 1 Sep 2023 14:22:23 -0700 Message-ID: Subject: Re: Did something change with ZFS and vnode caching? To: Garrett Wollman Cc: freebsd-stable@freebsd.org, Mateusz Guzik Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Bar: -- X-Spamd-Result: default: False [-3.00 / 15.00]; SUBJECT_ENDS_QUESTION(1.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20221208]; MIME_GOOD(-0.10)[text/plain]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; TO_DN_SOME(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org]; ARC_NA(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::42c:from]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_TLS_LAST(0.00)[]; TAGGED_FROM(0.00)[]; MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org]; DKIM_TRACE(0.00)[gmail.com:+]; MID_RHS_MATCH_FROMTLD(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; RCVD_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FREEMAIL_ENVFROM(0.00)[gmail.com]; FREEMAIL_CC(0.00)[freebsd.org,gmail.com] X-Rspamd-Queue-Id: 4RcrbH63LGz4kZ8 On Thu, Aug 31, 2023 at 12:05=E2=80=AFPM Garrett Wollman wrote: > > < said: > > > Any suggestions on what we should monitor or try to adjust? I remember you mentioning that you tried increasing kern.maxvnodes but I was wondering if you've tried bumping it way up (like 10X what it currently is)? You could try decreasing the max nfsd threads (--maxthreads command line option for nfsd. That would at least limit the # of vnodes used by the nfsd. rick > > To bring everyone up to speed: earlier this month we upgraded our NFS > servers from 12.4 to 13.2 and found that our backup system was > absolutely destroying NFS performance, which had not happened before. > > With some pointers from mjg@ and the thread relating to ZFS > performance on current@ I built a stable/13 kernel > (b5a5a06fc012d27c6937776bff8469ea465c3873) and installed it on one of > our NFS servers for testing, then removed the band-aid on our backup > system and allowed it to go as parallel as it wanted. > > Unfortunately, we do not control the scheduling of backup jobs, so > it's difficult to tell whether the changes made any difference. Each > backup job does a parallel breadth-first traversal of a given > filesystem, using as many as 150 threads per job (the backup client > auto-scales itself), and we sometimes see as many as eight jobs > running in parallel on one file server. (There are 17, soon to be 18, > file servers.) > > When the performance of NFS's backing store goes to hell, the NFS > server is not able to put back-pressure on the clients hard enough to > stop them from writing, and eventually the server runs out of 4k jumbo > mbufs and crashes. This at least is a known failure mode, going back > a decade. Before it gets to this point, the NFS server also > auto-scales itself, so it's in competition with the backup client over > who can create the most threads and ultimately allocate the most > vnodes. > > Last night, while I was watching, the first dozen or so backups went > fine, with no impact to NFS performance, until the backup server > decided to schedule scans of two, and then three, parallel scans of > filesystems containing about 35 million files each. These tend to > take an hour or four, depending on how much changed data is identified > during the scane, but most of the time it's just sitting in a > readdir()/fstatat() loop with a shared work queue for parallelism. > (That's my interpretation based on its activity; we do not have source > code.) > > Once these scans were underway, I observed the same symptoms as on > releng/13.2, with lots of lock contention and the vnlru process > running almost constantly (95% CPU, so most of a core on this > 20-core/40-thread server). From our monitoring, the server was > recycling about 35k vnodes per second during this period. I wasn't > monitoring these statistics before so I don't have historical > comparisons. My working assumption, such as it is, is that the switch > from OpnSolaris ZFS to OpenZFS in 13.x moved some bottlenecks around > so that the backup client previously got tangled higher up in the ZFS > code and now can put real pressure on the vnode allocator. > > During the hour that the three backup clients were running, I was able > to run mjg@'s dtrace script and generate a flame graph, which is > viewable at . > This just shows what the backup clients themselves are doing, and not > what's going on in the vnlru or nfsd processes. You can ignore all > the umtx stacks since that's just coordination between the threads in > the backup client. > > On the "oncpu" side, the trace captures a lot of time spent spinning > in lock_delay(), although I don't see where the alleged call site > acquires any locks, so there must have been some inlining. On the > "offcpu" side, it's clear that there's still a lot of time spent > sleeping on vnode_list_mtx in the vnode allocation pathway, both > directly from vn_alloc_hard() and also from vnlru_free_impl() after > the mutex is dropped and then needs to be reacquired. > > In ZFS, there's also a substantial number of waits (shown as > sx_xlock_hard stack frames), in both the easy case (a free vnode was > readily available) and the hard case where vn_alloc_hard() calls > vnlru_free_impl() and eventually zfs_inactive() to reclaim a vnode. > Looking into the implementation, I noted that ZFS uses a 64-entry hash > lock for this, and I'm wondering if there's an issue with false > sharing. Can anyone with ZFS experience speak to that? If I > increased ZFS_OBJ_MTX_SZ to 128 or 256, would it be likely to hurt > something else (other than memory usage)? Do we even know that the > low-order 6 bits of ZFS object IDs are actually uniformly distributed? > > -GAWollman > >