From nobody Fri Sep 01 21:22:23 2023
X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RcrbJ4H4Pz4rh0K
	for <freebsd-stable@mlmmj.nyi.freebsd.org>; Fri,  1 Sep 2023 21:22:36 +0000 (UTC)
	(envelope-from rick.macklem@gmail.com)
Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4RcrbH63LGz4kZ8
	for <freebsd-stable@freebsd.org>; Fri,  1 Sep 2023 21:22:35 +0000 (UTC)
	(envelope-from rick.macklem@gmail.com)
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=gmail.com header.s=20221208 header.b="d7G/m9vT";
	spf=pass (mx1.freebsd.org: domain of rick.macklem@gmail.com designates 2607:f8b0:4864:20::42c as permitted sender) smtp.mailfrom=rick.macklem@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-pf1-x42c.google.com with SMTP id d2e1a72fcca58-68bec3a1c0fso2092117b3a.1
        for <freebsd-stable@freebsd.org>; Fri, 01 Sep 2023 14:22:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1693603354; x=1694208154; darn=freebsd.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=NIDFk0GuWcnuwLN5juZM763g7ejPOWiXDh1wPkKD1iQ=;
        b=d7G/m9vT4T5xt32g8ir6Gu0Fjn2fVHHctlHZs3K7AtZcV5SGBahF9XpO8lK3KM9C/E
         RP66qRzuuhExoAohDjaXe4acmROUq969JnWDQeu6dHBOK81+oJ5Qwj4ZJMitsw+77skb
         hwQNWkWDBLi/fGhWWzRewvX3vLrdayubEUbHoxxW0M9y7d9xkMul+/EwfRAQV3qIiHHC
         duJ3ajkkkKnQW3lWpjOfXUoB1qYglpjCitQ0BM33oDYDsfnbwAs1poFxYdi4Kea6hrIT
         cDV08mGMhos5qD4jGGfQ+bipZHGUavRgpQA1vHstBIcRpsSlwDupxfLAForhE2/BfV2U
         TVdA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1693603354; x=1694208154;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=NIDFk0GuWcnuwLN5juZM763g7ejPOWiXDh1wPkKD1iQ=;
        b=hkfUKgBSKYUNOTGTiDVzYXPGUreA6/9mCPKrNrDVuoOlsO6HbDoJPHj+6NKxwFfCY/
         qmdBMUqdojUUJy7IdKWp/Lo6AUIO2t4C12gqlZzNTi98RSgUgxW9SNrpddmDIBMeGQzP
         8AEDX3sCU7TdTGfFg4c9GfQwvW8Da+FiGtUhfbdY3TWbIiOQObfnFud1c/wj6uWF8Nu1
         Gspj4/pSMBOiZz6cMmyTJJXLY/x0VInGSWct0X3vX4OycWGaN5UVyILaKhvLUdxQEviy
         uPAKTL26n5yXnzf8UECpGJaeWaSWwzApR4G/WAFQOxMTkkPaUOqbksC+Lzjn1dyOqRJP
         8VpQ==
X-Gm-Message-State: AOJu0YxbDXSD/u1QLAM/Dwukgndmm0kzongLRof2VWZHUEv10mFRWzpq
	7i25jxQzTX846OW9Zy+aRfXupGQyE7BEf8qFenV8QYZGjA==
X-Google-Smtp-Source: AGHT+IGDByQCBUaFVCkfKE4tyH9nYgb4SzN4ECBgg5s4tYzeaAUz1PCyiLqrnwQnWeuIEBWsMWfSISftTa97w0TMiAM=
X-Received: by 2002:aa7:8881:0:b0:68b:c423:fb20 with SMTP id
 z1-20020aa78881000000b0068bc423fb20mr4279580pfe.30.1693603354080; Fri, 01 Sep
 2023 14:22:34 -0700 (PDT)
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-stable
List-Help: <mailto:stable+help@freebsd.org>
List-Post: <mailto:stable@freebsd.org>
List-Subscribe: <mailto:stable+subscribe@freebsd.org>
List-Unsubscribe: <mailto:stable+unsubscribe@freebsd.org>
Sender: owner-freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
MIME-Version: 1.0
References: <25827.33600.611577.665054@hergotha.csail.mit.edu>
 <25831.30103.446606.733311@hergotha.csail.mit.edu> <25840.58487.468791.344785@hergotha.csail.mit.edu>
In-Reply-To: <25840.58487.468791.344785@hergotha.csail.mit.edu>
From: Rick Macklem <rick.macklem@gmail.com>
Date: Fri, 1 Sep 2023 14:22:23 -0700
Message-ID: <CAM5tNy4D3ADGaapwGYUroFLwmPAhovBU7eOWhreBKEusaAJxnw@mail.gmail.com>
Subject: Re: Did something change with ZFS and vnode caching?
To: Garrett Wollman <wollman@bimajority.org>
Cc: freebsd-stable@freebsd.org, Mateusz Guzik <mjguzik@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spamd-Bar: --
X-Spamd-Result: default: False [-3.00 / 15.00];
	SUBJECT_ENDS_QUESTION(1.00)[];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-1.00)[-1.000];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	DMARC_POLICY_ALLOW(-0.50)[gmail.com,none];
	R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36];
	R_DKIM_ALLOW(-0.20)[gmail.com:s=20221208];
	MIME_GOOD(-0.10)[text/plain];
	FROM_HAS_DN(0.00)[];
	RCPT_COUNT_THREE(0.00)[3];
	TO_DN_SOME(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org];
	ARC_NA(0.00)[];
	RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::42c:from];
	TO_MATCH_ENVRCPT_SOME(0.00)[];
	RCVD_TLS_LAST(0.00)[];
	TAGGED_FROM(0.00)[];
	MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org];
	DKIM_TRACE(0.00)[gmail.com:+];
	MID_RHS_MATCH_FROMTLD(0.00)[];
	FREEMAIL_FROM(0.00)[gmail.com];
	DWL_DNSWL_NONE(0.00)[gmail.com:dkim];
	RCVD_COUNT_ONE(0.00)[1];
	MIME_TRACE(0.00)[0:+];
	FROM_EQ_ENVFROM(0.00)[];
	ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US];
	FREEMAIL_ENVFROM(0.00)[gmail.com];
	FREEMAIL_CC(0.00)[freebsd.org,gmail.com]
X-Rspamd-Queue-Id: 4RcrbH63LGz4kZ8

On Thu, Aug 31, 2023 at 12:05=E2=80=AFPM Garrett Wollman <wollman@bimajorit=
y.org> wrote:
>
> <<On Thu, 24 Aug 2023 11:21:59 -0400, Garrett Wollman <wollman@bimajority=
.org> said:
>
> > Any suggestions on what we should monitor or try to adjust?
I remember you mentioning that you tried increasing kern.maxvnodes
but I was wondering if you've tried bumping it way up (like 10X what it
currently is)?

You could try decreasing the max nfsd threads (--maxthreads command
line option for nfsd. That would at least limit the # of vnodes used
by the nfsd.

rick

>
> To bring everyone up to speed: earlier this month we upgraded our NFS
> servers from 12.4 to 13.2 and found that our backup system was
> absolutely destroying NFS performance, which had not happened before.
>
> With some pointers from mjg@ and the thread relating to ZFS
> performance on current@ I built a stable/13 kernel
> (b5a5a06fc012d27c6937776bff8469ea465c3873) and installed it on one of
> our NFS servers for testing, then removed the band-aid on our backup
> system and allowed it to go as parallel as it wanted.
>
> Unfortunately, we do not control the scheduling of backup jobs, so
> it's difficult to tell whether the changes made any difference.  Each
> backup job does a parallel breadth-first traversal of a given
> filesystem, using as many as 150 threads per job (the backup client
> auto-scales itself), and we sometimes see as many as eight jobs
> running in parallel on one file server.  (There are 17, soon to be 18,
> file servers.)
>
> When the performance of NFS's backing store goes to hell, the NFS
> server is not able to put back-pressure on the clients hard enough to
> stop them from writing, and eventually the server runs out of 4k jumbo
> mbufs and crashes.  This at least is a known failure mode, going back
> a decade.  Before it gets to this point, the NFS server also
> auto-scales itself, so it's in competition with the backup client over
> who can create the most threads and ultimately allocate the most
> vnodes.
>
> Last night, while I was watching, the first dozen or so backups went
> fine, with no impact to NFS performance, until the backup server
> decided to schedule scans of two, and then three, parallel scans of
> filesystems containing about 35 million files each.  These tend to
> take an hour or four, depending on how much changed data is identified
> during the scane, but most of the time it's just sitting in a
> readdir()/fstatat() loop with a shared work queue for parallelism.
> (That's my interpretation based on its activity; we do not have source
> code.)
>
> Once these scans were underway, I observed the same symptoms as on
> releng/13.2, with lots of lock contention and the vnlru process
> running almost constantly (95% CPU, so most of a core on this
> 20-core/40-thread server).  From our monitoring, the server was
> recycling about 35k vnodes per second during this period.  I wasn't
> monitoring these statistics before so I don't have historical
> comparisons.  My working assumption, such as it is, is that the switch
> from OpnSolaris ZFS to OpenZFS in 13.x moved some bottlenecks around
> so that the backup client previously got tangled higher up in the ZFS
> code and now can put real pressure on the vnode allocator.
>
> During the hour that the three backup clients were running, I was able
> to run mjg@'s dtrace script and generate a flame graph, which is
> viewable at <https://people.csail.mit.edu/wollman/dtrace-terad.2.svg>.
> This just shows what the backup clients themselves are doing, and not
> what's going on in the vnlru or nfsd processes.  You can ignore all
> the umtx stacks since that's just coordination between the threads in
> the backup client.
>
> On the "oncpu" side, the trace captures a lot of time spent spinning
> in lock_delay(), although I don't see where the alleged call site
> acquires any locks, so there must have been some inlining.  On the
> "offcpu" side, it's clear that there's still a lot of time spent
> sleeping on vnode_list_mtx in the vnode allocation pathway, both
> directly from vn_alloc_hard() and also from vnlru_free_impl() after
> the mutex is dropped and then needs to be reacquired.
>
> In ZFS, there's also a substantial number of waits (shown as
> sx_xlock_hard stack frames), in both the easy case (a free vnode was
> readily available) and the hard case where vn_alloc_hard() calls
> vnlru_free_impl() and eventually zfs_inactive() to reclaim a vnode.
> Looking into the implementation, I noted that ZFS uses a 64-entry hash
> lock for this, and I'm wondering if there's an issue with false
> sharing.  Can anyone with ZFS experience speak to that?  If I
> increased ZFS_OBJ_MTX_SZ to 128 or 256, would it be likely to hurt
> something else (other than memory usage)?  Do we even know that the
> low-order 6 bits of ZFS object IDs are actually uniformly distributed?
>
> -GAWollman
>
>