From nobody Fri Mar 08 14:41:09 2024
X-Original-To: stable@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TrplD2Qm7z5DGcr
	for <stable@mlmmj.nyi.freebsd.org>; Fri,  8 Mar 2024 14:41:28 +0000 (UTC)
	(envelope-from rick.macklem@gmail.com)
Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4TrplC290Gz4cgv;
	Fri,  8 Mar 2024 14:41:27 +0000 (UTC)
	(envelope-from rick.macklem@gmail.com)
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=BQdPYASX;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (mx1.freebsd.org: domain of rick.macklem@gmail.com designates 2607:f8b0:4864:20::42b as permitted sender) smtp.mailfrom=rick.macklem@gmail.com
Received: by mail-pf1-x42b.google.com with SMTP id d2e1a72fcca58-6e5eaf5bb3eso1856938b3a.3;
        Fri, 08 Mar 2024 06:41:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1709908886; x=1710513686; darn=freebsd.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=E8jRn0uKOQyNzsTA0nVQDmDrQKptTeElLqkv6Nj1W0w=;
        b=BQdPYASXz3UHfjTElZjd3Oo1dNKTV8bXJ1r89QtPQwppesX2vKHLgMvPE/VF8Gvm0z
         I/AQsOynGpuISmGwlqgFoI4v6mJJWXL9lBuG8jgwa8ChrAdBCrBSagOmCpu1SPm6aLCt
         aUSHR5YMs37EkiWjuW+kNyj3p7T6/if14LXpET4zkV5q45ctZVoI78zmuyEI3I4jtQKv
         J7VYPUZKLlLP5LipZ+j5CRIe8NgyXV59VhF8zLttz70SLVHoP2LY8WOkc8Nrqf0UK5Wv
         dkAAESDudx8B8ep1yKVBzLqgEkvQiDnLhTVnqKDbx6ZYvV7uuGbTUgrODfnYlWIfOJqm
         TG3g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709908886; x=1710513686;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=E8jRn0uKOQyNzsTA0nVQDmDrQKptTeElLqkv6Nj1W0w=;
        b=aegNoi6qMy700D5y4kg153LcFEhBDD6Bx+q//y5dgYqY1RqPFS+0D0gjCZF5gLZ7uc
         q10EUxa9aaHhEjGO1NYKub0o7O65WyIkXA6920p6s65A0xOZfqVe5zHpfgCRCf5IdQZf
         FNe+ZVh/baDWYFQe82FCL00kGw1JsEj59xbfkUx8f1fHSw0KY+Yx3pjJrghAyxPLscIV
         Qzg5e8dQhyLqt5Rk4x9y9cE4KxnyDOslOJVVflR6BjvKoC6pkRSNuWxWwOWTVCSiv6IA
         h0YcBIcPLxbHTkyELnoV0CdYXTpehfCovQi4n6jb9STQiK9eR0Ac1+9IgAqWFH05Oh5w
         eagg==
X-Forwarded-Encrypted: i=1; AJvYcCW2xNFtZ6skeQXu6xdHbkIYMcL2Ytu6aK/FVcP8xd56nAQ2uDXKheWa9ADgI6WZDm85V9aLKD3jThAEaiI4ZtIFE7s=
X-Gm-Message-State: AOJu0Yw0srU15IBuIlT8juduhEIJQ2vKqJMe4eT189xTNcmXtb7Dxyyd
	aKheyIcrGRHZDGG5Y8HA+KOrLCN8mzJ0N+VS/ZOXjFd+u3enBQoyiM0GaasXK+86PVuAM8Y+dm0
	B/+DCloNz1EXNqJrjli24yJMT+W36yMg=
X-Google-Smtp-Source: AGHT+IHsFtDkLW7g7NAElHV4tU8oQzgx+D10sen3aVeMsxhLUYyGDbHm/l57L5/2HnpyRQ9DK3NbgUu2OV6Mrqd/02I=
X-Received: by 2002:a05:6a20:918b:b0:1a1:86e8:317e with SMTP id
 v11-20020a056a20918b00b001a186e8317emr1336380pzd.33.1709908885523; Fri, 08
 Mar 2024 06:41:25 -0800 (PST)
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-stable
List-Help: <mailto:stable+help@freebsd.org>
List-Post: <mailto:stable@freebsd.org>
List-Subscribe: <mailto:stable+subscribe@freebsd.org>
List-Unsubscribe: <mailto:stable+unsubscribe@freebsd.org>
Sender: owner-freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
MIME-Version: 1.0
References: <CAM5tNy6v3N-uiULGA0vb_2s0GK1atRh6TYNDGfYMK0PkP46BbQ@mail.gmail.com>
 <1020651467.1592.1709280020993@localhost> <CAM5tNy4ras1NN+LC7=gpyFqEefHpWCrSV-_aSyn-D6Kt8Fvw6Q@mail.gmail.com>
 <1608503215.4731.1709633602802@localhost> <CAM5tNy7W6tZxiTRWoyx=CafAA9SE_fgrW3mjGRY1+J=89QYZ+g@mail.gmail.com>
 <CAM5tNy45ovMMrGHx0_tKUPC3in5WfjrtbBpv23k99CuxFxY21w@mail.gmail.com> <1277770972.4770.1709725576695@localhost>
In-Reply-To: <1277770972.4770.1709725576695@localhost>
From: Rick Macklem <rick.macklem@gmail.com>
Date: Fri, 8 Mar 2024 06:41:09 -0800
Message-ID: <CAM5tNy6o_dygdqDJMvL8FvY7tv4EKzuq6AzA8Dxs5Lnir32cww@mail.gmail.com>
Subject: Re: 13-stable NFS server hang
To: Ronald Klop <ronald-lists@klop.ws>
Cc: rmacklem@freebsd.org, Garrett Wollman <wollman@bimajority.org>, stable@freebsd.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spamd-Bar: ---
X-Spamd-Result: default: False [-4.00 / 15.00];
	NEURAL_HAM_LONG(-1.00)[-1.000];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	NEURAL_HAM_SHORT(-1.00)[-1.000];
	DMARC_POLICY_ALLOW(-0.50)[gmail.com,none];
	R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36];
	R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601];
	MIME_GOOD(-0.10)[text/plain];
	DWL_DNSWL_NONE(0.00)[gmail.com:dkim];
	RCVD_TLS_LAST(0.00)[];
	FREEMAIL_FROM(0.00)[gmail.com];
	ARC_NA(0.00)[];
	MIME_TRACE(0.00)[0:+];
	TAGGED_FROM(0.00)[];
	TO_DN_SOME(0.00)[];
	FROM_HAS_DN(0.00)[];
	MISSING_XM_UA(0.00)[];
	RCPT_COUNT_THREE(0.00)[4];
	MID_RHS_MATCH_FROMTLD(0.00)[];
	TO_MATCH_ENVRCPT_SOME(0.00)[];
	FROM_EQ_ENVFROM(0.00)[];
	DKIM_TRACE(0.00)[gmail.com:+];
	MLMMJ_DEST(0.00)[stable@freebsd.org];
	RCVD_COUNT_ONE(0.00)[1];
	ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US];
	RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::42b:from];
	FREEMAIL_ENVFROM(0.00)[gmail.com]
X-Rspamd-Queue-Id: 4TrplC290Gz4cgv

On Wed, Mar 6, 2024 at 3:46=E2=80=AFAM Ronald Klop <ronald-lists@klop.ws> w=
rote:
>
>
> Van: Rick Macklem <rick.macklem@gmail.com>
> Datum: dinsdag, 5 maart 2024 15:43
> Aan: Ronald Klop <ronald-lists@klop.ws>
> CC: rmacklem@freebsd.org, Garrett Wollman <wollman@bimajority.org>, stabl=
e@freebsd.org
> Onderwerp: Re: 13-stable NFS server hang
>
> On Tue, Mar 5, 2024 at 6:34AM Rick Macklem <rick.macklem@gmail.com> wrote=
:
> >
> > On Tue, Mar 5, 2024 at 2:13AM Ronald Klop <ronald-lists@klop.ws> wrote:
> > >
> > >
> > > Van: Rick Macklem <rick.macklem@gmail.com>
> > > Datum: vrijdag, 1 maart 2024 15:23
> > > Aan: Ronald Klop <ronald-lists@klop.ws>
> > > CC: Garrett Wollman <wollman@bimajority.org>, stable@freebsd.org, rma=
cklem@freebsd.org
> > > Onderwerp: Re: 13-stable NFS server hang
> > >
> > > On Fri, Mar 1, 2024 at 12:00AM Ronald Klop <ronald-lists@klop.ws> wro=
te:
> > > >
> > > > Interesting read.
> > > >
> > > >  Would it be possible to separate locking for admin actions like a =
client mounting an fs from traffic flowing for file operations?
> > > Well, the NFS server does not really have any concept of a mount.
> > > What I am referring to is the ClientID maintained for NFSv4 mounts,
> > > which all the open/lock/session/layout state hangs off of.
> > >
> > > For most cases, this state information can safely be accessed/modifie=
d
> > > via a mutex, but there are three exceptions:
> > > - creating a new ClientID (which is done by the ExchangeID operation)
> > >   and typically happens when a NFS client does a mount.
> > > - delegation Recall (which only happens when delegations are enabled)
> > >   One of the reasons delegations are not enabled by default on the
> > > FreeBSD server.
> > > - the DestroyClientID which is typically done by a NFS client during =
dismount.
> > > For these cases, it is just too difficult to do them without sleeping=
.
> > > As such, there is a sleep lock which the nfsd threads normally acquir=
e shared
> > > when doing NFSv4 operations, but for the above cases the lock is aqui=
red
> > > exclusive.
> > > - I had to give the exclusive lock priority over shared lock
> > > acquisition (it is a
> > >   custom locking mechanism with assorted weirdnesses) because without
> > >   that someone reported that new mounts took up to 1/2hr to occur.
> > >   (The exclusive locker waited for 30min before all the other nfsd th=
reads
> > >    were not busy.)
> > >   Because of this priority, once a nfsd thread requests the exclusive=
 lock,
> > >   all other nfsd threads executing NFSv4 RPCs block after releasing t=
heir
> > >   shared lock, until the exclusive locker releases the exclusive lock=
.
> > >
> > > In summary, NFSv4 has certain advantages over NFSv3, but it comes
> > > with a lot of state complexity. It just is not feasible to manipulate=
 all that
> > > state with only mutex locking.
> > >
> > > rick
> > >
> > > >
> > > > Like ongoing file operations could have a read only view/copy of th=
e mount table. Only new operations will have to wait.
> > > > But the mount never needs to wait for ongoing operations before loc=
king the structure.
> > > >
> > > > Just a thought in the morning
> > > >
> > > > Regards,
> > > > Ronald.
> > > >
> > > > Van: Rick Macklem <rick.macklem@gmail.com>
> > > > Datum: 1 maart 2024 00:31
> > > > Aan: Garrett Wollman <wollman@bimajority.org>
> > > > CC: stable@freebsd.org, rmacklem@freebsd.org
> > > > Onderwerp: Re: 13-stable NFS server hang
> > > >
> > > > On Wed, Feb 28, 2024 at 4:04PM Rick Macklem wrote:
> > > > >
> > > > > On Tue, Feb 27, 2024 at 9:30PM Garrett Wollman wrote:
> > > > > >
> > > > > > Hi, all,
> > > > > >
> > > > > > We've had some complaints of NFS hanging at unpredictable inter=
vals.
> > > > > > Our NFS servers are running a 13-stable from last December, and
> > > > > > tonight I sat in front of the monitor watching `nfsstat -dW`.  =
I was
> > > > > > able to clearly see that there were periods when NFS activity w=
ould
> > > > > > drop *instantly* from 30,000 ops/s to flat zero, which would la=
st
> > > > > > for about 25 seconds before resuming exactly as it was before.
> > > > > >
> > > > > > I wrote a little awk script to watch for this happening and run
> > > > > > `procstat -k` on the nfsd process, and I saw that all but two o=
f the
> > > > > > service threads were idle.  The three nfsd threads that had non=
-idle
> > > > > > kstacks were:
> > > > > >
> > > > > >   PID    TID COMM                TDNAME              KSTACK
> > > > > >   997 108481 nfsd                nfsd: master        mi_switch =
sleepq_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_inte=
rnal svc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_=
common
> > > > > >   997 960918 nfsd                nfsd: service       mi_switch =
sleepq_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd=
_dorpc nfssvc_program svc_run_internal svc_thread_start fork_exit fork_tram=
poline
> > > > > >   997 962232 nfsd                nfsd: service       mi_switch =
_cv_wait txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs=
_freebsd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_F=
ILE_RANGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_pro=
gram svc_run_internal svc_thread_start fork_exit fork_trampoline
> > > > > >
> > > > > > I'm suspicious of two things: first, the copy_file_range RPC; s=
econd,
> > > > > > the "master" nfsd thread is actually servicing an RPC which req=
uires
> > > > > > obtaining a lock.  The "master" getting stuck while performing =
client
> > > > > > RPCs is, I believe, the reason NFS service grinds to a halt whe=
n a
> > > > > > client tries to write into a near-full filesystem, so this prob=
lem
> > > > > > would be more evidence that the dispatching function should not=
 be
> > > > > > mixed with actual operations.  I don't know what the clients ar=
e
> > > > > > doing, but is it possible that nfsrvd_copy_file_range is holdin=
g a
> > > > > > lock that is needed by one or both of the other two threads?
> > > > > >
> > > > > > Near-term I could change nfsrvd_copy_file_range to just
> > > > > > unconditionally return NFSERR_NOTSUP and force the clients to f=
all
> > > > > > back, but I figured I would ask if anyone else has seen this.
> > > > > I have attached a little patch that should limit the server's Cop=
y size
> > > > > to vfs.nfsd.maxcopyrange (default of 10Mbytes).
> > > > > Hopefully this makes sure that the Copy does not take too long.
> > > > >
> > > > > You could try this instead of disabling Copy. It would be nice to=
 know if
> > > > > this is suffciient? (If not, I'll probably add a sysctl to disabl=
e Copy.)
> > > > I did a quick test without/with this patch,where I copied a 1Gbyte =
file.
> > > >
> > > > Without this patch, the Copy RPCs mostly replied in just under 1sec
> > > > (which is what the flag requests), but took over 4sec for one of th=
e Copy
> > > > operations. This implies that one Read/Write of 1Mbyte on the serve=
r
> > > > took over 3 seconds.
> > > > I noticed the first Copy did over 600Mbytes, but the rest did about=
 100Mbytes
> > > > each and it was one of these 100Mbyte Copy operations that took ove=
r 4sec.
> > > >
> > > > With the patch, there were a lot more Copy RPCs (as expected) of 10=
Mbytes
> > > > each and they took a consistent 0.25-0.3sec to reply. (This is a te=
st of a local
> > > > mount on an old laptop, so nowhere near a server hardware config.)
> > > >
> > > > So, the patch might be sufficient?
> > > >
> > > > It would be nice to avoid disabling Copy, since it avoids reading t=
he data
> > > > into the client and then writing it back to the server.
> > > >
> > > > I will probably commit both patches (10Mbyte clip of Copy size and
> > > > disabling Copy) to main soon, since I cannot say if clipping the si=
ze
> > > > of the Copy will always be sufficient.
> > > >
> > > > Pleas let us know how trying these patches goes, rick
> > > >
> > > > >
> > > > > rick
> > > > >
> > > > > >
> > > > > > -GAWollman
> > > > > >
> > > > > >
> > > >
> > > > ________________________________
> > > >
> > > >
> > > >
> > > >
> > > ________________________________
> > >
> > >
> > >
> > > Hi Rick,
> > >
> > > You are much more into the NFS code than I am so excuse me if what I'=
m speaking about does not make sense.
> > >
> > > I was reading nfsrvd_compound() which calls nfsrvd_copy_file_range() =
via the nfsrv4_ops2 structure.
> > > Nfsrvd_compound holds a lock or refcount on nfsv4rootfs_lock during t=
he whole operation. Which is why nfsrv_setclient() is waiting in this speci=
fic case of "NFS server hang".
> > >
> > > But I don't see what is being modified on the nfsdstate after the IO =
operation ends. Or why the IO operation itself needs the lock to the nfsdst=
ate. IMHO the in-progress IOs will happen anyway regardless of the nfsdstat=
e. Changes to the nfsdstate during an IO operation would not affect the ong=
oing IO operation.
> > > Wouldn't it be possible to lock the nfsv4rootfs_lock, do checks on or=
 modify the nfsdstate as needed, unlock and then do the IO operation? That =
would remove a lot of the possible lock contention during (u)mount.
> > > Otherwise, if we do modify the nfsdstate after the IO operation, isn'=
t it possible to relock nfsv4rootfs_lock after the IO operation finishes?
> > Well, there are a couple of reasons. Every implementation has design tr=
adeoffs:
> > 1 - A NFSv4 RPC is a compound, which can be a pretty arbitrary list of
> > operations.
> >      As such, the NFSv4 server does not know if an open/byte range
> > lock is coming
> >      after the operation it is currently performing, since the
> > implementation does not
> >      pre-parse the entire compound. (I had a discussion w.r.t.
> > pre-parsing with one of
> >      the main Linux knfsd server maintainers and he noted that he was
> > not aware of
> >      any extant server that did pre-parse the compound. Although it
> > would be useful
> >      for the server to have the ordered list of operations before
> > commencing the RPC,
> >      we both agreed it was too hard to implement.
> >      --> It could possibly unlock/relock later, but see #2. Also, if
> > relocking took a long time,
> >           it would result in the compound RPC taking too long (see belo=
w).
> > 2 - For NFSv4.1/4.2 almost all RPCs are handled by a session. One non-a=
rbitrary
> >      part of almost all NFSv4.1/4.2 RPCs is that the Sequence
> > operation (the one that
> >      handles the session) must come first.
> >      Session(s) are associated with the ClientID, which means the
> > ClientID and the
> >      session must not go away while the compound RPC is in progress.
> >      - This is ensured by the shared lock on the ClientID (that
> > nfsv4rootfh_lock).
> > Since 99.99% of operations can be done with the shared lock, I do not t=
hink
> > there is a lot of contention.
> >
> > Although there is nothing wired down in the RFCs, there is an understan=
ding
> > in the NFSv4 community that a server should reply to an RPC in a reason=
able
> > time. Typically assumed to be 1-2sec. If the server does this, then a d=
elay for
> > the rare case of a new ClientID shouldn't be a big problem?
> > (The is also delegation recall, which is one reason why delegations
> > are not enabled
> > by default.)
> >
> > Btw, the RFC does define an asynchronous Copy, where the operation repl=
ies
> > as soon as the copy is started and the server notifies the client of co=
mpletion
> > later. I have not implemented this, because it introduces complexities =
that
> > I do not want to deal with.
> > For example, what happens when the server crashes/reboots while the cop=
y
> > is in progress? The file is left in a non-deterministic state, dependin=
g on what
> > the client does when it does not receive the completion notify.
> >
> Oh, I should also note that the "shared lock" is actually called a
> reference count
> in the code and is there to ensure that the ClientID/Session does not go =
away
> during execution of the compound.
>
> The problem in this case (which I should revisit) was that I could not fi=
gure
> out how to safely add a new ClientID while other nfsd threads were in pro=
gress
> performing other RPCs. Due to retries etc, there might be another RPC
> in progress
> using the ClientID.
>
> One thing to note here is that the identity of the ClientID
> is not known until the Sequence operation has been performed. (And there =
is
> cruft for NFSv4.0, since it does not have a Sequence operation.)
> As such, the RPC must be in progress before it is known.
>
> > rick
> > >
> > > I hope this makes any sense and thanks for all your work on the NFS c=
ode.
> > >
> > > Regards,
> > > Ronald.
> > >
> > >
> > >
> ________________________________
>
>
>
> Hi Rick,
>
> Thanks for the elaborate answer.
>
> Would it make sense to have the current RPC/compound have a lock on its C=
lientID/session, but not on the whole nfsdstate (nfsv4rootfs_lock)?
Nope. It is the structure of the linked lists (an open is in three of
them) that defines the
state relationship for open_owners/opens/lock_owners/locks.

The sessions are the exception. Since the code mostly updates contents of t=
hem,
each session structure has its own mutex and a refcnt to avoid use after fr=
ee,
Then there is a mutex for each hash list that is used to find the session.

The code for the clientID was first written over 20years ago (NFSv4.0 calls=
 the
operation SetClientID, but it does the same thing.) There is a confirmation=
 step
done by a CreateSession with a correct seq#.
As I've said, I'll look and see if I can figure out how o do it
without the exclusive lock.


>
> So concurrent requests like a new mount creating a new ClientID can go on=
 in parallel, but removing or modifying the locked ClientID will wait for t=
he lock.
>
> Or am I thinking too simple still?
>
> Regards,
> Ronald.
>