From nobody Fri Mar 01 14:23:56 2024
X-Original-To: stable@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TmVhc0Jtmz5CThp
	for <stable@mlmmj.nyi.freebsd.org>; Fri,  1 Mar 2024 14:24:16 +0000 (UTC)
	(envelope-from rick.macklem@gmail.com)
Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4TmVhb1pwnz4653;
	Fri,  1 Mar 2024 14:24:15 +0000 (UTC)
	(envelope-from rick.macklem@gmail.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: by mail-pf1-x433.google.com with SMTP id d2e1a72fcca58-6de3141f041so1570675b3a.0;
        Fri, 01 Mar 2024 06:24:15 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1709303053; x=1709907853; darn=freebsd.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=DJ+K9tLDyDpLuMYSlLSB6p2LlqhG/4qzZu8ig/PeHbw=;
        b=GEVOn2R+W8pLgYlHudDo3YH/UmIopazPR5//TKgDspHgGak+cA9EvDU5tJ5ioRA5t0
         uCjPxLWQYuS0Avt0tYIDOxWVXMKJ8fEbXGOgrKew1TTtEq+nr1A/YCvkp/FHsxMPUFpq
         ls4Mh+7IycYBBzZR1jTN46H68OK4TOdYFzSJvV4aq5j5oIiVre7S87HPf7zmPrHtCNHZ
         KSkQBjSUfRzYVk5WG4b0octRJk4ztgJ2d5YUSJV/tNpath+uDX0vLpQqijgXmPRfNtb1
         inpE0Hx1RjoPplOGWz4Mwsp8DljgXKrLdRzbzJTFHl3aQg2lsS3vKzy1AhLAsKT4Smz1
         +3ng==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709303053; x=1709907853;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=DJ+K9tLDyDpLuMYSlLSB6p2LlqhG/4qzZu8ig/PeHbw=;
        b=NaX3GWxjmH/vToMmLOspu/WexG8qX99MBGnsz59WJqemAX5xVu6Oobi4NQk9oJV51F
         tz46SH2d6KeFIv1SifBY7fPD5WnspyAtv1U88Gyd9kClaxb3Uz7uxQwgJyBm70rui3tY
         ZKXhAEyxZqddULZY5ELXcNF/rgLb+qnyioGXJmQqLPUEH8PpGYshONAOtiuKskmzk4Tt
         SFh7AcYTC9PGMcDhasVoMwzAoVVlUf7or4qTmtQzcW7507rO3fuh08EFiUtcjImNpWQE
         UgzyCYiK6IEVS2uKwb5nAjSHtExIJCDTTSls6vOi+GRwGjn1A8OSe3hMU0G+X9D3ATX/
         KiJQ==
X-Forwarded-Encrypted: i=1; AJvYcCVLUW0WnnUCTZ+6yJ3Mk9JNtl5G2DVRe7oHqpsnH0zu/5Cjk9WURz3RSdF/OTGN1EJ1S5xSwR6K+OGQiFjDI3XJqKT3mCuIq39RKwVHKfS83hy/1aLBLvn/
X-Gm-Message-State: AOJu0YyDZLz8QeNCdP1F5S9SSCUAULue5kCdeSdY+2+nAS/TjEhedLe0
	U7XANhIOTraLxVxkW3BtR7VED78xLgIuJTR9IpvsretCFgijPZ89bszHhRK8ucT5ov+e3JG2hh+
	kax5vbUzJrcgx/AVDn6spRtNRQsc+rUo=
X-Google-Smtp-Source: AGHT+IHoinZp6TsQpmLmZPsOOwuIuVunvM78NTIZonI2YPnPfczE5v9GY6U2g1QuVWzuns5/Eb4+EAd9rSQ1hUeDV2E=
X-Received: by 2002:a05:6a21:328a:b0:1a0:e944:15b7 with SMTP id
 yt10-20020a056a21328a00b001a0e94415b7mr2038806pzb.5.1709303053035; Fri, 01
 Mar 2024 06:24:13 -0800 (PST)
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-stable
List-Help: <mailto:stable+help@freebsd.org>
List-Post: <mailto:stable@freebsd.org>
List-Subscribe: <mailto:stable+subscribe@freebsd.org>
List-Unsubscribe: <mailto:stable+unsubscribe@freebsd.org>
Sender: owner-freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
MIME-Version: 1.0
References: <CAM5tNy6v3N-uiULGA0vb_2s0GK1atRh6TYNDGfYMK0PkP46BbQ@mail.gmail.com>
 <1020651467.1592.1709280020993@localhost>
In-Reply-To: <1020651467.1592.1709280020993@localhost>
From: Rick Macklem <rick.macklem@gmail.com>
Date: Fri, 1 Mar 2024 06:23:56 -0800
Message-ID: <CAM5tNy4ras1NN+LC7=gpyFqEefHpWCrSV-_aSyn-D6Kt8Fvw6Q@mail.gmail.com>
Subject: Re: 13-stable NFS server hang
To: Ronald Klop <ronald-lists@klop.ws>
Cc: Garrett Wollman <wollman@bimajority.org>, stable@freebsd.org, rmacklem@freebsd.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spamd-Bar: ----
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	TAGGED_FROM(0.00)[];
	ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]
X-Rspamd-Queue-Id: 4TmVhb1pwnz4653

On Fri, Mar 1, 2024 at 12:00=E2=80=AFAM Ronald Klop <ronald-lists@klop.ws> =
wrote:
>
> Interesting read.
>
>  Would it be possible to separate locking for admin actions like a client=
 mounting an fs from traffic flowing for file operations?
Well, the NFS server does not really have any concept of a mount.
What I am referring to is the ClientID maintained for NFSv4 mounts,
which all the open/lock/session/layout state hangs off of.

For most cases, this state information can safely be accessed/modified
via a mutex, but there are three exceptions:
- creating a new ClientID (which is done by the ExchangeID operation)
  and typically happens when a NFS client does a mount.
- delegation Recall (which only happens when delegations are enabled)
  One of the reasons delegations are not enabled by default on the
FreeBSD server.
- the DestroyClientID which is typically done by a NFS client during dismou=
nt.
For these cases, it is just too difficult to do them without sleeping.
As such, there is a sleep lock which the nfsd threads normally acquire shar=
ed
when doing NFSv4 operations, but for the above cases the lock is aquired
exclusive.
- I had to give the exclusive lock priority over shared lock
acquisition (it is a
  custom locking mechanism with assorted weirdnesses) because without
  that someone reported that new mounts took up to 1/2hr to occur.
  (The exclusive locker waited for 30min before all the other nfsd threads
   were not busy.)
  Because of this priority, once a nfsd thread requests the exclusive lock,
  all other nfsd threads executing NFSv4 RPCs block after releasing their
  shared lock, until the exclusive locker releases the exclusive lock.

In summary, NFSv4 has certain advantages over NFSv3, but it comes
with a lot of state complexity. It just is not feasible to manipulate all t=
hat
state with only mutex locking.

rick

>
> Like ongoing file operations could have a read only view/copy of the moun=
t table. Only new operations will have to wait.
> But the mount never needs to wait for ongoing operations before locking t=
he structure.
>
> Just a thought in the morning
>
> Regards,
> Ronald.
>
> Van: Rick Macklem <rick.macklem@gmail.com>
> Datum: 1 maart 2024 00:31
> Aan: Garrett Wollman <wollman@bimajority.org>
> CC: stable@freebsd.org, rmacklem@freebsd.org
> Onderwerp: Re: 13-stable NFS server hang
>
> On Wed, Feb 28, 2024 at 4:04PM Rick Macklem wrote:
> >
> > On Tue, Feb 27, 2024 at 9:30PM Garrett Wollman wrote:
> > >
> > > Hi, all,
> > >
> > > We've had some complaints of NFS hanging at unpredictable intervals.
> > > Our NFS servers are running a 13-stable from last December, and
> > > tonight I sat in front of the monitor watching `nfsstat -dW`.  I was
> > > able to clearly see that there were periods when NFS activity would
> > > drop *instantly* from 30,000 ops/s to flat zero, which would last
> > > for about 25 seconds before resuming exactly as it was before.
> > >
> > > I wrote a little awk script to watch for this happening and run
> > > `procstat -k` on the nfsd process, and I saw that all but two of the
> > > service threads were idle.  The three nfsd threads that had non-idle
> > > kstacks were:
> > >
> > >   PID    TID COMM                TDNAME              KSTACK
> > >   997 108481 nfsd                nfsd: master        mi_switch sleepq=
_timedwait _sleep nfsv4_lock nfsrvd_dorpc nfssvc_program svc_run_internal s=
vc_run nfsrvd_nfsd nfssvc_nfsd sys_nfssvc amd64_syscall fast_syscall_common
> > >   997 960918 nfsd                nfsd: service       mi_switch sleepq=
_timedwait _sleep nfsv4_lock nfsrv_setclient nfsrvd_exchangeid nfsrvd_dorpc=
 nfssvc_program svc_run_internal svc_thread_start fork_exit fork_trampoline
> > >   997 962232 nfsd                nfsd: service       mi_switch _cv_wa=
it txg_wait_synced_impl txg_wait_synced dmu_offset_next zfs_holey zfs_freeb=
sd_ioctl vn_generic_copy_file_range vop_stdcopy_file_range VOP_COPY_FILE_RA=
NGE vn_copy_file_range nfsrvd_copy_file_range nfsrvd_dorpc nfssvc_program s=
vc_run_internal svc_thread_start fork_exit fork_trampoline
> > >
> > > I'm suspicious of two things: first, the copy_file_range RPC; second,
> > > the "master" nfsd thread is actually servicing an RPC which requires
> > > obtaining a lock.  The "master" getting stuck while performing client
> > > RPCs is, I believe, the reason NFS service grinds to a halt when a
> > > client tries to write into a near-full filesystem, so this problem
> > > would be more evidence that the dispatching function should not be
> > > mixed with actual operations.  I don't know what the clients are
> > > doing, but is it possible that nfsrvd_copy_file_range is holding a
> > > lock that is needed by one or both of the other two threads?
> > >
> > > Near-term I could change nfsrvd_copy_file_range to just
> > > unconditionally return NFSERR_NOTSUP and force the clients to fall
> > > back, but I figured I would ask if anyone else has seen this.
> > I have attached a little patch that should limit the server's Copy size
> > to vfs.nfsd.maxcopyrange (default of 10Mbytes).
> > Hopefully this makes sure that the Copy does not take too long.
> >
> > You could try this instead of disabling Copy. It would be nice to know =
if
> > this is suffciient? (If not, I'll probably add a sysctl to disable Copy=
.)
> I did a quick test without/with this patch,where I copied a 1Gbyte file.
>
> Without this patch, the Copy RPCs mostly replied in just under 1sec
> (which is what the flag requests), but took over 4sec for one of the Copy
> operations. This implies that one Read/Write of 1Mbyte on the server
> took over 3 seconds.
> I noticed the first Copy did over 600Mbytes, but the rest did about 100Mb=
ytes
> each and it was one of these 100Mbyte Copy operations that took over 4sec=
.
>
> With the patch, there were a lot more Copy RPCs (as expected) of 10Mbytes
> each and they took a consistent 0.25-0.3sec to reply. (This is a test of =
a local
> mount on an old laptop, so nowhere near a server hardware config.)
>
> So, the patch might be sufficient?
>
> It would be nice to avoid disabling Copy, since it avoids reading the dat=
a
> into the client and then writing it back to the server.
>
> I will probably commit both patches (10Mbyte clip of Copy size and
> disabling Copy) to main soon, since I cannot say if clipping the size
> of the Copy will always be sufficient.
>
> Pleas let us know how trying these patches goes, rick
>
> >
> > rick
> >
> > >
> > > -GAWollman
> > >
> > >
>
> ________________________________
>
>
>
>