From nobody Tue Jun 04 15:23:22 2024
X-Original-To: freebsd-arch@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VtvW25CZHz5LgJ9
	for <freebsd-arch@mlmmj.nyi.freebsd.org>; Tue, 04 Jun 2024 15:23:26 +0000 (UTC)
	(envelope-from cy.schubert@cschubert.com)
Received: from omta002.cacentral1.a.cloudfilter.net (omta002.cacentral1.a.cloudfilter.net [3.97.99.33])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client CN "Client", Issuer "CA" (not verified))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4VtvW20ZGwz40sG;
	Tue,  4 Jun 2024 15:23:25 +0000 (UTC)
	(envelope-from cy.schubert@cschubert.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: from shw-obgw-4002a.ext.cloudfilter.net ([10.228.9.250])
	by cmsmtp with ESMTPS
	id ERTXsfMw6drxEEW0SsPSeK; Tue, 04 Jun 2024 15:23:24 +0000
Received: from spqr.komquats.com ([70.66.152.170])
	by cmsmtp with ESMTPSA
	id EW0RsopyD9Cr4EW0Ssc37i; Tue, 04 Jun 2024 15:23:24 +0000
X-Auth-User: cschuber
X-Authority-Analysis: v=2.4 cv=etl8zZpX c=1 sm=1 tr=0 ts=665f316c
 a=y8EK/9tc/U6QY+pUhnbtgQ==:117 a=y8EK/9tc/U6QY+pUhnbtgQ==:17
 a=kj9zAlcOel0A:10 a=T1WGqf2p2xoA:10 a=iKhvJSA4AAAA:8 a=YxBL1-UpAAAA:8
 a=6I5d2MoRAAAA:8 a=EkcXrb_YAAAA:8 a=vvKf3c01SdeiTJ8F2-oA:9 a=CjuIK1q_8ugA:10
 a=odh9cflL3HIXMm4fY7Wr:22 a=Ia-lj3WSrqcvXOmTRaiG:22 a=IjZwj45LgO3ly-622nXo:22
 a=LK5xJRSDVpKd5WXXoEvA:22
Received: from slippy.cwsent.com (slippy [10.1.1.91])
	by spqr.komquats.com (Postfix) with ESMTP id 098D3208;
	Tue, 04 Jun 2024 08:23:23 -0700 (PDT)
Received: by slippy.cwsent.com (Postfix, from userid 1000)
	id EFC0F5D; Tue, 04 Jun 2024 08:23:22 -0700 (PDT)
X-Mailer: exmh version 2.9.0 11/07/2018 with nmh-1.8+dev
Reply-to: Cy Schubert <Cy.Schubert@cschubert.com>
From: Cy Schubert <Cy.Schubert@cschubert.com>
X-os: FreeBSD
X-Sender: cy@cwsent.com
X-URL: http://www.cschubert.com/
To: "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>
cc: Mark Johnston <markj@FreeBSD.org>,
    Konstantin Belousov <kostikbel@gmail.com>, freebsd-arch@FreeBSD.org
Subject: Re: removing support for kernel stack swapping
In-reply-to: <202406041418.454EI1la011801@gndrsh.dnsmgr.net>
References: <202406041418.454EI1la011801@gndrsh.dnsmgr.net>
Comments: In-reply-to "Rodney W. Grimes" <freebsd-rwg@gndrsh.dnsmgr.net>
   message dated "Tue, 04 Jun 2024 07:18:01 -0700."
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-arch
List-Help: <mailto:freebsd-arch+help@freebsd.org>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Subscribe: <mailto:freebsd-arch+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-arch+unsubscribe@freebsd.org>
Sender: owner-freebsd-arch@FreeBSD.org
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 04 Jun 2024 08:23:22 -0700
Message-Id: <20240604152322.EFC0F5D@slippy.cwsent.com>
X-CMAE-Envelope: MS4xfAL2PHmXeZenwoqJhtCiuv+OEvy9R7qKhr1uyWiZEYHDdKDOLzvfRPVih/2WHh1fiq7qaPU1cM7M+0wjIWahCJszalzOyiAvEt37hamIup+SulLR9q3H
 Nw8t6Lu+whhawATwBYtF9Pw+dbltFCBR6McWHbD0setU1pUWOvIvja/PJo6MnnAPtVNQvkfbX7ZkydjZ9o6cD4WEMeAmf4ZmjJMrvBxIpR/q9SMoEWPCO7gB
 UquZ4qVH+JvIWZjhmTexcluCtZEYbkw4afsD6eCFWQU/MdjQTIa1SaWCmUtl7cwvRmKuOhMG9Zr26Dhlc0/zTg==
X-Spamd-Bar: ----
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:16509, ipnet:3.96.0.0/15, country:US]
X-Rspamd-Queue-Id: 4VtvW20ZGwz40sG

In message <202406041418.454EI1la011801@gndrsh.dnsmgr.net>, "Rodney W. 
Grimes"
writes:
> > On Tue, Jun 04, 2024 at 12:11:25AM +0300, Konstantin Belousov wrote:
> > > On Sun, Jun 02, 2024 at 07:57:04PM -0400, Mark Johnston wrote:
> > > > FreeBSD will, when free pages are scarce, try to swap out the kernel
> > > > stacks (typically 16KB per thread) of sleeping user threads.  I'm told
> > > > that this mechanism was first implemented in BSD for the VAX port and
> > > > that stabilizing it was quite an endeavour.
> > > > 
> > > > This feature has wide-ranging implications for code in the kernel.  For
> > > > instance, if a thread allocates a structure on its stack, links it into
> > > > some data structure visible to other threads, and goes to sleep, it mus
> t
> > > > use PHOLD to ensure that the stack doesn't get swapped out while
> > > > sleeping.  A missing PHOLD can thus result in a kernel panic, but this
> > > > kind of mistake is very easy to make and hard to catch without thorough
> > > > stress testing.  The kernel stack allocator also requires a fair bit of
> > > > code to implement this feature, and we've had multiple bugs in that
> > > > area, especially in relation to NUMA support.  Moreover, this feature
> > > > will leave threads swapped out after the system has recovered, resultin
> g
> > > > in high scheduling latency once they're ready to run again.
> > > > 
> > > > In a very stressed system, it's possible that we can free up something
> > > > like 1MB of RAM using this mechanism.  I argue that this mechanism is
> > > > not worth it on modern systems: it isn't going to make the difference
> > > > between a graceful recovery from memory pressure and a catatonic state
> > > > which forces a reboot.  The complexity and resulting bugs it induces is
> > > > not worth it.
> > > On amd64, 1MB of physical memory for stacks is consumed by 64k threads,
> > > which is not too stressed system.  I remember that very long time ago
> > > Peter ran tests with several hundreds of k threads, which is more realist
> ic
> > > high load, e.g. from typical java code (at least it was so several years
> > > ago).
> > 
> > Those threads are completely idle?
> > 
> > > For kernel stack to be swapped, normally thread must sleep for at least
> > > 10 secs. so a latency for next thread running moment should be not too
> > > important.
> > 
> > This isn't true in general.  A daemon which responds to requests should
> > do so with low latency even if it's been idle for a long time.  If
> > syslogd sleeps for 10 seconds and then receives a burst of messages, it
> > should be scheduled as quickly as possible.
> > 
> > > Having 1MB of essentially free memory is nice for system survival.
> > > Being able to swap out pcb as well could be useful, IMO.
> > 
> > There are many things we could do to shrink the kernel when under memory
> > pressure.  There is no pressure to shrink the buffer cache, or vnode or
> > name caches, for instance.  If we wanted to optimize the system in this
> > direction, there is a lot of lower-hanging fruit to pick.
>
> Yes please, better pressure on some much larger memory consumers
> would be greatly appreciated.

When using NFS (and UFS) along with ZFS the buffer cache will grow but 
rarely reduce in size. This is especially noticeable when ZFS ARC is 
reduced for the buffer cache. When a burst of NFS (or UFS) I/O has 
completed, minutes or even hours ago, the buffer cache will remain until 
the NFS share is unmounted, or in the case of UFS reboot (unless UFS 
residing on a USB mass storage device is unmounted).

Even when there is no memory pressure, a less busy buffer cache giving back 
some of its RAM for use by a more actively used ZFS ARC would certainly 
help too.

>
> > 
> > I'm sure there are special cases where stack swapping might help in
> > principle, but in practice it is far more common to see a small number
> > of threads get swapped out, quickly followed by OOM kills.
>
> Exactly my experience too.

Yes. Agreed.


-- 
Cheers,
Cy Schubert <Cy.Schubert@cschubert.com>
FreeBSD UNIX:  <cy@FreeBSD.org>   Web:  https://FreeBSD.org
NTP:           <cy@nwtime.org>    Web:  https://nwtime.org

			e^(i*pi)+1=0