From nobody Tue Jun 04 19:33:30 2024 X-Original-To: freebsd-arch@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Vv13r2y3qz5MKX5 for ; Tue, 04 Jun 2024 19:33:44 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4Vv13q6mVJz4ZcJ; Tue, 4 Jun 2024 19:33:43 +0000 (UTC) (envelope-from kostikbel@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: from tom.home (kib@localhost [127.0.0.1] (may be forged)) by kib.kiev.ua (8.18.1/8.18.1) with ESMTP id 454JXUgw002182; Tue, 4 Jun 2024 22:33:33 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua 454JXUgw002182 Received: (from kostik@localhost) by tom.home (8.18.1/8.18.1/Submit) id 454JXUd3002181; Tue, 4 Jun 2024 22:33:30 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 4 Jun 2024 22:33:30 +0300 From: Konstantin Belousov To: John Baldwin Cc: Mark Johnston , freebsd-arch@freebsd.org Subject: Re: removing support for kernel stack swapping Message-ID: References: <6ddedba5-fc2f-4caa-aab5-bd29ca4fdf0b@FreeBSD.org> List-Id: Discussion related to FreeBSD architecture List-Archive: https://lists.freebsd.org/archives/freebsd-arch List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-arch@FreeBSD.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6ddedba5-fc2f-4caa-aab5-bd29ca4fdf0b@FreeBSD.org> X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=4.0.1 X-Spam-Checker-Version: SpamAssassin 4.0.1 (2024-03-26) on tom.home X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US] X-Rspamd-Queue-Id: 4Vv13q6mVJz4ZcJ On Tue, Jun 04, 2024 at 09:59:24AM -0700, John Baldwin wrote: > On 6/2/24 7:57 PM, Mark Johnston wrote: > > FreeBSD will, when free pages are scarce, try to swap out the kernel > > stacks (typically 16KB per thread) of sleeping user threads. I'm told > > that this mechanism was first implemented in BSD for the VAX port and > > that stabilizing it was quite an endeavour. > > > > This feature has wide-ranging implications for code in the kernel. For > > instance, if a thread allocates a structure on its stack, links it into > > some data structure visible to other threads, and goes to sleep, it must > > use PHOLD to ensure that the stack doesn't get swapped out while > > sleeping. A missing PHOLD can thus result in a kernel panic, but this > > kind of mistake is very easy to make and hard to catch without thorough > > stress testing. The kernel stack allocator also requires a fair bit of > > code to implement this feature, and we've had multiple bugs in that > > area, especially in relation to NUMA support. Moreover, this feature > > will leave threads swapped out after the system has recovered, resulting > > in high scheduling latency once they're ready to run again. > > > > In a very stressed system, it's possible that we can free up something > > like 1MB of RAM using this mechanism. I argue that this mechanism is > > not worth it on modern systems: it isn't going to make the difference > > between a graceful recovery from memory pressure and a catatonic state > > which forces a reboot. The complexity and resulting bugs it induces is > > not worth it. > > > > At the BSDCan devsummit I proposed removing support for kernel stack > > swapping and got only positive feedback. Does anyone here have any > > comments or objections? > > +1 > > Things like epoch and rm(9) locks follow the pattern of storing on-stack > items in linked lists FWIW. > > In terms of the memory savings, I don't really think 1MB (or even a few > MB's) is really worth the complexity. > > I agree that if we want to find ways to free up RAM while under memory > pressure, there are probably other caches we can prune with less > complexity. (And in fact, just keeping the kstacks around might > lead to some of this "naturally" since we would just invoke vm_lowmem > a bit sooner to drain caches hooked up to it.) > > In terms of swapping out PCB's, that would have a negative impact on > debugging (e.g. if the PCB is swapped out that means you can't look > at the kthread in question in a crash dump, or remotely over the > remote GDB connection). Similar for if we were to swap out other > parts of the PCB like the XSAVE area on x86. For XSAVE in particular > we should probably look at using the XSAVE compact format if we are > worried about RAM consumption. Debuggers would need to swap the pcb in, of course. Compact XSAVE format would not really help in the situations where the large states still need to be saved, and I do not believe it is feasible to dynamically change the xsave area allocation according to the thread' usage. Similarly, it is not easy to shrink the vnode cache in case of memory shortage, due to need to free the owned pages, and most likely before that, free the owned buffers which otherwise wire the pages. Also we need to flush namecache before a vnode can be freed, which is somewhat unfortunate.