Re: removing support for kernel stack swapping

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Tue, 04 Jun 2024 19:33:30 UTC
On Tue, Jun 04, 2024 at 09:59:24AM -0700, John Baldwin wrote:
> On 6/2/24 7:57 PM, Mark Johnston wrote:
> > FreeBSD will, when free pages are scarce, try to swap out the kernel
> > stacks (typically 16KB per thread) of sleeping user threads.  I'm told
> > that this mechanism was first implemented in BSD for the VAX port and
> > that stabilizing it was quite an endeavour.
> > 
> > This feature has wide-ranging implications for code in the kernel.  For
> > instance, if a thread allocates a structure on its stack, links it into
> > some data structure visible to other threads, and goes to sleep, it must
> > use PHOLD to ensure that the stack doesn't get swapped out while
> > sleeping.  A missing PHOLD can thus result in a kernel panic, but this
> > kind of mistake is very easy to make and hard to catch without thorough
> > stress testing.  The kernel stack allocator also requires a fair bit of
> > code to implement this feature, and we've had multiple bugs in that
> > area, especially in relation to NUMA support.  Moreover, this feature
> > will leave threads swapped out after the system has recovered, resulting
> > in high scheduling latency once they're ready to run again.
> > 
> > In a very stressed system, it's possible that we can free up something
> > like 1MB of RAM using this mechanism.  I argue that this mechanism is
> > not worth it on modern systems: it isn't going to make the difference
> > between a graceful recovery from memory pressure and a catatonic state
> > which forces a reboot.  The complexity and resulting bugs it induces is
> > not worth it.
> > 
> > At the BSDCan devsummit I proposed removing support for kernel stack
> > swapping and got only positive feedback.  Does anyone here have any
> > comments or objections?
> 
> +1
> 
> Things like epoch and rm(9) locks follow the pattern of storing on-stack
> items in linked lists FWIW.
> 
> In terms of the memory savings, I don't really think 1MB (or even a few
> MB's) is really worth the complexity.
> 
> I agree that if we want to find ways to free up RAM while under memory
> pressure, there are probably other caches we can prune with less
> complexity.  (And in fact, just keeping the kstacks around might
> lead to some of this "naturally" since we would just invoke vm_lowmem
> a bit sooner to drain caches hooked up to it.)
> 
> In terms of swapping out PCB's, that would have a negative impact on
> debugging (e.g. if the PCB is swapped out that means you can't look
> at the kthread in question in a crash dump, or remotely over the
> remote GDB connection).  Similar for if we were to swap out other
> parts of the PCB like the XSAVE area on x86.  For XSAVE in particular
> we should probably look at using the XSAVE compact format if we are
> worried about RAM consumption.

Debuggers would need to swap the pcb in, of course.
Compact XSAVE format would not really help in the situations where
the large states still need to be saved, and I do not believe it is
feasible to dynamically change the xsave area allocation according to
the thread' usage.

Similarly, it is not easy to shrink the vnode cache in case of memory
shortage, due to need to free the owned pages, and most likely before
that, free the owned buffers which otherwise wire the pages.
Also we need to flush namecache before a vnode can be freed, which is
somewhat unfortunate.