Re: removing support for kernel stack swapping

From: John Baldwin <jhb_at_FreeBSD.org>
Date: Tue, 04 Jun 2024 22:43:51 UTC
On 6/4/24 3:33 PM, Konstantin Belousov wrote:
> On Tue, Jun 04, 2024 at 09:59:24AM -0700, John Baldwin wrote:
>> On 6/2/24 7:57 PM, Mark Johnston wrote:
>>> FreeBSD will, when free pages are scarce, try to swap out the kernel
>>> stacks (typically 16KB per thread) of sleeping user threads.  I'm told
>>> that this mechanism was first implemented in BSD for the VAX port and
>>> that stabilizing it was quite an endeavour.
>>>
>>> This feature has wide-ranging implications for code in the kernel.  For
>>> instance, if a thread allocates a structure on its stack, links it into
>>> some data structure visible to other threads, and goes to sleep, it must
>>> use PHOLD to ensure that the stack doesn't get swapped out while
>>> sleeping.  A missing PHOLD can thus result in a kernel panic, but this
>>> kind of mistake is very easy to make and hard to catch without thorough
>>> stress testing.  The kernel stack allocator also requires a fair bit of
>>> code to implement this feature, and we've had multiple bugs in that
>>> area, especially in relation to NUMA support.  Moreover, this feature
>>> will leave threads swapped out after the system has recovered, resulting
>>> in high scheduling latency once they're ready to run again.
>>>
>>> In a very stressed system, it's possible that we can free up something
>>> like 1MB of RAM using this mechanism.  I argue that this mechanism is
>>> not worth it on modern systems: it isn't going to make the difference
>>> between a graceful recovery from memory pressure and a catatonic state
>>> which forces a reboot.  The complexity and resulting bugs it induces is
>>> not worth it.
>>>
>>> At the BSDCan devsummit I proposed removing support for kernel stack
>>> swapping and got only positive feedback.  Does anyone here have any
>>> comments or objections?
>>
>> +1
>>
>> Things like epoch and rm(9) locks follow the pattern of storing on-stack
>> items in linked lists FWIW.
>>
>> In terms of the memory savings, I don't really think 1MB (or even a few
>> MB's) is really worth the complexity.
>>
>> I agree that if we want to find ways to free up RAM while under memory
>> pressure, there are probably other caches we can prune with less
>> complexity.  (And in fact, just keeping the kstacks around might
>> lead to some of this "naturally" since we would just invoke vm_lowmem
>> a bit sooner to drain caches hooked up to it.)
>>
>> In terms of swapping out PCB's, that would have a negative impact on
>> debugging (e.g. if the PCB is swapped out that means you can't look
>> at the kthread in question in a crash dump, or remotely over the
>> remote GDB connection).  Similar for if we were to swap out other
>> parts of the PCB like the XSAVE area on x86.  For XSAVE in particular
>> we should probably look at using the XSAVE compact format if we are
>> worried about RAM consumption.
> 
> Debuggers would need to swap the pcb in, of course.
> Compact XSAVE format would not really help in the situations where
> the large states still need to be saved, and I do not believe it is
> feasible to dynamically change the xsave area allocation according to
> the thread' usage.

kgdb cannot reach back in time to swap the pcb in for a crashdump after
the fact.  This is also another argument in favor of removing kstack
swapping as it can inhibit post-mortem debugging since you are not able
to examine the stack (local variables, etc.) of swapped out threads
if a crash occurred.

Compact XSAVE can matter on some CPUs with a large hole.  My desktop's
AMD CPUs report PKRU but not AVX-512, so they have a rather giant empty
hole for all of the AVX-512 state just to tack the tiny PKRU state onto
the end.  OTOH, this is probably still on the order of a couple of kb
per thread (probably less than the kstacks).

> Similarly, it is not easy to shrink the vnode cache in case of memory
> shortage, due to need to free the owned pages, and most likely before
> that, free the owned buffers which otherwise wire the pages.
> Also we need to flush namecache before a vnode can be freed, which is
> somewhat unfortunate.

While the vnode cache may not be easy to shrink, Mark did list some
others such as the buffer cache and namecache.  I'm still not convinced
though that kstack swapping on its own is enough of a gain to justify
its costs.  That is, even if we don't find another place to save the
1MB of RAM removing the complexity might be worth it alone.

-- 
John Baldwin