Re: Building kernels with FPU support?

From: John Baldwin <jhb_at_FreeBSD.org>
Date: Fri, 25 Oct 2024 19:54:32 UTC
On 10/23/24 10:38, gnn wrote:
> Howdy,
> 
> I am wondering if anyone has tried, lately, to see what effect building with FPU support has on overall system performance.  I've been working with a kernel module that needs this (for reasons I'll not go into now) and it occurred to me that the perceived performance overhead that caused us to only do fixed point in the kernel may no longer be significant.  I note that Linux has an option to build their kernel with FPU support.
> 
> And yes, I know that we have the ability to selectively deal with the FPU, from the calls outlined in Section 9 for fpu, but I'm asking the more general question of "does it matter?" and "if so, how much?"

To enable vector instructions "in general" in the kernel means that every trap would
need to save the floating point state.  Basically, struct trapframe would need to save
all the vector/FP register state in addition to GPRs.  You would also need to save/restore
it when switching threads in the kernel.  In essence, the current per-pcb state we have
now would stay, but would hold in-kernel state, and userspace state would end up in the
trapframe from userspace.

This would probably be quite expensive.  Saving and restoring FPU state is not cheap
and we would now be doing that on every entry/exit into the kernel (so extra overhead
on system calls, faults, and interrupts).  It would also probably blow out kernel
stack usage quite a bit.  The XSAVE region on modern x86 processors is already close
to 2k and is only growing.  That would be a substantially larger trapframe and require
larger kstacks as a result.

To mitigate the latter you could perhaps try to only use FP in the kernel "top-half"
and not use it in bottom-half interrupt code.  I worry a bit about clearly demarking
bottom-half code to still compile without FP, but as long as you disable FP access
for nested faults you'd find any inconsistencies there rather quickly in the form
of panics.

Certainly it would be a fair bit of work to prototype to see what happens.  Some other
things you could try are to only save a subset of register state for traps (e.g.
just FXSAVE on x86 would mean you can use SSE and FP, but not AVX which might be
enough for the many use cases in the kernel while not blowing out quite as much stack
space).
  
-- 
John Baldwin