cvs commit: src/sys/amd64/amd64 cpu_switch.S
Attilio Rao
attilio at freebsd.org
Thu Jun 19 15:42:04 UTC 2008
2008/3/24, Peter Wemm <peter at freebsd.org>:
> peter 2008-03-23 23:09:06 UTC
>
> FreeBSD src repository
>
> Modified files:
> sys/amd64/amd64 cpu_switch.S
> Log:
> First pass at (possibly futile) microoptimizing of cpu_switch. Results
> are mixed. Some pure context switch microbenchmarks show up to 29%
> improvement. Pipe based context switch microbenchmarks show up to 7%
> improvement. Real world tests are far less impressive as they are
> dominated more by actual work than switch overheads, but depending on
> the machine in question, workload, kernel options, phase of moon, etc, a
> few percent gain might be seen.
>
> Summary of changes:
> - don't reload MSR_[FG]SBASE registers when context switching between
> non-threaded userland apps. These typically cost 120 clock cycles each
> on an AMD cpu (less on Barcelona/Phenom). Intel cores are probably no
> faster on this.
> - The above change only helps unthreaded userland apps that tend to use
> the same value for gsbase. Threaded apps will get no benefit from this.
> - reorder things like accessing the pcb to be in memory order, to give
> prefetching a better chance of working. Operations are now in increasing
> memory address order, rather than reverse or random.
> - Push some lesser used code out of the main code paths. Hopefully
> allowing better code density in cache lines. This is probably futile.
> - (part 2 of previous item) Reorder code so that branches have a more
> realistic static branch prediction hint. Both Intel and AMD cpus
> default to predicting branches to lower memory addresses as being
> taken, and to higher memory addresses as not being taken. This is
> overridden by the limited dynamic branch prediction subsystem. A trip
> through userland might overflow this.
> - Futule attempt at spreading the use of the results of previous operations
> in new operations. Hopefully this will allow the cpus to execute in
> parallel better.
> - stop wasting 16 bytes at the top of kernel stack, below the PCB.
> - Never load the userland fs/gsbase registers for kthreads, but preserve
> curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!)
>
> Microbenchmarking this code seems to be really sensitive to things like
> scheduling luck, timing, cache behavior, tlb behavior, kernel options,
> other random code changes, etc.
>
> While it doesn't help heavy userland workloads much, it does help high
> context switch loads a little, and should help those that involve
> switching via kthreads a bit more.
>
> A special thanks to Kris for the testing and reality checks, and Jeff for
> tormenting me into doing this. :)
>
> This is still work-in-progress.
It looks like this patch introduces a regression.
In particular, this chunk:
@@ -181,82 +166,138 @@ sw1:
cmpq %rcx, %rdx
pause
je 1b
- lfence
#endif
is not totally right as we want to enforce an acq
--
Peace can only be achieved by understanding - A. Einstein
More information about the cvs-src
mailing list