cvs commit: src/sys/i386/i386 pmap.c
Robert Watson
rwatson at FreeBSD.org
Tue Nov 9 11:31:29 GMT 2004
On Mon, 8 Nov 2004, John Baldwin wrote:
> > > A discussion with John Baldwin and Scott Long yesterday revealed that the
> > > UP spin mutex is currently pessimized from a critical section to a
> > > critical section plus mutex internals due to a need for mtx_owned() on
> > > spin locks. I'm not convinced that explains the entire performance
> > > irregularity I see for P4 spin mutexes on UP, however. Note that 39 (P4
> > > UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin
> > > mutex) by a fair amount. Figuring out what's going on there would be a
> > > good idea, although it could well be a property of my measurement
> > > environment. I'm currently using this to do measurements:
...
> > > sleep mutex crit section spin mutex
> > > UP SMP UP SMP UP SMP
> > > PIII 21 90 83 81 112 141
> > > P4 39 260 120 119 274 342
> >
> > Nice catch!
> > On a UP releasing a spin mutex involves a xchgl operation while
> > releasing an uncontested sleep mutex uses cmpxchgl.
> > Since the xchgl does an implicit LOCK (and cmpxchgl does NOT) this could
> > explain why the spin mutex needs a lot more cycles.
> > This should be easy to fix since the xchgl is not needed on a UP system.
> > Right now I am sick and don't trust my own code so I won't write a patch
> > for the next few days ... hopefully someone else can get to it first.
>
> I've tried changing the store_rel() to just do a simple store since writes are
> ordered on x86, but benchmarks on SMP showed that it actually hurt. However,
> it would probably be good to at least do that for UP. The current patch to
> do it for all kernels is:
This change made a large difference, and eliminates the unexplained costs.
Here's a revised table as compared to the above:
sleep mutex crit section spin mutex new spin mutex
UP SMP UP SMP UP SMP UP SMP
PIII 21 81 83 81 112 141 95 141
P4 39 260 120 119 274 342 132 231
So it basically cut 140 cycles off the P4 UP spin lock, 15 off the PIII UP
spin lock, and 110 cycles off the P4 SMP spin lock. The PIII SMP spin
lock looks the same. Keep in mind that all of these measurements have a
standard deviation of between 0 and 3 cycles, most in the 1 range. Also
keep in mind that these are entirely uncontended measurements.
Assuming that these changes are correct, and pass whatever tests people
have in mind, this would be a very strong merge candidate for performance
reasons. The difference is visible in packet send tests from user space
as a percentage or two improvement on UP on my P4, although it's a litte
hard to tell due to the noise.
(Note: I corrected number in the original table: the PIII SMP sleep mutex
measured at 81 cycles, not 90 cycles as shown in the original).
Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org Principal Research Scientist, McAfee Research
More information about the cvs-src
mailing list