cvs commit: src/sys/i386/i386 pmap.c

Robert Watson rwatson at FreeBSD.org
Tue Nov 9 11:31:29 GMT 2004


On Mon, 8 Nov 2004, John Baldwin wrote:

> > > A discussion with John Baldwin and Scott Long yesterday revealed that the
> > > UP spin mutex is currently pessimized from a critical section to a
> > > critical section plus mutex internals due to a need for mtx_owned() on
> > > spin locks.  I'm not convinced that explains the entire performance
> > > irregularity I see for P4 spin mutexes on UP, however.  Note that 39 (P4
> > > UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin
> > > mutex) by a fair amount.  Figuring out what's going on there would be a
> > > good idea, although it could well be a property of my measurement
> > > environment.  I'm currently using this to do measurements:
...
> > >         sleep mutex     crit section    spin mutex
> > >         UP      SMP     UP      SMP     UP      SMP
> > > PIII    21      90      83      81      112     141
> > > P4      39      260     120     119     274     342
> >
> > Nice catch!
> > On a UP releasing a spin mutex involves a xchgl operation while
> > releasing an uncontested sleep mutex uses cmpxchgl.
> > Since the xchgl does an implicit LOCK (and cmpxchgl does NOT) this could
> > explain why the spin mutex needs a lot more cycles.
> > This should be easy to fix since the xchgl is not needed on a UP system.
> > Right now I am sick and don't trust my own code so I won't write a patch
> > for the next few days ... hopefully someone else can get to it first.
> 
> I've tried changing the store_rel() to just do a simple store since writes are 
> ordered on x86, but benchmarks on SMP showed that it actually hurt.  However, 
> it would probably be good to at least do that for UP.  The current patch to 
> do it for all kernels is:

This change made a large difference, and eliminates the unexplained costs.
Here's a revised table as compared to the above:

	sleep mutex	crit section	spin mutex	new spin mutex
	UP	SMP	UP	SMP	UP	SMP	UP	SMP
PIII	21	81	83	81	112	141	95	141
P4	39	260	120	119	274	342	132	231

So it basically cut 140 cycles off the P4 UP spin lock, 15 off the PIII UP
spin lock, and 110 cycles off the P4 SMP spin lock.  The PIII SMP spin
lock looks the same.  Keep in mind that all of these measurements have a
standard deviation of between 0 and 3 cycles, most in the 1 range.  Also
keep in mind that these are entirely uncontended measurements.

Assuming that these changes are correct, and pass whatever tests people
have in mind, this would be a very strong merge candidate for performance
reasons.  The difference is visible in packet send tests from user space
as a percentage or two improvement on UP on my P4, although it's a litte
hard to tell due to the noise. 

(Note: I corrected number in the original table: the PIII SMP sleep mutex
measured at 81 cycles, not 90 cycles as shown in the original). 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Principal Research Scientist, McAfee Research



More information about the cvs-src mailing list