cvs commit: src/sys/i386/i386 pmap.c
Stephan Uphoff
ups at tree.com
Tue Nov 9 19:34:31 GMT 2004
On Tue, 2004-11-09 at 13:57, Peter Wemm wrote:
> On Tuesday 09 November 2004 10:21 am, Stephan Uphoff wrote:
> > On Tue, 2004-11-09 at 13:02, Julian Elischer wrote:
> > > Robert Watson wrote:
> > > >This change made a large difference, and eliminates the
> > > > unexplained costs. Here's a revised table as compared to the
> > > > above:
> > > >
> > > > sleep mutex crit section spin mutex new spin mutex
> > > > UP SMP UP SMP UP SMP UP SMP
> > > >PIII 21 81 83 81 112 141 95 141
> > > >P4 39 260 120 119 274 342 132 231
> > > >
> > > >So it basically cut 140 cycles off the P4 UP spin lock, 15 off the
> > > > PIII UP spin lock, and 110 cycles off the P4 SMP spin lock. The
> > > > PIII SMP spin lock looks the same. Keep in mind that all of
> > > > these measurements have a standard deviation of between 0 and 3
> > > > cycles, most in the 1 range. Also keep in mind that these are
> > > > entirely uncontended measurements.
> > > >
> > > >Assuming that these changes are correct, and pass whatever tests
> > > > people have in mind, this would be a very strong merge candidate
> > > > for performance reasons. The difference is visible in packet
> > > > send tests from user space as a percentage or two improvement on
> > > > UP on my P4, although it's a litte hard to tell due to the noise.
> > >
> > > Can you explain why a spin mutex is more expensive than a sleep
> > > mutex (I assume this is uncontested)?
> >
> > cli() and sti() used for the critical section are expensive.
>
> ... on INTEL cpus! Don't make the mistake of assuming that all x86 cpus
> are as slow as Intel's P4 family on this stuff. Other cpus don't have
> the same massive microcode penalty. My recollection is that athlon
> (and athlon64 cpus in 32 bit mode) take about 8-12 clocks to do a cli
> or sti, compared to 300+ for a P4 cpu. And things like 50-90 clocks
> for an invlpg vs 1200-1600 clocks for a P4.
>
> Please don't accidently penalize those of us with cpus that were
> designed for good all-round performance. The P4 family was designed
> for games and 3d graphics, not all-round performance.
>
> (This isn't aimed at anybody in particular.. I just wanted to remind
> people that the P4 code is a particularly pathological case (and the
> writing is on the wall for that core). Other cpus, including intel's
> newer non-P4 cores, dont have the same pathological problems.)
Good points.
This seems to lead to the same choices as in my last email.
( non optimal code, lots of compile options or self modifying code)
Is there any reason not to implement self modifying code as for example
used in linux for memory barriers? ( Andi Kleen, [PATCH] Runtime memory
barrier patching - http://lkml.org/lkml/2003/4/21/168 )
Maybe this would even allow shipping SMP capable kernels by default
again.
Stephan
More information about the cvs-src
mailing list