cvs commit: src/sys/i386/i386 pmap.c
Stephan Uphoff
ups at tree.com
Tue Nov 9 02:28:22 GMT 2004
On Mon, 2004-11-08 at 13:17, John Baldwin wrote:
> On Saturday 06 November 2004 05:24 pm, Stephan Uphoff wrote:
> > On Fri, 2004-11-05 at 07:01, Robert Watson wrote:
> > > On Fri, 29 Oct 2004, Mike Silbersack wrote:
> > > > I think we really need some sort of light-weight critical_enter that
> > > > simply assures you that you won't get rescheduled to another CPU, but
> > > > gives no guarantees beyond that.
> > >
> > > <snip>
> > >
> > > > Er, wait - I guess I'm forgetting something, there exists the potential
> > > > for the interrupt that preempted whatever was calling arc4random to
> > > > also call arc4random, thereby breaking things...
> > >
> > > I've been looking at related issues for the last couple of days and must
> > > have missed this thread while at EuroBSDCon. Alan Cox pointed me at it,
> > > so here I am. :-)
> > >
> > > Right now, the cost of acquiring and dropping an uncontended a sleep
> > > mutex on a UP kernel is very low -- about 21 cycles on my PIII and 40 on
> > > my P4, including some efficiency problems in my measurement which
> > > probably add a non-trivial overhead. Compare this with the SMP versions
> > > on the PIII (90 cycles) and P4 (260 cycles!). Critical sections on the
> > > SMP PIII are about the same cost as the SMP mutex, but on the P4 a
> > > critical section is less than half the cost. Getting to a model where
> > > critical sections were as cheap as UP sleep mutexes, or where we could
> > > use a similar combination of primitives (such as UP mutexes with pinning)
> > > would be very useful. Otherwise, optimizing through use of critical
> > > sections will improve SMP but potentially damage performance on UP.
> > > There's been a fair amount of discussion of such approaches, including
> > > the implementation briefly present in the FreeBSD. I know John Baldwin
> > > and Justin Gibbs both have theories and plans in this area.
> > >
> > > If we do create a UP mutex primitive for use on SMP, I would suggest we
> > > actually expand the contents of the UP mutex structure slightly to
> > > include a cpu number that can be asserted, along with pinning, when an
> > > operation is attempted and INVARIANTS is present. One of the great
> > > strengths of the mutex/lock model is a strong assertion capability, both
> > > for the purposes of documentation and testing, so we should make sure
> > > that carries into any new synchronization primitives.
> > >
> > > Small table of synchronization primitives below; in each case, the count
> > > is in cycles and reflects the cost of acquiring and dropping the
> > > primitive (lock+unlock, enter+exit). The P4 is a 3ghz box, and the PIII
> > > is an 800mhz box. Note that the synchronization primitives requiring
> > > atomic operations are substantially pessimized on the P4 vs the PIII.
> > >
> > > A discussion with John Baldwin and Scott Long yesterday revealed that the
> > > UP spin mutex is currently pessimized from a critical section to a
> > > critical section plus mutex internals due to a need for mtx_owned() on
> > > spin locks. I'm not convinced that explains the entire performance
> > > irregularity I see for P4 spin mutexes on UP, however. Note that 39 (P4
> > > UP sleep mutex) + 120 (P4 UP critical section) is not 274 (P4 UP spin
> > > mutex) by a fair amount. Figuring out what's going on there would be a
> > > good idea, although it could well be a property of my measurement
> > > environment. I'm currently using this to do measurements:
> > >
> > > //depot/user/rwatson/percpu/sys/test/test_synch_timing.c
> > >
> > > In all of the below, the standard deviation is very small if you're
> > > careful about not bumping into hard clock or other interrupts during
> > > testing, especially when it comes to spin mutexes and critical sections.
> > >
> > > Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
> > > robert at fledge.watson.org Principal Research Scientist, McAfee
> > > Research
> > >
> > > sleep mutex crit section spin mutex
> > > UP SMP UP SMP UP SMP
> > > PIII 21 90 83 81 112 141
> > > P4 39 260 120 119 274 342
> >
> > Nice catch!
> > On a UP releasing a spin mutex involves a xchgl operation while
> > releasing an uncontested sleep mutex uses cmpxchgl.
> > Since the xchgl does an implicit LOCK (and cmpxchgl does NOT) this could
> > explain why the spin mutex needs a lot more cycles.
> > This should be easy to fix since the xchgl is not needed on a UP system.
> > Right now I am sick and don't trust my own code so I won't write a patch
> > for the next few days ... hopefully someone else can get to it first.
>
> I've tried changing the store_rel() to just do a simple store since writes are
> ordered on x86, but benchmarks on SMP showed that it actually hurt.
Strange - any idea why?
Can you recommend a specific benchmark for this?
I would like to try some variations.
> However,
> it would probably be good to at least do that for UP. The current patch to
> do it for all kernels is:
>
> --- //depot/vendor/freebsd/src/sys/i386/include/atomic.h 2004/03/12 21:50:47
> +++ //depot/projects/smpng/sys/i386/include/atomic.h 2004/08/02 15:16:35
> @@ -69,7 +69,7 @@
>
> int atomic_cmpset_int(volatile u_int *dst, u_int exp, u_int src);
>
> -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \
> +#define ATOMIC_STORE_LOAD(TYPE, LOP) \
> u_##TYPE atomic_load_acq_##TYPE(volatile u_##TYPE *p); \
> void atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)
>
> @@ -175,12 +175,12 @@
> #if defined(I386_CPU)
>
> /*
> - * We assume that a = b will do atomic loads and stores.
> - *
> - * XXX: This is _NOT_ safe on a P6 or higher because it does not guarantee
> - * memory ordering. These should only be used on a 386.
> + * We assume that a = b will do atomic loads and stores. However, on a
> + * PentiumPro or higher, reads may pass writes, so for that case we have
> + * to use a serializing instruction (i.e. with LOCK) to do the load. For
> + * the 386 case we can use a simple read since 386s don't support SMP.
> */
> -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \
> +#define ATOMIC_STORE_LOAD(TYPE, LOP) \
> static __inline u_##TYPE \
> atomic_load_acq_##TYPE(volatile u_##TYPE *p) \
> { \
> @@ -190,14 +190,14 @@
> static __inline void \
> atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
> { \
> + __asm __volatile("" : : : "memory"); \
> *p = v; \
> - __asm __volatile("" : : : "memory"); \
> } \
> struct __hack
>
> #else /* !defined(I386_CPU) */
>
> -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \
> +#define ATOMIC_STORE_LOAD(TYPE, LOP) \
> static __inline u_##TYPE \
> atomic_load_acq_##TYPE(volatile u_##TYPE *p) \
> { \
> @@ -211,16 +211,11 @@
> return (res); \
> } \
> \
> -/* \
> - * The XCHG instruction asserts LOCK automagically. \
> - */ \
> static __inline void \
> atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)\
> { \
> - __asm __volatile(SOP \
> - : "+m" (*p), /* 0 */ \
> - "+r" (v) /* 1 */ \
> - : : "memory"); \
> + __asm __volatile("" : : : "memory"); \
> + *p = v; \
> } \
> struct __hack
>
> @@ -230,7 +225,7 @@
>
> extern int atomic_cmpset_int(volatile u_int *, u_int, u_int);
>
> -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \
> +#define ATOMIC_STORE_LOAD(TYPE, LOP) \
> extern u_##TYPE atomic_load_acq_##TYPE(volatile u_##TYPE *p); \
> extern void atomic_store_rel_##TYPE(volatile u_##TYPE *p, u_##TYPE v)
>
> @@ -258,10 +253,10 @@
> ATOMIC_ASM(add, long, "addl %1,%0", "ir", v);
> ATOMIC_ASM(subtract, long, "subl %1,%0", "ir", v);
>
> -ATOMIC_STORE_LOAD(char, "cmpxchgb %b0,%1", "xchgb %b1,%0");
> -ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1", "xchgw %w1,%0");
> -ATOMIC_STORE_LOAD(int, "cmpxchgl %0,%1", "xchgl %1,%0");
> -ATOMIC_STORE_LOAD(long, "cmpxchgl %0,%1", "xchgl %1,%0");
> +ATOMIC_STORE_LOAD(char, "cmpxchgb %b0,%1");
> +ATOMIC_STORE_LOAD(short,"cmpxchgw %w0,%1");
> +ATOMIC_STORE_LOAD(int, "cmpxchgl %0,%1");
> +ATOMIC_STORE_LOAD(long, "cmpxchgl %0,%1");
>
> #undef ATOMIC_ASM
> #undef ATOMIC_STORE_LOAD
More information about the cvs-src
mailing list