[PATCH 1/2] Implement simple sequence counters with memory barriers.

Sat Oct 4 07:11:45 UTC 2014

On Sat, Oct 04, 2014 at 07:28:51AM +0200, Mateusz Guzik wrote:
> Reviving. Sorry everyone for such big delay, $life.
> 
> On Tue, Aug 19, 2014 at 02:24:16PM -0500, Alan Cox wrote:
> > On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik <mjguzik at gmail.com> wrote:
> > > Well, my memory-barrier-and-so-on-fu is rather weak.
> > >
> > > I had another look at the issue. At least on amd64, it looks like only
> > > compiler barrier is required for both reads and writes.
> > >
> > > According to AMD64 Architecture Programmer???s Manual Volume 2: System
> > > Programming, 7.2 Multiprocessor Memory Access Ordering states:
> > >
> > > "Loads do not pass previous loads (loads are not reordered). Stores do
> > > not pass previous stores (stores are not reordered)"
> > >
> > > Since the code modifying stuff only performs a series of writes and we
> > > expect exclusive writers, I find it applicable to this scenario.
> > >
> > > I checked linux sources and generated assembly, they indeed issue only
> > > a compiler barrier on amd64 (and for intel processors as well).
> > >
> > > atomic_store_rel_int on amd64 seems fine in this regard, but the only
> > > function for loads issues lock cmpxhchg which kills performance
> > > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain.
> > >
> > > Additionally release and acquire semantics seems to be a stronger than
> > > needed guarantee.
> > >
> > >
> > 
> > This statement left me puzzled and got me to look at our x86 atomic.h for
> > the first time in years.  It appears that our implementation of
> > atomic_load_acq_int() on x86 is, umm ..., unconventional.  That is, it is
> > enforcing a constraint that simple acquire loads don't normally enforce.
> > For example, the C11 stdatomic.h simple acquire load doesn't enforce this
> > constraint.  Moreover, our own implementation of atomic_load_acq_int() on
> > ia64, where the mapping from atomic_load_acq_int() to machine instructions
> > is straightforward, doesn't enforce this constraint either.
> > 
> 
> By 'this constraint' I presume you mean full memory barrier.
> 
> It is unclear to me if one can just get rid of it currently. It
> definitely would be beneficial.
> 
> In the meantime, if for some reason full barrier is still needed, we can
> speed up concurrent load_acq of the same var considerably. There is no
> need to lock cmpxchg on the same address. We should be able to replace
> it with +/-:
> lock add $0,(%rsp);
> movl ...;
> 
> I believe it is possible that cpu will perform some writes before doing
> read listed here, but this should be fine.
> 
> If this is considered too risky to hit 10.1, I would like to implement
> it within seq as a temporary hack to be fixed up later.
> 
> something along:
> static inline int
> atomic_load_acq_rmb(volatile u_int *p)
> {
> 	volaitle u_int *v;
> 
> 	v = *p;
> 	atomic_load_acq(&v);
> 	return (v);
> }
Do you need it as designated primitive ?  I think you could write this
inline for the purpose of getting the fix into 10.1.

With the inline quirk, I think that the fix should go into the HEAD
now, with some reasonable MFC timer.
> 
> This hack fixes aforementioned performance degradation and covers all
> architectures.
> 
> > Give us a chance to sort this out before you do anything further.  As
> > Kostik said, but in different words, we've always written our
> > machine-independent layer code using acquires and releases to express the
> > required ordering constraints and not {r,w}mb() primitives.
> > 
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>