[PATCH 1/2] Implement simple sequence counters with memory barriers.

Sat Oct 4 09:37:19 UTC 2014

On Sat, Oct 4, 2014 at 7:28 AM, Mateusz Guzik <mjguzik at gmail.com> wrote:
> Reviving. Sorry everyone for such big delay, $life.
>
> On Tue, Aug 19, 2014 at 02:24:16PM -0500, Alan Cox wrote:
>> On Sat, Aug 16, 2014 at 8:26 PM, Mateusz Guzik <mjguzik at gmail.com> wrote:
>> > Well, my memory-barrier-and-so-on-fu is rather weak.
>> >
>> > I had another look at the issue. At least on amd64, it looks like only
>> > compiler barrier is required for both reads and writes.
>> >
>> > According to AMD64 Architecture Programmer’s Manual Volume 2: System
>> > Programming, 7.2 Multiprocessor Memory Access Ordering states:
>> >
>> > "Loads do not pass previous loads (loads are not reordered). Stores do
>> > not pass previous stores (stores are not reordered)"
>> >
>> > Since the code modifying stuff only performs a series of writes and we
>> > expect exclusive writers, I find it applicable to this scenario.
>> >
>> > I checked linux sources and generated assembly, they indeed issue only
>> > a compiler barrier on amd64 (and for intel processors as well).
>> >
>> > atomic_store_rel_int on amd64 seems fine in this regard, but the only
>> > function for loads issues lock cmpxhchg which kills performance
>> > (median 55693659 -> 12789232 ops in a microbenchmark) for no gain.
>> >
>> > Additionally release and acquire semantics seems to be a stronger than
>> > needed guarantee.
>> >
>> >
>>
>> This statement left me puzzled and got me to look at our x86 atomic.h for
>> the first time in years.  It appears that our implementation of
>> atomic_load_acq_int() on x86 is, umm ..., unconventional.  That is, it is
>> enforcing a constraint that simple acquire loads don't normally enforce.
>> For example, the C11 stdatomic.h simple acquire load doesn't enforce this
>> constraint.  Moreover, our own implementation of atomic_load_acq_int() on
>> ia64, where the mapping from atomic_load_acq_int() to machine instructions
>> is straightforward, doesn't enforce this constraint either.
>>
>
> By 'this constraint' I presume you mean full memory barrier.
>
> It is unclear to me if one can just get rid of it currently. It
> definitely would be beneficial.
>
> In the meantime, if for some reason full barrier is still needed, we can
> speed up concurrent load_acq of the same var considerably. There is no
> need to lock cmpxchg on the same address. We should be able to replace
> it with +/-:
> lock add $0,(%rsp);
> movl ...;

When I looked into some AMD manual (I think the same one which reports
using lock add $0, (%rsp)) I recall that the (reported) added
instructions latencies of "lock add" + "movl" is superior than the
single "cmpxchg".
Moreover, I think that the simple movl is going to lock the cache-line
anyway, so I doubt the "lock add" is going to provide any benefit. The
only benefit I can think of is that we will be able to use an _acq()
barriers on read-only memory with this trick (which is not possible
today as timecounters code can testify).

If the latencies for "lock add" + "movl" is changed in the latest
Intel processors I can't say for sure, it may be worth to look at it.

Attilio

-- 
Peace can only be achieved by understanding - A. Einstein