atomic ops

Wed Oct 29 19:05:07 UTC 2014

On Tue, Oct 28, 2014 at 02:18:41PM +0100, Attilio Rao wrote:
> On Tue, Oct 28, 2014 at 3:52 AM, Mateusz Guzik <mjguzik at gmail.com> wrote:
> > As was mentioned sometime ago, our situation related to atomic ops is
> > not ideal.
> >
> > atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide
> > full memory barriers, which is stronger than needed.
> >
> > Moreover, load is implemented as lock cmpchg on var address, so it is
> > addditionally slower especially when cpus compete.
> 
> I already explained this once privately: fully memory barriers is not
> stronger than needed.
> FreeBSD has a different semantic than Linux. We historically enforce a
> full barrier on _acq() and _rel() rather then just a read and write
> barrier, hence we need a different implementation than Linux.
> There is code that relies on this property, like the locking
> primitives (release a mutex, for instance).
> 

I mean stronger than needed in some cases, popular one is fget_unlocked
and we provide no "lightest sufficient" barrier (which would also be
cheaper).

Other case which benefits greatly is sys/sys/seq.h. As noted in some
other thread, using load_acq as it is destroys performance.

I don't dispute the need for full barriers, although it is unclear what
current consumers of load_acq actually need a full barrier..

> In short: optimizing the implementation for performance is fine and
> due. Changing the semantic is not fine, unless you have reviewed and
> fixed all the uses of _rel() and _acq().
> 
> > On amd64 it is sufficient to place a compiler barrier in such cases.
> >
> > Next, we lack some atomic ops in the first place.
> >
> > Let's define some useful terms:
> > smp_wmb - no writes can be reordered past this point
> > smp_rmb - no reads can be reordered past this point
> >
> > With this in mind, we lack ops which would guarantee only the following:
> >
> > 1. var = tmp; smp_wmb();
> > 2. tmp = var; smp_rmb();
> > 3. smp_rmb(); tmp = var;
> >
> > This matters since what we can use already to emulate this is way
> > heavier than needed on aforementioned amd64 and most likely other archs.
> 
> I can see the value of such barriers in case you want to just
> synchronize operation regards read or writes.
> I also believe that on newest intel processors (for which we should
> optimize) rmb() and wmb() got significantly faster than mb(). However
> the most interesting case would be for arm and mips, I assume. That's
> where you would see a bigger perf difference if you optimize the
> membar paths.
> 
> Last time I looked into it, in FreeBSD kernel the Linux-ish
> rmb()/wmb()/etc. were used primilarly in 3 places: Linux-derived code,
> handling of 16-bits operand and implementation of "faster" bus
> barriers.
> Initially I had thought about just confining the smp_*() in a Linux
> compat layer and fix the other 2 in this way: for 16-bits operands
> just pad to 32-bits, as the C11 standard also does. For the bus
> barriers, just grow more versions to actually include the rmb()/wmb()
> scheme within.
> 
> At this point, I understand we may want to instead  support the
> concept of write-only or read-only barrier. This means that if we want
> to keep the concept tied to the current _acq()/_rel() scheme we will
> end up with a KPI explosion.
> 
> I'm not the one making the call here, but for a faster and more
> granluar approach, possibly we can end up using smp_rmb() and
> smp_wmb() directly. As I said I'm not the one making the call.
> 

Well, I don't know original motivation for expressing stuff with
_load_acq and _store_rel.

Anyway, maybe we could do something along (expressing intent, not actual
code):

mb_producer_start(p, v) { *p = v; smp_wmb(); }
mb_producer(p, v) { smp_wmb(); *p = v; }
mb_producer_end(p, v) { mb_producer(p, v); }

type mb_consumer(p) { var = *p; smp_rmb(); return (var); }
type mb_consumer_start(p) { return (mb_consumer(p)); } 
type mb_consumer_end(p) { smp_rmb(); return (*p); }

-- 
Mateusz Guzik <mjguzik gmail.com>