[PATCH] microoptimize locking primitives by introducing randomized delay between atomic ops

Sat Jul 16 15:17:45 UTC 2016

On Sun, Jul 10, 2016 at 08:32:01AM -0600, Ian Lepore wrote:
> On Sun, 2016-07-10 at 13:13 +0200, Mateusz Guzik wrote:
> > If the lock is contended, primitives like __mtx_lock_sleep will spin
> > checking if the owner is running or the lock was freed. The problem
> > is
> > that once it is discovered that the lock is free, multiple CPUs are
> > likely to try to do the atomic op which will make it more costly for
> > everyone and throughput suffers.
> > 
> > The standard thing to do is to have some sort of a randomized delay
> > so
> > that this kind of behaviour is reduced.
> > 
> > As such, below is a trivial hack which takes cpu_ticks() into account
> > and performs % 2048, which in my testing gives reasonbly good
> > results.
> > 
> > Please note there is definitely way more room for improvement in
> > general.
> > 
> > In terms of results, there was no statistically significant change in
> > -j 40 buildworld nor buildkernel.
> > 
> > However, a 40-way find on a ports tree placed on tmpfs yielded the
> > following:
> > 
> > x vanilla            
> > + patched
> > +--------------------------------------------------------------------
> > --------------------+
> > >     ++++                +                                         x
> > >          x x x      |
> > > +    ++++ +++    +  +  + ++       +       +     x               x 
> > >  x  xxxxxxxx x x     x|
> > >    |_____M____A__________|                                     
> > >  |________AM______|     |
> > +--------------------------------------------------------------------
> > --------------------+
> >     N           Min           Max        Median           Avg       
> >  Stddev
> > x  20        12.431        15.952        14.897       14.7444   
> >  0.74241657
> > +  20         8.103        11.863        9.0135       9.44565    
> >  1.0059484
> > Difference at 95.0% confidence
> > 	-5.29875 +/- 0.565836
> > 	-35.9374% +/- 3.83764%
> > 	(Student's t, pooled s = 0.884057)
> > 
> > The patch:
> [...]
> 
> What about platforms that don't have a useful implementation of
> cpu_ticks()?
> 

Do we have such platforms and do they have smp?

> What about platforms that don't suffer the large expense for atomic ops
> that x86 apparently does?
> 

The current state of locking primitives already seems to be x86-centric.
Postponing of atomic ops is implemented in some parts and this patch
only extends it (in a different form).

That said, if we have platforms where this kind of stuff is detrimental
to performance, machine-specific primitives should be introduced.

Meanwhile, courtesy of andrew@ I tested the patch on cavium (48-way
arm64) and saw great improvement.

x vanilla
+ patched
+----------------------------------------------------------------------------------------+
|+                                                                                       |
|+                                                                                       |
|+                                                                                       |
|+                                                                                       |
|+                                                                                       |
|+                                                                                   x   |
|++                                                                                 xxx  |
|++                                                                                xxxxxx|
|A|                                                                                 |A_| |
+----------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10         17.25        17.849         17.48       17.4968    0.19581556
+  10          6.56         6.679         6.586        6.6011   0.038013009
Difference at 95.0% confidence
	-10.8957 +/- 0.132528
	-62.2725% +/- 0.757439%
	(Student's t, pooled s = 0.141047)

Note: find does open+close a lot. close results in exclusive vnode locking if
the fs does not have the MNTK_EXTENDED_SHARED flag set, which is the case on
tmpfs. On this machine it contributed to a major slowdown. The flag was set
locally. I'm not sure yet how safe the change in terms of general use. It is
definitely fine enough for the benchmark.

That said, I would like to commit this next week unless there are objections.

-- 
Mateusz Guzik <mjguzik gmail.com>