Patch to optimize "bare" critical sections

Thu Nov 25 07:08:38 PST 2004

On Tue, 23 Nov 2004, John Baldwin wrote:

> On Tuesday 23 November 2004 03:00 pm, John Baldwin wrote:
> > Basically, I have a patch to divorce the interrupt disable/deferring to
> > only happen inside of spinlocks using a new spinlock_enter/exit() API
> > (where a spinlock_enter/exit includes a critical section as well) but that
> > plain critical sections won't have to do such a thing.  I've tested it on
> > i386, alpha, and sparc64 already, and it has also been tested on arm.  I'm
> > unable to get a cross-built powerpc kernel to link (linker dies with a
> > signal 6), but the compile did finish.  I have cross-compiled ia64 and
> > amd64
> > successfully, but have not run tested due to ENOHARDWARE.  So, I would
> > appreciate it if a few folks could try the patch out on ppc, ia64, and
> > amd64 to make sure it works ok.  Thanks.
> >
> > http://www.FreeBSD.org/~jhb/spinlock.patch
> 
> *cough* Ahem, http://www.FreeBSD.org/~jhb/patches/spinlock.patch

FYI, I'm seeing a fairly solid wedge occuring under stress with the i386
patch in place on a dual Xeon test box in the Netperf cluster.  I thought
at first it was a property of the UMA optimizations I have that use the
critical sections, but it also happens with just the critical section
changes, so... :-) 

The reproduction mode I'm using is to run the syscall_timing tool on the
box over a serial console repeatedly:

    http://www.watson.org/~robert/freebsd/syscall_timing.c

In particular, I'm running 10,000 iterations of the socket create/free
test.  Under normal circumstances it looks like this:

tiger-2# while (1)
while? ./syscall_timing 10000 socket | grep per
while? end
0.000006708 per/iteration
0.000006642 per/iteration
0.000006658 per/iteration
0.000006660 per/iteration
...
^C

When I get the wedge it does this:

tiger-2# while (1)
while? ./syscall_timing 10000 socket | grep per
while? end
0.000006735 per/iteration
0.000006772 per/iteration
0.000006721 per/iteration
0.000006744 per/iteration
...
0.000006716 per/iteration
0.000006710 per/iteration
0.000006745 per/			<-- hung

It could well be associated with poor timing involving a clock or serial
interrupt.

I haven't made much headway at investigating it yet, and it looks like
serial break is of no help, but will attempt to see what I can do this
afternoon.  I suspect without NMI on the box in question it will be
dificult.  I haven't yet tried with a UP kernel, however, only SMP.  That
said, with the critical section optimization in place and moving UMA to
using critical sections rather than mutexes for the per-CPU cache on SMP,
I see a small but heathy performance improvement in the socket
create/destroy micro-benchmark:

x netperf-socket-smp
+ percpu-socket-smp
+--------------------------------------------------------------------------+
|           +                                                            x |
|           ++                                                        x xxx|
|+     +   ++++     +                                                 xxxxx|
|      |____A____|                                                     |AM||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10      6.64e-06     6.676e-06     6.666e-06    6.6601e-06 1.2359881e-08
+  10     6.078e-06     6.236e-06     6.172e-06     6.165e-06 4.0734915e-08
Difference at 95.0% confidence
        -4.951e-07 +/- 2.82825e-08
        -7.43382% +/- 0.424655%
        (Student's t, pooled s = 3.01007e-08)

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Principal Research Scientist, McAfee Research