1.5 times slower performance with SCHED_ULE than SCHED_4BSD

Tue Aug 23 07:36:08 GMT 2005

Hello list,

I have noticed (last year) that SCHED_ULE is slower than SCHED_4BSD,
and raised a PR. At that time it was not convincing because
5.3-RELEASE/amd64 was not stable enough with large amount of memory,
etc. My recent 5.4-RELEASE/amd64 is extremely stable even with large
amount of memory and high load, and I can say something definite.
Someone who is interested in my e-mail, please test it.
I also prepared statically linked binaries to reproduce my test easily,
because this test uses ATLAS (math/atlas-devel) which is a really
really pain to build.

my conclusion in short:
SCHED_ULE is slower than SCHED_4BSD by 1.5 times in FreeBSD 5.4-RELEASE/amd64.
this means both SCHED_4BSD and SCHED_ULE are definitely SMP aware, but
SCHED_ULE scheduling is not efficient for very large jobs. Whereas 4BSD
is almost optimal.

my opetron box:
o Tyan S2885 Tiger K8W
o Opteron 242x2 (1.6GHzx2)
o Transcend 2Gx8 (total 16G) 

How to repeat:
fetch http://people.freebsd.org/~maho/scheduler_amd64.tar.gz
tar xvfz scheduler_amd64.tar.gz
cd scheduler_amd64/
sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200

My results:
o 4BSD
sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200
kern.sched.name: 4BSD

NREPS  ORDER   UPLO      N    LDA      TIME     MFLOP         RESID
=====  =====  =====  =====  =====  ========  ========  ============

    0    Col     GE   7000   7000   157.028   4368.50  2.110725e-02
    0    Col     GE   7200   7200   168.527   4429.39  2.106386e-02
    0    Col     GE   7400   7400   185.014   4380.32  2.099199e-02
    0    Col     GE   7600   7600   198.622   4420.07  2.073756e-02
    0    Col     GE   7800   7800   214.284   4429.04  2.089531e-02
    0    Col     GE   8000   8000   232.126   4411.27  2.142018e-02
    0    Col     GE   8200   8200   255.809   4310.65  2.041516e-02
    0    Col     GE   8400   8400   265.088   4471.62  2.092699e-02
    0    Col     GE   8600   8600   285.403   4457.11  2.119786e-02
    0    Col     GE   8800   8800   306.969   4439.88  2.257722e-02
    0    Col     GE   9000   9000   324.456   4493.54  2.347010e-02

11 cases: 11 passed, 0 skipped, 0 failed
     4707.78 real      9019.12 user        38.34 sys

o ULE
sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200
kern.sched.name: ule

NREPS  ORDER   UPLO      N    LDA      TIME     MFLOP         RESID
=====  =====  =====  =====  =====  ========  ========  ============

    0    Col     GE   7000   7000   284.579   2410.49  2.110725e-02
    0    Col     GE   7200   7200   176.769   4222.87  2.106386e-02
    0    Col     GE   7400   7400   183.035   4427.67  2.099199e-02
    0    Col     GE   7600   7600   195.830   4483.10  2.073756e-02
    0    Col     GE   7800   7800   228.077   4161.20  2.089531e-02
    0    Col     GE   8000   8000   267.382   3829.61  2.142018e-02
    0    Col     GE   8200   8200   247.578   4453.95  2.041516e-02
    0    Col     GE   8400   8400   261.590   4531.42  2.092699e-02
    0    Col     GE   8600   8600   308.443   4124.18  2.119786e-02
    0    Col     GE   8800   8800   331.672   4109.20  2.257722e-02
    0    Col     GE   9000   9000   320.790   4544.91  2.347010e-02

11 cases: 11 passed, 0 skipped, 0 failed
     6964.19 real      8720.26 user        34.31 sys

o What are my test doing?
what is xdinvtst_pt? this program calculates inversion of randomly genrated
matrices (double precision). _pt means pthread, and this creates two threads
at a time to calculate the inversion of the matrix. We performed calculation
from 7000x7000 matrix to 9000x9000 gradually increasing row and column by 200.
how to make xdivntst_pt? build math/atlas-devel with smp machine. this
port knows # of processors installed.
You will build atlas after a very very long time; 1.5 day or so after
typing make many times (10-20 times!) since this port is fragile.
Then go down the work directory and
manually fix some makefiles to point Fortran BLAS/LAPACK (via math/lapack)
and can build by yourself. so this is why i included in archive and
prepared statically linked binaries.

o Perfomance of Opteron and effect of SMP
Theoretical peak of calculation in double precision using SSE2 for 1.6GHz
opteron is 3.2GFlops, so 6.4GFlops for dual processors. Performance of the
largest test (inversion of 9000x9000 matrix
in double precision) is about 4.5Gflops. namely 70% of theoretical peak.
this is very good. From my experience, best experimental perfomance in
single processor is ~80% such achivement might be found at much more
primitive calculations.

o ULE vs 4BSD
please see this row:
4BSD
    0    Col     GE   9000   9000   324.456   4493.54  2.347010e-02
ule
    0    Col     GE   9000   9000   320.790   4544.91  2.347010e-02
these line shows 4.5Flops performance by inverting matrix. 324 seconds
have passed by 4BSD and 320 seconds have passed by ule. This doesn't
mean what I say was wrong; ~320 seconds have passed by both processors.
namely ~160 seconds by one processor and ~160 seconds by another processor,
then atlas measure as ~320 seconds have passed as total and this is the best
case. We definitely
need at least 320 seconds to invert the matrix and how actual time has
passed is not measured in this context. With ULE, for example
~240 have passed in one processor, and ~80s in another processor. so we
*must* wait for 240 seconds, while with 4BSD, we only wait for 160 seconds. 

we can know from actual difference between ULE and 4BSD by
/usr/bin/time
4BSD
     4707.78 real      9019.12 user        38.34 sys
ULE
     6964.19 real      8720.26 user        34.31 sys
and 6964.19/4707.78=1.479.
~9000 seconds have passed by 4BSD and 8700s by ULE.
and real is ~4700 seconds for 4BSD and ~7000s by ULE.
so time consumed by actual works are both same (~9000s and ~8700s).
but scheduling is not efficinet for this calculation and so, ULE
needs more time.

o Scheduling threads / processes?
scheduling threads and processes can be different. but other experiments
show that if we run same process at a time, ULE is also ~1.5 times slower
than 4BSD.

o conclusion
4BSD is near the optimal for large calculations and ULE is ~1.5 times
slower than 4BSD. Both scheduling algorithm smp aware. I don't think
ULE as default is good choice.

All the best,
-- NAKATA, Maho (maho at FreeBSD.org)