1.5 times slower performance with SCHED_ULE than SCHED_4BSD
NAKATA Maho
chat95 at mac.com
Tue Aug 23 07:36:08 GMT 2005
Hello list,
I have noticed (last year) that SCHED_ULE is slower than SCHED_4BSD,
and raised a PR. At that time it was not convincing because
5.3-RELEASE/amd64 was not stable enough with large amount of memory,
etc. My recent 5.4-RELEASE/amd64 is extremely stable even with large
amount of memory and high load, and I can say something definite.
Someone who is interested in my e-mail, please test it.
I also prepared statically linked binaries to reproduce my test easily,
because this test uses ATLAS (math/atlas-devel) which is a really
really pain to build.
my conclusion in short:
SCHED_ULE is slower than SCHED_4BSD by 1.5 times in FreeBSD 5.4-RELEASE/amd64.
this means both SCHED_4BSD and SCHED_ULE are definitely SMP aware, but
SCHED_ULE scheduling is not efficient for very large jobs. Whereas 4BSD
is almost optimal.
my opetron box:
o Tyan S2885 Tiger K8W
o Opteron 242x2 (1.6GHzx2)
o Transcend 2Gx8 (total 16G)
How to repeat:
fetch http://people.freebsd.org/~maho/scheduler_amd64.tar.gz
tar xvfz scheduler_amd64.tar.gz
cd scheduler_amd64/
sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200
My results:
o 4BSD
sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200
kern.sched.name: 4BSD
NREPS ORDER UPLO N LDA TIME MFLOP RESID
===== ===== ===== ===== ===== ======== ======== ============
0 Col GE 7000 7000 157.028 4368.50 2.110725e-02
0 Col GE 7200 7200 168.527 4429.39 2.106386e-02
0 Col GE 7400 7400 185.014 4380.32 2.099199e-02
0 Col GE 7600 7600 198.622 4420.07 2.073756e-02
0 Col GE 7800 7800 214.284 4429.04 2.089531e-02
0 Col GE 8000 8000 232.126 4411.27 2.142018e-02
0 Col GE 8200 8200 255.809 4310.65 2.041516e-02
0 Col GE 8400 8400 265.088 4471.62 2.092699e-02
0 Col GE 8600 8600 285.403 4457.11 2.119786e-02
0 Col GE 8800 8800 306.969 4439.88 2.257722e-02
0 Col GE 9000 9000 324.456 4493.54 2.347010e-02
11 cases: 11 passed, 0 skipped, 0 failed
4707.78 real 9019.12 user 38.34 sys
o ULE
sysctl kern.sched.name ; /usr/bin/time ./xdinvtst_pt -N 7000 9000 200
kern.sched.name: ule
NREPS ORDER UPLO N LDA TIME MFLOP RESID
===== ===== ===== ===== ===== ======== ======== ============
0 Col GE 7000 7000 284.579 2410.49 2.110725e-02
0 Col GE 7200 7200 176.769 4222.87 2.106386e-02
0 Col GE 7400 7400 183.035 4427.67 2.099199e-02
0 Col GE 7600 7600 195.830 4483.10 2.073756e-02
0 Col GE 7800 7800 228.077 4161.20 2.089531e-02
0 Col GE 8000 8000 267.382 3829.61 2.142018e-02
0 Col GE 8200 8200 247.578 4453.95 2.041516e-02
0 Col GE 8400 8400 261.590 4531.42 2.092699e-02
0 Col GE 8600 8600 308.443 4124.18 2.119786e-02
0 Col GE 8800 8800 331.672 4109.20 2.257722e-02
0 Col GE 9000 9000 320.790 4544.91 2.347010e-02
11 cases: 11 passed, 0 skipped, 0 failed
6964.19 real 8720.26 user 34.31 sys
o What are my test doing?
what is xdinvtst_pt? this program calculates inversion of randomly genrated
matrices (double precision). _pt means pthread, and this creates two threads
at a time to calculate the inversion of the matrix. We performed calculation
from 7000x7000 matrix to 9000x9000 gradually increasing row and column by 200.
how to make xdivntst_pt? build math/atlas-devel with smp machine. this
port knows # of processors installed.
You will build atlas after a very very long time; 1.5 day or so after
typing make many times (10-20 times!) since this port is fragile.
Then go down the work directory and
manually fix some makefiles to point Fortran BLAS/LAPACK (via math/lapack)
and can build by yourself. so this is why i included in archive and
prepared statically linked binaries.
o Perfomance of Opteron and effect of SMP
Theoretical peak of calculation in double precision using SSE2 for 1.6GHz
opteron is 3.2GFlops, so 6.4GFlops for dual processors. Performance of the
largest test (inversion of 9000x9000 matrix
in double precision) is about 4.5Gflops. namely 70% of theoretical peak.
this is very good. From my experience, best experimental perfomance in
single processor is ~80% such achivement might be found at much more
primitive calculations.
o ULE vs 4BSD
please see this row:
4BSD
0 Col GE 9000 9000 324.456 4493.54 2.347010e-02
ule
0 Col GE 9000 9000 320.790 4544.91 2.347010e-02
these line shows 4.5Flops performance by inverting matrix. 324 seconds
have passed by 4BSD and 320 seconds have passed by ule. This doesn't
mean what I say was wrong; ~320 seconds have passed by both processors.
namely ~160 seconds by one processor and ~160 seconds by another processor,
then atlas measure as ~320 seconds have passed as total and this is the best
case. We definitely
need at least 320 seconds to invert the matrix and how actual time has
passed is not measured in this context. With ULE, for example
~240 have passed in one processor, and ~80s in another processor. so we
*must* wait for 240 seconds, while with 4BSD, we only wait for 160 seconds.
we can know from actual difference between ULE and 4BSD by
/usr/bin/time
4BSD
4707.78 real 9019.12 user 38.34 sys
ULE
6964.19 real 8720.26 user 34.31 sys
and 6964.19/4707.78=1.479.
~9000 seconds have passed by 4BSD and 8700s by ULE.
and real is ~4700 seconds for 4BSD and ~7000s by ULE.
so time consumed by actual works are both same (~9000s and ~8700s).
but scheduling is not efficinet for this calculation and so, ULE
needs more time.
o Scheduling threads / processes?
scheduling threads and processes can be different. but other experiments
show that if we run same process at a time, ULE is also ~1.5 times slower
than 4BSD.
o conclusion
4BSD is near the optimal for large calculations and ULE is ~1.5 times
slower than 4BSD. Both scheduling algorithm smp aware. I don't think
ULE as default is good choice.
All the best,
-- NAKATA, Maho (maho at FreeBSD.org)
More information about the freebsd-amd64
mailing list