KSE and SMP problem in FreeBSD/amd64 5.3BETA3, namely KSE dosen't
make use of SMP.
NAKATA Maho
chat95 at mac.com
Sat Sep 11 20:08:07 PDT 2004
Dear amd64 freaks, I noticed that there seems to be a bug
in KSE with SMP configuration.
Here, I describe my problem in detail.
the math/atlas port utilize SMP by threading. namely,
if you have 2 processors you can gain the nearly double performance
so KSE is the key technology for SMP. However, for amd64, KSE doesn't
utilize second CPU at all.
My machine is:
Tyan S2885
Opteron 1.6GHz x 2
2G bytes of memory
I confirmed that:
o FreeBSD/amd64 5.2.1-RELEASE with KSE doesn't work at all,
dumps core or memory fault, while without KSE works well but
without performance gain (using libmap.conf, and this is not shown here).
o FreeBSD/amd64 5.3-BEAT3 with KSE works at least, however,
doesn't utilize SMP.
o FreeBSD/i386 5.2.1-RELEASE, and 5.3-BEAT3 works well.
How to repreat:
(it took huge hours to build math/atlas, so I put work dir at)
CVSup your ports tree, please use:
# $FreeBSD: ports/math/atlas/Makefile,v 1.27 2004/09/02 00:25:45 maho Exp $
0a. prepare opteron SMP machine, and install FreeBSD/amd64 5.3-BETA3.
1a. cd /usr/ports/math/atlas
2a. make
3a. wait for long time
4a. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED
5a. make xdlutst (it took only seconds)
6a. make xdlutst_pt (it took only seconds)
7a. type ./xdlutst -N 1000 2000 200 (this doesn't utilize SMP and KSE)
NREPS Major M N lda NPVTS TIME MFLOP RESID
===== ===== ===== ===== ===== ===== ======== ======== ========
0 Col 1000 1000 1000 995 0.301 2210.755 3.821e-02
0 Col 1200 1200 1200 1194 0.504 2282.569 3.793e-02
0 Col 1400 1400 1400 1395 0.794 2303.707 2.843e-02
0 Col 1600 1600 1600 1595 1.156 2360.557 2.893e-02
0 Col 1800 1800 1800 1793 1.637 2374.130 2.803e-02
0 Col 2000 2000 2000 1990 2.192 2431.838 2.744e-02
6 cases ran, 6 cases passed
8a. type ./xdlutst_pt -N 2000 3000 200
./xdlutst_pt -N 2000 3000 200
NREPS Major M N lda NPVTS TIME MFLOP RESID
===== ===== ===== ===== ===== ===== ======== ======== ========
0 Col 2000 2000 2000 1990 2.286 2332.527 2.744e-02
0 Col 2200 2200 2200 2194 2.764 2567.795 2.639e-02
0 Col 2400 2400 2400 2394 3.766 2446.449 2.721e-02
0 Col 2600 2600 2600 2593 4.722 2480.761 2.472e-02
0 Col 2800 2800 2800 2795 5.855 2499.038 2.441e-02
0 Col 3000 3000 3000 2992 7.302 2464.553 2.442e-02
6 cases ran, 6 cases passed
Please see the MFLOP column. This indicates the FLOPS of the calculation.
Opteron 1.6G's performance is 2.4GFlops for LU decomposition.
and as you can see no perfomance gain :(
typical output of top is like that:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
716 root 134 0 185M 179M CPU0 0 1:05 21.09% 21.09% xdlutst_pt
716 root 134 0 185M 179M RUN 0 1:05 19.53% 19.53% xdlutst_pt
716 root 20 0 185M 179M kserel 1 1:05 0.00% 0.00% xdlutst_pt
716 root 20 0 185M 179M ksesig 1 1:05 0.00% 0.00% xdlutst_pt
716 root 20 0 185M 179M kserel 0 1:05 0.00% 0.00% xdlutst_pt
two threads of xdlutst_pt are always running on *ONLY CPU0 or CPU1*
--------------------------------------------------------------------
Next, I have tried i386 version
0i. prepare opteron SMP machine same as above, and install FreeBSD/i386
5.3-BETA3.
CVSup your ports tree.
1i. cd /usr/ports/math/atlas
2i. make
3i. wait for long time
4i. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED
5i. make xdlutst (it took only seconds)
6i. make xdlutst_pt (it took only seconds)
7i. type ./xdlutst -N 1000 2000 200 (this doesn't utilize SMP and KSE)
./xdlutst -N 1000 2000 200
NREPS Major M N lda NPVTS TIME MFLOP RESID
===== ===== ===== ===== ===== ===== ======== ======== ========
0 Col 1000 1000 1000 995 0.307 2170.617 3.437e-02
0 Col 1200 1200 1200 1194 0.522 2204.335 3.482e-02
0 Col 1400 1400 1400 1395 0.799 2286.888 4.150e-02
0 Col 1600 1600 1600 1595 1.164 2345.104 3.598e-02
0 Col 1800 1800 1800 1793 1.616 2405.542 3.601e-02
0 Col 2000 2000 2000 1990 2.218 2403.157 3.436e-02
6 cases ran, 6 cases passed
8i. type ./xdlutst_pt -N 3000 4000 200 (this utilize KSE so that make
full use of SMP)
./xdlutst_pt -N 3000 4000 200
NREPS Major M N lda NPVTS TIME MFLOP RESID
===== ===== ===== ===== ===== ===== ======== ======== ========
0 Col 3000 3000 3000 2992 7.157 2514.351 3.650e-02
0 Col 3200 3200 3200 3186 5.127 4259.986 3.207e-02
0 Col 3400 3400 3400 3392 5.867 4465.006 3.528e-02
0 Col 3600 3600 3600 3589 6.791 4579.468 3.519e-02
0 Col 3800 3800 3800 3791 8.510 4297.730 3.285e-02
0 Col 4000 4000 4000 3995 9.207 4633.234 3.218e-02
6 cases ran, 6 cases passed
yes, there are perfomance gain by utilizing SMP.
typical output of top seems like
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
714 root 139 0 301M 300M CPU1 1 2:16 66.41% 66.41% xdlutst_pt
714 root 139 0 301M 300M RUN 0 2:16 66.41% 66.41% xdlutst_pt
714 root 20 0 301M 300M kserel 1 2:16 0.00% 0.00% xdlutst_pt
714 root 20 0 301M 300M kserel 0 2:16 0.00% 0.00% xdlutst_pt
714 root 20 0 301M 300M ksesig 0 2:16 0.00% 0.00% xdlutst_pt
Summary:
Difference between 8a and 8i are:
o there are no perfomance gain in 8a whereas 8i gains nearly double.
o the result of top indicates that by KSE of amd64, two threads are produced
correctly, however scheduling is somwhat odd, so that two threads runs
at the same processor, apparently threads are spread over different
processors, though.
You can try easily, work directory of these two ports are available:
http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-amd64.tar.bz
http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-i386.tar.bz
MD5 (atlas-work-opteron_dual-amd64.tar.bz) = 9d9d7e8b00b34a783b7d2172bc404e23
MD5 (atlas-work-opteron_dual-i386.tar.bz) = 8076a753c7b3edaea7bd446c6473f120
Does anybody can fix it?
Best regards,
--nakata maho
More information about the freebsd-amd64
mailing list