KSE and SMP problem in FreeBSD/amd64 5.3BETA3, namely KSEdosen't
make use of SMP.
Julian Elischer
julian at elischer.org
Sat Sep 11 23:13:21 PDT 2004
Firstly ,I am very happy to see your mail.
We need all bug repors.. even bad ones :-)
I have been working on trying to fix problems in this sort of thing in
the last few weeks for 5.3 but will be able to examine your work
more closely in a few days. I just want you to know that your email will be
worked on, even if you do not hear anything immediatly.
more notes below..
NAKATA Maho wrote:
> Dear amd64 freaks, I noticed that there seems to be a bug
> in KSE with SMP configuration.
>
> Here, I describe my problem in detail.
>
> the math/atlas port utilize SMP by threading. namely,
> if you have 2 processors you can gain the nearly double performance
> so KSE is the key technology for SMP. However, for amd64, KSE doesn't
> utilize second CPU at all.
>
> My machine is:
> Tyan S2885
> Opteron 1.6GHz x 2
> 2G bytes of memory
>
> I confirmed that:
> o FreeBSD/amd64 5.2.1-RELEASE with KSE doesn't work at all,
> dumps core or memory fault, while without KSE works well but
> without performance gain (using libmap.conf, and this is not shown here).
this is expected.
>
> o FreeBSD/amd64 5.3-BEAT3 with KSE works at least, however,
> doesn't utilize SMP.
I will try examine this together with Peter and Dan over the next few days..
Please show me the output in 5.3 of sysctl kern.threads and kern.sched
also there will be improvements in beta4 I hope
which scheduler?
show ldd output for your program please.
> o FreeBSD/i386 5.2.1-RELEASE, and 5.3-BEAT3 works well.
>
> How to repreat:
> (it took huge hours to build math/atlas, so I put work dir at)
at?
>
> CVSup your ports tree, please use:
> # $FreeBSD: ports/math/atlas/Makefile,v 1.27 2004/09/02 00:25:45 maho Exp $
>
> 0a. prepare opteron SMP machine, and install FreeBSD/amd64 5.3-BETA3.
> 1a. cd /usr/ports/math/atlas
> 2a. make
> 3a. wait for long time
> 4a. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED
> 5a. make xdlutst (it took only seconds)
> 6a. make xdlutst_pt (it took only seconds)
> 7a. type ./xdlutst -N 1000 2000 200 (this doesn't utilize SMP and KSE)
> NREPS Major M N lda NPVTS TIME MFLOP RESID
> ===== ===== ===== ===== ===== ===== ======== ======== ========
> 0 Col 1000 1000 1000 995 0.301 2210.755 3.821e-02
> 0 Col 1200 1200 1200 1194 0.504 2282.569 3.793e-02
> 0 Col 1400 1400 1400 1395 0.794 2303.707 2.843e-02
> 0 Col 1600 1600 1600 1595 1.156 2360.557 2.893e-02
> 0 Col 1800 1800 1800 1793 1.637 2374.130 2.803e-02
> 0 Col 2000 2000 2000 1990 2.192 2431.838 2.744e-02
>
> 6 cases ran, 6 cases passed
>
>
> 8a. type ./xdlutst_pt -N 2000 3000 200
> ./xdlutst_pt -N 2000 3000 200
> NREPS Major M N lda NPVTS TIME MFLOP RESID
> ===== ===== ===== ===== ===== ===== ======== ======== ========
> 0 Col 2000 2000 2000 1990 2.286 2332.527 2.744e-02
> 0 Col 2200 2200 2200 2194 2.764 2567.795 2.639e-02
> 0 Col 2400 2400 2400 2394 3.766 2446.449 2.721e-02
> 0 Col 2600 2600 2600 2593 4.722 2480.761 2.472e-02
> 0 Col 2800 2800 2800 2795 5.855 2499.038 2.441e-02
> 0 Col 3000 3000 3000 2992 7.302 2464.553 2.442e-02
>
> 6 cases ran, 6 cases passed
>
> Please see the MFLOP column. This indicates the FLOPS of the calculation.
> Opteron 1.6G's performance is 2.4GFlops for LU decomposition.
> and as you can see no perfomance gain :(
>
> typical output of top is like that:
>
> PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
> 716 root 134 0 185M 179M CPU0 0 1:05 21.09% 21.09% xdlutst_pt
> 716 root 134 0 185M 179M RUN 0 1:05 19.53% 19.53% xdlutst_pt
> 716 root 20 0 185M 179M kserel 1 1:05 0.00% 0.00% xdlutst_pt
> 716 root 20 0 185M 179M ksesig 1 1:05 0.00% 0.00% xdlutst_pt
> 716 root 20 0 185M 179M kserel 0 1:05 0.00% 0.00% xdlutst_pt
>
> two threads of xdlutst_pt are always running on *ONLY CPU0 or CPU1*
> --------------------------------------------------------------------
> Next, I have tried i386 version
>
> 0i. prepare opteron SMP machine same as above, and install FreeBSD/i386
> 5.3-BETA3.
> CVSup your ports tree.
>
> 1i. cd /usr/ports/math/atlas
> 2i. make
> 3i. wait for long time
> 4i. cd /usr/ports/math/atlas/work/ATLAS/bin/THREADED
> 5i. make xdlutst (it took only seconds)
> 6i. make xdlutst_pt (it took only seconds)
> 7i. type ./xdlutst -N 1000 2000 200 (this doesn't utilize SMP and KSE)
> ./xdlutst -N 1000 2000 200
> NREPS Major M N lda NPVTS TIME MFLOP RESID
> ===== ===== ===== ===== ===== ===== ======== ======== ========
> 0 Col 1000 1000 1000 995 0.307 2170.617 3.437e-02
> 0 Col 1200 1200 1200 1194 0.522 2204.335 3.482e-02
> 0 Col 1400 1400 1400 1395 0.799 2286.888 4.150e-02
> 0 Col 1600 1600 1600 1595 1.164 2345.104 3.598e-02
> 0 Col 1800 1800 1800 1793 1.616 2405.542 3.601e-02
> 0 Col 2000 2000 2000 1990 2.218 2403.157 3.436e-02
>
> 6 cases ran, 6 cases passed
>
> 8i. type ./xdlutst_pt -N 3000 4000 200 (this utilize KSE so that make
> full use of SMP)
> ./xdlutst_pt -N 3000 4000 200
> NREPS Major M N lda NPVTS TIME MFLOP RESID
> ===== ===== ===== ===== ===== ===== ======== ======== ========
> 0 Col 3000 3000 3000 2992 7.157 2514.351 3.650e-02
> 0 Col 3200 3200 3200 3186 5.127 4259.986 3.207e-02
> 0 Col 3400 3400 3400 3392 5.867 4465.006 3.528e-02
> 0 Col 3600 3600 3600 3589 6.791 4579.468 3.519e-02
> 0 Col 3800 3800 3800 3791 8.510 4297.730 3.285e-02
> 0 Col 4000 4000 4000 3995 9.207 4633.234 3.218e-02
>
> 6 cases ran, 6 cases passed
>
> yes, there are perfomance gain by utilizing SMP.
>
> typical output of top seems like
>
> PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
> 714 root 139 0 301M 300M CPU1 1 2:16 66.41% 66.41% xdlutst_pt
> 714 root 139 0 301M 300M RUN 0 2:16 66.41% 66.41% xdlutst_pt
> 714 root 20 0 301M 300M kserel 1 2:16 0.00% 0.00% xdlutst_pt
> 714 root 20 0 301M 300M kserel 0 2:16 0.00% 0.00% xdlutst_pt
> 714 root 20 0 301M 300M ksesig 0 2:16 0.00% 0.00% xdlutst_pt
>
> Summary:
> Difference between 8a and 8i are:
> o there are no perfomance gain in 8a whereas 8i gains nearly double.
> o the result of top indicates that by KSE of amd64, two threads are produced
> correctly, however scheduling is somwhat odd, so that two threads runs
> at the same processor, apparently threads are spread over different
> processors, though.
>
> You can try easily, work directory of these two ports are available:
> http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-amd64.tar.bz
> http://people.freebsd.org/~maho/atlas/atlas-work-opteron_dual-i386.tar.bz
>
> MD5 (atlas-work-opteron_dual-amd64.tar.bz) = 9d9d7e8b00b34a783b7d2172bc404e23
> MD5 (atlas-work-opteron_dual-i386.tar.bz) = 8076a753c7b3edaea7bd446c6473f120
>
> Does anybody can fix it?
yes we will try.
>
> Best regards,
> --nakata maho
>
More information about the freebsd-amd64
mailing list