FreeBSD 5.2 v/s FreeBSD 4.9 MFLOPS performance (gcc3.3.3v/sgcc2.9.5)

Mon Feb 16 15:58:49 PST 2004

:
:Thanks Matt for picking up on the linker problem.  Patching the kernel
:would, to me, be masking the real problem.
:
:What other "improvements" does gcc333 have over gcc295 that might
:explain why it's linked products run in a half-fast mode (take twice+
:as long)?
:
:JT

    I do not see a 50% loss in performance in my tests, but the GCC3 on
    DragonFly is a later snapshot (gcc-3.3-20040126).  Generally speaking
    GCC3 does a better job -O2 then GCC2 when I optimize for my Athlon64.
    (-O2 and -O3 have the same results on GCC3 in my tests).

These tests were run on an Athlon 64 3200+, on a DragonFly system of course,
(which has both gcc2 and gcc3 in the base system):

                GCC2    GCC2    GCC2    GCC3    GCC3    GCC3	GCC3
                -O      -O2     -O2/k6  -O      -O2     -O2	-O2
							athlon 	athlon
								stackbndry=5

MFLOPS(1)       1111    1071    1068     794     926     862	 861
MFLOPS(2)        832     818     810     789     825     855	 857
MFLOPS(3)       1131    1121    1105    1021    1134    1208	1208
MFLOPS(4)       1306    1356    1350    1156    1346    1460	1456

    GCC3 only loses in MFLOPS(1).

    When I looked at the assembly generated for MFLOPS(1) between GCC2 and
    GCC3 two things stand out:

	* GCC2 does a few extra stack-relative memory ops and they are
	  spread out more.  GCC3 does fewer memory ops and they are 
	  concentrated at the beginning and the end of the loop code.

	* GCC2 uses fld %st(x) to shift the FP stack around, while 
 	  GCC3 uses fxch %st(x) to shift the FP stack around.

    Since we know FP operations are stack-alignment-sensitive I can see
    how a stack misalignment can result in terrible performance.  What is
    less certain is whether (FP aligned) accesses to *different* data-cache
    lines effects performance, and that is something that GCC does not seem
    to optimize.

    My guess at least in regards to MFLOPS(1), for which GCC3 generates 
    consistently worse results on my machine, is that FXCH (exchange fp
    reg with top of fp stack) performance is not hardware optimized as well
    as FLD (load to top of FP stack) performance, at least on my Athlon 64.

    This also points to the fact that both Intel and AMD have done major
    reoptimizations of their floating point instruction set in nearly
    every processor release they've ever done.  The performance loss you are
    seeing on your machine could very well turn into a performance gain on
    different cpu.   On a DELL-2550 I get this:

		DELL2550 2xPentiumIII @ 1.1GHz	

		GCC2	GCC3	GCC3	GCC3
		-O3	-O3	-O3	-O3
-march=		(nil)	(nil)	p3	ppro

MFLOPS(1) 	380	290	283	283
MFLOPS(2) 	302	293	291	291
MFLOPS(3) 	454	459	462	463
MFLOPS(4) 	563	581	593	593

    My guess is that GCC3 introduced a bit of pessimization when they
    started over-using FXCH and that the MFLOPS(1) code just happens to
    hit the case due to the huge number of FXCH's it uses.  It's probably
    stalling the instruction pipline in a few more places.

						-Matt