FreeBSD 5.2 v/s FreeBSD 4.9 MFLOPS performance
(gcc3.3.3v/sgcc2.9.5)
Matthew Dillon
dillon at apollo.backplane.com
Mon Feb 16 15:58:49 PST 2004
:
:Thanks Matt for picking up on the linker problem. Patching the kernel
:would, to me, be masking the real problem.
:
:What other "improvements" does gcc333 have over gcc295 that might
:explain why it's linked products run in a half-fast mode (take twice+
:as long)?
:
:JT
I do not see a 50% loss in performance in my tests, but the GCC3 on
DragonFly is a later snapshot (gcc-3.3-20040126). Generally speaking
GCC3 does a better job -O2 then GCC2 when I optimize for my Athlon64.
(-O2 and -O3 have the same results on GCC3 in my tests).
These tests were run on an Athlon 64 3200+, on a DragonFly system of course,
(which has both gcc2 and gcc3 in the base system):
GCC2 GCC2 GCC2 GCC3 GCC3 GCC3 GCC3
-O -O2 -O2/k6 -O -O2 -O2 -O2
athlon athlon
stackbndry=5
MFLOPS(1) 1111 1071 1068 794 926 862 861
MFLOPS(2) 832 818 810 789 825 855 857
MFLOPS(3) 1131 1121 1105 1021 1134 1208 1208
MFLOPS(4) 1306 1356 1350 1156 1346 1460 1456
GCC3 only loses in MFLOPS(1).
When I looked at the assembly generated for MFLOPS(1) between GCC2 and
GCC3 two things stand out:
* GCC2 does a few extra stack-relative memory ops and they are
spread out more. GCC3 does fewer memory ops and they are
concentrated at the beginning and the end of the loop code.
* GCC2 uses fld %st(x) to shift the FP stack around, while
GCC3 uses fxch %st(x) to shift the FP stack around.
Since we know FP operations are stack-alignment-sensitive I can see
how a stack misalignment can result in terrible performance. What is
less certain is whether (FP aligned) accesses to *different* data-cache
lines effects performance, and that is something that GCC does not seem
to optimize.
My guess at least in regards to MFLOPS(1), for which GCC3 generates
consistently worse results on my machine, is that FXCH (exchange fp
reg with top of fp stack) performance is not hardware optimized as well
as FLD (load to top of FP stack) performance, at least on my Athlon 64.
This also points to the fact that both Intel and AMD have done major
reoptimizations of their floating point instruction set in nearly
every processor release they've ever done. The performance loss you are
seeing on your machine could very well turn into a performance gain on
different cpu. On a DELL-2550 I get this:
DELL2550 2xPentiumIII @ 1.1GHz
GCC2 GCC3 GCC3 GCC3
-O3 -O3 -O3 -O3
-march= (nil) (nil) p3 ppro
MFLOPS(1) 380 290 283 283
MFLOPS(2) 302 293 291 291
MFLOPS(3) 454 459 462 463
MFLOPS(4) 563 581 593 593
My guess is that GCC3 introduced a bit of pessimization when they
started over-using FXCH and that the MFLOPS(1) code just happens to
hit the case due to the huge number of FXCH's it uses. It's probably
stalling the instruction pipline in a few more places.
-Matt
More information about the freebsd-hackers
mailing list