cvs commit: src/lib/msun/i387 Makefile.inc e_atan2.S e_atan2f.S s_atan.S

Tue Feb 22 13:52:57 GMT 2005

On Tue, Feb 22, 2005, Maxim Sobolev wrote:
> David Schultz wrote:
> >das         2005-02-21 16:04:23 UTC
> >
> >  FreeBSD src repository
> >
> >  Modified files:
> >    lib/msun/i387        Makefile.inc 
> >  Removed files:
> >    lib/msun/i387        e_atan2.S e_atan2f.S s_atan.S 
> >  Log:
> >  Remove the i387 versions of atan(), atan2(), and atan2f().
> >  They are slower than the MI routines on modern hardware,
> >  except for degenerate cases such as the Pentium 4.
> 
> Well, it is worth probably nothing that 70-80% of machines running 
> FreeBSD today fall into that degenerate case. How much slower MI ws MD 
> on p4?

Here are the timings for inputs -8, -7, ..., 7, from bde's test
program (see PR 67469).  The results for !Pentium4 are from bde,
although I tested on a Pentium 3 as well and didn't transcribe the
results.  `asmatan' is the assembly routine, and `fdlatan' is the
MI routine:

to.486dx2-66
asmatan: nsec per call:  5518 5522 5527 5530 5474 5473 5674 5440 5433 5703 5625
5628 5554 5545 5554 5557
fdlatan: nsec per call:  8128 8126 8127 8132 7990 8352 8910 7667 7557 8723 8272
7929 7913 7926 7915 7921

to.cel366
asmatan: nsec per call:  444 444 444 444 444 444 444 424 424 444 444 444 444 444444 444
fdlatan: nsec per call:  370 370 370 370 370 382 397 323 323 397 382 370 370 370370 370

to.k6-233
asmatan: nsec per call:  827 827 827 827 827 827 857 838 833 853 823 823 823 823823 823
fdlatan: nsec per call:  771 771 771 771 772 801 834 712 707 826 793 763 763 763763 763

to.p3-800
asmatan: nsec per call:  209 209 205 209 209 209 209 200 200 209 209 209 209 209209 209
fdlatan: nsec per call:  175 175 175 176 176 181 179 150 149 178 174 172 171 171172 172

to.axpb-2223
asmatan: nsec per call:  87 87 87 87 87 87 87 78 78 87 87 87 87 87 87 87
fdlatan: nsec per call:  65 65 65 65 65 66 68 51 51 68 66 65 65 65 65 65

asmatan: nsec per call:  68 68 68 68 68 68 69 69 69 68 68 68 68 68 68 68
fdlatan: nsec per call:  71 66 66 66 66 66 65 51 51 65 66 66 66 66 66 66

The results show that the FPATAN instruction (as with most x87
ops) is pretty slow for anything more modern than a 486.  The
Pentium 4 was an exception in my original tests, but upon fixing a
bug, I found that the software version of atan() is faster than
the FPATAN instruction, too.  ;-)

The bug was that bde's test was telling the compiler to schedule
instructions for an Athlon.

Note that Intel has a continuing trend of making the x87 slower in
favor of higher clock speeds and better SSE performance, so in the
future, the x87 transcendental instructions are likely to only get
worse relative to the software functions.

By the way, here are some other results for the Pentium 4, all
without SSE.  SSE makes things a bit worse, probably because the
x87 and SSE registers are shared, and the Pentium 4 imposes a
large penalty for switching between the two sets.

icc:
asmatan: nsec per call:  77 77 77 77 79 77 77 78 78 77 77 77 77 77 77 77
fdlatan: nsec per call:  62 62 62 62 62 63 65 54 55 66 64 62 62 62 62 62

gcc -march=i486:
asmatan: nsec per call:  69 69 69 69 69 69 69 70 70 70 72 69 69 69 69 69
fdlatan: nsec per call:  54 54 54 54 54 56 59 49 48 57 55 52 52 52 52 52

gcc -march=pentium4:
asmatan: nsec per call:  68 68 68 68 68 68 69 69 69 68 68 68 68 68 68 68
fdlatan: nsec per call:  71 66 66 66 66 66 65 51 51 65 66 66 66 66 66 66

gcc -march=athlon-xp:
asmatan: nsec per call:  68 68 68 68 68 68 68 69 69 68 68 68 68 68 68 68
fdlatan: nsec per call:  92 92 93 94 92 95 97 71 71 97 95 92 92 92 92 93

It's funny that gcc generates worse code for a Pentium 4 when told
to schedule instructions for an Pentium 4 than when told to
schedule for a 486, and in the latter case, it beats icc.  I ran
some general purpose tests with gcc 3.0 or 3.1 a while ago, and I
seem to recall that telling gcc that I had a 486 worked best for
my Pentium 3, and telling it I had a Pentium worked best for my
Pentium 4.