Implementation of half-cycle trignometric functions
Steve Kargl
sgk at troutmask.apl.washington.edu
Wed Apr 26 03:23:58 UTC 2017
On Tue, Apr 25, 2017 at 06:27:52PM +1000, Bruce Evans wrote:
> On Mon, 24 Apr 2017, Steve Kargl wrote:
>
> > On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:
>
> >> [for expl]
> >> This is arranged to look a bit like Horner's method at first (operations
> >> not written in separate statements), but it is basically Estrin's method
> >> with no extensions to more complicated sub-expressions than Estrin's,
> >> but complications for accuracy:
> >> - I think accuracy for the B3 term is unimportant, but for some reason
> >> (perhaps just the one in the comment) it is grouped separately. Estrin
> >> might use x3 * (B3.e + x * B4.e).
> >
> > Yes, it is most likely an accuracy issue. I found something
> > similar with sinpi/cospi. My k_sinpi.c has
> >
> > #if 0
> > double x2;
> > /* Horner's method. */
> > x2 = x * x;
> > sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
> > + s2) + s1;
>
> Horner's of course has about as many dependencies as operations (13), so
> on modern x86 this is limited by the latency of 3-5 cycles/instructions
> or 39-65 cycles. Multiplications take an extra cycle in more cases, so
> it is better to have more additions, but unfortunately Horner gives 1
> more multuplication.
>
> With no dependencies, the limit would be throughput -- 2 instructions/cycle
> of 6.5 cycles rounded up to 7; then add latency for the last operation to
> get about 11.
>
> With practical dependencies, it is hard to get more than a 2-fold speedup.
>
> > #else
> > double x2, x4;
> > /* Sort of Estrin's method. */
> > x2 = x * x;
> > x4 = x2 * x2;
> > sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
> > + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
> > #endif
>
> This looks OK. The only obvious improvement is to try grouping x4*x4
> so that it can be evaluated separately. This reduces the length of
> dependency chains by 1, but increases early pressure on multiplication.
> Don't write this as x8 = x4*x4. Compilers will do the equivalent.
> You can also replace x4 by (x2*x2) but there are many x4's so this
> might be less readable.
The grouping of (x4*x4) and (x2 * x4) in other polynomials indeed
give small improvements. I've reworked the approximations in
sinpi[l], cospi[l], and tanpi[l]. I did not see any improves with
the float versions as the order of the polynomials are fairly small
(order of 4).
> > The addition of s1 must occur last, or I get greater 1 ULP
> > for a number of values.
>
> Indeed. This prevents using the best order, which is the opposite.
> You used the best order for the other inner terms, but not so obviously
> for the first outer addition. However, the compiler is free to reorder
> that: consider:
>
> T2 + T1 + T0
>
> T2 is more complicated and everything depends on it, so should be started
> first, but possibly T1 can be started first because it is simpler. Here
> it is not significantly simpler. However, for:
>
> T3 + T2 + T1 + T0
>
> the natural left-to-right grouping is usually pessimial. This should be
> written as:
>
> T3 + (T2 + T1) + T0
>
> Unfortunately, we can't write this naturally and usually optimally as:
>
> T0 + T1 + T2 + T3
>
> due to the accuracy problem with T0. Maybe right this as:
>
> T1 + T2 + T3 + T0 // only T0 in unnatural order
>
> For expl, I tried all orders.
I tried a few different orders; particularly with the long double
functions. I also managed to break the code a few times. Thankfully,
I have everything in a subversion repository.
>
> > My timing show on Intel core2 duo
> >
> > /* Horner's methods. */
> > laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
> > sinpi time: 0.066 usec per call
> > cospi time: 0.063 usec per call
> >
> > /* Estrin's methods. */
> > laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
> > sinpi time: 0.059 usec per call
> > cospi time: 0.055 usec per call
> >
> > Here, -n 22 means to test 2**22 - 1 values in the interval
> > [0,0.25].
>
> The improvement is marginal. Only worth doing if it doesn't complicate
> the code much. Older CPUs can't exploit te parallelism so well, so I
> would expect a relatively larger improvement. Don't measure in usec,
> since they are hard to scale. 0.66 usec and 1.86 GHz is more than 100
> cycles. This is quite slow. I get similar times for cos and sin on an
> older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long
> double). Newer CPUs are about 4 times faster in double prec (only 2
> times fast in float prec) in cycles despite having similar nominal
> timing, by unclogging their pipes.
I'm assuming your 0.66 in the above is a typo as I wrote 0.066. I also
got some improvement by changing the splitting of a few entities into
high and low parts by a simple cast. Instead of doing
EXTRACT_WORDS(hx, lx, x);
INSERT_WORDS(xhi, hx, 0);
xlo = x - xhi;
I do
xhi = (float)x;
xlo = x - xhi;
but this seems to only help in some situation. My new timing for
x in [0,0.25] on my 1995.05 MHz Core2 duo gives
Horner
sinpi time: 0.073 usec per call (~146 cycles)
cospi time: 0.070 usec per call (~140)
tanpi time: 0.112 usec per call (~224)
Estrin
sinpi time: 0.064 usec per call (~128)
cospi time: 0.060 usec per call (~120)
tanpi time: 0.088 usec per call (~176)
tanpi benefits the most as it has an order 16 polynomial, which
allows some freedom in group terms.
I don't think I can do much better. For example, the approximation
for sinpi(x) is
sinpi(x) = x * (s0 + x2 * S(x2))
where S(x2) is the polynomial and I need to split x2 into high
and low parts and s0 is split. The kernel with the Estrin method
and with the polynomial coefficients is
static inline double
__kernel_sinpi(double x)
{
double hi, lo, sm, xhi, xlo;
uint32_t hx, lx;
double x2, x4;
/* Sort of Estrin's method. */
x2 = x * x;
x4 = x2 * x2;
sm = (s5 + s6 * x2 + s7 * x4) * (x4 * x4)
+ (s2 + s3 * x2 + s4 * x4) * x2 + s1;
EXTRACT_WORDS(hx, lx, x);
INSERT_WORDS(xhi, hx, 0);
xlo = x - xhi;
lo = xlo * (x + xhi) * sm;
hi = xhi * xhi * sm;
lo = x * lo + xlo * (s0hi + s0lo) + x * hi;
return (lo + xhi * s0lo + xhi * s0hi);
}
The order of terms in the last 5 lines matters. Testing yields
a = 2.4220192246482908e-01, /* 0x3fcf0078, 0xfc01e3f0 */
sinpi(a) = 6.8957335930938313e-01, /* 0x3fe610fc, 0x264da72e */
sin(pi*a) = 6.8957335930938324e-01, /* 0x3fe610fc, 0x264da72f */
ULP: 0.79950446
MAX ULP: 0.79950446
Total tested: 8388606
1.0 < ULP : 0
0.9 < ULP <= 1.0: 0
0.8 < ULP <= 0.9: 0
0.7 < ULP <= 0.8: 554
0.6 < ULP <= 0.7: 11275
> We should also try to use the expression A*x + B more often so that
> things with be faster when this is done by fma, but this happens
> naturally anyway.
Using A*x+B or B+A*x did not seem to matter.
--
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow
More information about the freebsd-numerics
mailing list