Implementation of half-cycle trignometric functions
Steve Kargl
sgk at troutmask.apl.washington.edu
Tue Apr 25 06:36:45 UTC 2017
On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:
> On Mon, 24 Apr 2017, Steve Kargl wrote:
>
> > I have what appears to be working versions of asinpi[fl]. It
> > was suggested elsewhere that using an Estrin's method to
> > sum the polynomial approximations instead of Horner's method
> > may allow modern CPUs to better schedule generated code.
> > I have implemented an Estrin's-like method for sinpi[l]
> > and cospi[l], and indeed the generated code is faster on
> > my Intel core2 duo with only a slight degradation in the
> > observed max ULP. I'll post new versions to bugzilla in
> > the near future.
>
> I didn't remember Estrin, but checked Knuth and see that his
> method is a subset of what I use routinely. I might have learned
> it from Knuth. I tried all methods in Knuth, but they didn't seem
> to be any better at first. Later I learned more about scheduling.
> Estrin's method now seems routine as a subset of scheduling.
I think I came across this in Knuth as well, but I may have
picked of the name Estrin from Mueller's book (or google :).
>
> You already used it in expl: for ld80:
>
I recall that that is your handy work.
> X x2 = x * x;
> X x4 = x2 * x2;
>
> Unlike in Knuth's description of Estrin, we don't try to order the
> operations to suit the CPU(s) in the code, but just try to keep them
> independent, as required for them to execute independently. Compilers
> will rearrange them anyway, sometimes even correctly, but it makes
> little difference (on OOE CPUs) to throw all of the operations into
> the pipeline and see how fast something comes out).
>
> X q = x4 * (x2 * (x4 *
> X /*
> X * XXX the number of terms is no longer good for
> X * pairwise grouping of all except B3, and the
> X * grouping is no longer from highest down.
> X */
> X (x2 * B12 + (x * B11 + B10)) +
> X (x2 * (x * B9 + B8) + (x * B7 + B6))) +
> X (x * B5 + B4.e)) + x2 * x * B3.e;
>
> This is arranged to look a bit like Horner's method at first (operations
> not written in separate statements), but it is basically Estrin's method
> with no extensions to more complicated sub-expressions than Estrin's,
> but complications for accuracy:
> - I think accuracy for the B3 term is unimportant, but for some reason
> (perhaps just the one in the comment) it is grouped separately. Estrin
> might use x3 * (B3.e + x * B4.e).
Yes, it is most likely an accuracy issue. I found something
similar with sinpi/cospi. My k_sinpi.c has
#if 0
double x2;
/* Horner's method. */
x2 = x * x;
sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
+ s2) + s1;
#else
double x2, x4;
/* Sort of Estrin's method. */
x2 = x * x;
x4 = x2 * x2;
sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
+ (s2 + s3 * x2 + s4 * x4) * x2 + s1;
#endif
The addition of s1 must occur last, or I get greater 1 ULP
for a number of values.
My timing show on Intel core2 duo
/* Horner's methods. */
laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
sinpi time: 0.066 usec per call
cospi time: 0.063 usec per call
/* Estrin's methods. */
laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
sinpi time: 0.059 usec per call
cospi time: 0.055 usec per call
Here, -n 22 means to test 2**22 - 1 values in the interval
[0,0.25].
For the older functions that were committed a few years
ago, I should probably go back to recheck timings and accuracy
with and without Estrin's method.
--
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow
More information about the freebsd-standards
mailing list