Implementation of half-cycle trignometric functions

Tue Apr 25 06:36:45 UTC 2017

On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:
> On Mon, 24 Apr 2017, Steve Kargl wrote:
> 
> > I have what appears to be working versions of asinpi[fl].  It
> > was suggested elsewhere that using an Estrin's method to
> > sum the polynomial approximations instead of Horner's method
> > may allow modern CPUs to better schedule generated code.
> > I have implemented an Estrin's-like method for sinpi[l]
> > and cospi[l], and indeed the generated code is faster on
> > my Intel core2 duo with only a slight degradation in the
> > observed max ULP.  I'll post new versions to bugzilla in
> > the near future.
> 
> I didn't remember Estrin, but checked Knuth and see that his
> method is a subset of what I use routinely.  I might have learned
> it from Knuth.  I tried all methods in Knuth, but they didn't seem
> to be any better at first.  Later I learned more about scheduling.
> Estrin's method now seems routine as a subset of scheduling.

I think I came across this in Knuth as well, but I may have
picked of the name Estrin from Mueller's book (or google :).

> 
> You already used it in expl: for ld80:
> 

I recall that that is your handy work.

> X 		x2 = x * x;
> X 		x4 = x2 * x2;
> 
> Unlike in Knuth's description of Estrin, we don't try to order the
> operations to suit the CPU(s) in the code, but just try to keep them
> independent, as required for them to execute independently.  Compilers
> will rearrange them anyway, sometimes even correctly, but it makes
> little difference (on OOE CPUs) to throw all of the operations into
> the pipeline and see how fast something comes out).
> 
> X 		q = x4 * (x2 * (x4 *
> X 		    /*
> X 		     * XXX the number of terms is no longer good for
> X 		     * pairwise grouping of all except B3, and the
> X 		     * grouping is no longer from highest down.
> X 		     */
> X 		    (x2 *            B12  + (x * B11 + B10)) +
> X 		    (x2 * (x * B9 +  B8) +  (x * B7 +  B6))) +
> X 			  (x * B5 +  B4.e)) + x2 * x * B3.e;
> 
> This is arranged to look a bit like Horner's method at first (operations
> not written in separate statements), but it is basically Estrin's method
> with no extensions to more complicated sub-expressions than Estrin's,
> but complications for accuracy:
> - I think accuracy for the B3 term is unimportant, but for some reason
>    (perhaps just the one in the comment) it is grouped separately.  Estrin
>    might use x3 * (B3.e + x * B4.e).

Yes, it is most likely an accuracy issue.  I found something 
similar with sinpi/cospi.  My k_sinpi.c has

#if 0
        double x2;
        /* Horner's method. */
        x2 = x * x;
        sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
            + s2) + s1;
#else
        double x2, x4;
        /* Sort of Estrin's method. */
        x2 = x * x;
        x4 = x2 * x2;
        sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
            + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
#endif

The addition of s1 must occur last, or I get greater 1 ULP
for a number of values.

My timing show on Intel core2 duo

/* Horner's methods. */
laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
sinpi time: 0.066 usec per call
cospi time: 0.063 usec per call

/* Estrin's methods. */
laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
sinpi time: 0.059 usec per call
cospi time: 0.055 usec per call

Here, -n 22 means to test 2**22 - 1 values in the interval
[0,0.25].

For the older functions that were committed a few years
ago, I should probably go back to recheck timings and accuracy
with and without Estrin's method.

-- 
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow