Implementation of half-cycle trignometric functions
Steve Kargl
sgk at troutmask.apl.washington.edu
Fri Apr 28 23:36:00 UTC 2017
On Sat, Apr 29, 2017 at 08:09:39AM +1000, Bruce Evans wrote:
> On Fri, 28 Apr 2017, Steve Kargl wrote:
>
> > On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote:
> >> On Fri, 28 Apr 2017, Steve Kargl wrote:
> >>> ...
> >>> The slowness comes from handling extra precision arithmetic. For
> >>> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
> >>> approximation and x2 = x * x. One needs to take care in the ordering
> >>> on the evaluation where x and x2 are split into high and low parts.
> >>> See the source code posted in response to your other email.
> >> ...
> >> Calculating the polynomial as x * p(x) seems wrong.
> >
> > It starts life as the Remes approximation for p(x) = sin(pi*x) / x.
> > Yes, I also looked into using p(x) = sin(pi*x) / (pi*x). At some
> > point, one must do sinpi(x) = sin(pi*x) = x * p(x).
>
> I forgot that the linear term needs to be multiplied by Pi.
>
> Actually, the problem is only in the notation and the above description.
> s0 looks like a constant, but it is actually Pi*x in extra precision,
> and no extra precision is needed or done for x2.
I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61.
static const float
s0hi = 3.14062500e+00f, /* 0x40490000 */
s0lo = 9.67653585e-04f, /* 0x3a7daa22 */
s1 = -5.16771269e+00f, /* 0xc0a55de7 */
s2 = 2.55016255e+00f, /* 0x402335dd */
s3 = -5.99202096e-01f, /* 0xbf19654f */
s4 = 8.10018554e-02f; /* 0x3da5e44d */
static inline float
__kernel_sinpif(float x)
{
double s;
float p, x2;
x2 = x * x;
p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
s = x * (x2 * (double)p + s0hi + s0lo);
return ((float)s);
}
x2 and p are computed in float but is accumulates the final result
in double. Using a higher precision works for float, but double
and long double becomes problematic.
> __kernel_cospi() seems to be buggy. It splits x2, but doesn't use
> extra precision for calculating x2. Is extra precision really needed
> for the x2 term?
Yes. The last part of __kernel_cospif() is a fused-multiple and
add. For small |x2|, we can do a normal summation. That's this
block of code
GET_FLOAT_WORD(ix, x2);
if (ix < 0x3c800000) /* |x2| < 0x1p-6 */
return (c0 + c * x2);
If |x2| > 0x1p-6, then c0 + c * x2 is computed with a FMA, but
its a little trickier. This splits c in clo and chi, and adds
in a correction for c1
GET_FLOAT_WORD(ix, c);
SET_FLOAT_WORD(chi, (ix >> 14) << 14);
clo = c - chi + c1lo;
In the final expression,
c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0);
chi * c2hi is exact, and the parenthesis are required to achieve
max ULP < ~0.88. The grouping of the other terms does not seem
to matter.
> It is like the x3 term for sin*(), with the critical
> difference that the coeffient is much larger (-1/2 instead of -1/6
> before multiplication by pi). This requires extra accuracy for cos()
> too, but not for the calculating the term. I optimized this by changing
> a slow fdlibm method to essentially 2sum. The x2/2 term is inaccurate
> but no significant further inaccuracies occur for adding terms. You
> have Pi*Pi*x2/2 which is more inaccurate. I guess the extra half an
> ulp accuracy here is too much. You avoid this half an ulp using extra
> precision, and seem to use lots of extra precision instead of the final
> 2sum step. Probably much slower.
>
> Here is my modified method:
>
> x2 = x*x; /* hopefully really don't need extra prec */
> (hi, lo) = split(x2);
> resid = S(x2);
> hi = C2_hi * hi;
> lo = C2_lo * hi + C1_hi * lo + C1_lo * lo;
> lo = lo + resid;
> #ifdef hopefully_not_needed
> Re-split hi+lo if lo is too large.
> #endif
> return 3sum(1, hi, lo)
>
> 3sum is in math_private.h and is supposed to be used instead of magic-
> looking open code.
I completely missed 3sum in math_private.h. I did try using your
2sum and 2sumF macros, but did not recall seeing any difference
from my hand-rolled code. I probably didn't spend enough time
looking into if I was using 2sum[F] properly. I'll see if I can
adapt my code to use these macros.
> Here it would reduce to the same expressions as in
> __kernel_sin().
>
> 1+hi = hi_1 + lo_1; /* by hi-lo decomp. */
> return (lo + lo_1 + hi_1); /* safely add the lo's and "carry" */
>
> > One could do, as is done with sin(x),
> >
> > f(x) = (sin(pi*x) - pi*x)/(pi*x)**3
> >
> > then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds
> > the complication that pi*x must be computed in extra precision.
>
> Isn't that what is done now? Of course pi*x must be evaluated in
> extra precision, but hopefully no more is needed. The x3 term
> would be much harder to do all in extra precision than the x2
> term for cospi().
Well, yes, I supposed that's one way to look at it. In the approximation
of sinpif(x) = x * (s0 + x2 * S(x2)), the s0 term is nearly pi. The
s1 coefficient, as determined by Remes, is s1 = -5.16771269, which
happens to be close to -pi**3/6 = -5.16771278 (i.e, next term in
McClauren series of sin(pi*x) = pi*x - (pi**3/6)*x**3 + .... So, the
Remes coefficients have absorbed the factors of pi.
> >> That is too inaccurate even for sin(x), and probably slower too.
> >
> > I'm not sure why you're claiming that it is too inaccurate.
>
> Writing sin(x) as x * (1 + C2*x*x + ...) without extra precision.
> This gives at best an error of half an ulp for the part in parentheses
> and another half an ulp for the multiplication. The final step
> must be an addition of a large term with a term at least 2 times
> smaller so that the error can be controlled (this depends on 1
> of the terms being exact).
Ah, okay, I understand what you're getting at. I played a
number of games with grouping intermediate results and trying
to accumulate high and low parts. You're seeing the best
that I can do.
--
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow
More information about the freebsd-numerics
mailing list