Implementation of half-cycle trignometric functions

Thu May 18 06:35:26 UTC 2017

On Wed, 17 May 2017, Steve Kargl wrote:

> On Thu, May 18, 2017 at 09:25:58AM +1000, Bruce Evans wrote:
>> On Wed, 17 May 2017, Steve Kargl wrote:
> ...
>>> As such, I've added a comment
>>>
>>> /*
>>> * pi_hi contains the leading 56 bits of a 169 bit approximation for pi.
>>> */
>>
>> Why 56?  53 + 56 = 109.
>
> This is ld128.

ld128 has precision 113.

> static const long double
> pi_hi = 3.14159265358979322702026593105983920e+00L,
> pi_lo = 1.14423774522196636802434264184180742e-17L;
>
> pi_hi has 113/2 = 56 bits.
> pi_lo has 113 bit.
>
> 56 + 113 = 169

But hi is set intentionally sloppily by converting to double, so it has
53 bits instead of the 57 that 56 would go with (57 = 113/2 rounded up).

>> My improvements to lgamma()'s sin_pi*() are buggy.  I forgot the original
>> target of avoiding inexact better:
>
> I'll need to wait until the weekend to study your improved sin_pi.

I now have closer to production version.  sinpi() runs more than 2
times faster that your version, without doing much different except
not having so much EXTRACT/INSERT or special cases.  (66 cycles instead
of 150 on Haswell-i386, and the 66 can be reduced to 58 by removing
the STRICT_ASSIGN() which breaks only i386 with non-default extra
precision.  The difference would be smaller on amd64 or better compilers
, but Haswell-i386 has much the same slownesses as old Athlon from the
EXTRACTs and INSERTs.

However, both versions fail tests.  Min has a max error of 2048 ulps
and an everage error of 928 ulps and yours has a maximum error of many
giga-ups for the sign bit wrong and an average error of 561 ulps.  The
enormous errors are probably for denormals or NaNs.

XX double
XX ysinpi(double x)
XX {
XX 	double s,y,z;
XX 	int32_t hx,ix;
XX 	int n;
XX 
XX 	GET_HIGH_WORD(hx,x);
XX 	ix = hx & 0x7fffffff;
XX 
XX 	if (ix < 0x3fd00000) {		/* |x| < 0.25 */
XX 	    if (ix < 0x3e200000)	/* |x| < 2**-29 */
XX 		/* XXX: extra precision would generate spurious denormals. */
XX 		if(((int)x)==0) return M_PI*x; /* pi*x with inexact if x != 0 */

Note a problem with denormals.  The hi+lo decomposition on values up
to about 2**53 times the largest denormal creates denormals and thus
spurious denormal and underflow exceptions and possibly slowness.  I'm
not sure if it gives correct results either.  Your version has a special
case for calculating Pi*x, but doesn't handle this.  I think it would
work to scale up into non-denormal range for the calculation.  I just
multiply by M_PI.  I suspects this gives an error of nearly 1 ulp and
will test to see if it is strictly below 1 ulp (the details depend on
how closely M_PI approximates pi).

XX 	    return __kernel_sinpi(x);
XX 	}
XX 
XX 	if (ix>=0x7ff00000) return x-x;	/* Inf or NaN -> NaN */
XX

This classification was cloned from sin().  Then adjust constants as in
your version.  Not doing so much here probably gives half of the 150:66
speed improvement.

Don't bother with special cases for |x| <= 2 or 2.25 as done by sinf().
We don't really want the special case for |x| <= 0.25, but need it to
avoid denormals and underflow.  BTW, I just noticed that I broke this
case in lgamma*().  I moved the check for tiny x out of the kernels into
the sin*() and cos*() callers (or kept the duplicate of it there), but
didn't add it for the lgamma*() callers.  I broke it for sinf() or sin()
too.  You broke it for sinl().

XX 	if (ix >= 0x43400000) return 0;	/* even integer -> 0 */
XX 
XX 	y = fabs(x);
XX 
XX 	STRICT_ASSIGN(double,z,y+0x1p50);
XX 	GET_LOW_WORD(n,z);		/* bits for rounded y (units 0.25) */
XX 	z -= 0x1p50;			/* y rounded to a multiple of 0.25 */
XX 	if (z > y) {
XX 	    z -= 0.25;			/* adjust to round down */
XX 	    n--;
XX 	}
XX 	n &= 7;				/* octant of y mod 2 */
XX 	y -= z;				/* y mod 0.25 */

This is from old optimizations in lgamma() with minor improvements:
- remove extra reduction step to classify integers
- use STRICT_ASSIGN() instead of hard-coded volatile to avoid pessimizing
   arches without extra precision

XX 
XX 	if (n >= 4) {
XX 	    s = -1;
XX 	    n -= 4;
XX 	} else
XX 	    s = 1;
XX 	if (n == 0)
XX 	    return s*__kernel_sinpi(y);	/* this handles poles right */

Oops, the poles are specific to gamma().

This handles integers right.  y is 0 and the kernel returns +=0 without
setting inexact (depend on knowing its internals for the latter).  Don't
be careful about the sign.  I think it is always +0 with the default rounding
mode.

XX 	else if (n == 1)
XX 	    return s*__kernel_cospi(y-0.25);
XX 	else if (n == 2)
XX 	    return s*__kernel_cospi(y);

This handles half-integers right.  No problem with the sign since the kernel
always returns 1.

XX 	else
XX 	    return s*__kernel_sinpi(y-0.25);
XX }

This is from old optimizations in lgamma with minor improvements (but
larger code changes):
- don't use a case statement
- don't add n * 0.25 to y only to have to a multiple of 0.25 from y in more
   cases later.

Bruce