Implementation of half-cycle trignometric functions
Bruce Evans
brde at optusnet.com.au
Thu May 18 06:35:26 UTC 2017
On Wed, 17 May 2017, Steve Kargl wrote:
> On Thu, May 18, 2017 at 09:25:58AM +1000, Bruce Evans wrote:
>> On Wed, 17 May 2017, Steve Kargl wrote:
> ...
>>> As such, I've added a comment
>>>
>>> /*
>>> * pi_hi contains the leading 56 bits of a 169 bit approximation for pi.
>>> */
>>
>> Why 56? 53 + 56 = 109.
>
> This is ld128.
ld128 has precision 113.
> static const long double
> pi_hi = 3.14159265358979322702026593105983920e+00L,
> pi_lo = 1.14423774522196636802434264184180742e-17L;
>
> pi_hi has 113/2 = 56 bits.
> pi_lo has 113 bit.
>
> 56 + 113 = 169
But hi is set intentionally sloppily by converting to double, so it has
53 bits instead of the 57 that 56 would go with (57 = 113/2 rounded up).
>> My improvements to lgamma()'s sin_pi*() are buggy. I forgot the original
>> target of avoiding inexact better:
>
> I'll need to wait until the weekend to study your improved sin_pi.
I now have closer to production version. sinpi() runs more than 2
times faster that your version, without doing much different except
not having so much EXTRACT/INSERT or special cases. (66 cycles instead
of 150 on Haswell-i386, and the 66 can be reduced to 58 by removing
the STRICT_ASSIGN() which breaks only i386 with non-default extra
precision. The difference would be smaller on amd64 or better compilers
, but Haswell-i386 has much the same slownesses as old Athlon from the
EXTRACTs and INSERTs.
However, both versions fail tests. Min has a max error of 2048 ulps
and an everage error of 928 ulps and yours has a maximum error of many
giga-ups for the sign bit wrong and an average error of 561 ulps. The
enormous errors are probably for denormals or NaNs.
XX double
XX ysinpi(double x)
XX {
XX double s,y,z;
XX int32_t hx,ix;
XX int n;
XX
XX GET_HIGH_WORD(hx,x);
XX ix = hx & 0x7fffffff;
XX
XX if (ix < 0x3fd00000) { /* |x| < 0.25 */
XX if (ix < 0x3e200000) /* |x| < 2**-29 */
XX /* XXX: extra precision would generate spurious denormals. */
XX if(((int)x)==0) return M_PI*x; /* pi*x with inexact if x != 0 */
Note a problem with denormals. The hi+lo decomposition on values up
to about 2**53 times the largest denormal creates denormals and thus
spurious denormal and underflow exceptions and possibly slowness. I'm
not sure if it gives correct results either. Your version has a special
case for calculating Pi*x, but doesn't handle this. I think it would
work to scale up into non-denormal range for the calculation. I just
multiply by M_PI. I suspects this gives an error of nearly 1 ulp and
will test to see if it is strictly below 1 ulp (the details depend on
how closely M_PI approximates pi).
XX return __kernel_sinpi(x);
XX }
XX
XX if (ix>=0x7ff00000) return x-x; /* Inf or NaN -> NaN */
XX
This classification was cloned from sin(). Then adjust constants as in
your version. Not doing so much here probably gives half of the 150:66
speed improvement.
Don't bother with special cases for |x| <= 2 or 2.25 as done by sinf().
We don't really want the special case for |x| <= 0.25, but need it to
avoid denormals and underflow. BTW, I just noticed that I broke this
case in lgamma*(). I moved the check for tiny x out of the kernels into
the sin*() and cos*() callers (or kept the duplicate of it there), but
didn't add it for the lgamma*() callers. I broke it for sinf() or sin()
too. You broke it for sinl().
XX if (ix >= 0x43400000) return 0; /* even integer -> 0 */
XX
XX y = fabs(x);
XX
XX STRICT_ASSIGN(double,z,y+0x1p50);
XX GET_LOW_WORD(n,z); /* bits for rounded y (units 0.25) */
XX z -= 0x1p50; /* y rounded to a multiple of 0.25 */
XX if (z > y) {
XX z -= 0.25; /* adjust to round down */
XX n--;
XX }
XX n &= 7; /* octant of y mod 2 */
XX y -= z; /* y mod 0.25 */
This is from old optimizations in lgamma() with minor improvements:
- remove extra reduction step to classify integers
- use STRICT_ASSIGN() instead of hard-coded volatile to avoid pessimizing
arches without extra precision
XX
XX if (n >= 4) {
XX s = -1;
XX n -= 4;
XX } else
XX s = 1;
XX if (n == 0)
XX return s*__kernel_sinpi(y); /* this handles poles right */
Oops, the poles are specific to gamma().
This handles integers right. y is 0 and the kernel returns +=0 without
setting inexact (depend on knowing its internals for the latter). Don't
be careful about the sign. I think it is always +0 with the default rounding
mode.
XX else if (n == 1)
XX return s*__kernel_cospi(y-0.25);
XX else if (n == 2)
XX return s*__kernel_cospi(y);
This handles half-integers right. No problem with the sign since the kernel
always returns 1.
XX else
XX return s*__kernel_sinpi(y-0.25);
XX }
This is from old optimizations in lgamma with minor improvements (but
larger code changes):
- don't use a case statement
- don't add n * 0.25 to y only to have to a multiple of 0.25 from y in more
cases later.
Bruce
More information about the freebsd-numerics
mailing list