Bit twiddling question
Steve Kargl
sgk at troutmask.apl.washington.edu
Thu Mar 9 19:51:35 UTC 2017
On Fri, Mar 10, 2017 at 05:02:13AM +1100, Bruce Evans wrote:
> On Thu, 9 Mar 2017, Steve Kargl wrote:
>
> > On Wed, Mar 08, 2017 at 11:52:36PM -0800, Steve Kargl wrote:
> >> To give a hint at what I have been working on, I have implementations
> >> for sinpif(x), sinpi(x), cospif(x), and cospi(x). For sinpif(x) and
> >> cospif(x) I have kernels that give correctly rounded results for
> >> FLT_MIN <= x < 0x1p23 (at least on x86_64). I also have slightly
>
> I didn't see the original mail.
Odd. Maybe munched by a spam filter.
> I'd prefer you to finish the long double versions of cosl(), sinl() and
> tanl() (maybe das is responsible for the latter). These are still too
> slow and minor bugs for at least sinl() on small x, partly by not copying
> the lower-precision versions enough.
sinl, cosl, and tanl are good enough for my purposes at the moment.
tgammal and powl are higher on the list of things to fix. But,
sinpi[fl] and company are high on the list because: (1) I can use
those in my actual job; and (2), the desirable property that
sinpi(x) = 0 whereas sin(pi*x) = 0.mumbleE-17 (or even E-16) for
integer x.
> >> faster kernels that give max ULPs of about 0.5008. For sinpi(x) and
>
> Similar to for sinf() and cosf(). The maximum for those is 0.5009, using
> double precision calculations which I now know to be a mistake, and
> a polynomial of degree higher than necessary for good enough rounding in
> float precision and lower than necessary for perfect rounding in float
> precision.
I have three polynomials for __kernel_sinpif() over x in [0,1/4].
All are accumulated in double precision. With a 6th order polynomial
(12 double FOP), I get a max ULP of 0.5. With a 5th order polynomial
(10 double FOP), I get a max ULP of 0.50000006. With the 4th order
polynomial (8 double FOP), the max ULP is 0.50007051. Note, in my
Remes algorithm I used f(x) = sin(pi*x) / x. I still haven't found a
way to round 20*53-bit coefficients to a desired precision, which here
is 53 bits. (Well, I have thought about a recursive Remes algorithm,
but it requires much more work/time that I can commit.)
I also have three polynomials for __kernel_cospif() over the
same interval. 6th order gives Max ULP of 0.5, the 5th order
gives 0.50000089, and the 4th order gives 0.5008257.
> >> cospi(x), my kernels currently give about 1.25 ULP for the worse case
> >> error if I accumulate the polynomials in double precision. If I
>
> Not too good. I think the problem is that the coeffs for the lowest
> terms are no longer exactly representable. The linear function f(x) = a*x
> is harder to evaluate accurately than exp(x) = 1 + x + ..., unless a
> is exactly representable and also has few digits (like 1 or -0.5 for
> early coeffs of sin() and cos()).
I agree. I'm still working on the double routines and starting to
contemplate the ld80 and ld128 ones.
> >> accumulate the polynomials in long double precision, then I get
> >> correctly rounded results. To complete the set, I was hoping to
>
> Don't do that. It is much slower, and not available starting with
> long double precision, or even starting with double precision on arches
> without real long doubles. It is also a mistake to use it for float
> precision like I did for trig functions and an uncommitted version of
> logf(). I managed to make the specialized logf() run 50% faster than
> the less specialized log() (also uncommitted), by using double precision
> and manually schedling for Athlon-XP, but modern CPUs and compilers handle
> the less sprecialized log() so well that the specialized logf() is only
> a couple of cycles faster. Similarly but less pronounced for trig functions.
Don't worry. I have no intention to use long double to accumulate the
polynomials. I was just curious to see if the extra 11-bits would
reduce the ulp from 1.25 to something below 1. I was surprised at
how well those 11 bits worked. I also tried fma() in my Horner's
rule representation of the polynomial, but that did not help.
> >> work out ld80 and ld128 versions. `ld128 is going to be problematic
> >> due to the absense of int128_t.
> >>
> >> I'll send you what I have in a few days.
> >
> > It is sometime amazing what happens when I sleep. For sinpi[fl](x)
> > and the method I came up with one needs to do argument reduction
> > where x = n + r. Here n is an integer and 0 < r < 1. If n is even
> > then one is in the positive half-cyle of sin(x) and if n is odd then
> > one has the negative half-cycle. For sinpif(x) this looks like
> > if (ix < 0x4b000000) { /* 1 <= |x| < 0x1p23 */
> > /* 1 */ ix = (uint32_t)ax;
> > /* 2 */ ax = ax == ix ? zero : __compute_sinpif(ax - ix);
> > /* 3 */ if (ix & 1) ax = -ax;
> > /* 4 */ return ((hx & 0x80000000) ? -ax : ax);
> > }
> >
> > Line 1 determines n. Line 2 computes either sinpi(n) exactly or
> > sinpi(r). Line 3 uses n odd to set the sign for the half-cycle.
> > Line 4 is used to set the sign from sin(-x) = -sin(x).
> >
> > For double and ld80 one can use int64_t instead of uint32_t. Last
> > night I realized that I don't need to use int64_t, and more importantly
> > I don't need int128_t. The argument reduction can use uint32_t
> > everywhere. For double, I currently have
> >
> > if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */
> > int64_t n;
> > n = (int64_t)ax;
> > ax = ax == n ? zero : __compute_sinpi(ax - n);
> > if (n & 1) ax = -ax;
> > return ((hx & 0x80000000) ? -ax : ax);
> > }
> >
> > which can be written (off the top-of-head and not tested)
> > if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */
> > double volatile vx;
> > uint32_t n;
> > vx = ax + 0x1p52;
> > vx = vx - 0x1p52;
> > if (vx == ax) return(0);
> > if (vx > UINT32_MAX) { /* ax = m + n + r with m + n */
> > vx -= UINT32_MAX; /* m = UINT32_MAX. n is in range */
> > ax -= UINT32_MAX;
> > }
> > n = (uint32_t)vx;
> > ax = __compute_sinpi(ax - n);
> > if (n & 1) ax = -ax;
> > return ((hx & 0x80000000) ? -ax : ax);
> > }
> >
> > Something similar can be applied to ld128, but reduction may take
> > two rounds (ie., a comparison with UINT64_MAX and then UINT32_MAX).
>
> I don't like that much. The 0x1p52 magic is tricky and probably buggy.
> s_rint.c uses the same magic but needs many complications to avoid
> double rounding.
Yep. Seems to have some issue. From my working copy of s_sinpif(x),
I tried
#if 0
ix = (uint32_t)ax;
a = ax == ix ? zero : __compute_sinpif(ax - ix);
if (ix & 1) ax = -ax;
return ((hx & 0x80000000) ? -ax : ax);
#else
volatile float vx;
float y;
vx = ax + 0x1p23F;
y = vx - 0x1p23F;
ix = (uint32_t)y;
ax = ax == y ? zero : __compute_sinpif(ax - ix);
if (ix & 1) ax = -ax;
return ((hx & 0x80000000) ? -ax : ax);
#endif
My max ULP went from 0.5 to 0.50398505 for exhaustive testing
in the interval [1,100]. If I grab the worse case x and use the
'#if 1' code, I see
./testf -S -a 1.50121641e+00f
a = 1.50121641e+00f, /* 0x3fc027dc */
sinpif(a) = -9.99992669e-01f, /* 0xbf7fff85 */
sin(pi*a) = -9.99992669e-01f, /* 0xbf7fff85 */ MPFR 96-bits rounded 24-bits.
ULP: 0.49601495
The '#if 0' code gives
./testf -S -a 1.50121641e+00f
a = 1.50121641e+00f, /* 0x3fc027dc */
sinpif(a) = -9.99992728e-01f, /* 0xbf7fff86 */
sin(pi*a) = -9.99992669e-01f, /* 0xbf7fff85 */ MPFR 96-bits rounded 24-bits.
ULP: 0.50398505
> Similar-looking magic for trig reduction in radians
> uses floating the magic value of 0x1.8p52 to bias rounding errors in
> a safe way, and arranges so that rounding errors don't matter then.
> Perhaps you can do the same here -- do a fixup of +-UINT32_MAX if
> necessary. This seem to be no harder than using rint(), round() or
> floor(). Except for floor(), we it is hard to control whether the
> rounded result is too high or too low. Using floor:
>
> prec_t n = floor(x);
> r = x - n;
>
> but we want to classify quadrants, and still don't have an n that will
> fit in an integer, so first scale by 2 (4 quadrants of size 1/2):
>
> y = x * 0.5 - floor(x * 0.5); /* 4 quads in [2n, 2n+2) -> [0, 1) */
> y = 4 * y; /* -> [0, 4) */
> int n = y; /* quadrant */
> r = (y - n) / 4; /* offset in quadrant */
>
> This is a bit slow, and too much like lgamma's method.
I did look over what you did in lgamma[f], but forgot the logic
behind the reduction into octants. I suppose that I'm more
concern with 'make it work', 'make it correct', and then I
will/might worry about 'make it fast'. But, I don't want it to be
too slow, so I'm trying to avoid function calls and reclassification
of x as +-inf or nan. Heck, r = remquof(ax, vx, &n) should work,
but there's that speed issue.
--
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow
More information about the freebsd-numerics
mailing list