Bit twiddling question

Thu Mar 9 19:51:35 UTC 2017

On Fri, Mar 10, 2017 at 05:02:13AM +1100, Bruce Evans wrote:
> On Thu, 9 Mar 2017, Steve Kargl wrote:
> 
> > On Wed, Mar 08, 2017 at 11:52:36PM -0800, Steve Kargl wrote:
> >> To give a hint at what I have been working on, I have implementations
> >> for sinpif(x), sinpi(x), cospif(x), and cospi(x).  For sinpif(x) and
> >> cospif(x) I have kernels that give correctly rounded results for
> >> FLT_MIN <= x < 0x1p23 (at least on x86_64).   I also have slightly
> 
> I didn't see the original mail.

Odd.  Maybe munched by a spam filter.

> I'd prefer you to finish the long double versions of cosl(), sinl() and
> tanl() (maybe das is responsible for the latter).  These are still too
> slow and minor bugs for at least sinl() on small x, partly by not copying
> the lower-precision versions enough.

sinl, cosl, and tanl are good enough for my purposes at the moment.
tgammal and powl are higher on the list of things to fix.  But, 
sinpi[fl] and company are high on the list because: (1) I can use
those in my actual job; and (2),  the desirable property that
sinpi(x) = 0 whereas sin(pi*x) = 0.mumbleE-17 (or even E-16) for
integer x.

> >> faster kernels that give max ULPs of about 0.5008.  For sinpi(x) and
> 
> Similar to for sinf() and cosf().  The maximum for those is 0.5009, using
> double precision calculations which I now know to be a mistake, and
> a polynomial of degree higher than necessary for good enough rounding in
> float precision and lower than necessary for perfect rounding in float
> precision.

I have three polynomials for __kernel_sinpif() over x in [0,1/4].
All are accumulated in double precision.  With a 6th order polynomial
(12 double FOP), I get a max ULP of 0.5.  With a 5th order polynomial
(10 double FOP), I get a max ULP of 0.50000006.  With the 4th order
polynomial (8 double FOP), the max ULP is 0.50007051.  Note, in my
Remes algorithm I used f(x) = sin(pi*x) / x.  I still haven't found a
way to round 20*53-bit coefficients to a desired precision, which here
is 53 bits.  (Well, I have thought about a recursive Remes algorithm,
but it requires much more work/time that I can commit.)

I also have three polynomials for __kernel_cospif() over the
same interval.  6th order gives Max ULP of 0.5, the 5th order
gives 0.50000089, and the 4th order gives 0.5008257.

> >> cospi(x), my kernels currently give about 1.25 ULP for the worse case
> >> error if I accumulate the polynomials in double precision.  If I
> 
> Not too good.  I think the problem is that the coeffs for the lowest
> terms are no longer exactly representable.  The linear function f(x) = a*x
> is harder to evaluate accurately than exp(x) = 1 + x + ..., unless a
> is exactly representable and also  has few digits (like 1 or -0.5 for
> early coeffs of sin() and cos()).

I agree.  I'm still working on the double routines and starting to
contemplate the ld80 and ld128 ones.

> >> accumulate the polynomials in long double precision, then I get
> >> correctly rounded results.  To complete the set, I was hoping to
> 
> Don't do that.  It is much slower, and not available starting with
> long double precision, or even starting with double precision on arches
> without real long doubles.  It is also a mistake to use it for float
> precision like I did for trig functions and an uncommitted version of
> logf().  I managed to make the specialized logf() run 50% faster than
> the less specialized log() (also uncommitted), by using double precision
> and manually schedling for Athlon-XP, but modern CPUs and compilers handle
> the less sprecialized log() so well that the specialized logf() is only
> a couple of cycles faster.  Similarly but less pronounced for trig functions.

Don't worry.  I have no intention to use long double to accumulate the
polynomials.  I was just curious to see if the extra 11-bits would
reduce the ulp from 1.25 to something below 1.  I was surprised at
how well those 11 bits worked.  I also tried fma() in my Horner's
rule representation of the polynomial, but that did not help.

> >> work out ld80 and ld128 versions.  `ld128 is going to be problematic
> >> due to the absense of int128_t.
> >>
> >> I'll send you what I have in a few days.
> >
> > It is sometime amazing what happens when I sleep.  For sinpi[fl](x)
> > and the method I came up with one needs to do argument reduction
> > where x = n + r.  Here n is an integer and 0 < r < 1.  If n is even
> > then one is in the positive half-cyle of sin(x) and if n is odd then
> > one has the negative half-cycle.  For sinpif(x) this looks like
> >        if (ix < 0x4b000000) {          /* 1 <= |x| < 0x1p23 */
> > /* 1 */         ix = (uint32_t)ax;
> > /* 2 */         ax = ax == ix ? zero : __compute_sinpif(ax - ix);
> > /* 3 */         if (ix & 1) ax = -ax;
> > /* 4 */         return ((hx & 0x80000000) ? -ax : ax);
> >        }
> >
> > Line 1 determines n.  Line 2 computes either sinpi(n) exactly or
> > sinpi(r).  Line 3 uses n odd to set the sign for the half-cycle.
> > Line 4 is used to set the sign from sin(-x) = -sin(x).
> >
> > For double and ld80 one can use int64_t instead of uint32_t.  Last
> > night I realized that I don't need to use int64_t, and more importantly
> > I don't need int128_t.  The argument reduction can use uint32_t
> > everywhere.  For double, I currently have
> >
> >        if (ix < 0x43300000) {          /* 1 <= |x| < 0x1p52 */
> >                int64_t n;
> >                n = (int64_t)ax;
> >                ax = ax == n ? zero : __compute_sinpi(ax - n);
> >                if (n & 1) ax = -ax;
> >                return ((hx & 0x80000000) ? -ax : ax);
> >        }
> >
> > which can be written (off the top-of-head and not tested)
> >        if (ix < 0x43300000) {          /* 1 <= |x| < 0x1p52 */
> >                double volatile vx;
> >                uint32_t n;
> >                vx = ax + 0x1p52;
> >                vx = vx - 0x1p52;
> >                if (vx == ax) return(0);
> >                if (vx > UINT32_MAX) {   /* ax = m + n + r with m + n */
> >                    vx -= UINT32_MAX;    /* m = UINT32_MAX.  n is in range */
> >                    ax -= UINT32_MAX;
> > 				}
> >                n = (uint32_t)vx;
> >                ax = __compute_sinpi(ax - n);
> >                if (n & 1) ax = -ax;
> >                return ((hx & 0x80000000) ? -ax : ax);
> >        }
> >
> > Something similar can be applied to ld128, but reduction may take
> > two rounds (ie., a comparison with UINT64_MAX and then UINT32_MAX).
> 
> I don't like that much.  The 0x1p52 magic is tricky and probably buggy.
> s_rint.c uses the same magic but needs many complications to avoid
> double rounding.

Yep. Seems to have some issue.  From my working copy of s_sinpif(x),
I tried

#if 0
		ix = (uint32_t)ax;
		a = ax == ix ? zero : __compute_sinpif(ax - ix);
		if (ix & 1) ax = -ax;
		return ((hx & 0x80000000) ? -ax : ax);
#else
		volatile float vx;
		float y;
		vx = ax + 0x1p23F;
		y = vx - 0x1p23F;
		ix = (uint32_t)y;
		ax = ax == y ? zero : __compute_sinpif(ax - ix);
		if (ix & 1) ax = -ax;
		return ((hx & 0x80000000) ? -ax : ax);
#endif	

My max ULP went from 0.5 to 0.50398505 for exhaustive testing
in the interval [1,100].  If I grab the worse case x and use the
'#if 1' code, I see

./testf -S -a 1.50121641e+00f
a =  1.50121641e+00f, /* 0x3fc027dc */
sinpif(a) = -9.99992669e-01f, /* 0xbf7fff85 */
sin(pi*a) = -9.99992669e-01f, /* 0xbf7fff85 */  MPFR 96-bits rounded 24-bits.
ULP: 0.49601495

The '#if 0' code gives

 ./testf -S -a 1.50121641e+00f
a =  1.50121641e+00f, /* 0x3fc027dc */
sinpif(a) = -9.99992728e-01f, /* 0xbf7fff86 */
sin(pi*a) = -9.99992669e-01f, /* 0xbf7fff85 */  MPFR 96-bits rounded 24-bits.
ULP: 0.50398505

> Similar-looking magic for trig reduction in radians
> uses floating the magic value of 0x1.8p52 to bias rounding errors in
> a safe way, and arranges so that rounding errors don't matter then.
> Perhaps you can do the same here -- do a fixup of +-UINT32_MAX if
> necessary.  This seem to be no harder than using rint(), round() or
> floor().  Except for floor(), we it is hard to control whether the
> rounded result is too high or too low.  Using floor:
> 
>  	prec_t n = floor(x);
>  	r = x - n;
> 
> but we want to classify quadrants, and still don't have an n that will
> fit in an integer, so first scale by 2 (4 quadrants of size 1/2):
> 
>  	y = x * 0.5 - floor(x * 0.5);	/* 4 quads in [2n, 2n+2) -> [0, 1) */
>  	y = 4 * y;			/* -> [0, 4) */
>  	int n = y;			/* quadrant */
>  	r = (y - n) / 4;		/* offset in quadrant */
> 
> This is a bit slow, and too much like lgamma's method.

I did look over what you did in lgamma[f], but forgot the logic
behind the reduction into octants.  I suppose that I'm more
concern with 'make it work', 'make it correct', and then I
will/might worry about 'make it fast'.  But, I don't want it to be
too slow, so I'm trying to avoid function calls and reclassification
of x as +-inf or nan.  Heck, r = remquof(ax, vx, &n) should work,
but there's that speed issue.

-- 
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow