Implementation of half-cycle trignometric functions
Bruce Evans
brde at optusnet.com.au
Sat Apr 29 10:49:26 UTC 2017
On Sat, 29 Apr 2017, Bruce Evans wrote:
> On Fri, 28 Apr 2017, Steve Kargl wrote:
>
>> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
>>>
>>> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61.
>
> Comments on this below.
>
> This is all rather over-engineered. Optimizing these functions is
> unimportant comparing with finishing cosl() and sinl() and optimizing
> all of the standard trig functions better, but we need correctness.
> But I now see many simplifications and improvements:
>
> (1) There is no need for new kernels. The standard kernels already handle
> extra precision using approximations like:
>
> sin(x+y) ~= sin(x) + (1-x*x/2)*y.
>
> Simply reduce x and write Pi*x = hi+lo. Then
>
> sin(Pi*x) = __kernel_sin(hi, lo, 1).
>
> I now see how to do the extra-precision calculations without any
> multiplications.
But that is over-engineered too.
Using the standard kernels is easy and works well:
XX #include <float.h>
XX #include <math.h>
XX
XX #include "math_private.h"
XX
XX static const double
XX pi_hi = 3.1415926218032837e+00, /* 0x400921fb 0x50000000 */
XX pi_lo = 3.1786509547050787e-08; /* 0x3e6110b4 0x611a5f14 */
XX
XX /* Only for |x| <= ~0.25 (missing range reduction. */
XX
XX double
XX cospi(double x)
XX {
XX double_t hi, lo;
XX
XX hi = (float)x;
XX lo = x - hi;
XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX hi = pi_hi * hi;
XX _2sumF(hi, lo);
XX return __kernel_cos(hi, lo);
XX }
XX
XX double
XX sinpi(double x)
XX {
XX double_t hi, lo;
XX
XX hi = (float)x;
XX lo = x - hi;
XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX hi = pi_hi * hi;
XX _2sumF(hi, lo);
XX return __kernel_sin(hi, lo, 1);
XX }
XX
XX double
XX tanpi(double x)
XX {
XX double_t hi, lo;
XX
XX hi = (float)x;
XX lo = x - hi;
XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX hi = pi_hi * hi;
XX _2sumF(hi, lo);
XX return __kernel_tan(hi, lo, 1);
XX }
I only did a sloppy accuracy test for sinpi(). It was 0.03 ulps less
accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for
sin().
Efficiency is very good in some cases, but anomalous in others: all
times in cycles, on i386, on the range [0, 0.25]
athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1
cos: 61-62 44 43
cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!)
sin: 59-60 51 37
sinpi: 67-68 (8 extra) 80 42
tan: 136-172 93-195 67-94
tanpi: 144-187 (8-15 extra) 145-176 61-189
That was a throughput test. Latency is not so good. My latency test
doesn't use serializing instructions, but uses random args and the
partial serialization of making each result depend on the previous
one.
athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1
cos: 84-85 69 79
cospi: 103-104 (19-21 extra) 117 94
sin: 75-76 89 77
sinpi: 105-106 (30 extra) 116 90
tan: 168-170 167-168 147
tanpi: 191-194 (23-24 extra) 191 154
This also indicates that the longest times for tan in the throughput
test are what happens when the function doesn't run in parallel with
itself. The high-degree polynomial and other complications in tan()
are too complicated for much cross-function parallelism.
Anywyay, it looks like the cost of using the kernel is at most 8-9
in the parallel case and at most 30 in the serial case. The extra-
precision code has about 10 dependent instructions, so it s is
doing OK to take 30.
Bruce
More information about the freebsd-numerics
mailing list