powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
Konstantin Belousov
kostikbel at gmail.com
Sun Mar 3 16:16:47 UTC 2019
On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
>
> > On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
> >> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>
> >>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
> >>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >* ...
> >>>> I don't changing this at all this. binuptime() was carefully written
> >>>> to not need so much 64-bit arithmetic.
> >>>>
> >>>> If this pessimization is allowed, then it can also handle a 64-bit
> >>>> deltas. Using the better kernel method:
> >>>>
> >>>> if (__predict_false(delta >= th->th_large_delta)) {
> >>>> bt->sec += (scale >> 32) * (delta >> 32);
> >>>> x = (scale >> 32) * (delta & 0xffffffff);
> >>>> bt->sec += x >> 32;
> >>>> bintime_addx(bt, x << 32);
> >>>> x = (scale & 0xffffffff) * (delta >> 32);
> >>>> bt->sec += x >> 32;
> >>>> bintime_addx(bt, x << 32);
> >>>> bintime_addx(bt, (scale & 0xffffffff) *
> >>>> (delta & 0xffffffff));
> >>>> } else
> >>>> bintime_addx(bt, scale * (delta & 0xffffffff));
> >>> This only makes sense if delta is extended to uint64_t, which requires
> >>> the pass over timecounters.
> >>
> >> Yes, that was its point. It is a bit annoying to have a hardware
> >> timecounter like the TSC that doesn't wrap naturally, but then make it
> >> wrap by masking high bits.
> >>
> >> The masking step is also a bit wasteful. For the TSC, it is 1 step to
> >> discard high bids at the register level, then another step to apply the
> >> nask to discard th high bits again.
> > rdtsc-low is implemented in the natural way, after RDTSC, no register
> > combining into 64bit value is done, instead shrd operates on %edx:%eax
> > to get the final result into %eax. I am not sure what you refer to.
>
> I was referring mostly to the masking step '& tc->tc_counter_mask' and
> the lack of register combining in rdtsc().
>
> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
> step. i386 used to be faster here -- the first masking step of discarding
> %edx doesn't take any code. amd64 has to mask out the top bits in %rax.
> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
> has to do a not so slow shr.
i386 cannot discard %edx after RDTSC since some bits from %edx come into
the timecounter value.
amd64 cannot either, but amd64 does not need to mask out top bits in %rax,
since the whole shrdl calculation occurs in 32bit registers, and the result
is in %rax where top word is cleared by shrdl instruction automatically.
But the clearing is not required since result is unsigned int anyway.
Dissassemble of tsc_get_timecount_low() is very clear:
0xffffffff806767e4 <+4>: mov 0x30(%rdi),%ecx
0xffffffff806767e7 <+7>: rdtsc
0xffffffff806767e9 <+9>: shrd %cl,%edx,%eax
...
0xffffffff806767ed <+13>: retq
(I removed frame manipulations).
>
> Then the '& tc->tc_counter_mask' step has no effect.
This is true.
>
> All this is wrapped in many layers of function calls which are quite slow
> but this lets the other operations run in parallel on some CPUs.
>
> >>>> /* 32-bit arches did the next multiplication implicitly. */
> >>>> x = (scale >> 32) * delta;
> >>>> /*
> >>>> * And they did the following shifts and most of the adds
> >>>> * implicitly too. Except shifting x left by 32 lost the
> >>>> * seconds part that the next line handles. The next line
> >>>> * is the only extra cost for them.
> >>>> */
> >>>> bt->sec += x >> 32;
> >>>> bintime_addx(bt, (x << 32) + (scale & 0xffffffff) * delta);
> >>>
> >>> Ok, what about the following.
> >>
> >> I'm not sure that I really want this, even if the pessimization is done.
> >> But it avoids using fls*(), so is especially good for 32-bit systems and
> >> OK for 64-bit systems too, especially in userland where fls*() is in the
> >> fast path.
> > For userland I looked at the generated code, and BSR usage seems to be
> > good enough, for default compilation settings with clang.
>
> I use gcc-4.2.1, and it doesn't do this optimization.
>
> I already reported this in connection with fixing calcru1(). calcru1()
> is unnecessarily several times slower on i386 than on amd64 even after
> avoiding using flsll() on it. The main slowness is in converting 'usec'
> to tv_sec and tv_usec, due to the bad design and implementation of the
> __udivdi3 and __umoddi3 libcalls. The bad design is having to make 2
> libcalls to get the quotient and remainder. The bad implementation is
> the portable C version in libkern. libgcc provides a better implementation,
> but this is not available in the kernel.
>
> >>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> >>> index 2656fb4d22f..2e28f872229 100644
> >>> --- a/sys/kern/kern_tc.c
> >>> +++ b/sys/kern/kern_tc.c
> >>> ...
> >>> @@ -351,17 +352,44 @@ fbclock_getmicrotime(struct timeval *tvp)
> >>> } while (gen == 0 || gen != th->th_generation);
> >>> }
> >>> #else /* !FFCLOCK */
> >>> +
> >>> +static void
> >>> +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
> >>> +{
> >>> + uint64_t x;
> >>> +
> >>> + x = (*scale >> 32) * delta;
> >>> + *scale &= 0xffffffff;
> >>> + bt->sec += x >> 32;
> >>> + bintime_addx(bt, x << 32);
> >>> +}
> >>
> >> It is probably best to not inline the slow path, but clang tends to
> >> inline everything anyway.
> > It does not matter if it inlines it, as far as it is moved out of the
> > linear sequence for the fast path.
> >>
> >> I prefer my way of writing this in 3 lines. Modifying 'scale' for
> >> the next step is especially ugly and pessimal when the next step is
> >> in the caller and this function is not inlined.
> > Can you show exactly what do you want ?
>
> Just write 'scale & 0xffffffff' for the low bits of 'scale' in callers,
> and don't pass 'scale' indirectly to bintime_helper() and don't modify
> it there.
>
> Oops, there is a problem. 'scale' must be reduced iff bintime_helper()
> was used. Duplicate some source code so as to not need a fall-through
> to the fast path. See below.
Yes, this is the reason why it is passed by pointer (C has no references).
>
> >>> void
> >>> binuptime(struct bintime *bt)
> >>> {
> >>> struct timehands *th;
> >>> - u_int gen;
> >>> + uint64_t scale;
> >>> + u_int delta, gen;
> >>>
> >>> do {
> >>> th = timehands;
> >>> gen = atomic_load_acq_int(&th->th_generation);
> >>> *bt = th->th_offset;
> >>> - bintime_addx(bt, th->th_scale * tc_delta(th));
> >>> + scale = th->th_scale;
> >>> + delta = tc_delta(th);
> >>> +#ifdef _LP64
> >>> + /* Avoid overflow for scale * delta. */
> >>> + if (__predict_false(th->th_large_delta <= delta))
> >>> + bintime_helper(bt, &scale, delta);
> >>> + bintime_addx(bt, scale * delta);
> >>> +#else
> >>> + /*
> >>> + * Also avoid (uint64_t, uint32_t) -> uint64_t
> >>> + * multiplication on 32bit arches.
> >>> + */
> >>
> >> "Also avoid overflow for ..."
> >>
> >>> + bintime_helper(bt, &scale, delta);
> >>> + bintime_addx(bt, (u_int)scale * delta);
> >>
> >> The cast should be to uint32_t, but better write it as & 0xffffffff as
> >> elsewhere.
>
> This is actually very broken. The cast gives a 32 x 32 -> 32 bit
> multiplication, but all 64 bits of the result are needed.
Yes, fixed in the updated version.
>
> >>
> >> bintime_helper() already reduced 'scale' to 32 bits. The cast might be
> >> needed to tell the compiler this, especially when the function is not
> >> inlined. Better not do it in the function. The function doesn't even
> >> use the reduced value.
> > I used cast to use 32x32 multiplication. I am not sure that all (or any)
> > compilers are smart enough to deduce that they can use 32 bit mul.
>
> Writing the reduction to 32 bits using a mask instead of a cast automatically
> avoids the bug, but might not give the optimization.
>
> They do do this optimization, but might need the cast as well as the mask.
> At worst, '(uint64_t)(uint32_t)(scale & 0xffffffff)', where the mask is
> now redundant but the cast back to 64 bits is needed if the cast to 32
> bits is used.
>
> You already depended on them not needing the cast for the expression
> '(*scale >> 32) * delta'. Here delta is 32 bits and the other operand
> must remain 64 bits so that after default promotions the multiplication
> is 64 x 64 -> 64 bits, but the compiler should optimize this to
> 32 x 32 -> 64 bits. (*scale >> 32) would need to be cast to 32 bits
> and then back to 64 bits if the compiler can't do this automatically.
>
> I checked what some compilers do. Both gcc-3.3.3 and gcc-4.2.1
> optimize only (uint64_t)x * y (where x and y have type uint32_t), so they
> need to be helped by casts if x and y have have a larger type even if
> their values obviously fit in 32 bits. So the expressions should be
> written as:
>
> (uint64_t)(uint32_t)(scale >> 32) * delta;
>
> and
>
> (uint64_t)(uint32_t)scale * delta;
>
> The 2 casts are always needed, but the '& 0xffffffff' operation doesn't
> need to be explicit because the cast does.
This is what I do now.
>
> >> This needs lots of testing of course.
> >
> > Current kernel-only part of the change is below, see the question about
> > your preference for binuptime_helper().
> >
> > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> > index 2656fb4d22f..6c41ab22288 100644
> > --- a/sys/kern/kern_tc.c
> > +++ b/sys/kern/kern_tc.c
> > @@ -72,6 +71,7 @@ struct timehands {
> > struct timecounter *th_counter;
> > int64_t th_adjustment;
> > uint64_t th_scale;
> > + uint64_t th_large_delta;
> > u_int th_offset_count;
> > struct bintime th_offset;
> > struct bintime th_bintime;
> > @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp)
> > } while (gen == 0 || gen != th->th_generation);
> > }
> > #else /* !FFCLOCK */
> > +
> > +static void
>
> Add __inline. This is in the fast path for 32-bit systems.
Compilers do not need this hand-holding, and I prefer to avoid __inline
unless really necessary. I checked with both clang 7.0 and gcc 8.3
that autoinlining did occured.
>
> > +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
> > +{
> > + uint64_t x;
> > +
> > + x = (*scale >> 32) * delta;
> > + *scale &= 0xffffffff;
>
> Remove the '*' on scale, cast (scale >> 32) to
> (uint64_t)(uint32_t)(scale >> 32), and remove the change to *scale.
>
> > + bt->sec += x >> 32;
> > + bintime_addx(bt, x << 32);
> > +}
> > +
> > void
> > binuptime(struct bintime *bt)
> > {
> > struct timehands *th;
> > - u_int gen;
> > + uint64_t scale;
> > + u_int delta, gen;
> >
> > do {
> > th = timehands;
> > gen = atomic_load_acq_int(&th->th_generation);
> > *bt = th->th_offset;
> > - bintime_addx(bt, th->th_scale * tc_delta(th));
> > + scale = th->th_scale;
> > + delta = tc_delta(th);
> > +#ifdef _LP64
> > + /* Avoid overflow for scale * delta. */
> > + if (__predict_false(th->th_large_delta <= delta))
> > + bintime_helper(bt, &scale, delta);
> > + bintime_addx(bt, scale * delta);
>
> Change to:
>
> if (__predict_false(th->th_large_delta <= delta)) {
> bintime_helper(bt, scale, delta);
> bintime_addx(bt, (scale & 0xffffffff) * delta);
> } else
> bintime_addx(bt, scale * delta);
I do not like it, but ok.
>
> > +#else
> > + /*
> > + * Avoid both overflow as above and
> > + * (uint64_t, uint32_t) -> uint64_t
> > + * multiplication on 32bit arches.
> > + */
>
> This is a bit unclear. Better emphasize avoidance of the 64 x 32 -> 64 bit
> multiplication. Something like:
>
> /*
> * Use bintime_helper() unconditionally, since the fast
> * path in the above method is not so fast here, since
> * the 64 x 32 -> 64 bit multiplication is usually not
> * available in hardware and emulating it using 2
> * 32 x 32 -> 64 bit multiplications uses code much
> * like that in bintime_helper().
> */
>
> > + bintime_helper(bt, &scale, delta);
> > + bintime_addx(bt, (uint32_t)scale * delta);
> > +#endif
>
> Remove '&' as usual, and fix this by casting the reduced scale back to
> 64 bits.
>
> Similarly in bintime().
I merged two functions, finally. Having to copy the same code is too
annoying for this change.
So I verified that:
- there is no 64bit multiplication in the generated code, for i386 both
for clang 7.0 and gcc 8.3;
- that everything is inlined, the only call from bintime/binuptime is
the indirect call to get the timecounter value.
>
> Similarly in libc -- don't use the slow flsll() method in the 32-bit
> case where it is especially slow. Don't use it in the 64-bit case either,
> since this would need to be change when th_large_delta is added to the
> API.
>
> Now I don't like my method in the kernel. It is is unnecessarily
> complicated to have a specal case, and not faster either.
diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 2656fb4d22f..0fd39e25058 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -72,6 +72,7 @@ struct timehands {
struct timecounter *th_counter;
int64_t th_adjustment;
uint64_t th_scale;
+ uint64_t th_large_delta;
u_int th_offset_count;
struct bintime th_offset;
struct bintime th_bintime;
@@ -351,21 +352,63 @@ fbclock_getmicrotime(struct timeval *tvp)
} while (gen == 0 || gen != th->th_generation);
}
#else /* !FFCLOCK */
-void
-binuptime(struct bintime *bt)
+
+static void
+bintime_helper(struct bintime *bt, uint64_t scale, u_int delta)
+{
+ uint64_t x;
+
+ x = (scale >> 32) * delta;
+ bt->sec += x >> 32;
+ bintime_addx(bt, x << 32);
+}
+
+static void
+binnouptime(struct bintime *bt, u_int off)
{
struct timehands *th;
- u_int gen;
+ struct bintime *bts;
+ uint64_t scale;
+ u_int delta, gen;
do {
th = timehands;
gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_offset;
- bintime_addx(bt, th->th_scale * tc_delta(th));
+ bts = (struct bintime *)(vm_offset_t)th + off;
+ *bt = *bts;
+ scale = th->th_scale;
+ delta = tc_delta(th);
+#ifdef _LP64
+ if (__predict_false(th->th_large_delta <= delta)) {
+ /* Avoid overflow for scale * delta. */
+ bintime_helper(bt, scale, delta);
+ bintime_addx(bt, (scale & 0xffffffff) * delta);
+ } else {
+ bintime_addx(bt, scale * delta);
+ }
+#else
+ /*
+ * Use bintime_helper() unconditionally, since the fast
+ * path in the above method is not so fast here, since
+ * the 64 x 32 -> 64 bit multiplication is usually not
+ * available in hardware and emulating it using 2
+ * 32 x 32 -> 64 bit multiplications uses code much
+ * like that in bintime_helper().
+ */
+ bintime_helper(bt, scale, delta);
+ bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
+#endif
atomic_thread_fence_acq();
} while (gen == 0 || gen != th->th_generation);
}
+void
+binuptime(struct bintime *bt)
+{
+
+ binnouptime(bt, __offsetof(struct timehands, th_offset));
+}
+
void
nanouptime(struct timespec *tsp)
{
@@ -387,16 +430,8 @@ microuptime(struct timeval *tvp)
void
bintime(struct bintime *bt)
{
- struct timehands *th;
- u_int gen;
- do {
- th = timehands;
- gen = atomic_load_acq_int(&th->th_generation);
- *bt = th->th_bintime;
- bintime_addx(bt, th->th_scale * tc_delta(th));
- atomic_thread_fence_acq();
- } while (gen == 0 || gen != th->th_generation);
+ binnouptime(bt, __offsetof(struct timehands, th_bintime));
}
void
@@ -1464,6 +1499,7 @@ tc_windup(struct bintime *new_boottimebin)
scale += (th->th_adjustment / 1024) * 2199;
scale /= th->th_counter->tc_frequency;
th->th_scale = scale * 2;
+ th->th_large_delta = ((uint64_t)1 << 63) / scale;
/*
* Now that the struct timehands is again consistent, set the new
More information about the freebsd-ppc
mailing list