powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Sun Mar 3 16:16:47 UTC 2019

On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> 
> > On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
> >> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >>
> >>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
> >>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> >* ...
> >>>> I don't changing this at all this.  binuptime() was carefully written
> >>>> to not need so much 64-bit arithmetic.
> >>>>
> >>>> If this pessimization is allowed, then it can also handle a 64-bit
> >>>> deltas.  Using the better kernel method:
> >>>>
> >>>>  		if (__predict_false(delta >= th->th_large_delta)) {
> >>>>  			bt->sec += (scale >> 32) * (delta >> 32);
> >>>>  			x = (scale >> 32) * (delta & 0xffffffff);
> >>>>  			bt->sec += x >> 32;
> >>>>  			bintime_addx(bt, x << 32);
> >>>>  			x = (scale & 0xffffffff) * (delta >> 32);
> >>>>  			bt->sec += x >> 32;
> >>>>  			bintime_addx(bt, x << 32);
> >>>>  			bintime_addx(bt, (scale & 0xffffffff) *
> >>>>  			    (delta & 0xffffffff));
> >>>>  		} else
> >>>>  			bintime_addx(bt, scale * (delta & 0xffffffff));
> >>> This only makes sense if delta is extended to uint64_t, which requires
> >>> the pass over timecounters.
> >>
> >> Yes, that was its point.  It is a bit annoying to have a hardware
> >> timecounter like the TSC that doesn't wrap naturally, but then make it
> >> wrap by masking high bits.
> >>
> >> The masking step is also a bit wasteful.  For the TSC, it is 1 step to
> >> discard high bids at the register level, then another step to apply the
> >> nask to discard th high bits again.
> > rdtsc-low is implemented in the natural way, after RDTSC, no register
> > combining into 64bit value is done, instead shrd operates on %edx:%eax
> > to get the final result into %eax.  I am not sure what you refer to.
> 
> I was referring mostly to the masking step '& tc->tc_counter_mask' and
> the lack of register combining in rdtsc().
> 
> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
> step.  i386 used to be faster here -- the first masking step of discarding
> %edx doesn't take any code.  amd64 has to mask out the top bits in %rax.
> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
> has to do a not so slow shr.
i386 cannot discard %edx after RDTSC since some bits from %edx come into
the timecounter value.
amd64 cannot either, but amd64 does not need to mask out top bits in %rax,
since the whole shrdl calculation occurs in 32bit registers, and the result
is in %rax where top word is cleared by shrdl instruction automatically.
But the clearing is not required since result is unsigned int anyway.

Dissassemble of tsc_get_timecount_low() is very clear:
   0xffffffff806767e4 <+4>:     mov    0x30(%rdi),%ecx
   0xffffffff806767e7 <+7>:     rdtsc  
   0xffffffff806767e9 <+9>:     shrd   %cl,%edx,%eax
...
   0xffffffff806767ed <+13>:    retq
(I removed frame manipulations).

> 
> Then the '& tc->tc_counter_mask' step has no effect.
This is true.

> 
> All this is wrapped in many layers of function calls which are quite slow
> but this lets the other operations run in parallel on some CPUs.
> 
> >>>>  		/* 32-bit arches did the next multiplication implicitly. */
> >>>>  		x = (scale >> 32) * delta;
> >>>>  		/*
> >>>>  		 * And they did the following shifts and most of the adds
> >>>>  		 * implicitly too.  Except shifting x left by 32 lost the
> >>>>  		 * seconds part that the next line handles.  The next line
> >>>>  		 * is the only extra cost for them.
> >>>>  		 */
> >>>>  		bt->sec += x >> 32;
> >>>>  		bintime_addx(bt, (x << 32) + (scale & 0xffffffff) * delta);
> >>>
> >>> Ok, what about the following.
> >>
> >> I'm not sure that I really want this, even if the pessimization is done.
> >> But it avoids using fls*(), so is especially good for 32-bit systems and
> >> OK for 64-bit systems too, especially in userland where fls*() is in the
> >> fast path.
> > For userland I looked at the generated code, and BSR usage seems to be
> > good enough, for default compilation settings with clang.
> 
> I use gcc-4.2.1, and it doesn't do this optimization.
> 
> I already reported this in connection with fixing calcru1().  calcru1()
> is unnecessarily several times slower on i386 than on amd64 even after
> avoiding using flsll() on it.  The main slowness is in converting 'usec'
> to tv_sec and tv_usec, due to the bad design and implementation of the
> __udivdi3 and __umoddi3 libcalls.  The bad design is having to make 2
> libcalls to get the quotient and remainder.  The bad implementation is
> the portable C version in libkern.  libgcc provides a better implementation,
> but this is not available in the kernel.
> 
> >>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> >>> index 2656fb4d22f..2e28f872229 100644
> >>> --- a/sys/kern/kern_tc.c
> >>> +++ b/sys/kern/kern_tc.c
> >>> ...
> >>> @@ -351,17 +352,44 @@ fbclock_getmicrotime(struct timeval *tvp)
> >>> 	} while (gen == 0 || gen != th->th_generation);
> >>> }
> >>> #else /* !FFCLOCK */
> >>> +
> >>> +static void
> >>> +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
> >>> +{
> >>> +	uint64_t x;
> >>> +
> >>> +	x = (*scale >> 32) * delta;
> >>> +	*scale &= 0xffffffff;
> >>> +	bt->sec += x >> 32;
> >>> +	bintime_addx(bt, x << 32);
> >>> +}
> >>
> >> It is probably best to not inline the slow path, but clang tends to
> >> inline everything anyway.
> > It does not matter if it inlines it, as far as it is moved out of the
> > linear sequence for the fast path.
> >>
> >> I prefer my way of writing this in 3 lines.  Modifying 'scale' for
> >> the next step is especially ugly and pessimal when the next step is
> >> in the caller and this function is not inlined.
> > Can you show exactly what do you want ?
> 
> Just write 'scale & 0xffffffff' for the low bits of 'scale' in callers,
> and don't pass 'scale' indirectly to bintime_helper() and don't modify
> it there.
> 
> Oops, there is a problem.  'scale' must be reduced iff bintime_helper()
> was used.  Duplicate some source code so as to not need a fall-through
> to the fast path.  See below.
Yes, this is the reason why it is passed by pointer (C has no references).

> 
> >>> void
> >>> binuptime(struct bintime *bt)
> >>> {
> >>> 	struct timehands *th;
> >>> -	u_int gen;
> >>> +	uint64_t scale;
> >>> +	u_int delta, gen;
> >>>
> >>> 	do {
> >>> 		th = timehands;
> >>> 		gen = atomic_load_acq_int(&th->th_generation);
> >>> 		*bt = th->th_offset;
> >>> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> >>> +		scale = th->th_scale;
> >>> +		delta = tc_delta(th);
> >>> +#ifdef _LP64
> >>> +		/* Avoid overflow for scale * delta. */
> >>> +		if (__predict_false(th->th_large_delta <= delta))
> >>> +			bintime_helper(bt, &scale, delta);
> >>> +		bintime_addx(bt, scale * delta);
> >>> +#else
> >>> +		/*
> >>> +		 * Also avoid (uint64_t, uint32_t) -> uint64_t
> >>> +		 * multiplication on 32bit arches.
> >>> +		 */
> >>
> >> "Also avoid overflow for ..."
> >>
> >>> +		bintime_helper(bt, &scale, delta);
> >>> +		bintime_addx(bt, (u_int)scale * delta);
> >>
> >> The cast should be to uint32_t, but better write it as & 0xffffffff as
> >> elsewhere.
> 
> This is actually very broken.  The cast gives a 32 x 32 -> 32 bit
> multiplication, but all 64 bits of the result are needed.
Yes, fixed in the updated version.

> 
> >>
> >> bintime_helper() already reduced 'scale' to 32 bits.  The cast might be
> >> needed to tell the compiler this, especially when the function is not
> >> inlined.  Better not do it in the function.  The function doesn't even
> >> use the reduced value.
> > I used cast to use 32x32 multiplication.  I am not sure that all (or any)
> > compilers are smart enough to deduce that they can use 32 bit mul.
> 
> Writing the reduction to 32 bits using a mask instead of a cast automatically
> avoids the bug, but might not give the optimization.
> 
> They do do this optimization, but might need the cast as well as the mask.
> At worst, '(uint64_t)(uint32_t)(scale & 0xffffffff)', where the mask is
> now redundant but the cast back to 64 bits is needed if the cast to 32
> bits is used.
> 
> You already depended on them not needing the cast for the expression
> '(*scale >> 32) * delta'.  Here delta is 32 bits and the other operand
> must remain 64 bits so that after default promotions the multiplication
> is 64 x 64 -> 64 bits, but the compiler should optimize this to
> 32 x 32 -> 64 bits.  (*scale >> 32) would need to be cast to 32 bits
> and then back to 64 bits if the compiler can't do this automatically.
> 
> I checked what some compilers do.  Both gcc-3.3.3 and gcc-4.2.1
> optimize only (uint64_t)x * y (where x and y have type uint32_t), so they
> need to be helped by casts if x and y have have a larger type even if
> their values obviously fit in 32 bits.  So the expressions should be
> written as:
> 
>  	(uint64_t)(uint32_t)(scale >> 32) * delta;
> 
> and
> 
>  	(uint64_t)(uint32_t)scale * delta;
> 
> The 2 casts are always needed, but the '& 0xffffffff' operation doesn't
> need to be explicit because the cast does.
This is what I do now.

> 
> >> This needs lots of testing of course.
> >
> > Current kernel-only part of the change is below, see the question about
> > your preference for binuptime_helper().
> >
> > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> > index 2656fb4d22f..6c41ab22288 100644
> > --- a/sys/kern/kern_tc.c
> > +++ b/sys/kern/kern_tc.c
> > @@ -72,6 +71,7 @@ struct timehands {
> > 	struct timecounter	*th_counter;
> > 	int64_t			th_adjustment;
> > 	uint64_t		th_scale;
> > +	uint64_t		th_large_delta;
> > 	u_int	 		th_offset_count;
> > 	struct bintime		th_offset;
> > 	struct bintime		th_bintime;
> > @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp)
> > 	} while (gen == 0 || gen != th->th_generation);
> > }
> > #else /* !FFCLOCK */
> > +
> > +static void
> 
> Add __inline.  This is in the fast path for 32-bit systems.
Compilers do not need this hand-holding, and I prefer to avoid __inline
unless really necessary.  I checked with both clang 7.0 and gcc 8.3
that autoinlining did occured.

> 
> > +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
> > +{
> > +	uint64_t x;
> > +
> > +	x = (*scale >> 32) * delta;
> > +	*scale &= 0xffffffff;
> 
> Remove the '*' on scale, cast (scale >> 32) to
> (uint64_t)(uint32_t)(scale >> 32), and remove the change to *scale.
> 
> > +	bt->sec += x >> 32;
> > +	bintime_addx(bt, x << 32);
> > +}
> > +
> > void
> > binuptime(struct bintime *bt)
> > {
> > 	struct timehands *th;
> > -	u_int gen;
> > +	uint64_t scale;
> > +	u_int delta, gen;
> >
> > 	do {
> > 		th = timehands;
> > 		gen = atomic_load_acq_int(&th->th_generation);
> > 		*bt = th->th_offset;
> > -		bintime_addx(bt, th->th_scale * tc_delta(th));
> > +		scale = th->th_scale;
> > +		delta = tc_delta(th);
> > +#ifdef _LP64
> > +		/* Avoid overflow for scale * delta. */
> > +		if (__predict_false(th->th_large_delta <= delta))
> > +			bintime_helper(bt, &scale, delta);
> > +		bintime_addx(bt, scale * delta);
> 
> Change to:
> 
>  		if (__predict_false(th->th_large_delta <= delta)) {
>  			bintime_helper(bt, scale, delta);
>  			bintime_addx(bt, (scale & 0xffffffff) * delta);
>  		} else
>  			bintime_addx(bt, scale * delta);
I do not like it, but ok.

> 
> > +#else
> > +		/*
> > +		 * Avoid both overflow as above and
> > +		 * (uint64_t, uint32_t) -> uint64_t
> > +		 * multiplication on 32bit arches.
> > +		 */
> 
> This is a bit unclear.  Better emphasize avoidance of the 64 x 32 -> 64 bit
> multiplication.  Something like:
> 
>  		/*
>  		 * Use bintime_helper() unconditionally, since the fast
>  		 * path in the above method is not so fast here, since
>  		 * the 64 x 32 -> 64 bit multiplication is usually not
>  		 * available in hardware and emulating it using 2
>  		 * 32 x 32 -> 64 bit multiplications uses code much
>  		 * like that in bintime_helper().
>  		 */
> 
> > +		bintime_helper(bt, &scale, delta);
> > +		bintime_addx(bt, (uint32_t)scale * delta);
> > +#endif
> 
> Remove '&' as usual, and fix this by casting the reduced scale back to
> 64 bits.
> 
> Similarly in bintime().
I merged two functions, finally.  Having to copy the same code is too
annoying for this change.

So I verified that:
- there is no 64bit multiplication in the generated code, for i386 both
  for clang 7.0 and gcc 8.3;
- that everything is inlined, the only call from bintime/binuptime is
  the indirect call to get the timecounter value.

> 
> Similarly in libc -- don't use the slow flsll() method in the 32-bit
> case where it is especially slow.  Don't use it in the 64-bit case either,
> since this would need to be change when th_large_delta is added to the
> API.
> 
> Now I don't like my method in the kernel.  It is is unnecessarily
> complicated to have a specal case, and not faster either.

diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
index 2656fb4d22f..0fd39e25058 100644
--- a/sys/kern/kern_tc.c
+++ b/sys/kern/kern_tc.c
@@ -72,6 +72,7 @@ struct timehands {
 	struct timecounter	*th_counter;
 	int64_t			th_adjustment;
 	uint64_t		th_scale;
+	uint64_t		th_large_delta;
 	u_int	 		th_offset_count;
 	struct bintime		th_offset;
 	struct bintime		th_bintime;
@@ -351,21 +352,63 @@ fbclock_getmicrotime(struct timeval *tvp)
 	} while (gen == 0 || gen != th->th_generation);
 }
 #else /* !FFCLOCK */
-void
-binuptime(struct bintime *bt)
+
+static void
+bintime_helper(struct bintime *bt, uint64_t scale, u_int delta)
+{
+	uint64_t x;
+
+	x = (scale >> 32) * delta;
+	bt->sec += x >> 32;
+	bintime_addx(bt, x << 32);
+}
+
+static void
+binnouptime(struct bintime *bt, u_int off)
 {
 	struct timehands *th;
-	u_int gen;
+	struct bintime *bts;
+	uint64_t scale;
+	u_int delta, gen;
 
 	do {
 		th = timehands;
 		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_offset;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
+		bts = (struct bintime *)(vm_offset_t)th + off;
+		*bt = *bts;
+		scale = th->th_scale;
+		delta = tc_delta(th);
+#ifdef _LP64
+		if (__predict_false(th->th_large_delta <= delta)) {
+			/* Avoid overflow for scale * delta. */
+			bintime_helper(bt, scale, delta);
+			bintime_addx(bt, (scale & 0xffffffff) * delta);
+		} else {
+			bintime_addx(bt, scale * delta);
+		}
+#else
+		/*
+		 * Use bintime_helper() unconditionally, since the fast
+		 * path in the above method is not so fast here, since
+		 * the 64 x 32 -> 64 bit multiplication is usually not
+		 * available in hardware and emulating it using 2
+		 * 32 x 32 -> 64 bit multiplications uses code much
+		 * like that in bintime_helper().
+		 */
+		bintime_helper(bt, scale, delta);
+		bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
+#endif
 		atomic_thread_fence_acq();
 	} while (gen == 0 || gen != th->th_generation);
 }
 
+void
+binuptime(struct bintime *bt)
+{
+
+	binnouptime(bt, __offsetof(struct timehands, th_offset));
+}
+
 void
 nanouptime(struct timespec *tsp)
 {
@@ -387,16 +430,8 @@ microuptime(struct timeval *tvp)
 void
 bintime(struct bintime *bt)
 {
-	struct timehands *th;
-	u_int gen;
 
-	do {
-		th = timehands;
-		gen = atomic_load_acq_int(&th->th_generation);
-		*bt = th->th_bintime;
-		bintime_addx(bt, th->th_scale * tc_delta(th));
-		atomic_thread_fence_acq();
-	} while (gen == 0 || gen != th->th_generation);
+	binnouptime(bt, __offsetof(struct timehands, th_bintime));
 }
 
 void
@@ -1464,6 +1499,7 @@ tc_windup(struct bintime *new_boottimebin)
 	scale += (th->th_adjustment / 1024) * 2199;
 	scale /= th->th_counter->tc_frequency;
 	th->th_scale = scale * 2;
+	th->th_large_delta = ((uint64_t)1 << 63) / scale;
 
 	/*
 	 * Now that the struct timehands is again consistent, set the new