powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]

Fri Mar 1 20:38:33 UTC 2019

On Fri, 1 Mar 2019, Konstantin Belousov wrote:

> On Sat, Mar 02, 2019 at 05:40:58AM +1100, Bruce Evans wrote:
>> On Fri, 1 Mar 2019, Konstantin Belousov wrote:
>>
>>> On Thu, Feb 28, 2019 at 11:39:06PM -0800, Mark Millard wrote:
>>>> [The new, trial code also has truncation occurring.]
>>> In fact no, I do not think it is.
>>>
>>>> An example observation of diff_scaled having an overflowed
>>>> value was:
>>>>
>>>> scale_factor            == 0x80da2067ac
>>>> scale_factor*freq overflows unsigned, 64 bit representation.
>>>> tim_offset              ==   0x3da0eaeb
>>>> tim_cnt                 ==   0x42dea3c4
>>>> tim_diff                ==    0x53db8d9
>>>> For reference:                0x1fc9d43 == 0xffffffffffffffffull/scale_factor
>>>> scaled_diff       == 0xA353A5BF3FF780CC (truncated to 64 bits)
>>>>
>>>> But for the new, trail code:
>>>>
>>>> 0x80da2067ac is 40 bits
>>>>    0x53db8d9 is 27 bits
>>>> So 67 bits, more than 63. Then:
>>>>
>>>>    x
>>>> == (0x80da2067ac>>32) * 0x53db8d9
>>>> == 0x80 * 0x53db8d9
>>>> == 0x29EDC6C80
>>>>
>>>>    x>>32
>>>> == 0x2
>>>>
>>>>    x<<32
>>>> == 0x9EDC6C8000000000 (limited to 64 bits)
>>>> Note the truncation of: 0x29EDC6C8000000000.
>>> Right, this is how the patch is supposed to work.  Note that the overflow
>>> bits 'lost' due to overflow of the left shift are the same bits that as
>>> used to increment bt->sec:
>>> 	bt->sec += x >> 32;
>>> So the 2 seconds are accounted for.

67 bits is 4-8 seconds, and it is an error for the adjustment to be >= 1
second.

>>>
>>>>
>>>> Thus the "bintime_addx(bt, x << 32)" is still
>>>> based on a truncated value.
>>>
>>> I must admit that 2 seconds of interval where the timehands where
>>> not updated is too much.  This might be the real cause of all ppc
>>> troubles.  I tried to see if the overflow case is possible on amd64,
>>> and did not get a single case of the '> 63' branch executed during the
>>> /usr/tests/lib/libc run.
>>
>> The algorithm requires the update interval to be less than 1 second.
>> th_scale is 2**64 / tc_frequency, so whatever tc_frequency is, after
>> 1 second the value of the multiplication is approximately 2**64 so
>> it overflows about then (depending on rounding).
>>
>> The most useful timecounters are TSC's, and these give another overflow
>> in tc_delta() after 1 second when their frequency is 4 GHz (except the
>> bogus TSC-low timecounter reduces the frequency to below 2 binary GHz,
>> so the usual case is overflow after 2 seconds).
> As I said, I was unable to trigger the overflow on amd64.

Yes, amd64 doesn't have the bug.  It gets tested more with fast TSCs that
overflow after about 1 second for other reasons.

Try amd64 with an ACPI or HPET timecounter.  I sometimes do this to avoid
the overflow in tc_delta() while stopped in ddb.  This seemed to work.
Actually, it doesn't work since the overflow under discussion still occurs.

>* ...
>>> @@ -78,7 +81,14 @@ binuptime(struct bintime *bt, struct vdso_timekeep *tk, int abs)
>>> 			continue;
>>> 		if (error != 0)
>>> 			return (error);
>>> -		bintime_addx(bt, th->th_scale * delta);
>>> +		scale = th->th_scale;
>>> +		if (__predict_false(fls(scale) + fls(delta) > 63)) {
>>
>> This is unnecessarily pessimal.  Updates must be frequent enough to prevent
>> tc_delta() overflowing, and it is even easier to arrange that this
>> multiplication doesn't overflow (since the necessary update interval for
>> the latter is constant).
>>
>> `scale' is 64 bits, so fls(scale) is broken on 32-bit arches, and
>> flls(scale) is an especially large pessimization.  I saw this on my
>> calcru1() fixes -- the flls()s take almost as long as long long
>> divisions when the use the pessimal C versions.
> Ok, fixed.

It is still very slow.

> Updated patch.
>
> diff --git a/lib/libc/sys/__vdso_gettimeofday.c b/lib/libc/sys/__vdso_gettimeofday.c
> index 3749e0473af..fdefda08e39 100644
> --- a/lib/libc/sys/__vdso_gettimeofday.c
> +++ b/lib/libc/sys/__vdso_gettimeofday.c
> ...
> @@ -78,7 +81,19 @@ binuptime(struct bintime *bt, struct vdso_timekeep *tk, int abs)
> 			continue;
> 		if (error != 0)
> 			return (error);
> -		bintime_addx(bt, th->th_scale * delta);
> +		scale = th->th_scale;
> +#ifdef _LP64
> +	scale_bits = ffsl(scale);
> +#else
> +	scale_bits = ffsll(scale);
> +#endif
> +		if (__predict_false(scale_bits + fls(delta) > 63)) {

Userland is still fully pessimized, except when the compiler auto-inlines
ffs*().

> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 2656fb4d22f..eedea5183c0 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
> @@ -355,13 +356,22 @@ void
> binuptime(struct bintime *bt)
> {
> 	struct timehands *th;
> -	u_int gen;
> +	uint64_t scale, x;
> +	u_int delta, gen;
>
> 	do {
> 		th = timehands;
> 		gen = atomic_load_acq_int(&th->th_generation);
> 		*bt = th->th_offset;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> +		scale = th->th_scale;
> +		delta = tc_delta(th);
> +		if (__predict_false(th->th_scale_bits + fls(delta) > 63)) {

Better, but shouldn't be changed (and the bug that causes the large intervals
remains unlocated), and if it is changed then it should use:

 		if (delta >= th->th_large_delta)

> @@ -1464,6 +1483,11 @@ tc_windup(struct bintime *new_boottimebin)
> 	scale += (th->th_adjustment / 1024) * 2199;
> 	scale /= th->th_counter->tc_frequency;
> 	th->th_scale = scale * 2;
> +#ifdef _LP64
> +	th->th_scale_bits = ffsl(th->th_scale);
> +#else
> +	th->th_scale_bits = ffsll(th->th_scale);
> +#endif

 	th->th_large_delta = ((uint64_t)1 << 63) / scale;

Bruce