bug in calcru() in kernel: integer overflow computing user time

Tue Jan 25 07:58:40 PST 2005

hihi, all -

well, i have an "almost" fix for the problem - read on, ...
(this is for discussin before send-pr submission)

description

	in FreeBSD 5.3-RELEASE (and 5.2.1R, and 5.1R),
	in file /usr/src/sys/kern/kern_resource.c,
	lines 657-750 define a function calcru(),
	and there is a bug in that function -

	line 713 in that file is

		 uu = (tu * ut) / tt,

	which is trying to compute a proportion (since ut<=tt) - the problem
	is that for my application, the values of tt and ut are very large and
	the value of the intermediate product overflows, leading to incorrect
	values (i reported the symptoms to freebsd-hackers and a very helpful
	description and localization of the problem to calcru() was provided
	by peter jeremy, who also wrote that it is a very old, but only partly
	known, problem)

repetition

	use time in csh or /usr/bin/time or getrusage() to time any program
	that takes more than a week to run

fix (almost)

	it turns out that this problem can be (partly) fixed by replacing that
	one line above with the following lines:

	/* we know 0 <= ut <= tt and 1 <= tt */
	if (tu >= tt)
		{
		**whatever type they need to be** q, r;
		q = tu / tt;
		r = tu % tt;
		uu = (q * ut) + (r * ut) / tt;
		}
	else	uu = (tu * ut) / tt

	this change does not solve all the arithmetic problems (especially
	when ut is very large), and a similar change is needed for system
	time, computing su in line 714 of the file, but it suffices for me -

	analysis (yup, proof that it should work 8-)) -

	i expect that all of these counters are increasing throughout the life
	of the process -
		tu is total time in microseconds
		ut is user 128hz ticks
		tt is total 128hz ticks
	i assume therefore that
		tu ~ tt * 10^6/128
	strictly because of the clock rates
	(machines with other kinds of clock rate ratios will need a different
	analysis, but the same idea should work there, too) -

	in my case ut ~ tt, so we see overflow in the old computation when
		tu * ut >= 2^64
		tt^2 * 10^6/128 >= 2^64
		tt * 10^3/8*sqrt(2) >= 2^32 ~ 4 * 10^9
		tt >= 32*sqrt(2)*10^6,
	which is about
		sqrt(2)*10^6 / 4 ~ 3.54*10^5 seconds,
	or
			~ 98 hours
	(this is the phenomenon i observed)

	in the new computation offered above, since we know that
		ut <= tt,
	we also know that
		uu <= tu,
	and
		(q * ut) <= uu,
	so there can be no overflow in the (q * ut) term -
	therefore, this changed code will overflow only when r is large
		r ~ tt,
	and then only when
		r * ut >= 2^64
		tt^2 >= 2^64
		tt >= 2^32,
	which is about
			~ 2^25 seconds
			~ 9300 hours
			~ 388 days
	(i can live with that for now)

	for everyone else, it adds the one test in the if statement to every
	call to calcru() (or two, if system time is similarly fixed) -
	<philosophy> instrumentation is costly, and correct instrumentation is
	even more costly, but it's worth it, every time, to get the right
	answer </philosophy>

	i'm about to try it out this week - i will report the results when i
	get some data (which will be a few weeks)

	i'm thinking about how to solve the problem correctly for really
	long-running programs, but it's all pretty special case ugly so far

more later,
cal

Dr. Christopher Landauer
Aerospace Integration Science Center
The Aerospace Corporation, Mail Stop M6/214
P.O.Box 92957
Los Angeles, California 90009-2957, USA
e-mail: cal at aero.org
Phone: +1 (310) 336-1361