bug in calcru() in kernel: integer overflow computing user time
Chris Landauer
cal at rush.aero.org
Tue Jan 25 07:58:40 PST 2005
hihi, all -
well, i have an "almost" fix for the problem - read on, ...
(this is for discussin before send-pr submission)
description
in FreeBSD 5.3-RELEASE (and 5.2.1R, and 5.1R),
in file /usr/src/sys/kern/kern_resource.c,
lines 657-750 define a function calcru(),
and there is a bug in that function -
line 713 in that file is
uu = (tu * ut) / tt,
which is trying to compute a proportion (since ut<=tt) - the problem
is that for my application, the values of tt and ut are very large and
the value of the intermediate product overflows, leading to incorrect
values (i reported the symptoms to freebsd-hackers and a very helpful
description and localization of the problem to calcru() was provided
by peter jeremy, who also wrote that it is a very old, but only partly
known, problem)
repetition
use time in csh or /usr/bin/time or getrusage() to time any program
that takes more than a week to run
fix (almost)
it turns out that this problem can be (partly) fixed by replacing that
one line above with the following lines:
/* we know 0 <= ut <= tt and 1 <= tt */
if (tu >= tt)
{
**whatever type they need to be** q, r;
q = tu / tt;
r = tu % tt;
uu = (q * ut) + (r * ut) / tt;
}
else uu = (tu * ut) / tt
this change does not solve all the arithmetic problems (especially
when ut is very large), and a similar change is needed for system
time, computing su in line 714 of the file, but it suffices for me -
analysis (yup, proof that it should work 8-)) -
i expect that all of these counters are increasing throughout the life
of the process -
tu is total time in microseconds
ut is user 128hz ticks
tt is total 128hz ticks
i assume therefore that
tu ~ tt * 10^6/128
strictly because of the clock rates
(machines with other kinds of clock rate ratios will need a different
analysis, but the same idea should work there, too) -
in my case ut ~ tt, so we see overflow in the old computation when
tu * ut >= 2^64
tt^2 * 10^6/128 >= 2^64
tt * 10^3/8*sqrt(2) >= 2^32 ~ 4 * 10^9
tt >= 32*sqrt(2)*10^6,
which is about
sqrt(2)*10^6 / 4 ~ 3.54*10^5 seconds,
or
~ 98 hours
(this is the phenomenon i observed)
in the new computation offered above, since we know that
ut <= tt,
we also know that
uu <= tu,
and
(q * ut) <= uu,
so there can be no overflow in the (q * ut) term -
therefore, this changed code will overflow only when r is large
r ~ tt,
and then only when
r * ut >= 2^64
tt^2 >= 2^64
tt >= 2^32,
which is about
~ 2^25 seconds
~ 9300 hours
~ 388 days
(i can live with that for now)
for everyone else, it adds the one test in the if statement to every
call to calcru() (or two, if system time is similarly fixed) -
<philosophy> instrumentation is costly, and correct instrumentation is
even more costly, but it's worth it, every time, to get the right
answer </philosophy>
i'm about to try it out this week - i will report the results when i
get some data (which will be a few weeks)
i'm thinking about how to solve the problem correctly for really
long-running programs, but it's all pretty special case ugly so far
more later,
cal
Dr. Christopher Landauer
Aerospace Integration Science Center
The Aerospace Corporation, Mail Stop M6/214
P.O.Box 92957
Los Angeles, California 90009-2957, USA
e-mail: cal at aero.org
Phone: +1 (310) 336-1361
More information about the freebsd-hackers
mailing list