Timekeeping [Was: Re: cvs commit: src/usr.bin/vmstat vmstat.c src/usr.bin/w w.c]

Fri Oct 21 07:23:08 PDT 2005

On Thu, 20 Oct 2005, Poul-Henning Kamp wrote:

> ...
> If people do stupid things like use hard steps (*settime*()) to
> correct rate problems, then they get what they deserve, including
> potentially backwards jumps in time, but the integral over time of
> all steps apart from the first one amounts to a rate correction.

Using *settime*() isn't stupid.  It is always done by ntpdate -b and
sometimes done by ntpd.  (I use ntpd -x to prevent stepping, but -x
shouldn't be used except for debugging since stepping is the best way
to correct large errors, and at least old versions of ntpd are broken
if they would prefer to step but are prevent from doing so by -x.)

> In summary:  CLOCK_MONOTONIC is our best estimate of how many SI
> seconds the system have been runing [3].

Actual testing shows that CLOCK_MONOTONIC, or possibly CLOCK_REALTIME
less the boot time, gives a very bad estimate of how long the system has
been running.  The difference between these clocks was about 500 seconds
on all systems tested:

% sledge:
%  1:03PM  up 22:45, 1 user, load averages: 0.23, 0.08, 0.02
% uptime 1 81900
% uptime 2 82887
% 
% pluto1:
%  1:05PM  up 15 days, 10:18, 1 user, load averages: 1.28, 1.15, 1.26
% uptime 1 1333090
% uptime 2 1333540
% 
% pluto2:
%  1:06PM  up 10 days,  7:19, 1 user, load averages: 1.95, 1.83, 1.80
% uptime 1 890323
% uptime 2 890721

These are freebsd machines.  uptime1 is from gettimeofday() less boottime.
uptime2 is from CLOCK_MONOTONIC.  I don't know what root has been doing
to mess up the clocks on these machines.

% delplex:
% 11:00PM  up 31 days,  4:37, 2 users, load averages: 0.06, 0.02, 0.00
% uptime 1 2695028
% uptime 2 2695926
% 
% epsplex:
% 11:00PM  up  3:34, 4 users, load averages: 0.00, 0.00, 0.00
% uptime 1 12856
% uptime 2 13390
% 
% besplex:
% 11:01PM  up 26 days,  1:09, 1 user, load averages: 0.00, 0.00, 0.00
% uptime 1 2250584
% uptime 2 2251311

These are my local machines.  Root did a lot of ntpdate -b's on delplex
and besplex when they rebooted after a power failure 26 days ago, but
the steps were much smaller than 500 seconds and there haven't been any
since.  epsplex has the ~500 second difference after not doing any steps
except:

% Oct 21 19:26:59 epsplex kernel: tc_windup: large step 1129922814

Usual step from 0 to year 2005 on startup:

% Oct 21 19:26:59 epsplex kernel: tc_windup: negative step 36000

Usual step by adjkerntz to fix up hardware clock being on local time.
Doesn't affect deltas.

% Oct 21 19:27:01 epsplex kernel: tc_windup: negative step 2

By ntpdate to sync with delplex.

A large fairly machdine-independent differece is hard to explain.  I
will reboot after sending this to see if one of the values is much
larger than the uptime when the uptime is < 60 seconds.

> Given that CLOCK_MONOTONIC is our best guess how long the kernel
> has been running, it follows that CLOCK_REALTIME - CLOCK_MONOTONIC
> must be our best estimate of what time the kernel booted.

Not given, and not true.  After syncing with an accurate external clock
by a step, we know the real time very accurately.  Normally we sync
soon after booting.  Then we know the boot time very accurately (it
is the current real time less CLOCK_MONOTONIC).  Then if we resync
with the external clock later using a step, we again know the real
time very accurately, and our best guess at the uptime is the current
real time less the previously determined boot time (with a non-broken
time_t or difftime() restoring leap seconds).  CLOCK_MONOTONIC cannot
track this because it cannot jump.  You might say that the uptime
cannot jump either.  This is OK, but then it (like CLOCK_MONOTONIC)
should be slewed to catch up with the jump.

Bruce