Timekeeping [Was: Re: cvs commit: src/usr.bin/vmstat vmstat.c src/usr.bin/w w.c]

Poul-Henning Kamp phk at phk.freebsd.dk
Thu Oct 20 14:04:21 PDT 2005


I can see that Warner has already handled some of the necessary
rebuttals so I will not repeat his arguments apart from noting my
agreement that leapseconds are evil and should be abandonned as
soon as possible.


But let me step back a bit and explain the rationale for the way
we keep time in FreeBSD, as a means for clearing up some of the
confusion which the discussion between Bruce and me have caused.


The first thing to remember is that a clock consists of a frequency
source and a counter.  The counter is trivial [1], you can do it
with any technology and get it right, it's the frequency source
which is the tricky bit.

So our hardest task is to decide how long we think seconds are.

Initially we trust the timecount hardware to know this (some of
them autocalibrate) but we take corrections from NTPD and other
programs via a specialized group of syscalls, because unless the
computer has timecounting hardware driven by a primary frequency
standard (Cesium or a steered oscillator) corrections are necessary
to get the length of seconds right.

But we also need to get the counter synchronized with UTC.  If the
length of our seconds is perfect, we need to do this only once.

If the length of our seconds are not perfect, the phase error will
become non-zero, and we can either fix this with a correction to
the phase, a time step, or we do it by overcorrection of the frequency
(the length of our seconds) for a period of time until we have
regained or lost the phase synchronization.

If we are able to estimate the frequency error, we can of course
apply the correction predictively.

Hardware or software, like NTPD, which does all of the above three
are called a second order Phase Locked Loop ("a PLL"), and has a
lot of mathematical theory hidden in dusty textbooks.

If people do stupid things like use hard steps (*settime*()) to
correct rate problems, then they get what they deserve, including
potentially backwards jumps in time, but the integral over time of
all steps apart from the first one amounts to a rate correction.

When NTPD is running it gives the kernel gets a rate correction
which is really mix of a corrective phase adjustment, a corrective
rate adjustment and a predictve rate adjustment.  The math works
out the same however: leaving out the first phase adjustment (which
is usually handled by a step anyway) the integral over time of the
sum of the phase and rate adjustments is the true rate correction [2].

Adjtime() is a middle case, it implements a phase step but spreads
it out over time (by doing frequency corrections) to avoid large
gaps or backwards steps in the CLOCK_REALTIME timescale.

Adjtime() is used by various time synchronization tools which doesn't
do rate estimation at all but rather implements occational phase
synchronization using these "soft steps".

Again repeated phase synchronization amounts to crude frequency
steering, and therefore again, the integral over time is our best
estimate of SI second duration.

But as I said: timekeeping in all forms consists of getting the
phase right the first time, and keeping the frequency right (on
average) afterwards and there is no escaping this basic mathematical
fact because you can't go back and remeasure the past.

FreeBSD incorporates everything but the hard steps into the
CLOCK_MONOTONIC timescale, because over time, the integral of those
corrections are our best estimate of the correct length of SI
seconds.

It can be argued that any hard steps after the second should be
factored in as well, but in practice subsequent hard steps are
either to correct mistakes in the initial hard step or so infrequent
that averaging out the corrections doesn't make sense, so we
treat all hard steps as phase only corrections.

In summary:  CLOCK_MONOTONIC is our best estimate of how many SI
seconds the system have been runing [3].

Given that CLOCK_MONOTONIC is our best guess how long the kernel
has been running, it follows that CLOCK_REALTIME - CLOCK_MONOTONIC
must be our best estimate of what time the kernel booted.

CLOCK_REALTIME aka. UTC is therefore maintained in FreeBSD by keeping
around our best estimate of when the system booted in UTC time and
adding CLOCK_MONOTONIC to it.  Hard phase steps are implemented by
changing our boottime estimate according to the desired step.

The only snag in this is that leapsecond does not exist in
CLOCK_REALTIME, but they very much exist in the real world.

We deal with (ie: ignore) leap seconds by either replaying or
skipping a second on the CLOCK_REALTIME timescale [5], and in order
to make the math come out right, we do that by adjusting boottime
one second either way.

This is technically wrong, and will mean that the boottime estimate
is wrong by the number of leapseconds the system has experienced
while running.

Considering that leapseconds happen once every 500 days or so and
that POSIX found them so insignificant that they just defined them
out of existence as far as computers go, I have no problem with
this approximation.

Conclusion:

Provided root doesn't go out of his way to muck it up, timekeeping
in FreeBSD will Do The Right Thing, and do it a fair bit better and
with higher precicion than any other operating system.

If you want to know how long time the system has been running,
CLOCK_MONOTONIC is the best number you will get.


Footnotes:

[1] Actually, as leapseconds have proven it is possible for a highly
skilled group of scientists to get the counting part wrong also.

[2] Because NTPD implements a 2nd order PLL, the integral over time
of the phase adjustment alone is the frequency drift divided by the
PLL timeconstant, a number which is lost in the noise unless you
have an hi-quality OCXO or better timebase.

[3] As Bruce has correctly pointed out, if the root plays silly
buggers with time management systemcalls, he can muck it up [4].
One way would be to apply a 500PPM frequency correction and step
one second in the other direction every 2000 seconds.  On average
the clock would be right, but the CLOCK_MONOTONIC would be 500PPM
wrong.

[4] Toot could also do "killall -9 sh" or "rm -rf /", either of
which would be both faster and more spectacular.

[5] Warner is right: I got the actual sequence it wrong in my
previous email.


References:

http://phk.freebsd.dk/pubs/timecounter.pdf

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.


More information about the cvs-src mailing list