[PATCH 0/2] plug capability races
Bruce Evans
brde at optusnet.com.au
Thu Aug 21 03:34:59 UTC 2014
On Thu, 21 Aug 2014, Bruce Evans wrote:
> ...
> I now remember a bit more about the algorithm. There are several
> generations of timehands. Each generation remains stable for several
> clock ticks. That should be several clock ticks at 100 Hz. Normally
> there is no problem with just using the old pointer read from timehands
> (except there is no serialization for updating timehands itself (*)).
> ...
> (*):
>
> % binuptime(struct bintime *bt)
> % {
> % struct timehands *th;
> % u_int gen;
> % % do {
> % th = timehands;
>
> Since tc_windup() also doesn't dream of memory ordering, timehands here
> may be in the future of what it points to. That is much worse than it
> being in the past. Barriers would be cheap in tc_windup() but useless
> if they require barriers in binuptime() to work.
>
> tc_windup() is normally called from the clock interrupt handler. There
> are several mutexes (or at least atomic ops that give synchronization on
> at least x86 SMP) before and after it. These gives serialization very
> soon after the changes.
>
> The fix (without adding any barrier instructions) is easy. Simply
> run the timehands update 1 or 2 generations behind the update of what
> it points to. This gives even more than time-domain locking, since
> the accidental synchronization from the interrupt handler gives ordering
> between the update of the pointed-to data and the timehands pointer.
> ...
More details:
- lock tc_windup() and tc_ticktock() using a spinlock
- add hard real-time rate limiting and error recovery so that the timehands
are not cycled through too fast or too slow.
tc_ticktock() already does this for calls from the clock interrupt
handler except when clock interrupts are non-hard. tc_ticktock()
can use mtx_trylock() and do nothing if the mutex is contested.
tc_setclock() and possibly inittimecounter() should wait to synchronize
with the next clock interrupt that would call tc_windup(), and advance
the time that they set by the wait delay plus previous delays, and even
more, since its changes shouldn't go live for several generations. It
sort of does this now, in a broken way. It corrupts the boot time
using racy accesses. This limits problems from large adjustments to
realtime clock ids (the ones that add the boot time). There are no
further delays, just races accessing the boot time in critical places
like boottime(). Delays are now also limited by calling tc_windup()
and tc_windup() going live with updated timehands almost immediately
(as soon as it complete). The immediate tc_windup() call is commented
on as being to fiddle with all the crinkly bits aroudn the fiords, but
the only criticial thing it does is update the generation count in
a fiarly non-racy way -- this tells bintime() to loop, so it has a
chance of picking up the changed boot time with a coherent value.
sysctl_kern_timecounter_hardware() should call tc_windup() to do
a staged update way much like for tc_setclock(). It refrains from
doing this because of the races, but it hacks on the timehands
pointer in a different and even more fragile racy way. It now calls
timekeep_push_vdso() to do the userland part of tc_windup().
The timehands may be recycled too slowly. This happens mainly on
suspension. The system depends on frequent windups to work, so
it can't run really tickless. After suspension, all old generations
are garbage but their generation counts might not have been updated
to indicate this. The system should at least try to detect this.
I don't understand what happens for timecounters on resume now.
- in tc_windup(), bump the generation count for the second-oldest generation
instead of setting it to 0 for the current generation, and update the
timehands for the oldest generation instead of changing them for the
current generation. This also fixes busy-waiting and contention on the
timehands for the current generation during the windup.
Using the special generation count of 0 essentially reduces the
"queue" of timehands from length 10 to length 0 during the
windup, at a cost of complications and bugs.
It also makes the other 9 generations of the timehands almost
never used, and not very useful. 1 generation together with a
generation count that is set to 0 during windups suffices, at
the cost of spinning while the generation count is 0 and
complications and bugs in accesses to the generation count.
But the current version already has all these costs in the usual
case where the generation changes. tc_windup() is supposed to
run with interrupts disabled, so that it cannot be preempted
and the length of the spinning is bounded. (Having only Giant
locking for the call in settime() is even worse than first appeared.
It doesn't prevent preemption at all, so the length of the spinning
is unbounded.)
In unusal cases, binuptime() is preempted and the generation count
changes many times before the original timehands is used. Then
the pointer to it is invalid. But the generation count in it has
increased by more than usual, so the change is detected and the
pointer is updated. So old generations are not used for storing
anything important except for the generation count, and having 10
generations just reduces the rate of increase of generation counts
by a factor of 10, so it takes preemption by 10 ** 2^32 windups
instead of only 2**32 for the algorithm to by broken by wraparound
of the generation count (with HZ = 1000, that is 490 days of
preemption instead of only 49).
The delayed updates might cause different complications. I think
ntp seconds updates strictly should to be done in advance so as
to go live on seconds rollover. The details can't be too
critical, since with HZ = 100 tc_windup calls are out of sync with
seconds rollovers by an average of 5 milliseconds (+-5) and no one
seemed to notice problems from that. Isn't there an error of 1
second for the duration of the sync time around leap seconds
adjustments? With HZ = 1000 the update "queue" with intentionally
delayed updates could have length 5 and give much the same
behaviour except for missing races (the average delay would still
be 5 milliseconds but now +-0.5).
Bruce
More information about the freebsd-arch
mailing list