Re: git: 3d9d64aa1846 - main - kern_tc: unify timecounter to bintime delta conversion
Date: Tue, 30 Nov 2021 16:14:22 UTC
On Tue, Nov 30, 2021 at 7:34 AM Andriy Gapon <avg@freebsd.org> wrote: > > The branch main has been updated by avg: > > URL: https://cgit.FreeBSD.org/src/commit/?id=3d9d64aa1846217eac9229f8cba5cb6646a688b7 > > commit 3d9d64aa1846217eac9229f8cba5cb6646a688b7 > Author: Andriy Gapon <avg@FreeBSD.org> > AuthorDate: 2021-11-30 13:23:23 +0000 > Commit: Andriy Gapon <avg@FreeBSD.org> > CommitDate: 2021-11-30 13:23:23 +0000 > > kern_tc: unify timecounter to bintime delta conversion > > There are two places where we convert from a timecounter delta to > a bintime delta: tc_windup and bintime_off. > Both functions use the same calculations when the timecounter delta is > small. But for a large delta (greater than approximately an equivalent > of 1 second) the calculations were different. Both functions use > approximate calculations based on th_scale that avoid division. Both > produce values slightly greater than a true value, calculated with > division by tc_frequency, would be. tc_windup is slightly more > accurate, so its result is closer to the true value and, thus, smaller > than bintime_off result. > > As a consequence there can be a jump back in time when time hands are > switched after a long period of time (a large delta). Just before the > switch the time would be calculated with a large delta from > th_offset_count in bintime_off. tc_windup does the switch using its own > calculations of a new th_offset using the large delta. As explained > earlier, the new th_offset may end up being less than the previously > produced binuptime. So, for a period of time new binuptime values may > be "back in time" comparing to values just before the switch. > > Such a jump must never happen. All the code assumes that the uptime is > monotonically nondecreasing and some code works incorrectly when that > assumption is broken. For example, we have observed sleepq_timeout() > ignoring a timeout when the sbinuptime value obtained by the callout > code was greater than the expiration value, but the sbinuptime obtained > in sleepq_timeout() was less than it. In that case the target thread > would never get woken up. > > The unified calculations should ensure the monotonic property of the > uptime. > > The problem is quite rare as normally tc_windup should be called HZ > times per second (typically 1000 or 100). But it may happen in VMs on > very busy hypervisors where a VM's virtual CPU may not get an execution > time slot for a second or more. > I wonder if this helps explain the behavior we saw when enabling TSC on VirtualBox guests. Threads doing small ~1 second or less sleeps would start to miss their wakeups, so we'd consistently see, e.g., shutdown issues after applying a high loading while we're waiting for bufdaemon threads.