CPU time accounting broken on 8-STABLE machine after a few hours of uptime

Tue Sep 28 17:15:45 UTC 2010

On 28 Sep, Chip Camden wrote:
> Quoth Don Lewis on Monday, 27 September 2010:
>> CPU time accounting is broken on one of my machines running 8-STABLE.  I
>> ran a test with a simple program that just loops and consumes CPU time:
>> 
>> % time ./a.out
>> 94.544u 0.000s 19:14.10 8.1%	62+2054k 0+0io 0pf+0w
>> 
>> The display in top shows the process with WCPU at 100%, but TIME
>> increments very slowly.
>> 
>> Several hours after booting, I got a bunch of "calcru: runtime went
>> backwards" messages, but they stopped right away and never appeared
>> again.
>> 
>> Aug 23 13:40:07 scratch ntpd[1159]: ntpd 4.2.4p5-a (1)
>> Aug 23 13:43:18 scratch ntpd[1160]: kernel time sync status change 2001
>> Aug 23 18:05:57 scratch dbus-daemon: [system] Reloaded configuration
>> Aug 23 18:06:16 scratch dbus-daemon: [system] Reloaded configuration
>> Aug 23 18:12:40 scratch ntpd[1160]: time reset +18.059948 s
>> [snip]
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6836685136 usec to 5425839798 usec for pid 1526 (csh)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 4747 usec to 2403 usec for pid 1519 (csh)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 5265 usec to 2594 usec for pid 1494 (hald-addon-mouse-sy)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 7818 usec to 3734 usec for pid 1488 (console-kit-daemon)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 977 usec to 459 usec for pid 1480 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 958 usec to 450 usec for pid 1479 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 957 usec to 449 usec for pid 1478 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 952 usec to 447 usec for pid 1477 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 959 usec to 450 usec for pid 1476 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 975 usec to 458 usec for pid 1475 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1026 usec to 482 usec for pid 1474 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1333 usec to 626 usec for pid 1473 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 2469 usec to 1160 usec for pid 1440 (inetd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 719 usec to 690 usec for pid 1402 (sshd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 120486 usec to 56770 usec for pid 1360 (cupsd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6204 usec to 2914 usec for pid 1289 (dbus-daemon)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 179 usec to 84 usec for pid 1265 (moused)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 22156 usec to 10407 usec for pid 1041 (nfsd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1292 usec to 607 usec for pid 1032 (mountd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 8801 usec to 4134 usec for pid 664 (devd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 19 usec to 9 usec for pid 9 (sctp_iterator)
>> 
>> 
>> If I reboot and run the test again, the CPU time accounting seems to be
>> working correctly.
>> % time ./a.out
>> 1144.226u 0.000s 19:06.62 99.7%	5+168k 0+0io 0pf+0w
>> 
> <snip>
> 
> I notice that before the calcru messages, ntpd reset the clock by
> 18 seconds -- that probably accounts for that.

Interesting observation.  Since this happened so early in the log, I
thought that this time change was the initial time change after boot,
but taking a closer look, the time change occurred about 4 1/2 hours
after boot.  The calcru messages occured another 5 1/2 hours after that.

I also just noticed that this log info was from the August 23rd kernel,
before I noticed the CPU time accounting problem, and not the latest
occurance.  Here's the latest log info:

Sep 23 16:33:50 scratch ntpd[1144]: ntpd 4.2.4p5-a (1)
Sep 23 16:37:03 scratch ntpd[1145]: kernel time sync status change 2001
Sep 23 17:43:47 scratch ntpd[1145]: time reset +276.133928 s
Sep 23 17:43:47 scratch ntpd[1145]: kernel time sync status change 6001
Sep 23 17:47:15 scratch ntpd[1145]: kernel time sync status change 2001
Sep 23 19:02:48 scratch ntpd[1145]: time reset +291.507262 s
Sep 23 19:02:48 scratch ntpd[1145]: kernel time sync status change 6001
Sep 23 19:06:37 scratch ntpd[1145]: kernel time sync status change 2001
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 1120690857 u
sec to 367348485 usec for pid 1518 (csh)
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 5403 usec to
 466 usec for pid 1477 (hald-addon-mouse-sy)
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 7511 usec to
 1502 usec for pid 1472 (hald-runner)
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 17323 usec t
o 12470 usec for pid 1472 (hald-runner)
[snip]

The time jumps are even larger.  There is still the large interval
between the last ntp message and the first calcru message.

My time source is another FreeBSD box with a GPS receiver on my LAN.  My
other client machine isn't seeing these time jumps.  The only messages
from ntp in its log from this period are these:

Sep 23 04:12:23 mousie ntpd[1111]: kernel time sync status change 6001
Sep 23 04:29:29 mousie ntpd[1111]: kernel time sync status change 2001
Sep 24 03:55:24 mousie ntpd[1111]: kernel time sync status change 6001
Sep 24 04:12:28 mousie ntpd[1111]: kernel time sync status change 2001

> I don't know if that has any connection to time(1) running slower -- but
> perhaps ntpd is aggressively adjusting your clock?

It seems to be pretty stable when the machine is idle:

% ntpq -c pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*gw.catspoiler.o .GPS.            1 u    8   64  377    0.168   -0.081   0.007

Not too much degradation under CPU load:

% ntpq -c pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*gw.catspoiler.o .GPS.            1 u   40   64  377    0.166   -0.156   0.026

I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it
much, either.

% ntpq -c pe
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*gw.catspoiler.o .GPS.            1 u   35   64  377    0.169   -0.106   0.009