CPU time accounting broken on 8-STABLE machine after a few
hours of uptime
Don Lewis
truckman at FreeBSD.org
Tue Sep 28 17:15:45 UTC 2010
On 28 Sep, Chip Camden wrote:
> Quoth Don Lewis on Monday, 27 September 2010:
>> CPU time accounting is broken on one of my machines running 8-STABLE. I
>> ran a test with a simple program that just loops and consumes CPU time:
>>
>> % time ./a.out
>> 94.544u 0.000s 19:14.10 8.1% 62+2054k 0+0io 0pf+0w
>>
>> The display in top shows the process with WCPU at 100%, but TIME
>> increments very slowly.
>>
>> Several hours after booting, I got a bunch of "calcru: runtime went
>> backwards" messages, but they stopped right away and never appeared
>> again.
>>
>> Aug 23 13:40:07 scratch ntpd[1159]: ntpd 4.2.4p5-a (1)
>> Aug 23 13:43:18 scratch ntpd[1160]: kernel time sync status change 2001
>> Aug 23 18:05:57 scratch dbus-daemon: [system] Reloaded configuration
>> Aug 23 18:06:16 scratch dbus-daemon: [system] Reloaded configuration
>> Aug 23 18:12:40 scratch ntpd[1160]: time reset +18.059948 s
>> [snip]
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6836685136 usec to 5425839798 usec for pid 1526 (csh)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 4747 usec to 2403 usec for pid 1519 (csh)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 5265 usec to 2594 usec for pid 1494 (hald-addon-mouse-sy)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 7818 usec to 3734 usec for pid 1488 (console-kit-daemon)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 977 usec to 459 usec for pid 1480 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 958 usec to 450 usec for pid 1479 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 957 usec to 449 usec for pid 1478 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 952 usec to 447 usec for pid 1477 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 959 usec to 450 usec for pid 1476 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 975 usec to 458 usec for pid 1475 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1026 usec to 482 usec for pid 1474 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1333 usec to 626 usec for pid 1473 (getty)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 2469 usec to 1160 usec for pid 1440 (inetd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 719 usec to 690 usec for pid 1402 (sshd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 120486 usec to 56770 usec for pid 1360 (cupsd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 6204 usec to 2914 usec for pid 1289 (dbus-daemon)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 179 usec to 84 usec for pid 1265 (moused)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 22156 usec to 10407 usec for pid 1041 (nfsd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 1292 usec to 607 usec for pid 1032 (mountd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 8801 usec to 4134 usec for pid 664 (devd)
>> Aug 23 23:49:06 scratch kernel: calcru: runtime went backwards from 19 usec to 9 usec for pid 9 (sctp_iterator)
>>
>>
>> If I reboot and run the test again, the CPU time accounting seems to be
>> working correctly.
>> % time ./a.out
>> 1144.226u 0.000s 19:06.62 99.7% 5+168k 0+0io 0pf+0w
>>
> <snip>
>
> I notice that before the calcru messages, ntpd reset the clock by
> 18 seconds -- that probably accounts for that.
Interesting observation. Since this happened so early in the log, I
thought that this time change was the initial time change after boot,
but taking a closer look, the time change occurred about 4 1/2 hours
after boot. The calcru messages occured another 5 1/2 hours after that.
I also just noticed that this log info was from the August 23rd kernel,
before I noticed the CPU time accounting problem, and not the latest
occurance. Here's the latest log info:
Sep 23 16:33:50 scratch ntpd[1144]: ntpd 4.2.4p5-a (1)
Sep 23 16:37:03 scratch ntpd[1145]: kernel time sync status change 2001
Sep 23 17:43:47 scratch ntpd[1145]: time reset +276.133928 s
Sep 23 17:43:47 scratch ntpd[1145]: kernel time sync status change 6001
Sep 23 17:47:15 scratch ntpd[1145]: kernel time sync status change 2001
Sep 23 19:02:48 scratch ntpd[1145]: time reset +291.507262 s
Sep 23 19:02:48 scratch ntpd[1145]: kernel time sync status change 6001
Sep 23 19:06:37 scratch ntpd[1145]: kernel time sync status change 2001
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 1120690857 u
sec to 367348485 usec for pid 1518 (csh)
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 5403 usec to
466 usec for pid 1477 (hald-addon-mouse-sy)
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 7511 usec to
1502 usec for pid 1472 (hald-runner)
Sep 24 00:03:36 scratch kernel: calcru: runtime went backwards from 17323 usec t
o 12470 usec for pid 1472 (hald-runner)
[snip]
The time jumps are even larger. There is still the large interval
between the last ntp message and the first calcru message.
My time source is another FreeBSD box with a GPS receiver on my LAN. My
other client machine isn't seeing these time jumps. The only messages
from ntp in its log from this period are these:
Sep 23 04:12:23 mousie ntpd[1111]: kernel time sync status change 6001
Sep 23 04:29:29 mousie ntpd[1111]: kernel time sync status change 2001
Sep 24 03:55:24 mousie ntpd[1111]: kernel time sync status change 6001
Sep 24 04:12:28 mousie ntpd[1111]: kernel time sync status change 2001
> I don't know if that has any connection to time(1) running slower -- but
> perhaps ntpd is aggressively adjusting your clock?
It seems to be pretty stable when the machine is idle:
% ntpq -c pe
remote refid st t when poll reach delay offset jitter
==============================================================================
*gw.catspoiler.o .GPS. 1 u 8 64 377 0.168 -0.081 0.007
Not too much degradation under CPU load:
% ntpq -c pe
remote refid st t when poll reach delay offset jitter
==============================================================================
*gw.catspoiler.o .GPS. 1 u 40 64 377 0.166 -0.156 0.026
I/O (dd if=/dev/ad6 of=/dev/null bs=512) doesn't appear to bother it
much, either.
% ntpq -c pe
remote refid st t when poll reach delay offset jitter
==============================================================================
*gw.catspoiler.o .GPS. 1 u 35 64 377 0.169 -0.106 0.009
More information about the freebsd-stable
mailing list