cvs commit: src/sys/conf files.amd64

Sun May 23 22:36:00 PDT 2004

On Sun, 23 May 2004, Nate Lawson wrote:

> On Sun, 23 May 2004, Bruce Evans wrote:
> >   The SMP case hasn't been tested.  The high resolution subcase of this uses
> >   the i8254, and as on i386's, the locking for this is deficient and the
> >   i8254 is too inefficient.  The acpi timer is also too inefficient.
>
> The ACPI timer is significantly better than it used to be.  It is
> currently just a bus_space_read, which maps directly to inl.  The fact
> that IO ports are slow is inescapable.

Actually, its speed hasn't changed, since it is the hardare speed that I
mean.  It is about 200 times as slow as rdtsc() on my amd64 test system.

Sample profiling output which shows the problem indirectly:

output produced by:
%%%
sync
kgmon -rB
time ping -fq -c100000 localhost
kgmon -hp
gprof -u `sysctl -n kern.bootfile` gmon.out >zp
%%%

%%%
...
granularity: each sample hit covers 16 byte(s) for 0.00% of 5.40 seconds

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
 40.4      2.181    2.181                             mcount [1]
 15.3      3.009    0.828   814149     1017     1017  acpi_timer_read [7]
 14.7      3.802    0.793                             mexitcount [9]
  5.1      4.076    0.275                             cputime [22]
  2.3      4.202    0.125                             user [42]
  1.9      4.307    0.105   408446      258      258  cpu_switch [44]
  0.9      4.358    0.051       76   671400   671400  acpi_cpu_c1 [51]
  0.9      4.406    0.048   200007      242     1569  ip_output [21]
  0.8      4.450    0.043   600161       72     1831  syscall [4]
  0.7      4.490    0.040  1958847       20       20  critical_enter [52]
  0.6      4.522    0.033  1300188       25       25  bzero [56]
  0.5      4.552    0.030   600108       49       49  copyin [58]
  0.5      4.580    0.029   800106       36       36  copyout [59]
  0.5      4.608    0.027  1958847       14       14  critical_exit [61]
  0.5      4.632    0.025   408453       61     1495  mi_switch [11]
  0.4      4.656    0.024   200007      119     2043  ip_input [17]
...
%%%

This was made using the TSC clock for high resolution profiling (sysctl
machdep.cputime_clock=2).  It shows that the acpi_timer_read() takes
about 40% of the of the time for "ping -fq localhost" (mcounting
overheads = 40.4 + 14.7 + 5.1 = 60.2%, acpi_timer_read = 15.3%, other
= 24.6%).  It also shows that there are 8 timecounter calls per packet;
4 are for 4 context switches and 4 are for timestamps made by ping(8)
(one to put in the packet and 3 to leep track of the time).

The system is configured with acpi, so it chooses an ACPI timecounter
(ACPI-fast) as the highest quality, but in fact the ACPI timecounter
has much lower quality than the TSC (it is about 100 times slower and
100 times less precise, and no more stable since this is a desktop
system which never suspends or throttles the CPU if I can help it).
Switching to the TSC timecounter gives for the above:

%%%
...
granularity: each sample hit covers 16 byte(s) for 0.00% of 4.49 seconds

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
 47.3      2.123    2.123                             mcount [1]
 17.2      2.895    0.772                             mexitcount [2]
  6.0      3.162    0.267                             cputime [13]
  2.8      3.289    0.126                             user [23]
  2.4      3.394    0.106   406425      260      260  cpu_switch [27]
  1.1      3.445    0.051   200006      255     1129  ip_output [15]
  0.9      3.487    0.041   600153       69     1155  syscall [5]
  0.9      3.525    0.039  1301899       30       30  bzero [41]
  0.8      3.561    0.036  1950822       18       18  critical_enter [44]
  0.7      3.594    0.033  1950822       17       17  critical_exit [46]
  0.7      3.626    0.032   200003      160      160  fgetsock [47]
  0.7      3.658    0.032   800100       39       39  copyout [48]
  0.6      3.687    0.029   600099       48       48  copyin [51]
  0.6      3.714    0.027   406429       67      474  mi_switch [16]
  0.5      3.738    0.024   300048       80       80  mb_free [58]
  0.5      3.761    0.023   200006      114     1353  ip_input [12]
  0.5      3.783    0.022   100004      221      743  soreceive [36]
  0.5      3.804    0.021   100018      214      678  kern_select [38]
  0.4      3.823    0.018   700032       26       26  in_cksumdata [65]
  0.4      3.841    0.018   200000       91     1196  icmp_input [14]
  ...
  0.2      4.184    0.009   811079       11       11  tsc_get_timecount [93]
...
%%%

This says among other things that tsc_get_timecounter() takes 11nS.
It really does take about 11nS when not profiled (5.5nS for rdtsc at
2GHz + function call overheads) so I'm confident that the adjustments
for mcounting overheads are accurate to within 10-100nS.  So even with
8 tsc_get_timecount() calls per packet, the timecounter part of the
overhead is small.

Also, cputime() is basically just rdtsc(), so using it doesn't take too
long (6%).  If cputime() were based on the ACPI timer, then it would
want to take 600% of the CPU.  It couldn't do that of course, so running
the profile would just take about 50 times as long as when high resulution
profiling is not running instead of only about 4-5 times as long.  Low
resolution profiling takes about 3 times as long.

Bruce