non-temporal copyin/copyout?

Fri Feb 17 08:14:41 PST 2006

Joseph Koshy writes:
 > > I'm bringing this up because I've noticed that FreeBSD 10GbE
 > > performance is far below Solaris/amd64 and linux/x86_64 when
 > > using the PCI-e 10GbE adaptor that I'm doing drivers for.
 > > For example, Solaris can recieve a netperf TCP stream at
 > 
 > There was a bug in my port of netperf; I had left the
 > `HISTOGRAM' option turned on, which causes it to slow
 > down significantly.
 > 
 > v2.3.1,1 is the latest & bugfixed version of the port.

I don't use the port specifically because of the HISTOGRAM
(mis)feature :).  I have my own copy of netperf that I use on all
platforms I support (linux, solaris, macosx, freebsd, aix) with
various bugs fixed (sendfile support for solaris, cpu time for
macosx & aix, etc).

 > > 9.75Gb/sec while using only 47% CPU as measured by vmstat.
 > > (eg, it is using a little less than a single core).  In
 > > contrast, FreeBSD is limited to 7.7Gb/sec, and uses nearly
 > > 90% CPU.  When profiling with hwpmc, I see a profile which
 > > shows up to 70% of the time is spent in copyout.
 > 
 > You could use the following events to probe the system:

OK.  I did these probes while a netperf was running at ~7.7Gb/s.
I did each for roughly 10-20 seconds, not very scientifically :)
Here is everything above 1% for all of them:

 >  "k8-dc-miss" : data cache misses

 91.5    6466.00  6466.00        0  100.00%           copyout [1]
  2.8    6666.00   200.00        0  100.00%           soreceive [2]
  1.5    6774.00   108.00        0  100.00%           uiomoveco [3]
  1.0    6846.00    72.00        0  100.00%           mb_free_ext [4]

 >  "k8-bu-fill-request-l2-miss,mask=dc-fill" : L2 fills for the
 >      data cache

 88.2    3866.00  3866.00        0  100.00%           copyout [1]
  4.0    4041.00   175.00        0  100.00%           soreceive [2]
  1.9    4125.00    84.00        0  100.00%           uiomoveco [3]
  1.9    4207.00    82.00        0  100.00%           mb_free_ext [4]
  1.5    4273.00    66.00        0  100.00%           mb_dtor_clust[5]

 >  "k8-dc-misaligned-data-reference": in case there are any

 99.5   66763.00 66763.00        0  100.00%           copyout [1]

 >  "k8-fr-interrupts-masked-while-pending-cycles": for
 >      finding spots in the code where spin-locks are being
 >      held for long.

I had to tweak the sample rate to 512 for this one.

 52.5     330.00   330.00        0  100.00%           acpi_cpu_idle [1]
 10.4     395.00    65.00        0  100.00%           spinlock_exit [2]
  9.1     452.00    57.00        0  100.00%           acpi_cpu_c1 [3]
  6.1     490.00    38.00        0  100.00%           _mtx_lock_sleep [4]
  4.0     515.00    25.00        0  100.00%           runq_remove [5]
  2.4     530.00    15.00        0  100.00%           ast [6]
  2.2     544.00    14.00        0  100.00%           _mtx_unlock_sleep [7]
  2.1     557.00    13.00        0  100.00%           turnstile_lock [8]
  1.9     569.00    12.00        0  100.00%           choosethread [9]
  1.6     579.00    10.00        0  100.00%           cpu_switch [10]
  1.3     587.00     8.00        0  100.00%           turnstile_release [11]
  1.1     594.00     7.00        0  100.00%           sched_switch [12]
  1.0     600.00     6.00        0  100.00%           sched_add [13]

Drew