non-temporal copyin/copyout?
Andrew Gallatin
gallatin at cs.duke.edu
Fri Feb 17 08:14:41 PST 2006
Joseph Koshy writes:
> > I'm bringing this up because I've noticed that FreeBSD 10GbE
> > performance is far below Solaris/amd64 and linux/x86_64 when
> > using the PCI-e 10GbE adaptor that I'm doing drivers for.
> > For example, Solaris can recieve a netperf TCP stream at
>
> There was a bug in my port of netperf; I had left the
> `HISTOGRAM' option turned on, which causes it to slow
> down significantly.
>
> v2.3.1,1 is the latest & bugfixed version of the port.
I don't use the port specifically because of the HISTOGRAM
(mis)feature :). I have my own copy of netperf that I use on all
platforms I support (linux, solaris, macosx, freebsd, aix) with
various bugs fixed (sendfile support for solaris, cpu time for
macosx & aix, etc).
> > 9.75Gb/sec while using only 47% CPU as measured by vmstat.
> > (eg, it is using a little less than a single core). In
> > contrast, FreeBSD is limited to 7.7Gb/sec, and uses nearly
> > 90% CPU. When profiling with hwpmc, I see a profile which
> > shows up to 70% of the time is spent in copyout.
>
> You could use the following events to probe the system:
OK. I did these probes while a netperf was running at ~7.7Gb/s.
I did each for roughly 10-20 seconds, not very scientifically :)
Here is everything above 1% for all of them:
> "k8-dc-miss" : data cache misses
91.5 6466.00 6466.00 0 100.00% copyout [1]
2.8 6666.00 200.00 0 100.00% soreceive [2]
1.5 6774.00 108.00 0 100.00% uiomoveco [3]
1.0 6846.00 72.00 0 100.00% mb_free_ext [4]
> "k8-bu-fill-request-l2-miss,mask=dc-fill" : L2 fills for the
> data cache
88.2 3866.00 3866.00 0 100.00% copyout [1]
4.0 4041.00 175.00 0 100.00% soreceive [2]
1.9 4125.00 84.00 0 100.00% uiomoveco [3]
1.9 4207.00 82.00 0 100.00% mb_free_ext [4]
1.5 4273.00 66.00 0 100.00% mb_dtor_clust[5]
> "k8-dc-misaligned-data-reference": in case there are any
99.5 66763.00 66763.00 0 100.00% copyout [1]
> "k8-fr-interrupts-masked-while-pending-cycles": for
> finding spots in the code where spin-locks are being
> held for long.
I had to tweak the sample rate to 512 for this one.
52.5 330.00 330.00 0 100.00% acpi_cpu_idle [1]
10.4 395.00 65.00 0 100.00% spinlock_exit [2]
9.1 452.00 57.00 0 100.00% acpi_cpu_c1 [3]
6.1 490.00 38.00 0 100.00% _mtx_lock_sleep [4]
4.0 515.00 25.00 0 100.00% runq_remove [5]
2.4 530.00 15.00 0 100.00% ast [6]
2.2 544.00 14.00 0 100.00% _mtx_unlock_sleep [7]
2.1 557.00 13.00 0 100.00% turnstile_lock [8]
1.9 569.00 12.00 0 100.00% choosethread [9]
1.6 579.00 10.00 0 100.00% cpu_switch [10]
1.3 587.00 8.00 0 100.00% turnstile_release [11]
1.1 594.00 7.00 0 100.00% sched_switch [12]
1.0 600.00 6.00 0 100.00% sched_add [13]
Drew
More information about the freebsd-amd64
mailing list