Some performance measurements on the FreeBSD network stack
Luigi Rizzo
rizzo at iet.unipi.it
Thu Apr 19 13:10:53 UTC 2012
I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.
Test conditions:
- intel i7-870 CPU running at 2.93 GHz + TurboBoost,
all 4 cores enabled, no hyperthreading
- FreeBSD HEAD as of 15 april 2012, no ipfw, no other
pfilter clients, no ipv6 or ipsec.
- userspace running 'netsend 10.0.0.2 5555 18 0 5'
(output to a physical interface, udp port 5555, small
frame, no rate limitations, 5sec experiments)
- the 'ns' column reports
the total time divided by the number of successful
transmissions we report the min and max in 5 tests
- 1 to 4 parallel tasks, variable packet sizes
- there are variations in the numbers which become
larger as we reach the bottom of the stack
Caveats:
- in the table below, clock and pktlen are constant.
I am including the info here so it is easier to compare
the results with future experiments
- i have a small number of samples, so i am only reporting
the min and the max in a handful of experiments.
- i am only measuring average values over millions of
cycles. I have no info on what is the variance between
the various executions.
- from what i have seen, numbers vary significantly on
different systems, depending on memory speed, caches
and other things. The big jumps are significant and present
on all systems, but the small deltas (say < 5%) are
not even statistically significant.
- if someone is interested in replicating the experiments
email me and i will post a link to a suitable picobsd image.
- i have not yet instrumented the bottom layers (if_output
and below).
The results show a few interesting things:
- the packet-sending application is reasonably fast
and certainly not a bottleneck (over 100Mpps before
calling the system call);
- the system call is somewhat expensive, about 100ns.
I am not sure where the time is spent (the amd64 code
does a few push on the stack and then runs "syscall"
(followed by a sysret). I am not sure how much
room for improvement is there in this area.
The relevant code is in lib/libc/i386/SYS.h and
lib/libc/i386/sys/syscall.S (KERNCALL translates
to "syscall" on amd64, and "int 0x80" on the i386)
- the next expensive operation, consuming another 100ns,
is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
seems to scale decently at least with 4 cores. The copyin() is
relatively inexpensive (not reported in the data below, but
disabling it saves only 15-20ns for a short packet).
I have not followed the details, but the allocator calls the zone
allocator and there is at least one critical_enter()/critical_exit()
pair, and the highly modular architecture invokes long chains of
indirect function calls both on allocation and release.
It might make sense to keep a small pool of mbufs attached to the
socket buffer instead of going to the zone allocator.
Or defer the actual encapsulation to the
(*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.
- another big bottleneck is the route lookup in ip_output()
(between entries 51 and 56). Not only it eats another
100ns+ on an empty routing table, but it also
causes huge contentions when multiple cores
are involved.
There is other bad stuff occurring in if_output() and
below (on this system it takes about 1300ns to send one
packet even with one core, and ony 500-550 are consumed
before the call to if_output()) but i don't have
detailed information yet.
POS CPU clock pktlen ns/pkt --- EXIT POINT ----
min max
-----------------------------------------------------
U 1 2934 18 8 8 userspace, before the send() call
[ syscall ]
20 1 2934 18 103 107 sys_sendto(): begin
20 4 2934 18 104 107
21 1 2934 18 110 113 sendit(): begin
21 4 2934 18 111 116
22 1 2934 18 110 114 sendit() after getsockaddr(&to, ...)
22 4 2934 18 111 124
23 1 2934 18 111 115 sendit() before kern_sendit
23 4 2934 18 112 120
24 1 2934 18 117 120 kern_sendit() after AUDIT_ARG_FD
24 4 2934 18 117 121
25 1 2934 18 134 140 kern_sendit() before sosend()
25 4 2934 18 134 146
40 1 2934 18 144 149 sosend_dgram(): start
40 4 2934 18 144 151
41 1 2934 18 157 166 sosend_dgram() before m_uiotombuf()
41 4 2934 18 157 168
[ mbuf allocation and copy. The copy is relatively cheap ]
42 1 2934 18 264 268 sosend_dgram() after m_uiotombuf()
42 4 2934 18 265 269
30 1 2934 18 273 276 udp_send() begin
30 4 2934 18 274 278
[ here we start seeing some contention with multiple threads ]
31 1 2934 18 323 324 udp_output() before ip_output()
31 4 2934 18 344 348
50 1 2934 18 326 331 ip_output() beginning
50 4 2934 18 356 367
51 1 2934 18 343 349 ip_output() before "if (opt) { ..."
51 4 2934 18 366 373
[ rtalloc() is sequential so multiple clients contend heavily ]
56 1 2934 18 470 480 ip_output() after rtalloc*()
56 4 2934 18 1310 1378
52 1 2934 18 472 488 ip_output() at sendit:
52 4 2934 18 1252 1286
53 1 2934 18 ip_output() before pfil_run_hooks()
53 4 2934 18
54 1 2934 18 476 477 ip_output() at passout:
54 4 2934 18 1249 1286
55 1 2934 18 509 526 ip_output() before if_output
55 4 2934 18 1268 1278
----------------------------------------------------------------------
cheers
luigi
More information about the freebsd-net
mailing list