Some performance measurements on the FreeBSD network stack

Thu Apr 19 13:10:53 UTC 2012

I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.

Test conditions:
- intel i7-870 CPU running at 2.93 GHz + TurboBoost,
  all 4 cores enabled, no hyperthreading
- FreeBSD HEAD as of 15 april 2012, no ipfw, no other
  pfilter clients, no ipv6 or ipsec.
- userspace running 'netsend 10.0.0.2 5555 18 0 5'
  (output to a physical interface, udp port 5555, small
  frame, no rate limitations, 5sec experiments)
- the 'ns' column reports
  the total time divided by the number of successful
  transmissions we report the min and max in 5 tests
- 1 to 4 parallel tasks, variable packet sizes
- there are variations in the numbers which become
  larger as we reach the bottom of the stack

Caveats:
- in the table below, clock and pktlen are constant.
  I am including the info here so it is easier to compare
  the results with future experiments

- i have a small number of samples, so i am only reporting
  the min and the max in a handful of experiments.

- i am only measuring average values over millions of
  cycles. I have no info on what is the variance between
  the various executions.

- from what i have seen, numbers vary significantly on
  different systems, depending on memory speed, caches
  and other things. The big jumps are significant and present
  on all systems, but the small deltas (say < 5%) are
  not even statistically significant.

- if someone is interested in replicating the experiments
  email me and i will post a link to a suitable picobsd image.

- i have not yet instrumented the bottom layers (if_output
  and below).

The results show a few interesting things:

- the packet-sending application is reasonably fast
  and certainly not a bottleneck (over 100Mpps before
  calling the system call);

- the system call is somewhat expensive, about 100ns.
  I am not sure where the time is spent (the amd64 code
  does a few push on the stack and then runs "syscall"
  (followed by a sysret). I am not sure how much
  room for improvement is there in this area.
  The relevant code is in lib/libc/i386/SYS.h and
  lib/libc/i386/sys/syscall.S (KERNCALL translates
  to "syscall" on amd64, and "int 0x80" on the i386)

- the next expensive operation, consuming another 100ns,
  is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
  seems to scale decently at least with 4 cores.  The copyin() is
  relatively inexpensive (not reported in the data below, but
  disabling it saves only 15-20ns for a short packet).

  I have not followed the details, but the allocator calls the zone
  allocator and there is at least one critical_enter()/critical_exit()
  pair, and the highly modular architecture invokes long chains of
  indirect function calls both on allocation and release.

  It might make sense to keep a small pool of mbufs attached to the
  socket buffer instead of going to the zone allocator.
  Or defer the actual encapsulation to the
  (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.

- another big bottleneck is the route lookup in ip_output()
  (between entries 51 and 56). Not only it eats another
  100ns+ on an empty routing table, but it also
  causes huge contentions when multiple cores
  are involved.

There is other bad stuff occurring in if_output() and
below (on this system it takes about 1300ns to send one
packet even with one core, and ony 500-550 are consumed
before the call to if_output()) but i don't have
detailed information yet.

POS CPU clock pktlen ns/pkt    --- EXIT POINT ----
                      min  max
-----------------------------------------------------
U   1   2934     18     8    8  userspace, before the send() call
  [ syscall ]
20  1   2934     18   103  107  sys_sendto(): begin
20  4   2934     18   104  107

21  1   2934     18   110  113  sendit(): begin
21  4   2934     18   111  116

22  1   2934     18   110  114  sendit() after getsockaddr(&to, ...)
22  4   2934     18   111  124

23  1   2934     18   111  115  sendit() before kern_sendit
23  4   2934     18   112  120

24  1   2934     18   117  120  kern_sendit() after AUDIT_ARG_FD
24  4   2934     18   117  121

25  1   2934     18   134  140  kern_sendit() before sosend()
25  4   2934     18   134  146

40  1   2934     18   144  149  sosend_dgram(): start
40  4   2934     18   144  151

41  1   2934     18   157  166  sosend_dgram() before m_uiotombuf()
41  4   2934     18   157  168
   [ mbuf allocation and copy. The copy is relatively cheap ]
42  1   2934     18   264  268  sosend_dgram() after m_uiotombuf()
42  4   2934     18   265  269

30  1   2934     18   273  276  udp_send() begin
30  4   2934     18   274  278
   [ here we start seeing some contention with multiple threads ]
31  1   2934     18   323  324  udp_output() before ip_output()
31  4   2934     18   344  348

50  1   2934     18   326  331  ip_output() beginning
50  4   2934     18   356  367

51  1   2934     18   343  349  ip_output() before "if (opt) { ..."
51  4   2934     18   366  373
   [ rtalloc() is sequential so multiple clients contend heavily ]
56  1   2934     18   470  480          ip_output() after rtalloc*()
56  4   2934     18  1310 1378

52  1   2934     18   472  488  ip_output() at sendit:
52  4   2934     18  1252 1286

53  1   2934     18             ip_output() before pfil_run_hooks()
53  4   2934     18

54  1   2934     18   476  477  ip_output() at passout:
54  4   2934     18  1249 1286

55  1   2934     18   509  526  ip_output() before if_output
55  4   2934     18  1268 1278

----------------------------------------------------------------------

cheers
luigi