Some performance measurements on the FreeBSD network stack

Tue Apr 24 13:42:57 UTC 2012

On Tue, Apr 24, 2012 at 03:16:48PM +0200, Andre Oppermann wrote:
> On 19.04.2012 22:46, Luigi Rizzo wrote:
> >On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:
> >>On 19.04.2012 15:30, Luigi Rizzo wrote:
> >>>I have been running some performance tests on UDP sockets,
> >>>using the netsend program in tools/tools/netrate/netsend
> >>>and instrumenting the source code and the kernel do return in
> >>>various points of the path. Here are some results which
> >>>I hope you find interesting.
> >>>- another big bottleneck is the route lookup in ip_output()
> >>>   (between entries 51 and 56). Not only it eats another
> >>>   100ns+ on an empty routing table, but it also
> >>>   causes huge contentions when multiple cores
> >>>   are involved.
> >>
> >>This is indeed a big problem.  I'm working (rough edges remain) on
> >>changing the routing table locking to an rmlock (read-mostly) which
> >
> >i was wondering, is there a way (and/or any advantage) to use the
> >fastforward code to look up the route for locally sourced packets ?
> 
> I've completed the updating of the routing table rmlock patch.  There
> are two steps.  Step one is just changing the rwlock to an rmlock.
> Step two streamlines the route lookup in ip_output and ip_fastfwd by
> copying out the relevant data while only holding the rmlock instead
> of obtaining a reference to the route.
> 
> Would be very interesting to see how your benchmark/profiling changes
> with these patches applied.

If you want to give it a try yourself, the high level benchmark is
just the 'netsend' program from tools/tools/netrate/netsend -- i
am running something like

	for i in $X ; do
		netsend 10.0.0.2 5555 18 0 5 &
	done

and the cardinality of $X can be used to test contention on the
low layers (routing tables and interface/queues).

>From previous tests, the difference between flowtable and
routing table was small with a single process (about 5% or 50ns
in the total packet processing time, if i remember well),
but there was a large gain with multiple concurrent processes.

Probably the change in throughput between HEAD and your
branch is all you need. The info below shows that your
gain is something around 100-200 ns depending on how good
is the info that you return back (see below).

My profiling changes were mostly aimed at charging the costs to the
various layers. With my current setting (single process i7-870 @2933
MHz+Turboboost, ixgbe, FreeBSD HEAD, FLOWTABLE enabled, UDP) i see
the following:

    File            Function/description    Total/delta
					    nanoseconds
    user program    sendto()                    8   96
		    system call

    uipc_syscalls.c sys_sendto                104 
    uipc_syscalls.c sendit                    111
    uipc_syscalls.c kern_sendit               118
    uipc_socket.c   sosend
    uipc_socket.c   sosend_dgram              146  137
          sockbuf locking, mbuf alloc, copyin

    udp_usrreq.c    udp_send                  273
    udp_usrreq.c    udp_output                273   57

    ip_output.c     ip_output                 330  198
	  route lookup, ip header setup

    if_ethersubr.c  ether_output              528  162
	  MAC header lookup and construction,
	  loopback checks
    if_ethersubr.c  ether_output_frame        690

    ixgbe.c         ixgbe_mq_start            698
    ixgbe.c         ixgbe_mq_start_locked     720
    ixgbe.c         ixgbe_xmit                730  220
	 mbuf mangling, device programming

    --              packet on the wire        950

Removing flowtable increases the cost in ip_output()
(obviously) but also in ether_output() (because the
route does not have a lle entry so you need to call
arpresolve on each packet). It also causes trouble
in the device driver because the mbuf does not have a
flowid set, so the ixgbe device driver puts the
packet on the queue corresponding to the current CPU.
If the process (as in my case) floats, one flow might end
up on multiple queues.

So in revising the route lookup i believe it would be good
if we could also get at once most of the info that
ether_output() is computing again and again.

cheers
luigi

> http://svn.freebsd.org/changeset/base/234649
> Log:
>   Change the radix head lock to an rmlock (read mostly lock).
> 
>   There is some header pollution going on because rmlock's are
>   not entirely abstracted and need per-CPU structures.
> 
>   A comment in _rmlock.h says this can be hidden if there were
>   per-cpu linker magic/support.  I don't know if we have that
>   already.
> 
> http://svn.freebsd.org/changeset/base/234650
> Log:
>   Add a function rtlookup() that copies out the relevant information
>   from an rtentry instead of returning the rtentry.  This avoids the
>   need to lock the rtentry and to increase the refcount on it.
> 
>   Convert ip_output() to use rtlookup() in a simplistic way.  Certain
>   seldom used functionality may not work anymore and the flowtable
>   isn't available at the moment.
> 
>   Convert ip_fastfwd() to use rtlookup().
> 
>   This code is meant to be used for profiling and to be experimented
>   with further to determine which locking strategy returns the best
>   results.
> 
> Make sure to apply this one as well:
> http://svn.freebsd.org/changeset/base/234648
> Log:
>   Add INVARIANT and WITNESS support to rm_lock locks and optimize the
>   synchronization path by replacing a LIST of active readers with a
>   TAILQ.
> 
>   Obtained from:	Isilon
>   Submitted by:	mlaier
> 
> -- 
> Andre
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"