Some performance measurements on the FreeBSD network stack

Thu Apr 19 20:26:52 UTC 2012

On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:
> On 19.04.2012 15:30, Luigi Rizzo wrote:
> >I have been running some performance tests on UDP sockets,
> >using the netsend program in tools/tools/netrate/netsend
> >and instrumenting the source code and the kernel do return in
> >various points of the path. Here are some results which
> >I hope you find interesting.
> 
> Jumping over very interesting analysis...
> 
> >- the next expensive operation, consuming another 100ns,
> >   is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
> >   seems to scale decently at least with 4 cores.  The copyin() is
> >   relatively inexpensive (not reported in the data below, but
> >   disabling it saves only 15-20ns for a short packet).
> >
> >   I have not followed the details, but the allocator calls the zone
> >   allocator and there is at least one critical_enter()/critical_exit()
> >   pair, and the highly modular architecture invokes long chains of
> >   indirect function calls both on allocation and release.
> >
> >   It might make sense to keep a small pool of mbufs attached to the
> >   socket buffer instead of going to the zone allocator.
> >   Or defer the actual encapsulation to the
> >   (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways.
> 
> The UMA mbuf allocator is certainly not perfect but rather good.
> It has a per-CPU cache of mbuf's that are very fast to allocate
> from.  Once it has used them it needs to refill from the global
> pool which may happen from time to time and show up in the averages.

indeed i was pleased to see no difference between 1 and 4 threads.
This also suggests that the global pool is accessed very seldom,
and for short times, otherwise you'd see the effect with 4 threads.

What might be moderately expensive are the critical_enter()/critical_exit()
calls around individual allocations.
The allocation happens while the code has already an exclusive
lock on so->snd_buf so a pool of fresh buffers could be attached
there.

But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
  attached to the socket, built on demand, and cached and managed
  with similar invalidation rules as used by fastforward;
- possibly extend the pru_send interface so one can pass down the uio
  instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
  where the code already has an x-lock on some resource (could be
  the snd_buf, the interface, ...) so the allocation comes for free.

> >- another big bottleneck is the route lookup in ip_output()
> >   (between entries 51 and 56). Not only it eats another
> >   100ns+ on an empty routing table, but it also
> >   causes huge contentions when multiple cores
> >   are involved.
> 
> This is indeed a big problem.  I'm working (rough edges remain) on
> changing the routing table locking to an rmlock (read-mostly) which

i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?

cheers
luigi