m_freem()

Mon Dec 31 16:43:26 PST 2007

On Mon, 31 Dec 2007, Julian Elischer wrote:

> m_freem() would be a perfect candidate for an inline function.
> (or even macro).
> in the case where m is null, no function call would even be made...
> The whole function is only 2 lines, and it gets called once for every packet 
> :-)

On the contrary, m_freem() is a large function that is fairly unsuitable
for inlining.  I just happened to count that it usually takes 180
instructions in -current with no INVARIANTS etc. (down from 245
instructions in ~5.2).  Further counting gave 112 and 132 instructions
for it (180 was for ttcp udp input packets and 112 and 132 are for
ping packets in 2 directions).

m_freem() is only one statement, but that statement consists mainly
of a function call to a function that is inline (m_free()).  m_free()
sometimes calls m_free_ext(), which is not inline, and usually calls
uma_zfree(), which is inline, but which is just a wrapper for
uma_zfree_arg(), which is not inline.  uma_zfree_arg() is very large
and thus very unsuitable for inlining.  I didn't check for [nested]
inlining of its internals at the source level.  At runtime it usually
calls the non-inline-function m_dtor_mbuf() which calls the non-inline
function m_tag_delete_chain(); then it calls critical_enter() and
critical_exit().  critical_exit() is fairly large and sometimes calls
thread_lock(), mi_switch() and thread_unlock(), but usually doesn't.
So the non-inline part of the call chain is usually:

     m_freem()
       uma_zfree_arg()     # the following is just 1 short path through this
 	m_dtor_mbuf()
 	  m_tag_delete_chain()
 	critical_enter()
 	critical_exit()

[Pause to recover from a double fault panic in critical*().  critical*()
or kdb is non-reeantrant somehwere, so tracing through critical_*() or
one of its callers in order to count instructions tends to cause panics.]

All this is too large to inline.  Inlining only the top level of it would
only make a tiny difference.  It might make a positive or negative
difference, depending on whether the reduced instruction count has a larger
effect than the increased cache pressure.  Generally I think it is bogus
to inline at the top level.  Here inlining at the top level may win in 2
ways:
- by avoiding the function call to the next level (and thus all function
   calls) in the usual case.  I think this doesn't happen here.  I think it
   is the usual case for the m_free_ext() call in m_free(), so inlining
   m_free() is a clear win.
- by improving branch prediction.  With a branch in a non-inline function,
   it may be mispredicted often because different classes of callers
   make it go in different ways.  With branch the distributed in callers
   by inlining, it can be predicted perfectly in individual callers
   that don't change its direction often and/or change its direction
   in predictable ways.  On Athlon CPUs, mispredicting a single branch
   costs the same several function calls provided the implicit branches
   for all the function calls are not mispredicted.  Too much inlining
   is still bad.  Apart from busting icaches, it can bust branch
   prediction caches -- with enough distribution of branches, all
   branches will be mispredicted.

The m_freem() wrapper currently limits the icache bloat from the
m_free() inline.  In RELENG_4, both m_free() and m_freem() are non-inline
and non-macro.  That may be why networking in RELENG_4 is so much more
efficient than in -current ;-).  (Actually it makes little difference.)

Bruce