more network performance info: ether_output()

Tue May 1 17:34:12 UTC 2012

On May 1, 2012, at 11:40 , Luigi Rizzo wrote:

> On Tue, May 01, 2012 at 10:27:42AM -0400, George Neville-Neil wrote:
>> 
>> On Apr 20, 2012, at 15:03 , Luigi Rizzo wrote:
>> 
>>> Continuing my profiling on network performance, another place
>>> were we waste a lot of time is if_ethersubr.c::ether_output()
>>> 
>>> In particular, from the beginning of ether_output() to the
>>> final call to ether_output_frame() the code takes slightly
>>> more than 210ns on my i7-870 CPU running at 2.93 GHz + TurboBoost.
>>> In particular:
>>> 
>>> - the route does not have a MAC address (lle) attached, which causes
>>> arpresolve() to be called all the times. This consumes about 100ns.
>>> It happens also with locally sourced TCP.
>>> Using the flowtable cuts this time down to about 30-40ns
>>> 
>>> - another 100ns is spend to copy the MAC header into the mbuf,
>>> and then check whether a local copy should be looped back.
>>> Unfortunately the code here is a bit convoluted so the
>>> header fields are copied twice, and using memcpy on the
>>> individual pieces.
>>> 
>>> Note that all the above happens not just with my udp flooding
>>> tests, but also with regular TCP traffic.
>> 
>> Hi Luigi,
>> 
>> I'm really glad you're working on this.  I may have missed this in a thread
>> but are you tracking these somewhere so we can pick them up and fix them?
>> 
>> Also, how are you doing the measurements.
> 
> The measurements are done with tools/tools/netrate/netsend and
> kernel patches to return from sendto() at various places in the
> stack (from the syscall entry point down to the device driver).
> A patch is attached. You don't really need netmap to run it,
> it was just a convenient place to put the variables.
> 
> I am not sure how much we can "fix", there are multiple expensive
> functions on the tx path, and probably also on the rx path.
> 
> My hope at least for the tx path is that we can find out a way to install a
> "fastpath" handler in the socket.
> When there is no handler installed (e.g. on the first packet or
> unsupported protocols/interfaces) everything works as usual. Then
> when the packet reaches the bottom of the stack, we try to update
> the socket with a copy of the headers generated in the process, and
> the name of the fastpath function to be called.  Next transmissions
> will then be able to shortcut the stack and go straight to the
> device output routine.
> 
> I don't have data on the receive path or good ideas on how to proceed -- the
> advantage of the tx path is that traffic is implicitly classified,
> whereas it might not be the case for incoming traffic, and classification
> might be the expensive step.
> 
> Hopefully we'll have time to discuss this next week in ottawa.

Yes, I think we should.

Best,
George