Network stack changes

Sun Sep 22 20:13:11 UTC 2013

On 29.08.2013 15:49, Adrian Chadd wrote:
> Hi,
Hello Adrian!
I'm very sorry for the looong reply.

>
> There's a lot of good stuff to review here, thanks!
>
> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to 
> keep locking things like that on a per-packet basis. We should be able 
> to do this in a cleaner way - we can defer RX into a CPU pinned 
> taskqueue and convert the interrupt handler to a fast handler that 
> just schedules that taskqueue. We can ignore the ithread entirely here.
>
> What do you think?
Well, it sounds good :) But performance numbers and Jack opinion is more 
important :)

Are you going to Malta?
>
> Totally pie in the sky handwaving at this point:
>
> * create an array of mbuf pointers for completed mbufs;
> * populate the mbuf array;
> * pass the array up to ether_demux().
>
> For vlan handling, it may end up populating its own list of mbufs to 
> push up to ether_demux(). So maybe we should extend the API to have a 
> bitmap of packets to actually handle from the array, so we can pass up 
> a larger array of mbufs, note which ones are for the destination and 
> then the upcall can mark which frames its consumed.
>
> I specifically wonder how much work/benefit we may see by doing:
>
> * batching packets into lists so various steps can batch process 
> things rather than run to completion;
> * batching the processing of a list of frames under a single lock 
> instance - eg, if the forwarding code could do the forwarding lookup 
> for 'n' packets under a single lock, then pass that list of frames up 
> to inet_pfil_hook() to do the work under one lock, etc, etc.
I'm thinking the same way, but we're stuck with 'forwarding lookup' due 
to problem with egress interface pointer, as I mention earlier. However 
it is interesting to see how much it helps, regardless of locking.

Currently I'm thinking that we should try to change radix to something 
different (it seems that it can be checked fast) and see what happened.
Luigi's performance numbers for our radix are too awful, and there is a 
patch implementing alternative trie:
http://info.iet.unipi.it/~luigi/papers/20120601-dxr.pdf
http://www.nxlab.fer.hr/dxr/stable_8_20120824.diff

>
> Here, the processing would look less like "grab lock and process to 
> completion" and more like "mark and sweep" - ie, we have a list of 
> frames that we mark as needing processing and mark as having been 
> processed at each layer, so we know where to next dispatch them.
>
> I still have some tool coding to do with PMC before I even think about 
> tinkering with this as I'd like to measure stuff like per-packet 
> latency as well as top-level processing overhead (ie, 
> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC 
> interrupts on that core, etc.)
That will be great to see!
>
> Thanks,
>
>
>
> -adrian
>