FreeBSD 10G forwarding performance @Intel

Thu Jul 5 13:41:52 UTC 2012

On 04.07.2012 19:48, Luigi Rizzo wrote:
> On Wed, Jul 04, 2012 at 01:54:01PM +0400, Alexander V. Chernikov wrote:
>> On 04.07.2012 13:12, Luigi Rizzo wrote:
>>> Alex,
>>> i am sure you are aware that in FreeBSD we have netmap too
>> Yes, I'm aware of that :)
>>
>>> which is probably a lot more usable than packetshader
>>> (hw independent, included in the OS, also works on linux...)
>> I'm actually not talking about usability and comparison here :). Thay
>> have nice idea and nice performance graphs. And packetshader is actually
>> _platform_ with fast packet delivery being one (and the only open) part
>> of platform.
>
> i am not sure if i should read the above as a feature or a limitation :)
I'm not trying to compare their i/o code with netmap implementation :)
>
>>
>> Their graphs shows 40MPPS (27G/64byte) CPU-only IPv4 packet forwarding
>> on "two four-core Intel Nehalem CPUs (2.66GHz)" which illustrates
>> software routing possibilities quite clearly.
>
> i suggest to be cautious about graphs in papers (including mine) and
> rely on numbers you can reproduce yourself.
Yup. Of course. However, even it if we divide their number by 4, there 
is still a huge gap.
> As your nice experiments showed (i especially liked when you moved
> from one /24 to four /28 routes), at these speeds a factor
> of 2 or more in throughput can easily arise from tiny changes
> in configurations, bus, memory and CPU speeds, and so on.

Traffic stats with most possible counters eliminated:
(there is a possibility in ixgbe code to update rx/tx packets once per 
rx_process_limit (which is 100 by default)):

             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       2.8M     0     0       186M       2.8M     0       186M     0
       2.8M     0     0       187M       2.8M     0       186M     0

And it seems that netstat uses 1024 as divisor (no HN_DIVISOR_1000 
passed in if.c to show_stat), so real frame count from Ixia side is much 
closer to 3MPPS (~ 2.961600 ).

This is wrong from my point of view and we should change it, at least 
for packets count.

Here is the patch itself:
http://static.ipfw.ru/files/fbsd10g/no_ifcounters.diff

IPFW contention:
Same setup as shown upper, same traffic level

17:48 [0] test15# ipfw show
00100 0 0 allow ip from any to any
65535 0 0 deny ip from any to any

net.inet.ip.fw.enable: 0 -> 1
             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       2.1M  734k     0       187M       2.1M     0       139M     0
       2.1M  736k     0       187M       2.1M     0       139M     0
       2.1M  737k     0       187M       2.1M     0        89M     0
       2.1M  735k     0       187M       2.1M     0       189M     0
net.inet.ip.fw.update_counters: 1 -> 0
       2.3M  636k     0       187M       2.3M     0       148M     0
       2.5M  343k     0       187M       2.5M     0       164M     0
       2.5M  351k     0       187M       2.5M     0       164M     0
       2.5M  345k     0       187M       2.5M     0       164M     0

Patch here: http://static.ipfw.ru/files/fbsd10g/no_ipfw_counters.diff

It seems that ipfw counters are suffering from this problem, too.
Unfortunately, there is no DPCPU allocator in our kernel.
I'm planning to make a very simple per-cpu counters patch:
(
allocate 65k*(u64_bytes+u64_packets) memory for each CPU per vnet 
instance init and make ipfw use it as counter backend.

There is a problem with several rules residing in single entry. This can 
(probably) be worked-around by using fast counters for the first such 
rule (or not using fast counters for such rules at all)
)

What do you think about this?

>
> cheers
> luigi
>

-- 
WBR, Alexander