Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Thu Jul 3 12:48:52 UTC 2008

Bruce Evans wrote:
> On Thu, 3 Jul 2008, Paul wrote:
>
>> Bruce Evans wrote:
>>>> No polling:
>>>> 843762 25337   52313248          1     0        178     0
>>>>   763555     0   47340414          1     0        178     0
>>>>   830189     0   51471722          1     0        178     0
>>>>   838724     0   52000892          1     0        178     0
>>>>   813594   939   50442832          1     0        178     0
>>>>   807303   763   50052790          1     0        178     0
>>>>   791024     0   49043492          1     0        178     0
>>>>   768316  1106   47635596          1     0        178     0
>>>> Machine is maxed and is unresponsive..
>>>
>>> That's the most interesting one.  Even 1% packet loss would probably
>>> destroy performance, so the benchmarks that give 10-50% packet loss
>>> are uninteresting.
>>>
>> But you realize that it's outputting all of these packets on em3  and 
>> I'm watching them coming out
>> and they are consistent with the packets received on em0 that netstat 
>> shows are 'good' packets.
>
> Well, output is easier.  I don't remember seeing the load on a taskq for
> em3.  If there is a memory bottleneck, it might to might not be more 
> related
> to running only 1 taskq per interrupt, depending on how independent the
> memory system is for different CPU.  I think Opterons have more 
> indenpendence
> here than most x86's.
>
Opterons have on cpu memory controller.. That should help a little. :P
But I must be getting more than 1 packet per descriptor because I can do 
HZ=100 and still get it without polling..
idle polling helps in all cases of polling that I have tested it with, 
seems moreso on 32 bit
>> I'm using a server opteron which supposedly has the best memory 
>> performance out of any CPU right now.
>> Plus opterons have the biggest l1 cache, but small l2 cache.  Do you 
>> think larger l2 cache on the Xeon (6mb for 2 core) would be better?
>> I have a 2222 opteron coming which is 1ghz faster so we will see what 
>> happens
>
> I suspect lower latency memory would help more.  Big memory systems
> have inherently higher latency.  My little old A64 workstation and
> laptop have main memory latencies 3 times smaller than freebsd.org's
> new Core2 servers according to lmbench2 (42 nsec for the overclocked
> DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec).
> If there are a lot of cache misses, then the extra 100 nsec can be
> important.  Profiling of sendto() using hwpmc or perfmon shows a
> significant number of cache misses per packet (2 or 10?).
>
The opterons are 667mhz DDR2 [registered], I have a Xeon that is ddr3 
but i think the latency is higher than ddr2.
I'll look up those programs you mentioned and see If I can run some tests.

>>>> Polling ON:
>>>>         input          (em0)           output
>>>>  packets  errs      bytes    packets  errs      bytes colls
>>>>   784138 179079   48616564          1     0        226     0
>>>>   788815 129608   48906530          2     0        356     0
>>>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ?  I'm 
>>>> really mistified by this..
>>>
>>> Is this with hz=2000 and 256/256 and no polling in idle?  40% is easy
>>> to explain (perhaps incorrectly).  Polling can then read at most 256
>>> descriptors every 1/2000 second, giving a max throughput of 512 kpps.
>>> Packets < descriptors in general but might be equal here (for small
>>> packets).  You seem to actually get 784 kpps, which is too high even
>>> in descriptors unless, but matches exactly if the errors are counted
>>> twice (784 - 179 - 505 ~= 512).  CPU is getting short too, but 40%
>>> still happens to be left over after giving up at 512 kpps.  Most of
>>> the errors are probably handled by the hardware at low cost in CPU by
>>> dropping packets.  There are other types of errors but none except
>>> dropped packets is likely.
>>>
>> Read above, it's actually transmitting 770kpps out of em3 so it can't 
>> just be 512kpps.
>
> Transmitting is easier, but with polling its even harder to send 
> faster than
> hz * queue_length than to receive.  This is without polling in idle.
>
What i'm saying though, it that it's not giving up at 512kpps because 
784kpps is coming in em0
and going out em3 so obviously it's reading more than 256 every 1/2000th 
of a second (packets).
What would be the best settings (theoretical) for 1mpps processing?
I actually don't have a problem 'receiving' more than 800kpps with much 
lower CPU usage if it's going to blackhole .
so obviously it can receive a lot more, maybe even line rate pps but i 
can't generate that much.
>> I was thinking of trying 4 or 5.. but how would that work with this 
>> new hardware?
>
> Poorly, except possibly with polling in FreeBSD-4.  FreeBSD-4 generally
> has lower overheads and latency, but is missing important improvements
> (mainly tcp optimizations in upper layers, better DMA and/or mbuf
> handling, and support for newer NICs).  FreeBSD-5 is also missing the
> overhead+latency advantage.
>
> Here are some benchmarks. (ttcp mainly tests sendto().  4.10 em needed a
> 2-line change to support a not-so-new PCI em NIC.  Summary:
> - my bge NIC can handle about 600 kpps on my faster machine, but only
>   achieves 300 in 4.10 unpatched.
> - my em NIC can handle about 400 kpps on my slower machine, except in
>   later versions it can receive at about 600 kpps.
> - only 6.x and later can achieve near wire throughput for 1500-MTU
>   packets (81 kpps vs 76 kpps).  This depends on better DMA or mbuf
>   handling...  I now remember the details -- it is mainly better mbuf
>   handling: old versions split the 1500-MTU packets into 2 mbufs and
>   this causes 2 descriptors per packet, which causes extra software
>   overheads and even larger overheads for the hardware.
>
> %%%
> Results of benchmarks run on 23 Feb 2007:
>
> my~5.2 bge --> ~4.10 em
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     639     98    1660     398*     77      8k
> ttcp -l5       -t     6.0    100    3960     6.0       6    5900
> ttcp -l1472 -u -t      76     27     395      76      40      8k
> ttcp -l1472    -t      51     40     11k      51      26      8k
>
> (*) Same as sender according to netstat -I, but systat -ip shows that
>     almost half aren't delivered to upper layers.
>
> my~5.2 bge --> 4.11 em
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     635     98    1650     399*     74      8k
> ttcp -l5       -t     5.8    100    3900     5.8       6    5800
> ttcp -l1472 -u -t      76     27     395      76      32      8k
> ttcp -l1472    -t      51     40     11k      51      25      8k
>
> (*) Same as sender according to netstat -I, but systat -ip shows that
>     almost half aren't delivered to upper layers.
>
> my~5.2 bge --> my~5.2 em
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     638     98    1660     394*    100-     8k
> ttcp -l5       -t     5.8    100    3900     5.8       9    6000
> ttcp -l1472 -u -t      76     27     395      76      46      8k
> ttcp -l1472    -t      51     40     11k      51      35      8k
>
> (*) Same as sender according to netstat -I, but systat -ip shows that
>     almost half aren't delivered to upper layers.  With the em rate
>     limit on ips changed from 8k to 80k, about 95% are delivered up.
>
> my~5.2 bge --> 6.2 em
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     637     98    1660     637     100-    15k
> ttcp -l5       -t     5.8    100    3900     5.8       8     12k
> ttcp -l1472 -u -t      76     27     395      76      36     16k
> ttcp -l1472    -t      51     40     11k      51      37     16k
>
> my~5.2 bge --> ~current em-fastintr
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     641     98    1670     641      99      8k
> ttcp -l5       -t     5.9    100    2670     5.9       7      6k
> ttcp -l1472 -u -t      76     27     395      76      35      8k
> ttcp -l1472    -t      52     43     11k      52      30      8k
>
> ~6.2 bge --> ~current em-fastintr
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     309     62    1600     309      64      8k
> ttcp -l5       -t     4.9    100    3000     4.9       6      7k
> ttcp -l1472 -u -t      76     27     395      76      34      8k
> ttcp -l1472    -t      54     28    6800      54      30      8k
>
> ~current bge --> ~current em-fastintr
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     602    100    1570     602      99      8k
> ttcp -l5       -t     5.3    100    2660     5.3       5    5300
> ttcp -l1472 -u -t      81#    19     212      81#     38      8k
> ttcp -l1472    -t      53     34     11k      53      30      8k
>
> (#) Wire speed to within 0.5%.  This is the only kppps in this set of
>     benchmarks that is close to wire speed.  Older kernels apparently
>     lose relative to -current because mbufs for mtu-sized packets are
>     not contiguous in older kernels.
>
> Old results:
>
> ~4.10 bge --> my~5.2 em
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     n/a    n/a     n/a     346      79      8k
> ttcp -l5       -t     n/a    n/a     n/a     5.4      10    6800
> ttcp -l1472 -u -t     n/a    n/a     n/a      67      40      8k
> ttcp -l1472    -t     n/a    n/a     n/a      51      36      8k
>
> ~4.10 kernel, =4 bge --> ~current em
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     n/a    n/a     n/a     347      96     14k
> ttcp -l5       -t     n/a    n/a     n/a     5.8      10     14k
> ttcp -l1472 -u -t     n/a    n/a     n/a      67      62     14K
> ttcp -l1472    -t     n/a    n/a     n/a      52      40     16k
>
> ~4.10 kernel, =4+ bge --> ~current em
>                              tx                      rx
>                      kpps   load%    ips    kpps    load%    ips
> ttcp -l5    -u -t     n/a    n/a     n/a     627     100      9k
> ttcp -l5       -t     n/a    n/a     n/a     5.6       9     13k
> ttcp -l1472 -u -t     n/a    n/a     n/a      68      63     14k
> ttcp -l1472    -t     n/a    n/a     n/a      54      44     16k
> %%%
>
> %%%
> Results of benchmarks run on 28 Dec 2007:
>
> ~5.2 epsplex (em) ttcp:
>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
> local no sink:        825k    3 206k  229 412k     52.1  45.1   2.8
> local with sink:      659k    3 263k  231 131k     66.5  27.3   6.2
> tx remote no sink:     35k    3 273k 8237 266k     42.0  52.1   2.3   3.6
> tx remote with sink:   26k    3 394k 8224  100     60.0  5.41   3.4  11.2
> rx remote no sink:     25k    4   26 8237 373k     20.6  79.4   0.0   0.0
> rx remote with sink:   30k    3 203k 8237 398k     36.5  60.7   2.8   0.0
>
> 6.3-PR besplex (em) ttcp:
>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
> local no sink:        417k    1 208k 418k    2     49.5  48.5   2.0
> local with sink:      420k    1 276k 145k    2     70.0  23.6   6.4
> tx remote no sink:     19k    2 250k 8144    2     58.5  38.7   2.8   0.0
> tx remote with sink:   16k    2 361k 8336    2     72.9  24.0   3.1   4.4
> rx remote no sink:     429    3   49  888    2      0.3  99.33  0.0   0.4
> tx remote with sink:   13k    2 316k 5385    2     31.7  63.8   3.6   0.8
>
> 8.0-C epsplex (em-fast) ttcp:
>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
> local no sink:        442k    3 221k  230 442k     47.2  49.6   2.7
> local with sink:      394k    3 262k  228 131k     72.1  22.6   5.3
> tx remote no sink:     17k    3 226k 7832  100     94.1   0.2   3.0   0.0
> tx remote with sink:   17k    3 360k 7962  100     91.7   0.2   3.7   4.4
> rx remote no sink:     saturated -- cannot update systat display
> rx remote with sink:   15k    6 358k 8224  100     97.0   0.0   2.5   0.5
>
> ~4.10 besplex (bge) ttcp:
>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
> local no sink:          15    0 425k  228   11     96.3   0.0   3.7
> local with sink:        **    0 622k  229   **     94.7   0.3   5.0
> tx remote no sink:      29    1 490k 7024   11     47.9  29.8   4.4  17.9
> tx remote with sink:    26    1 635k 1883   11     65.7  11.4   5.6  17.3
> rx remote no sink:       5    1   68 7025    1      0.0  47.3   0.0  52.7
> rx remote with sink:  6679    2 365k 6899   12     19.7  29.2   2.5  48.7
>
> ~5.2-C besplex (bge) ttcp:
>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
> local no sink:          1M    3 271k  229 543k     50.7  46.8   2.5
> local with sink:        1M    3 406k  229 203k     67.4  28.2   4.4
> tx remote no sink:     49k    3 474k  11k 167k     52.3  42.7   5.0   0.0
> tx remote with sink:  6371    3 641k 1900  100     76.0  16.8   6.2   0.9
> rx remote no sink:     34k    3   25  11k 270k      0.8  65.4   0.0  33.8
> rx remote with sink:   41k    3 365k  10k 370k     31.5  47.1   2.3  19.0
>
> 6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken):
>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
> local no sink:        540k    0 270k 540k    0     50.5  46.0   3.5
> local with sink:      628k    0 417k 210k    0     68.8  27.9   3.3
> tx remote no sink:     15k    1 222k 7190    1     28.4  29.3   1.7  40.6
> tx remote with sink:  5947    1 315k 2825    1     39.9  14.7   2.6  42.8
> rx remote no sink:     13k    1   23 6943    0      0.3  49.5   0.2  50.0
> rx remote with sink:   20k    1 371k 6819    0     29.5  30.1   3.9  36.5
>
> 8.0-C besplex (bge) ttcp:
>                        Csw  Trp  Sys  Int  Sof      Sys  Intr  User  Idle
> local no sink:        649k    3 324k  100 649k     53.9  42.9   3.2
> local with sink:      649k    3 433k  100 216k     75.2  18.8   6.0
> tx remote no sink:     24k    3 432k  10k  100     49.7  41.3   2.4   6.6
> tx remote with sink:  3199    3 568k 1580  100     64.3  19.6   4.0  12.2
> rx remote no sink:     20k    3   27  10k  100      0.0  46.1   0.0  53.9
> rx remote with sink:   31k    3 370k  10k  100     30.7  30.9   4.8  33.5
> %%%
>
> Bruce
>