Freebsd IP Forwarding performance (question,
and some info) [7-stable, current, em, smp]
Paul
paul at gtcomm.net
Thu Jul 3 12:48:52 UTC 2008
Bruce Evans wrote:
> On Thu, 3 Jul 2008, Paul wrote:
>
>> Bruce Evans wrote:
>>>> No polling:
>>>> 843762 25337 52313248 1 0 178 0
>>>> 763555 0 47340414 1 0 178 0
>>>> 830189 0 51471722 1 0 178 0
>>>> 838724 0 52000892 1 0 178 0
>>>> 813594 939 50442832 1 0 178 0
>>>> 807303 763 50052790 1 0 178 0
>>>> 791024 0 49043492 1 0 178 0
>>>> 768316 1106 47635596 1 0 178 0
>>>> Machine is maxed and is unresponsive..
>>>
>>> That's the most interesting one. Even 1% packet loss would probably
>>> destroy performance, so the benchmarks that give 10-50% packet loss
>>> are uninteresting.
>>>
>> But you realize that it's outputting all of these packets on em3 and
>> I'm watching them coming out
>> and they are consistent with the packets received on em0 that netstat
>> shows are 'good' packets.
>
> Well, output is easier. I don't remember seeing the load on a taskq for
> em3. If there is a memory bottleneck, it might to might not be more
> related
> to running only 1 taskq per interrupt, depending on how independent the
> memory system is for different CPU. I think Opterons have more
> indenpendence
> here than most x86's.
>
Opterons have on cpu memory controller.. That should help a little. :P
But I must be getting more than 1 packet per descriptor because I can do
HZ=100 and still get it without polling..
idle polling helps in all cases of polling that I have tested it with,
seems moreso on 32 bit
>> I'm using a server opteron which supposedly has the best memory
>> performance out of any CPU right now.
>> Plus opterons have the biggest l1 cache, but small l2 cache. Do you
>> think larger l2 cache on the Xeon (6mb for 2 core) would be better?
>> I have a 2222 opteron coming which is 1ghz faster so we will see what
>> happens
>
> I suspect lower latency memory would help more. Big memory systems
> have inherently higher latency. My little old A64 workstation and
> laptop have main memory latencies 3 times smaller than freebsd.org's
> new Core2 servers according to lmbench2 (42 nsec for the overclocked
> DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec).
> If there are a lot of cache misses, then the extra 100 nsec can be
> important. Profiling of sendto() using hwpmc or perfmon shows a
> significant number of cache misses per packet (2 or 10?).
>
The opterons are 667mhz DDR2 [registered], I have a Xeon that is ddr3
but i think the latency is higher than ddr2.
I'll look up those programs you mentioned and see If I can run some tests.
>>>> Polling ON:
>>>> input (em0) output
>>>> packets errs bytes packets errs bytes colls
>>>> 784138 179079 48616564 1 0 226 0
>>>> 788815 129608 48906530 2 0 356 0
>>>> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm
>>>> really mistified by this..
>>>
>>> Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy
>>> to explain (perhaps incorrectly). Polling can then read at most 256
>>> descriptors every 1/2000 second, giving a max throughput of 512 kpps.
>>> Packets < descriptors in general but might be equal here (for small
>>> packets). You seem to actually get 784 kpps, which is too high even
>>> in descriptors unless, but matches exactly if the errors are counted
>>> twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40%
>>> still happens to be left over after giving up at 512 kpps. Most of
>>> the errors are probably handled by the hardware at low cost in CPU by
>>> dropping packets. There are other types of errors but none except
>>> dropped packets is likely.
>>>
>> Read above, it's actually transmitting 770kpps out of em3 so it can't
>> just be 512kpps.
>
> Transmitting is easier, but with polling its even harder to send
> faster than
> hz * queue_length than to receive. This is without polling in idle.
>
What i'm saying though, it that it's not giving up at 512kpps because
784kpps is coming in em0
and going out em3 so obviously it's reading more than 256 every 1/2000th
of a second (packets).
What would be the best settings (theoretical) for 1mpps processing?
I actually don't have a problem 'receiving' more than 800kpps with much
lower CPU usage if it's going to blackhole .
so obviously it can receive a lot more, maybe even line rate pps but i
can't generate that much.
>> I was thinking of trying 4 or 5.. but how would that work with this
>> new hardware?
>
> Poorly, except possibly with polling in FreeBSD-4. FreeBSD-4 generally
> has lower overheads and latency, but is missing important improvements
> (mainly tcp optimizations in upper layers, better DMA and/or mbuf
> handling, and support for newer NICs). FreeBSD-5 is also missing the
> overhead+latency advantage.
>
> Here are some benchmarks. (ttcp mainly tests sendto(). 4.10 em needed a
> 2-line change to support a not-so-new PCI em NIC. Summary:
> - my bge NIC can handle about 600 kpps on my faster machine, but only
> achieves 300 in 4.10 unpatched.
> - my em NIC can handle about 400 kpps on my slower machine, except in
> later versions it can receive at about 600 kpps.
> - only 6.x and later can achieve near wire throughput for 1500-MTU
> packets (81 kpps vs 76 kpps). This depends on better DMA or mbuf
> handling... I now remember the details -- it is mainly better mbuf
> handling: old versions split the 1500-MTU packets into 2 mbufs and
> this causes 2 descriptors per packet, which causes extra software
> overheads and even larger overheads for the hardware.
>
> %%%
> Results of benchmarks run on 23 Feb 2007:
>
> my~5.2 bge --> ~4.10 em
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t 639 98 1660 398* 77 8k
> ttcp -l5 -t 6.0 100 3960 6.0 6 5900
> ttcp -l1472 -u -t 76 27 395 76 40 8k
> ttcp -l1472 -t 51 40 11k 51 26 8k
>
> (*) Same as sender according to netstat -I, but systat -ip shows that
> almost half aren't delivered to upper layers.
>
> my~5.2 bge --> 4.11 em
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t 635 98 1650 399* 74 8k
> ttcp -l5 -t 5.8 100 3900 5.8 6 5800
> ttcp -l1472 -u -t 76 27 395 76 32 8k
> ttcp -l1472 -t 51 40 11k 51 25 8k
>
> (*) Same as sender according to netstat -I, but systat -ip shows that
> almost half aren't delivered to upper layers.
>
> my~5.2 bge --> my~5.2 em
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t 638 98 1660 394* 100- 8k
> ttcp -l5 -t 5.8 100 3900 5.8 9 6000
> ttcp -l1472 -u -t 76 27 395 76 46 8k
> ttcp -l1472 -t 51 40 11k 51 35 8k
>
> (*) Same as sender according to netstat -I, but systat -ip shows that
> almost half aren't delivered to upper layers. With the em rate
> limit on ips changed from 8k to 80k, about 95% are delivered up.
>
> my~5.2 bge --> 6.2 em
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t 637 98 1660 637 100- 15k
> ttcp -l5 -t 5.8 100 3900 5.8 8 12k
> ttcp -l1472 -u -t 76 27 395 76 36 16k
> ttcp -l1472 -t 51 40 11k 51 37 16k
>
> my~5.2 bge --> ~current em-fastintr
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t 641 98 1670 641 99 8k
> ttcp -l5 -t 5.9 100 2670 5.9 7 6k
> ttcp -l1472 -u -t 76 27 395 76 35 8k
> ttcp -l1472 -t 52 43 11k 52 30 8k
>
> ~6.2 bge --> ~current em-fastintr
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t 309 62 1600 309 64 8k
> ttcp -l5 -t 4.9 100 3000 4.9 6 7k
> ttcp -l1472 -u -t 76 27 395 76 34 8k
> ttcp -l1472 -t 54 28 6800 54 30 8k
>
> ~current bge --> ~current em-fastintr
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t 602 100 1570 602 99 8k
> ttcp -l5 -t 5.3 100 2660 5.3 5 5300
> ttcp -l1472 -u -t 81# 19 212 81# 38 8k
> ttcp -l1472 -t 53 34 11k 53 30 8k
>
> (#) Wire speed to within 0.5%. This is the only kppps in this set of
> benchmarks that is close to wire speed. Older kernels apparently
> lose relative to -current because mbufs for mtu-sized packets are
> not contiguous in older kernels.
>
> Old results:
>
> ~4.10 bge --> my~5.2 em
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t n/a n/a n/a 346 79 8k
> ttcp -l5 -t n/a n/a n/a 5.4 10 6800
> ttcp -l1472 -u -t n/a n/a n/a 67 40 8k
> ttcp -l1472 -t n/a n/a n/a 51 36 8k
>
> ~4.10 kernel, =4 bge --> ~current em
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t n/a n/a n/a 347 96 14k
> ttcp -l5 -t n/a n/a n/a 5.8 10 14k
> ttcp -l1472 -u -t n/a n/a n/a 67 62 14K
> ttcp -l1472 -t n/a n/a n/a 52 40 16k
>
> ~4.10 kernel, =4+ bge --> ~current em
> tx rx
> kpps load% ips kpps load% ips
> ttcp -l5 -u -t n/a n/a n/a 627 100 9k
> ttcp -l5 -t n/a n/a n/a 5.6 9 13k
> ttcp -l1472 -u -t n/a n/a n/a 68 63 14k
> ttcp -l1472 -t n/a n/a n/a 54 44 16k
> %%%
>
> %%%
> Results of benchmarks run on 28 Dec 2007:
>
> ~5.2 epsplex (em) ttcp:
> Csw Trp Sys Int Sof Sys Intr User Idle
> local no sink: 825k 3 206k 229 412k 52.1 45.1 2.8
> local with sink: 659k 3 263k 231 131k 66.5 27.3 6.2
> tx remote no sink: 35k 3 273k 8237 266k 42.0 52.1 2.3 3.6
> tx remote with sink: 26k 3 394k 8224 100 60.0 5.41 3.4 11.2
> rx remote no sink: 25k 4 26 8237 373k 20.6 79.4 0.0 0.0
> rx remote with sink: 30k 3 203k 8237 398k 36.5 60.7 2.8 0.0
>
> 6.3-PR besplex (em) ttcp:
> Csw Trp Sys Int Sof Sys Intr User Idle
> local no sink: 417k 1 208k 418k 2 49.5 48.5 2.0
> local with sink: 420k 1 276k 145k 2 70.0 23.6 6.4
> tx remote no sink: 19k 2 250k 8144 2 58.5 38.7 2.8 0.0
> tx remote with sink: 16k 2 361k 8336 2 72.9 24.0 3.1 4.4
> rx remote no sink: 429 3 49 888 2 0.3 99.33 0.0 0.4
> tx remote with sink: 13k 2 316k 5385 2 31.7 63.8 3.6 0.8
>
> 8.0-C epsplex (em-fast) ttcp:
> Csw Trp Sys Int Sof Sys Intr User Idle
> local no sink: 442k 3 221k 230 442k 47.2 49.6 2.7
> local with sink: 394k 3 262k 228 131k 72.1 22.6 5.3
> tx remote no sink: 17k 3 226k 7832 100 94.1 0.2 3.0 0.0
> tx remote with sink: 17k 3 360k 7962 100 91.7 0.2 3.7 4.4
> rx remote no sink: saturated -- cannot update systat display
> rx remote with sink: 15k 6 358k 8224 100 97.0 0.0 2.5 0.5
>
> ~4.10 besplex (bge) ttcp:
> Csw Trp Sys Int Sof Sys Intr User Idle
> local no sink: 15 0 425k 228 11 96.3 0.0 3.7
> local with sink: ** 0 622k 229 ** 94.7 0.3 5.0
> tx remote no sink: 29 1 490k 7024 11 47.9 29.8 4.4 17.9
> tx remote with sink: 26 1 635k 1883 11 65.7 11.4 5.6 17.3
> rx remote no sink: 5 1 68 7025 1 0.0 47.3 0.0 52.7
> rx remote with sink: 6679 2 365k 6899 12 19.7 29.2 2.5 48.7
>
> ~5.2-C besplex (bge) ttcp:
> Csw Trp Sys Int Sof Sys Intr User Idle
> local no sink: 1M 3 271k 229 543k 50.7 46.8 2.5
> local with sink: 1M 3 406k 229 203k 67.4 28.2 4.4
> tx remote no sink: 49k 3 474k 11k 167k 52.3 42.7 5.0 0.0
> tx remote with sink: 6371 3 641k 1900 100 76.0 16.8 6.2 0.9
> rx remote no sink: 34k 3 25 11k 270k 0.8 65.4 0.0 33.8
> rx remote with sink: 41k 3 365k 10k 370k 31.5 47.1 2.3 19.0
>
> 6.3-PR besplex (bge) ttcp (hz = 1000 else stathz broken):
> Csw Trp Sys Int Sof Sys Intr User Idle
> local no sink: 540k 0 270k 540k 0 50.5 46.0 3.5
> local with sink: 628k 0 417k 210k 0 68.8 27.9 3.3
> tx remote no sink: 15k 1 222k 7190 1 28.4 29.3 1.7 40.6
> tx remote with sink: 5947 1 315k 2825 1 39.9 14.7 2.6 42.8
> rx remote no sink: 13k 1 23 6943 0 0.3 49.5 0.2 50.0
> rx remote with sink: 20k 1 371k 6819 0 29.5 30.1 3.9 36.5
>
> 8.0-C besplex (bge) ttcp:
> Csw Trp Sys Int Sof Sys Intr User Idle
> local no sink: 649k 3 324k 100 649k 53.9 42.9 3.2
> local with sink: 649k 3 433k 100 216k 75.2 18.8 6.0
> tx remote no sink: 24k 3 432k 10k 100 49.7 41.3 2.4 6.6
> tx remote with sink: 3199 3 568k 1580 100 64.3 19.6 4.0 12.2
> rx remote no sink: 20k 3 27 10k 100 0.0 46.1 0.0 53.9
> rx remote with sink: 31k 3 370k 10k 100 30.7 30.9 4.8 33.5
> %%%
>
> Bruce
>
More information about the freebsd-net
mailing list