Freebsd IP Forwarding performance (question, and some info)
[7-stable, current, em, smp]
Bruce Evans
brde at optusnet.com.au
Thu Jul 3 07:07:32 UTC 2008
On Wed, 2 Jul 2008, Paul wrote:
>...
> -----------Reboot with 4096/4096........(my guess is that it will be a lot
> worse, more errors..)
> ........
> Without polling, 4096 is horrible, about 200kpps less ... :/
> Turning on polling..
> polling on, 4096 is bad,
> input (em0) output
> packets errs bytes packets errs bytes colls
> 622379 307753 38587506 1 0 178 0
> 635689 277303 39412718 1 0 178 0
> ...
> ------Rebooting with 256/256 descriptors..........
> ..........
> No polling:
> 843762 25337 52313248 1 0 178 0
> 763555 0 47340414 1 0 178 0
> 830189 0 51471722 1 0 178 0
> 838724 0 52000892 1 0 178 0
> 813594 939 50442832 1 0 178 0
> 807303 763 50052790 1 0 178 0
> 791024 0 49043492 1 0 178 0
> 768316 1106 47635596 1 0 178 0
> Machine is maxed and is unresponsive..
That's the most interesting one. Even 1% packet loss would probably
destroy performance, so the benchmarks that give 10-50% packet loss
are uninteresting.
All indications are that you are running out of CPU and memory (DMA
and/or cache fills) throughput. The above apparently hits both limits
at the same time, while with more descriptors memory throughput runs
out first. 1 CPU is apparently barely enough for 800 kpps (is this
all with UP now?), and I think more CPUs could only be slower, as you
saw with SMP, especially using multiple em taskqs, since memory traffic
would be higher. I wouldn't expect this to be fixed soon (except by
throwing better/different hardware at it).
The CPU/DMA balance can probably be investigated by slowing down the CPU/
memory system.
You may remember my previous mail about getting higher pps on bge.
Again, all indications are that I'm running out of CPU, memory, and
bus throughput too since the bus is only PCI 33MHz. These interact
in a complicated way which I haven't been able to untangle. -current
is fairly consistently slower than my ~5.2 by about 10%, apparently
due to code bloat (extra CPU and related extra cache misses). OTOH,
like you I've seen huge variations for changes that should be null
(e.g., disturbing the alignment of the text section without changing
anything else). My ~5.2 is very consistent since I rarely change it,
while -current changes a lot and shows more variation, but with no
sign of getting near the ~5.2 plateau or even its old peaks.
> Polling ON:
> input (em0) output
> packets errs bytes packets errs bytes colls
> 784138 179079 48616564 1 0 226 0
> 788815 129608 48906530 2 0 356 0
> 755555 142997 46844426 2 0 468 0
> 803670 144459 49827544 1 0 178 0
> 777649 147120 48214242 1 0 178 0
> 779539 146820 48331422 1 0 178 0
> 786201 148215 48744478 2 0 356 0
> 776013 101660 48112810 1 0 178 0
> 774239 145041 48002834 2 0 356 0
> 771774 102969 47850004 1 0 178 0
>
> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really
> mistified by this..
Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy
to explain (perhaps incorrectly). Polling can then read at most 256
descriptors every 1/2000 second, giving a max throughput of 512 kpps.
Packets < descriptors in general but might be equal here (for small
packets). You seem to actually get 784 kpps, which is too high even
in descriptors unless, but matches exactly if the errors are counted
twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40%
still happens to be left over after giving up at 512 kpps. Most of
the errors are probably handled by the hardware at low cost in CPU by
dropping packets. There are other types of errors but none except
dropped packets is likely.
> Every time it maxes out and gets errors, top reports:
> CPU: 0.0% user, 0.0% nice, 10.1% system, 45.3% interrupt, 44.6% idle
> pretty much the same line every time
>
> 256/256 blows away 4096 , probably fits the descriptors into the cache lines
> on the cpu and 4096 has too many cache misses and causes worse performance.
Quite likely. Maybe your systems have memory systems that are weak relative
to other resources, so that they this limit sooner than expected.
I should look at my "fixes" for bge, one than changes rxd from 256 to 512,
and one that increases the ifq tx length from txd = 512 to about 20000.
Both of these might thrash caches. The former makes little difference
except for polling at < 4000 Hz, but I don't believe in or use polling.
The latter works around select() for write descriptors not working on
sockets, so that high frequency polling from userland is not needed to
determine a good time to retry after ENOBUFs errors. This is probably
only important in pps benchmarks. txd = 512 gives good efficiency in
my version of bge, but might be too high for good throughput and is mostly
wasted in distribution versions of FreeBSD.
Bruce
More information about the freebsd-net
mailing list