Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]

Thu Jul 3 07:07:32 UTC 2008

On Wed, 2 Jul 2008, Paul wrote:

>...
> -----------Reboot with 4096/4096........(my guess is that it will be a lot 
> worse, more errors..)
> ........
> Without polling, 4096 is horrible, about 200kpps less ... :/
> Turning on polling..
> polling on, 4096 is bad,
>           input          (em0)           output
>  packets  errs      bytes    packets  errs      bytes colls
>   622379 307753   38587506          1     0        178     0
>   635689 277303   39412718          1     0        178     0
> ...
> ------Rebooting with 256/256 descriptors..........
> ..........
> No polling:
> 843762 25337   52313248          1     0        178     0
>   763555     0   47340414          1     0        178     0
>   830189     0   51471722          1     0        178     0
>   838724     0   52000892          1     0        178     0
>   813594   939   50442832          1     0        178     0
>   807303   763   50052790          1     0        178     0
>   791024     0   49043492          1     0        178     0
>   768316  1106   47635596          1     0        178     0
> Machine is maxed and is unresponsive..

That's the most interesting one.  Even 1% packet loss would probably
destroy performance, so the benchmarks that give 10-50% packet loss
are uninteresting.

All indications are that you are running out of CPU and memory (DMA
and/or cache fills) throughput.  The above apparently hits both limits
at the same time, while with more descriptors memory throughput runs
out first.  1 CPU is apparently barely enough for 800 kpps (is this
all with UP now?), and I think more CPUs could only be slower, as you
saw with SMP, especially using multiple em taskqs, since memory traffic
would be higher.  I wouldn't expect this to be fixed soon (except by
throwing better/different hardware at it).

The CPU/DMA balance can probably be investigated by slowing down the CPU/
memory system.

You may remember my previous mail about getting higher pps on bge.
Again, all indications are that I'm running out of CPU, memory, and
bus throughput too since the bus is only PCI 33MHz.  These interact
in a complicated way which I haven't been able to untangle.  -current
is fairly consistently slower than my ~5.2 by about 10%, apparently
due to code bloat (extra CPU and related extra cache misses).  OTOH,
like you I've seen huge variations for changes that should be null
(e.g., disturbing the alignment of the text section without changing
anything else).  My ~5.2 is very consistent since I rarely change it,
while -current changes a lot and shows more variation, but with no
sign of getting near the ~5.2 plateau or even its old peaks.

> Polling ON:
>         input          (em0)           output
>  packets  errs      bytes    packets  errs      bytes colls
>   784138 179079   48616564          1     0        226     0
>   788815 129608   48906530          2     0        356     0
>   755555 142997   46844426          2     0        468     0
>   803670 144459   49827544          1     0        178     0
>   777649 147120   48214242          1     0        178     0
>   779539 146820   48331422          1     0        178     0
>   786201 148215   48744478          2     0        356     0
>   776013 101660   48112810          1     0        178     0
>   774239 145041   48002834          2     0        356     0
>   771774 102969   47850004          1     0        178     0
>
> Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ?  I'm really 
> mistified by this..

Is this with hz=2000 and 256/256 and no polling in idle?  40% is easy
to explain (perhaps incorrectly).  Polling can then read at most 256
descriptors every 1/2000 second, giving a max throughput of 512 kpps.
Packets < descriptors in general but might be equal here (for small
packets).  You seem to actually get 784 kpps, which is too high even
in descriptors unless, but matches exactly if the errors are counted
twice (784 - 179 - 505 ~= 512).  CPU is getting short too, but 40%
still happens to be left over after giving up at 512 kpps.  Most of
the errors are probably handled by the hardware at low cost in CPU by
dropping packets.  There are other types of errors but none except
dropped packets is likely.

> Every time it maxes out and gets errors, top reports:
> CPU:  0.0% user,  0.0% nice, 10.1% system, 45.3% interrupt, 44.6% idle
> pretty much the same line every time
>
> 256/256 blows away 4096 , probably fits the descriptors into the cache lines 
> on the cpu and 4096 has too many cache misses and causes worse performance.

Quite likely.  Maybe your systems have memory systems that are weak relative
to other resources, so that they this limit sooner than expected.

I should look at my "fixes" for bge, one than changes rxd from 256 to 512,
and one that increases the ifq tx length from txd = 512 to about 20000.
Both of these might thrash caches.  The former makes little difference
except for polling at < 4000 Hz, but I don't believe in or use polling.
The latter works around select() for write descriptors not working on 
sockets, so that high frequency polling from userland is not needed to
determine a good time to retry after ENOBUFs errors.  This is probably
only important in pps benchmarks.  txd = 512 gives good efficiency in
my version of bge, but might be too high for good throughput and is mostly
wasted in distribution versions of FreeBSD.

Bruce