FreeBSD 10G forwarding performance @Intel
Alexander V. Chernikov
melifaro at FreeBSD.org
Tue Jul 3 16:12:24 UTC 2012
Hello list!
I'm quite stuck with bad forwarding performance on many FreeBSD boxes
doing firewalling.
Typical configuration is E5645 / E5675 @ Intel 82599 NIC.
HT is turned off.
(Configs and tunables below).
I'm mostly concerned with unidirectional traffic flowing to single
interface (e.g. using singe route entry).
In most cases system can forward no more than 700 (or 1400) kpps which
is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).
Test scenario:
Ixia XM2 (traffic generator) <> ix0 (FreeBSD).
Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to
destinations in vlan11 (10.100.1.128 - 10.100.1.192).
Static arps are configured for all destination addresses.
Traffic level is slightly above or slightly below system performance.
================= Test 1 =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE,
no firewall
Traffic: 1-1 flow (1 src, 1 dst)
(This is actually a bit different from described above)
Result:
input (ix0) output
packets errs idrops bytes packets errs bytes colls
878k 48k 0 59M 878k 0 56M 0
874k 48k 0 59M 874k 0 56M 0
875k 48k 0 59M 875k 0 56M 0
16:41 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf " %7s
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
STATE C TIME CPU COMMAND
CPU6 6 17:28 100.00% kernel{ix0 que}
CPU9 9 20:42 60.06% intr{irq265: ix0:que
16:41 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0 500796 167
irq257: ix0:que 1 6693573 2245
irq258: ix0:que 2 2572380 862
irq259: ix0:que 3 3166273 1062
irq260: ix0:que 4 9691706 3251
irq261: ix0:que 5 10766434 3611
irq262: ix0:que 6 8933774 2996
irq263: ix0:que 7 5246879 1760
irq264: ix0:que 8 3548930 1190
irq265: ix0:que 9 11817986 3964
irq266: ix0:que 10 227561 76
irq267: ix0:link 1 0
Note that system is using 2 cores to forward, so 12 cores should be able
to forward 4+ mpps which is more or less consistent with Linux results.
Note that interrupts on all queues are (as far as I understand from the
fact that AIM is turned off and interrupt rates are the same from
previous test). Additionally, despite hw.intr_storm_threshold = 200k,
i'm constantly getting
interrupt storm detected on "irq265:"; throttling interrupt source
message.
================= Test 2 =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE,
no firewall
Traffic: Unidirectional many-2-many
16:20 [0] test15# netstat -I ix0 -hw 1
input (ix0) output
packets errs idrops bytes packets errs bytes colls
507k 651k 0 74M 508k 0 32M 0
506k 652k 0 74M 507k 0 28M 0
509k 652k 0 74M 508k 0 37M 0
16:28 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf " %7s
%2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
STATE C TIME CPU COMMAND
CPU10 6 0:40 100.00% kernel{ix0 que}
CPU2 2 11:47 84.86% intr{irq258: ix0:que
CPU3 3 11:50 81.88% intr{irq259: ix0:que
CPU8 8 11:38 77.69% intr{irq264: ix0:que
CPU7 7 11:24 77.10% intr{irq263: ix0:que
WAIT 1 10:10 74.76% intr{irq257: ix0:que
CPU4 4 8:57 63.48% intr{irq260: ix0:que
CPU6 6 8:35 61.96% intr{irq262: ix0:que
CPU9 9 14:01 60.79% intr{irq265: ix0:que
RUN 0 9:07 59.67% intr{irq256: ix0:que
WAIT 5 6:13 43.26% intr{irq261: ix0:que
CPU11 11 5:19 35.89% kernel{ix0 que}
- 4 3:41 25.49% kernel{ix0 que}
- 1 3:22 21.78% kernel{ix0 que}
- 1 2:55 17.68% kernel{ix0 que}
- 4 2:24 16.55% kernel{ix0 que}
- 1 9:54 14.99% kernel{ix0 que}
CPU0 11 2:13 14.26% kernel{ix0 que}
16:07 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0 13654 15
irq257: ix0:que 1 87043 96
irq258: ix0:que 2 39604 44
irq259: ix0:que 3 48308 53
irq260: ix0:que 4 138002 153
irq261: ix0:que 5 169596 188
irq262: ix0:que 6 107679 119
irq263: ix0:que 7 72769 81
irq264: ix0:que 8 30878 34
irq265: ix0:que 9 1002032 1115
irq266: ix0:que 10 10967 12
irq267: ix0:link 1 0
Note that all cores are loaded more or less evenly, but the result is
_worse_. The first reason for this is mtx_lock which is acquired twice
on every lookup (once in in in_matroute() where it can possibly be
removed and once again in rtalloc1_fib()). Latter one is addressed by
andre@ in r234650).
Additionally, despite itreads are bound to singe CPU each, kernel que
are not in stock setup. However, configuration with 5 queues and 5
kernel threads bound to different CPU provides the same bad results.
================= Test 3 =======================
Kernel: FreeBSD-8-S June 4 SVN, +merged ifaddrlock, stock drivers, stock
routing, no FLOWTABLE, no firewall
packets errs idrops bytes packets errs bytes colls
580k 18k 0 38M 579k 0 37M 0
581k 26k 0 39M 580k 0 37M 0
580k 24k 0 39M 580k 0 37M 0
................
Enabling ipfw _increases_ performance a bit:
604k 0 0 39M 604k 0 39M 0
604k 0 0 39M 604k 0 39M 0
582k 19k 0 38M 568k 0 37M 0
527k 81k 0 39M 530k 0 34M 0
605k 28 0 39M 605k 0 39M 0
================= Test 3.1 =======================
Same as test 3, the only difference is the following:
route add -net 10.100.1.160/27 -iface vlan11.
input (ix0) output
packets errs idrops bytes packets errs bytes colls
543k 879k 0 91M 544k 0 35M 0
547k 870k 0 91M 545k 0 35M 0
541k 870k 0 91M 539k 0 30M 0
952k 565k 0 97M 962k 0 48M 0
1.2M 228k 0 91M 1.2M 0 92M 0
1.2M 226k 0 90M 1.1M 0 76M 0
1.1M 228k 0 91M 1.2M 0 76M 0
1.2M 233k 0 90M 1.2M 0 76M 0
================= Test 3.2 =======================
Same as test 3, splitting destination into 4 smaller rtes:
route add -net 10.100.1.128/28 -iface vlan11
route add -net 10.100.1.144/28 -iface vlan11
route add -net 10.100.1.160/28 -iface vlan11
route add -net 10.100.1.176/28 -iface vlan11
input (ix0) output
packets errs idrops bytes packets errs bytes colls
1.4M 0 0 106M 1.6M 0 106M 0
1.8M 0 0 106M 1.6M 0 71M 0
1.6M 0 0 106M 1.6M 0 71M 0
1.6M 0 0 87M 1.6M 0 71M 0
1.6M 0 0 126M 1.6M 0 212M 0
================= Test 3.3 =======================
Same as test 3, splitting destination into 16 smaller rtes:
input (ix0) output
packets errs idrops bytes packets errs bytes colls
1.6M 0 0 118M 1.8M 0 118M 0
2.0M 0 0 118M 1.8M 0 119M 0
1.8M 0 0 119M 1.8M 0 79M 0
1.8M 0 0 117M 1.8M 0 157M 0
================= Test 4 =======================
Kernel: FreeBSD-8-S June 4 SVN, stock drivers, routing patch 1, no
FLOWTABLE, no firewall
input (ix0) output
packets errs idrops bytes packets errs bytes colls
1.8M 0 0 114M 1.9M 0 114M 0
1.7M 0 0 114M 1.7M 0 114M 0
1.8M 0 0 114M 1.8M 0 114M 0
1.7M 0 0 114M 1.7M 0 114M 0
1.8M 0 0 114M 1.8M 0 74M 0
1.5M 0 0 114M 1.8M 0 74M 0
2M 0 0 114M 1.8M 0 194M 0
Patch 1 totally eliminates mtx_lock for fastforwarding path to get an
idea how much performance we can achieve. The result is nearly the same
as in 3.3
================= Test 4.1 =======================
Same as the test 4, same traffic level, enabling firewall with single
allow rule (evaluating RLOCK performance)
22:35 [0] test15# netstat -I ix0 -hw 1
input (ix0) output
packets errs idrops bytes packets errs bytes colls
1.8M 149k 0 114M 1.6M 0 142M 0
1.4M 148k 0 85M 1.6M 0 104M 0
1.8M 149k 0 143M 1.6M 0 104M 0
1.6M 151k 0 114M 1.6M 0 104M 0
1.6M 151k 0 114M 1.6M 0 104M 0
1.4M 152k 0 114M 1.6M 0 104M 0
E.g something like 10% performance loss.
================= Test 4.2 =======================
Same as test4, playing with number of queues.
5queues, same traffic level
1.5M 225k 0 114M 1.5M 0 99M 0
================= Test 4.3 =======================
Same as test 4, HT on, number of queues = 16
input (ix0) output
packets errs idrops bytes packets errs bytes colls
2.4M 0 0 157M 2.4M 0 156M 0
2.4M 0 0 156M 2.4M 0 157M 0
However, enabling firewall immediately drops rate to 1.9mpps which is
nearly the same as 4.1 (and complicated fw ruleset possibly kill HT core
much faster)
================= Test 4.3 =======================
Same as test4, kerwnel ix0 que Tx threads bound to specific CPUs
(corresponding to RX ):
18:02 [0] test15# procstat -ak | grep ix0 | sort -nk 2
12 100045 intr irq256: ix0:que <running>
0 100046 kernel ix0 que <running>
12 100047 intr irq257: ix0:que <running>
0 100048 kernel ix0 que mi_switch sleepq_wait
msleep_spin taskqueue_thread_loop fork_exit fork_trampoline
12 100049 intr irq258: ix0:que <running>
..
test15# for i in `jot 12 0`; do cpuset -l $i -t $((100046+2*$i)); done
Result:
input (ix0) output
packets errs idrops bytes packets errs bytes colls
2.1M 0 0 139M 2M 0 193M 0
2.1M 0 0 139M 2.3M 0 139M 0
2.1M 0 0 139M 2.1M 0 85M 0
2.1M 0 0 139M 2.1M 0 193M 0
Quite considerable increase, however this works better for uniform
traffic distribution only.
================= Test 5 =======================
Same as test 4, make radix use rmlock (r234648, r234649).
Result: 1.7 MPPS.
================= Test 6 =======================
Same as test 4 + FLOWTABLE
Result: 1.7 MPPS.
================= Test 7 =======================
Same as test 4, build with GCC 4.7
Result: No performance gain
Further investigations:
================= Test 8 =======================
Test 4 setup with kernel build with LOCK_PROFILING.
17:46 [0] test15# sysctl debug.lock.prof.enable=1 ; sleep 2 ; sysctl
debug.lock.prof.enable=0
920k 0 0 59M 920k 0 59M 0
875k 0 0 59M 920k 0 59M 0
628k 0 0 39M 566k 0 45M 0
79k 2.7M 0 186M 57k 0 6.5M 0
71k 878k 0 61M 73k 0 4.0M 0
891k 254k 0 72M 917k 0 54M 0
920k 0 0 59M 920k 0 59M 0
When enabled, forwarding performance goes down to 60kpps.
Enabled for 2 seconds (so actually 130k packets forwarded), results
attached as separate file. Several hundred lock contentions in ixgbe,
that's all.
================= Test 9 =======================
Same as test 4 setup with hwpmc.
Results attached.
================= Test 9 =======================
Kernel: Freebsd-9-S.
No major difference
Some (my) preliminary conclusions:
1) rte mtx_lock should (and can) be eliminated from stock kernel. (And
it can be done more or less easily for in_matroute).
2) rmlock vs rwlock performance difference is insignificant (maybe
because of 3) )
3) there are locks contention between ixgbe taskq threads and ithreads.
I'm not sure if taskq threads are necessary in the case of packet
forwarding and not traffic generation.
Maybe I'm missing something else? (l2 cache misses or other things).
What else I can do to debug this further?
Relevant files:
http://static.ipfw.ru/files/fbsd10g/0001-no-rt-mutex.patch
http://static.ipfw.ru/files/fbsd10g/kernel.gprof.txt
http://static.ipfw.ru/files/fbsd10g/prof_stats.txt
============= CONFIGS ====================
sysctl.conf:
kern.ipc.maxsockbuf=33554432
net.inet.udp.maxdgram=65535
net.inet.udp.recvspace=16777216
net.inet.tcp.sendbuf_auto=0
net.inet.tcp.recvbuf_auto=0
net.inet.tcp.sendspace=16777216
net.inet.tcp.recvspace=16777216
net.inet.ip.maxfragsperpacket=64
kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
kern.random.sys.harvest.interrupt=0
net.inet.ip.forwarding=1
net.inet.ip.fastforwarding=1
net.inet.ip.redirect=0
hw.intr_storm_threshold=20000
loader.conf:
kern.ipc.nmbclusters="512000"
ixgbe_load="YES"
hw.ixgbe.rx_process_limit="300"
hw.ixgbe.nojumbobuf="1"
hw.ixgbe.max_loop="100"
hw.ixgbe.max_interrupt_rate="20000"
hw.ixgbe.num_queues="11"
hw.ixgbe.txd=4096
hw.ixgbe.rxd=4096
kern.hwpmc.nbuffers=2048
debug.debugger_on_panic=1
net.inet.ip.fw.default_to_accept=1
kernel:
cpu HAMMER
ident CORE_RELENG_7
options COMPAT_IA32
makeoptions DEBUG=-g # Build kernel with gdb(1) debug
symbols
options SCHED_ULE # ULE scheduler
options PREEMPTION # Enable kernel thread preemption
options INET # InterNETworking
options INET6 # IPv6 communications protocols
options SCTP # Stream Control Transmission
Protocol
options FFS # Berkeley Fast Filesystem
options SOFTUPDATES # Enable FFS soft updates support
options UFS_ACL # Support for access control lists
options UFS_DIRHASH # Improve performance on big
directories
options UFS_GJOURNAL # Enable gjournal-based UFS
journaling
options MD_ROOT # MD is a potential root device
options PROCFS # Process filesystem (requires
PSEUDOFS)
options PSEUDOFS # Pseudo-filesystem framework
options GEOM_PART_GPT # GUID Partition Tables.
options GEOM_LABEL # Provides labelization
options COMPAT_43TTY # BSD 4.3 TTY compat [KEEP THIS!]
options COMPAT_FREEBSD4 # Compatible with FreeBSD4
options COMPAT_FREEBSD5 # Compatible with FreeBSD5
options COMPAT_FREEBSD6 # Compatible with FreeBSD6
options COMPAT_FREEBSD7 # Compatible with FreeBSD7
options COMPAT_FREEBSD32
options SCSI_DELAY=4000 # Delay (in ms) before probing SCSI
options KTRACE # ktrace(1) support
options STACK # stack(9) support
options SYSVSHM # SYSV-style shared memory
options SYSVMSG # SYSV-style message queues
options SYSVSEM # SYSV-style semaphores
options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time
extensions
options KBD_INSTALL_CDEV # install a CDEV entry in /dev
options AUDIT # Security event auditing
options HWPMC_HOOKS
options GEOM_MIRROR
options MROUTING
options PRINTF_BUFR_SIZE=100
# To make an SMP kernel, the next two lines are needed
options SMP # Symmetric MultiProcessor Kernel
# CPU frequency control
device cpufreq
# Bus support.
device acpi
device pci
device ada
device ahci
# SCSI Controllers
device ahd # AHA39320/29320 and onboard AIC79xx devices
options AHD_REG_PRETTY_PRINT # Print register bitfields in debug
# output. Adds ~215k to driver.
device mpt # LSI-Logic MPT-Fusion
# SCSI peripherals
device scbus # SCSI bus (required for SCSI)
device da # Direct Access (disks)
device pass # Passthrough device (direct SCSI access)
device ses # SCSI Environmental Services (and SAF-TE)
# RAID controllers
device mfi # LSI MegaRAID SAS
# atkbdc0 controls both the keyboard and the PS/2 mouse
device atkbdc # AT keyboard controller
device atkbd # AT keyboard
device psm # PS/2 mouse
device kbdmux # keyboard multiplexer
device vga # VGA video card driver
device splash # Splash screen and screen saver support
# syscons is the default console driver, resembling an SCO console
device sc
device agp # support several AGP chipsets
## Power management support (see NOTES for more options)
#device apm
## Add suspend/resume support for the i8254.
#device pmtimer
# Serial (COM) ports
#device sio # 8250, 16[45]50 based serial ports
device uart # Generic UART driver
# If you've got a "dumb" serial or parallel PCI card that is
# supported by the puc(4) glue driver, uncomment the following
# line to enable it (connects to sio, uart and/or ppc drivers):
#device puc
# PCI Ethernet NICs.
device em # Intel PRO/1000 adapter Gigabit
Ethernet Card
device bce
#device ixgb # Intel PRO/10GbE Ethernet Card
#device ixgbe
# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device miibus # MII bus support
# Pseudo devices.
device loop # Network loopback
device random # Entropy device
device ether # Ethernet support
device pty # Pseudo-ttys (telnet etc)
device md # Memory "disks"
device firmware # firmware assist module
device lagg
# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device bpf # Berkeley packet filter
# USB support
device uhci # UHCI PCI->USB interface
device ohci # OHCI PCI->USB interface
device ehci # EHCI PCI->USB interface (USB 2.0)
device usb # USB Bus (required)
#device udbp # USB Double Bulk Pipe devices
device uhid # "Human Interface Devices"
device ukbd # Keyboard
device umass # Disks/Mass storage - Requires scbus and da
device ums # Mouse
# USB Serial devices
device ucom # Generic com ttys
options INCLUDE_CONFIG_FILE
options KDB
options KDB_UNATTENDED
options DDB
options ALT_BREAK_TO_DEBUGGER
options IPFIREWALL #firewall
options IPFIREWALL_FORWARD #packet destination changes
options IPFIREWALL_VERBOSE #print information about
# dropped packets
options IPFIREWALL_VERBOSE_LIMIT=10000 #limit verbosity
# MRT support
options ROUTETABLES=16
device vlan #VLAN support
# Size of the kernel message buffer. Should be N * pagesize.
options MSGBUF_SIZE=4096000
options SW_WATCHDOG
options PANIC_REBOOT_WAIT_TIME=4
#
# Hardware watchdog timers:
#
# ichwd: Intel ICH watchdog timer
#
#device ichwd
device smbus
device ichsmb
device ipmi
--
WBR, Alexander
More information about the freebsd-net
mailing list