Poor performance with natd/ipfw and TSO enabled on bce(4) card and 8.1-PRERELEASE

Fri Jul 2 05:30:41 UTC 2010

On Thu, Jul 1, 2010 at 9:19 PM, Ian Smith <smithi at nimnet.asn.au> wrote:
> On Thu, 1 Jul 2010, Garrett Cooper wrote:
>  > On Thu, Jul 1, 2010 at 4:54 PM, Pyun YongHyeon <pyunyh at gmail.com> wrote:
>  > > On Wed, Jun 30, 2010 at 07:00:53PM -0700, Garrett Cooper wrote:
>  > >> Hi,
>  > >>     Just an observation I made while transferring a file:
>  > >>
>  > >> # time scp floppy.img somehost:
>  > >> Password:
>  > >> floppy.img                                    100% 1440KB  13.7KB/s   01:45
>  > >>
>  > >> real  1m59.400s
>  > >> user  0m0.031s
>  > >> sys   0m0.028s
>  > >> # sysctl net.inet.tcp.tso=0
>  > >> net.inet.tcp.tso: 1 -> 0
>  > >> # time scp floppy.img somehost:
>  > >> floppy.img                                    100% 1440KB   1.4MB/s   00:00
>  > >>
>  > >> real  0m0.712s
>  > >> user  0m0.018s
>  > >> sys   0m0.018s
>  > >>
>  > >>     Going ISDN speeds transferring a 1.44MB file is sad when you have
>  > >> a gigabit uplink :(... natd seems to be doing a LOT of spinning when
>  > >> TSO is enabled (it's going up to 73% CPU on a dual-proc quad-core
>  > >> machine).
>  > >
>  > > I would use pf(4) if I have to handle lots of NAT rules.
>
> There's only one NAT rule here, not clear how many active NAT sessions
> are involved.  I'm tending to doubt this is really a natd issue; natd
> has no interaction with interface issues like TSO, that I know of,
> hopefully someone will correct me if I'm wrong about that.
>
>  > >>     Here are some other details:
>  > >>
>  > >> # ipfw list
>  > >> 00050 divert 8668 ip4 from any to any via bce1
>  > >> 00100 allow ip from any to any via lo0
>  > >> 00200 deny ip from any to 127.0.0.0/8
>  > >> 00300 deny ip from 127.0.0.0/8 to any
>  > >> 00400 deny ip from any to ::1
>  > >> 00500 deny ip from ::1 to any
>  > >> 00600 allow ipv6-icmp from :: to ff02::/16
>  > >> 00700 allow ipv6-icmp from fe80::/10 to fe80::/10
>  > >> 00800 allow ipv6-icmp from fe80::/10 to ff02::/16
>  > >> 00900 allow ipv6-icmp from any to any ip6 icmp6types 1
>  > >> 01000 allow ipv6-icmp from any to any ip6 icmp6types 2,135,136
>  > >> 65000 allow ip from any to any
>  > >> 65535 deny ip from any to any
>  > >> # ls /etc/natd*
>  > >> ls: /etc/natd*: No such file or directory
>
> I assume that's the 'open' rc.firewall ruleset?

Yes.

$ grep ^firewall /etc/rc.conf
firewall_type="open"

> So you have no
> natd.conf, and are taking all defaults?  Just to check the config:

Correct.

$ ls /etc/natd.conf
ls: /etc/natd.conf: No such file or directory

> # grep natd_ /etc/rc.conf

$ grep ^natd_ /etc/rc.conf
natd_enable="YES"
natd_interface="bce1"

> # ps axw | grep "[n]atd"
>
> Do you have options IPFIREWALL and IPDIVERT in kernel, or are you
> loading these as modules?

Modules.

$ egrep 'IPDIVERT|IPFIREWALL' /root/TAMESHI_STABLE
$ make -VMODULES_OVERRIDE -f /etc/src.conf foo
bce bge em bridgestp if_bridge ipdivert ipfw ipfw_nat libalias
i2c/smbus ipmi ipmi/ipmi_linux linprocfs linsysfs linux

>  > >> # uname -a
>  > >> FreeBSD tameshi.cisco.com 8.1-PRERELEASE FreeBSD 8.1-PRERELEASE #0
>  > >> r209169: Mon Jun 14 12:41:49 PDT 2010
>  > >> root@:/usr/obj/data/scratch/src/stable/8/sys/TAMESHI_STABLE  amd64
>  > >> # pciconf -lv | grep -A 4 bce
>  > >> bce1 at pci0:7:0:0:      class=0x020000 card=0x01b21028 chip=0x164c14e4
>  > >> rev=0x12 hdr=0x00
>  > >>     vendor     = 'Broadcom Corporation'
>  > >>     device     = 'Broadcom NetXtreme II Gigabit Ethernet Adapter (BCM5708)'
>  > >>     class      = network
>  > >>     subclass   = ethernet
>  > >> --
>  > >> bce0 at pci0:3:0:0:      class=0x020000 card=0x01b21028 chip=0x164c14e4
>  > >> rev=0x12 hdr=0x00
>  > >>     vendor     = 'Broadcom Corporation'
>  > >>     device     = 'Broadcom NetXtreme II Gigabit Ethernet Adapter (BCM5708)'
>  > >>     class      = network
>  > >>     subclass   = ethernet
>  > >>
>  > >>     Let me know what other info is required.
>  > >
>  > > Can you reproduce this issue on other TSO capable drivers?
>  > > I'm not aware of any TSO issues on bce(4).
>  >
>  > Hi Pyun!
>  >
>  > I'll have to pop in a Copper Intel card that we have laying around in
>  > the lab. I think it's em(4) compatible.. I forget... I have a few
>  > things to test network wise this weekend, so I'll try and repro a few
>  > things this weekend (say, Sunday?).
>  >
>  > I also have my msk(4) enabled machine in the lab I can test with, but
>  > I'll have to install the machine to spec with the Poweredge 2950 I
>  > have in the lab.
>  >
>  > I'm using ipfw because it was easy to setup according to the handbook,
>  > but in reality if ipfw is this bad dealing with nat rules, then I need
>  > to work with someone to improve how it scales.
>
> Unless there's something weird with tagging or something going on with
> divert sockets, this looks like something else;

Ok.

> natd usually works fine
> at much higher rates, but I can't talk about gigabit.  Though in-kernel
> NAT should be better at the higher throughput end,

But this panics deterministically as I've shown in another thread on
8-STABLE, so unfortunately I can't use this.

> your 'ISDN' rate and the high CPU usage for natd is certainly not typical.

That I wouldn't doubt.

> Does this box have a public IP address on bce1?

Nope.

> It's not clear whether you're doing this transfer from this box, or from another, through it, ie what address translation is expected?

I'm doing the transfer from tameshi.cisco.com to ironport1.cisco.com
via (what I would hope) is the public interface -- bce1 -- because my
routes are setup that way:

$ netstat -nr
Routing tables

Internet:
Destination        Gateway            Flags    Refs      Use  Netif Expire
default            173.37.10.1        UGS         0 42504770   bce1
127.0.0.1          link#1             UH          0     2052    lo0
173.37.10.0/24     link#4             U          38  2752472   bce1
173.37.10.6        link#4             UHS         0 20258228    lo0
192.168.20.0/22    link#3             U           3  5570413   bce0
192.168.20.1       link#3             UHS         0     2572    lo0
192.168.21.1       link#3             UHS         0        0    lo0
192.168.22.1       link#3             UHS         0        0    lo0
192.168.23.1       link#3             UHS         0        0    lo0
192.168.24.0/22    link#3             U           0        0   bce0
192.168.24.1       link#3             UHS         0        0    lo0

Internet6:
Destination                       Gateway                       Flags
    Netif Expire
::1                               ::1                           UH          lo0
fe80::%lo0/64                     link#1                        U           lo0
fe80::1%lo0                       link#1                        UHS         lo0
ff01:1::/32                       fe80::1%lo0                   U           lo0
ff02::%lo0/32                     fe80::1%lo0                   U           lo0
$ ifconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
	options=3<RXCSUM,TXCSUM>
	inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
	inet6 ::1 prefixlen 128
	inet 127.0.0.1 netmask 0xff000000
	nd6 options=3<PERFORMNUD,ACCEPT_RTADV>
ipfw0: flags=8800<SIMPLEX,MULTICAST> metric 0 mtu 65536
bce0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST>
metric 0 mtu 1500
	options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
	ether 00:1e:4f:38:65:ab
	inet 192.168.20.1 netmask 0xfffffc00 broadcast 192.168.23.255
	inet 192.168.21.1 netmask 0xfffffc00 broadcast 192.168.23.255
	inet 192.168.22.1 netmask 0xfffffc00 broadcast 192.168.23.255
	inet 192.168.23.1 netmask 0xfffffc00 broadcast 192.168.23.255
	inet 192.168.24.1 netmask 0xfffffc00 broadcast 192.168.27.255
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active
bce1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
	options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
	ether 00:1e:4f:38:65:ad
	inet 173.37.10.6 netmask 0xffffff00 broadcast 173.37.10.255
	media: Ethernet autoselect (1000baseT <full-duplex>)
	status: active

I would expect bce1 -> bce0 to hop a vlan, but apart from that
transfer speeds should be reasonably fast. It (ironport1) is a
semi-ancient Sparc machine, so I don't expect the speeds to be blazing
fast, but I've gotten up to 15 MBps on a good day.

> Where is 'somehost'?

Ok, bleh... turns out that someone internally used somehost as a
CNAME, so rather than obfuscating things I'll just divulge the real
hostname because it's needed:

$ host ironport1.cisco.com ; host tameshi.cisco.com
ironport1.cisco.com has address 173.37.5.41
tameshi.cisco.com has address 173.37.10.6

> Hence, knowing natd's config options and net topology might be helpful.

Fair enough .. security by obscurity isn't going to do any difference
because all of this crud is behind the corporate firewall anyhow :).

Another weird thing I noticed when I looked at it further is that
dhcpcd's usage is spiking up to 33% instead of remaining near idle,
and I had no idea why; so I truss'ed the process and there's a lot of
chatter on 127.0.0.1:0 with recvfrom, so it looks like the traffic is
being broadcast to all ports instead of port 67/68, and something
looks horribly broken in the networking stack with TSO on. Turning off
TSO shows that _no_ traffic is being intercepted via lo0 by dhcpd when
I scp the file, which I would expect to occur.

I'll see whether or not there are any firmware upgrades for the NIC on
this machine because there might be some sort of hardware errata that
I need to take into consideration. I'll get back to you guys after I
do that because I'm concerned that that might be an issue.

Thanks,
-Garrett