amd64/156408: Routing failure when using VLANs vs. Physical
ethernet interfaces.
Thomas Johnson
tom at claimlynx.com
Thu Apr 14 21:00:25 UTC 2011
>Number: 156408
>Category: amd64
>Synopsis: Routing failure when using VLANs vs. Physical ethernet interfaces.
>Confidential: no
>Severity: serious
>Priority: medium
>Responsible: freebsd-amd64
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Thu Apr 14 21:00:19 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator: Thomas Johnson
>Release: FreeBSD 8.2-RELEASE amd64
>Organization:
ClaimLynx, Inc.
>Environment:
System: FreeBSD jaguar-2.claimlynx.com 8.2-RELEASE FreeBSD 8.2-RELEASE #8: Sat Feb 26 21:23:00 CST 2011 root at jaguar-2.claimlynx.com:/usr/obj/usr/src/sys/GENERIC-CARP amd64
>Description:
I have discovered some odd routing behavior that seems to occur when VLANs are used as members of a bridge. Specifically, it seems that static routes do not function correctly.
Here is some background on the situation I have. I am building a new host to replace our aging (running 8.0) firewall. The new machine I am building has a single ethernet interface (re driver, but over the course of troubleshooting I've used sk and igb ethernet adapters), so I am using VLANs to segment traffic. The 'LAN' VLAN on my setup uses interface vlan500, with the 'WAN' on vlan200. The firewall also has an OpenVPN tunnel to our data center, operating in bridged mode on interface tap0. vlan500 and tap0 are both members of bridge0, allowing the LANs at our office and data center to talk on the same subnet, 172.31.0.0/16.
In this configuration, I am able to connect from the office lan to hosts on the data center lan. The openvpn server at the datacenter (separate host from the firewall) pushes out a route for the dc production subnet upon connect. The logical configuration looks something like this:
(office lan)<->[vlan500|bridge0|tap0]<-vpn->(dc lan)<->[dc firewall]<->(dc production subnet)
[ firewall ]
[ common 172.31.0.0/16 subnet throughout ] [ 100.100.100.128/26 ]
For the sake of reference, here are the relevant IP addresses:
172.31.0.252 - local firewall vlan500
172.31.0.254 - local firewall lan carp
172.31.5.1 - data center firewall
The problem seems to exist with the route to the production subnet at the data center. When the openvpn connection comes up, the route is installed in the routing table as expected. However, attempts to connect to hosts on this network result in instantaneous failure; not even a host unreachable. For example
~-> ping hostfoo
PING hostfoo.claimlynx.com (100.100.100.149): 56 data bytes
ping: sendto: Invalid argument
Here is the output of 'netstat -rn' on this host:
root at shawshank-1:~-> netstat -rn
Routing tables
Internet:
Destination Gateway Flags Refs Use Netif Expire
default 10.8.20.1 UGS 4 124778 vlan20
172.31.0.0/16 link#12 U 3 56103 vlan50
172.31.0.252 link#12 UHS 0 0 lo0
172.31.0.254 link#13 UH 0 0 carp10
172.31.3.5 link#8 UHS 0 0 lo0
10.8.20.0/24 link#9 U 0 33 vlan20
10.8.20.252 link#9 UHS 0 0 lo0
10.8.20.254 link#14 UH 0 0 carp20
10.8.30.0/24 link#10 U 0 0 vlan30
10.8.30.252 link#10 UHS 0 0 lo0
10.8.30.254 link#15 UH 0 0 carp30
10.8.40.0/24 link#11 U 0 0 vlan40
10.8.40.252 link#11 UHS 0 0 lo0
127.0.0.1 link#7 UH 0 0 lo0
100.100.100.128/26 172.31.5.1 UGS 0 21466 tap0
Internet6:
Destination Gateway Flags Netif Expire
::1 ::1 UH lo0
fe80::%lo0/64 link#7 U lo0
fe80::1%lo0 link#7 UHS lo0
ff01:7::/32 fe80::1%lo0 U lo0
ff02::%lo0/32 fe80::1%lo0 U lo0
As you can see, the routing table shows the 172.31.0.0/16 subnet route on the vlan500 interface, and puts the 100.100.100.128/26 production subnet route on the tap0 interface. While troubleshooting this, my hunch was that perhaps the system was choking because the next-hop for the production route was on a network (172.31.0.0/16) that is not reachable via tap0 (in actuality it is). To test this, I inserted a host route for the next hop:
route add 172.31.5.1 -interface tap0
Adding this route resolves the condition, but it seems like a hacky fix. In comparison, the firewall that I am replacing uses the same lan/bridge/tap setup, but the machine has physical ethernet interfaces for all segments, rather than the vlans that my new setup uses. The existing setup works fine, without the need to add a host route. Here is the routing table for the existing firewall:
tom at shawshank:~-> netstat -rn
Routing tables
Internet:
Destination Gateway Flags Refs Use Netif Expire
default 74.95.66.26 UGS 7 5043426 fxp2
172.31.0.0/16 link#2 U 4 70728235 fxp1
172.31.0.1 link#2 UHS 0 3870772 lo0
172.31.3.4 link#8 UHS 0 0 lo0
74.95.66.24/30 link#3 U 0 1243 fxp2
74.95.66.25 link#3 UHS 0 9 lo0
127.0.0.1 link#6 UH 0 1140570 lo0
192.168.50.0/24 link#1 U 0 0 fxp0
192.168.50.4 link#1 UHS 0 0 lo0
100.100.100.128/26 172.31.5.1 UGS 0 19877 fxp1
Internet6:
Destination Gateway Flags Netif Expire
::1 ::1 UH lo0
fe80::%lo0/64 link#6 U lo0
fe80::1%lo0 link#6 UHS lo0
ff01:6::/32 fe80::1%lo0 U lo0
ff02::%lo0/32 fe80::1%lo0 U lo0
The noteworthy difference between the two routing tables is that the production route on the old firewall is put on the LAN interface (fxp1).
>How-To-Repeat:
This situation occurs every time this host is booted.
>Fix:
The workaround I have found is to add a host route for the next-hop to the tap0 interface. This seems to work alright, but I want to make sure that this isn't a symptom of a bug in the vlan driver or elsewhere.
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-amd64
mailing list