ix driver vlanhwfilter issue - how to catch a lion
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 03 Mar 2024 23:29:38 UTC
Hi everyone! I'd like to report, that there's some issue with the ix driver's vlanhwfilter feature. To be more specific, I'm not sure, if this is a driver issue, a hardware issue, or a firmware issue. I'm just happy, that I could catch it. If someone more experienced fella is interested here, I'm happy to help to dive deeper to find the root cause. For the rest, I'd rather tell the story, how I could catch this lion. Maybe my recollection is not 100% correct, but whoever learned some elevated level of mathematics in high school, learned how to catch a lion in the desert: Step 1: cut the desert -- or the remainder, exp below -- in half Step 2: check in which half the lion is Step 3: If the size of your half is not bigger than a cage, you caught the lion. If it is, than repeat from Step 1. Now I had a similar problem in my homelab network. I've bought recently one of them cheap minipc's from Aliexpress. This one has 4 i226v port. 2 X520-DA2 style SFP ports, 2 DDR5 slots, an nvme slots, 2 sata ports, and a Pentium Gold 8505 CPU. First I started to experiment with the network configuration. I run several VMs on this box. I mean, the VMs were already given. They were running on the predecessor, which only had an i3 cpu and was maxed with an 8GiB RAM module. What I wanted to achieve, to have redundant connection to two switches, "speak rstp", and let the vm's talk to their respective vlans. I remembered, that on FreeBSD, if you do a bridge between two ports, than vlans won't work oin them, since even the tagged frames will be handled by the bridge code. I was than thinking, I could add another VM and the vm could do vlan interfaces on the port, which is also connected to the bridge interface. Than I was thinging a little bit further: I don't even need a VM, if I just create an epair interface, and put the a half of the epair into the bridge, and create the vlan interfaces on the b half of the epair. Config was working, and successful: ifconfig_igc0="mtu 9004" ifconfig_igc1="mtu 1504" ifconfig_ix0="mtu 9004" ifconfig_epair0a="mtu 9004" ifconfig_epair0b="mtu 9004" cloned_interfaces="bridge0 epair0 vlan1 vlan7 vxlan30 bridge1" ifconfig_bridge0="addm ix0 stp ix0 addm igc0 stp igc0 addm epair0a" create_args_vlan1="vlan 1 vlandev epair0b mtu 1500" ifconfig_vlan1="up" create_args_vlan7="vlan 7 vlandev epair0b mtu 9000" ifconfig_vlan7="inet 172.16.7.5 netmask 255.255.255.0 up" ifconfig_bridge1="inet 172.16.33.5 netmask 255.255.255.0 addm vlan1" create_args_vxlan30="vxlanid 30 vxlanlocal 172.16.7.5 vxlangroup 225.0.0.1 vxlandev vlan7 mtu 1500" ifconfig_vxlan30="up" The rest of the vlan interfaces, and the respective switches were created during the boot via the vm-bhyve's initscript, and managed there. The bridge for vlan1 was only necessary, because my defaultroute needed that to be configured earlier in the boot process. As I said, this was working just fine, until I started to use it. As soon as I started all the 8 VMs, system crashed within 2 minutes. Wasn't even reactive to the serial port, nothing. Only the long press of the power button to turn it off, than after a second turn it back too, could help me to gain back control over the mini-pc-router-host. At this point, I try to spend several hours to catch the lion, considering one of the VMs are the culprit. Long story short: They weren't. The more and bigger VMs I started, just made sure, the issue happens sooner, but the culprit was none of my VMs. Though, I wasn't 100% sure, at this point. There is a really lightweight, running only an nsd, and have only one interface. For that alone to make the system crash, would took much more time. There was another one which had 2 interfaces + carp ip + a bird for running ospf. The one which has the master carp interface advertises the stub network from the 2nd interface to the ospf routers on the 1st interface. In case roles switched from master to backup, bird is restarted automatically to use the appropriate config file. Not as big and complex VM like the ones, running pfSense. This vm only needs 256 MiB ram, but could almost predictably fail the host within ~2hrs. If I started all 8 VMs, host crashed within 2 minutes. Anyway... As I said, my educated guess at this point was, that the culprit are not the VMs, somewhere in the network. I consulted Zahy, my old friend, who is a more seasoned bsd user than myself, and has a few decades more experience. He asked, why am I doing this complex scenario with the bridge. Well, I had my answer: Let's say the connection fails between the two switches where this host is connected to, the bsd could still do connect the two of them. A well desined loop in the network, with the proper configuration just makes it more redundant. This way the connection between the two switch won't be an SPF. Anyway, I listened to him, and tried to simplify the config and replaced bridge0 with a lagg interface: ####### Fallback network config with failover #cloned_interfaces="lagg0 vlan1 vlan7 vxlan30 bridge0" #create_args_lagg0="laggproto failover laggport ix0 laggport igc0 mtu 9004" #ifconfig_lagg0="up" #create_args_vlan1="vlan 1 vlandev lagg0 mtu 1500" #ifconfig_vlan1="up" #create_args_vlan7="vlan 7 vlandev lagg0 mtu 9000" #ifconfig_vlan7="inet 172.28.7.5 netmask 255.255.255.0" #ifconfig_bridge0="inet 172.28.33.5 netmask 255.255.255.0 addm vlan1" #create_args_vxlan30="vxlanid 30 vxlanlocal 172.28.7.5 vxlangroup 225.0.0.1 vxlandev vlan7 mtu 1500" #ifconfig_vxlan30="up" Using lagg instead of bridge didn't solve my problems. I even tried to not use bridge0, and configure that ip address directly on vlan1, and not run those VMs, which want to interact with vlan1, but this also was not solving my problems. pfsense VMs could crash the systems quite fast. Then I was thinking of further simplifying the system, and throwing out the redundancy, and configure everything directly onto the ix0 interface. Long story short: The issue remains. And that was the point, I was 100% sure, it's not the bridge, not the lagg, and not the VMs are the culprit. The system freeze just randomly during one time during the boot process. The last message I've seen on the serial console was the initialization of vlan1. I even tried to take the miniPC apart. I reseated the 10g card,. Put the ddr5 module in the other slot. None of those helped. So, Let's check the other half of the desert: I configured everything onto the igc0 interface instead of the ix0. Surprise surprise: everything worked. After started all 8 VMs, the system was up and running, even 40 minutes later. That was a 100% proof, that the issue relates to the ix0 interface. I was even considered to configure everything onto the ix1. Maybe, the port is the culprit. But, what are the odds, that one port has a hw faliure and the other has not? Pretty slim, so I instead went with my educated guess (one might only call that a gut feeling) I checked the difference between the features like between the ix0 interface and the igx0 interface: root@vjun:~ # ifconfig ix1 | grep options | head -1 ; ifconfig igc0 | grep options | head -1 options=4e53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG> options=4a420b9<RXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWTSO,RXCSUM_IPV6,HWSTATS,MEXTPG> And now a definitive gut feeling moment: Lat's turn off the vlan_hwfilter. No crash. I restored the "big" redundant rstp speaking configuration, but with one tiny difference: ifconfig_ix0="-vlanhwfilter mtu 9004" I even put back the vm_list setting to the rc.conf, so the vm's can start automatically on boot. root@vjun:~ # uptime 11:20PM up 7:06, 5 users, load averages: 0.08, 0.19, 0.20 So far so good! As I said, the lion is now in a cage sized part of the desert. I only don't know, if this is a driver issue, of firmware issue, or something with the hardware. Can someone help me to find out? Since I have a workaround, I can sleep well now. But if we could just solve the root cause of the problem it would be even better. PS.: A pic and a few second video about the frozen system on my monitor: https://drive.google.com/drive/folders/1b0TRcd-W0XnHG_Uia5oXgXxk8XH_Jpv5?usp=sharing PS2: Zahy already told me, I should use the vlanmtu config parameter instead of configuring the mtu 4 bytes bigger on the main interface. But this works. I don't want to further mess with it. Also, I presume, that'd only work, if the vlan interfaces would be created directly on the ix0 and igc0 devices, and not on the epair0b device. Anyway: This IS working. TYA! gyu