Re: epair and vnet jail loose connection.

From: Michael Gmelin <grembo_at_freebsd.org>
Date: Sat, 12 Mar 2022 00:55:44 UTC

> On 12. Mar 2022, at 01:21, Kristof Provost <kp@freebsd.org> wrote:
> 
> On 11 Mar 2022, at 17:44, Johan Hendriks wrote:
>>> On 09/03/2022 20:55, Johan Hendriks wrote:
>>> The problem:
>>> I have a FreeBSD 14 machine and a FreeBSD 13-stable machine, both running the same jails just to test the workings.
>>> 
>>> The jails that are running are a salt master, a haproxy  jail, 2 webservers, 2 varnish servers, 2 php jails one for php8.0 and one with 8.1. All the jails are connected to bridge0 and all the jails use vnet.
>>> 
>>> I believe this worked on an older 14-HEAD machine, but i did not do a lot with it back then, and when i started testing again and after updating the OS i noticed that one of the varnish jails lost it's network connection after running for a few hours. I thought it was just something on HEAD so never really looked at it. But later on when i start using the jails again and testing a test wordpress site i noticed that with a simple load test my haproxy jail within one minute looses it's network connection. I see nothing in the logs, on the host and on the jail.
>>> From the jail i can not ping the other jails or the IP adres of the bridge. I can however ping the jails own IP adres. From the host i can also not ping the haproxy jail IP adres. If i start a tcpdump on the epaira interface from the haproxy jail i do see the packets arrive but not in the jail.
>>> 
>>> I used ZFS to send all the jails to a 13-STABLE machine and copied over the jail.conf file as well as the pf.conf file and i saw the same behavior.
>>> 
>>> Then i tried to use 13.0-RELEASE-p7 and on that machine i do not see this happening. There i can stress test the machine for 10 minutes without a problem but on 14-HEAD and 13-STABLE within a minute the jail's network connection fails and only a restart of the jail brings it back online to exhibit the same behavior if i start a simple load test which it should handle nicely.
>>> 
>>> One of the jail hosts is running under VMWARE and the other is running under Ubuntu with KVM. The 13.0-RELEASE-p7 jail host is running under Ubuntu with KVM
>>> 
>>> Thank you for your time.
>>> regards
>>> Johan
>>> 
>> I did some bisecting and the latest commit that works on FreeBSD 13-Stable is 009a56b2e
>> Then the commit 2e0bee4c7  if_epair: implement fanout and above is showing the symptoms described above.
>> 
> Interestingly I cannot reproduce stalls in simple epair setups.
> It would be useful if you could reduce the setup with the problem into a minimal configuration so we can figure out what other factors are involved.

If there are clear instructions on how to reproduce, I’m happy to help experimenting (I’m relying heavily on epair at this point).

@Kristof: Did you try on bare metal or on vms?

-m