Re: epair and vnet jail loose connection.

Reply: Michael Gmelin : "Re: epair and vnet jail loose connection."
In reply to: Michael Gmelin : "Re: epair and vnet jail loose connection."
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Kristof Provost <kp_at_FreeBSD.org>
Date: Tue, 15 Mar 2022 16:30:41 UTC
On 14 Mar 2022, at 18:02, Michael Gmelin wrote:
> On Mon, 14 Mar 2022 09:09:49 -0600
> Kristof Provost <kp@FreeBSD.org> wrote:
>
>> On 14 Mar 2022, at 7:44, Michael Gmelin wrote:
>>> On Sun, 13 Mar 2022 17:53:44 +0000
>>> "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote:
>>>
>>>> On 13 Mar 2022, at 17:45, Michael Gmelin wrote:
>>>>
>>>>>> On 13. Mar 2022, at 18:16, Bjoern A. Zeeb
>>>>>> <bzeeb-lists@lists.zabbadoz.net> wrote:
>>>>>>
>>>>>> On 13 Mar 2022, at 16:33, Michael Gmelin wrote:
>>>>>>> It's important to point out that this only happens with
>>>>>>> kern.ncpu>1. With kern.ncpu==1 nothing gets stuck.
>>>>>>>
>>>>>>> This perfectly fits into the picture, since, as pointed out by
>>>>>>> Johan,
>>>>>>> the first commit that is affected[0] is about multicore
>>>>>>> support.
>>>>>>
>>>>>> Ignore my ignorance, what is the default of net.isr.maxthreads
>>>>>> and net.isr.bindthreads (in stable/13) these days?
>>>>>>
>>>>>
>>>>> My tests were on CURRENT and I’m afk, but according to cgit[0][1],
>>>>> max is 1 and bind is 0.
>>>>>
>>>>> Would it make sense to repeat the test with max=-1?
>>>>
>>>> I’d say yes, I’d also bind, but that’s just me.
>>>>
>>>> I would almost assume Kristof running with -1 by default (but he
>>>> can chime in on that).
>>>
>>> I tried various configuration permutations, all with ncpu=2:
>>>
>>> - 14.0-CURRENT #0 main-n253697-f1d450ddee6
>>> - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
>>> - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2
>>> - net.isr.bindthreads: -1, 0, 1, 2
>>> - net.isr.dispatch: direct, deferred
>>>
>>> All resulting in the same behavior (hang after a few seconds). They
>>> all
>>> work ok when running on a single core instance (threads=1 in this
>>> case).
>>>
>>> I also ran the same test on 13.0-RELEASE-p7 for
>>> comparison (unsurprisingly, it's ok).
>>>
>>> I placed the script to reproduce the issue on freefall for your
>>> convenience, so running it is as simple as:
>>>
>>>     fetch https://people.freebsd.org/~grembo/hang_epair.sh
>>>     # inspect content
>>>     sh hang_epair.sh
>>>
>>> or, if you feel lucky
>>>
>>>     fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh
>>>
>> With that script I can also reproduce the problem.
>>
>> I’ve experimented with this hack:
>>
>> 	diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c
>> 	index c39434b31b9f..1e6bb07ccc4e 100644
>> 	--- a/sys/net/if_epair.c
>> 	+++ b/sys/net/if_epair.c
>> 	@@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long
>> cmd, caddr_t data)
>>
>> 	        case SIOCSIFMEDIA:
>> 	        case SIOCGIFMEDIA:
>> 	+               printf("KP: %s() SIOCGIFMEDIA\n", __func__);
>> 	                sc = ifp->if_softc;
>> 	+               taskqueue_enqueue(epair_tasks.tq[0],
>> &sc->queues[0].tx_task);
>> 	+
>> 	                error = ifmedia_ioctl(ifp, ifr, &sc->media,
>> cmd); break;
>>
>> That kicks the receive code whenever I `ifconfig epair0a`, and I see
>> a little more traffic every time I do so.
>> That suggests pretty strongly that there’s an issue with how we
>> dispatch work to the handler thread. So presumably there’s a race
>> between epair_menq() and epair_tx_start_deferred().
>>
>> epair_menq() tries to only enqueue the receive work if there’s
>> nothing in the buf_ring, on the grounds that if there is the previous
>> packet scheduled the work. Clearly there’s an issue there.
>>
>> I’ll try to dig into that in the next few days.
>>
>
> Hi Kristof,
>
> This sounds plausible. I spent a few hours getting familiar with the
> epair code and came up with a patch that seems to fix the issue at hand
> (both with and without RSS). I'm not certain that it is a good
> solution, especially in terms of performance, but I wanted to share it
> with you anyway, maybe it helps:
> https://people.freebsd.org/~grembo/epair.patch
>
That seems to be working, and at first glance doesn’t look like it’d hurt performance too badly.

Can you write up a commit message and post it on phabricator?

Kristof