Re: epair and vnet jail loose connection.
- Reply: Michael Gmelin : "Re: epair and vnet jail loose connection."
- In reply to: Michael Gmelin : "Re: epair and vnet jail loose connection."
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 15 Mar 2022 16:30:41 UTC
On 14 Mar 2022, at 18:02, Michael Gmelin wrote: > On Mon, 14 Mar 2022 09:09:49 -0600 > Kristof Provost <kp@FreeBSD.org> wrote: > >> On 14 Mar 2022, at 7:44, Michael Gmelin wrote: >>> On Sun, 13 Mar 2022 17:53:44 +0000 >>> "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote: >>> >>>> On 13 Mar 2022, at 17:45, Michael Gmelin wrote: >>>> >>>>>> On 13. Mar 2022, at 18:16, Bjoern A. Zeeb >>>>>> <bzeeb-lists@lists.zabbadoz.net> wrote: >>>>>> >>>>>> On 13 Mar 2022, at 16:33, Michael Gmelin wrote: >>>>>>> It's important to point out that this only happens with >>>>>>> kern.ncpu>1. With kern.ncpu==1 nothing gets stuck. >>>>>>> >>>>>>> This perfectly fits into the picture, since, as pointed out by >>>>>>> Johan, >>>>>>> the first commit that is affected[0] is about multicore >>>>>>> support. >>>>>> >>>>>> Ignore my ignorance, what is the default of net.isr.maxthreads >>>>>> and net.isr.bindthreads (in stable/13) these days? >>>>>> >>>>> >>>>> My tests were on CURRENT and I’m afk, but according to cgit[0][1], >>>>> max is 1 and bind is 0. >>>>> >>>>> Would it make sense to repeat the test with max=-1? >>>> >>>> I’d say yes, I’d also bind, but that’s just me. >>>> >>>> I would almost assume Kristof running with -1 by default (but he >>>> can chime in on that). >>> >>> I tried various configuration permutations, all with ncpu=2: >>> >>> - 14.0-CURRENT #0 main-n253697-f1d450ddee6 >>> - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb >>> - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2 >>> - net.isr.bindthreads: -1, 0, 1, 2 >>> - net.isr.dispatch: direct, deferred >>> >>> All resulting in the same behavior (hang after a few seconds). They >>> all >>> work ok when running on a single core instance (threads=1 in this >>> case). >>> >>> I also ran the same test on 13.0-RELEASE-p7 for >>> comparison (unsurprisingly, it's ok). >>> >>> I placed the script to reproduce the issue on freefall for your >>> convenience, so running it is as simple as: >>> >>> fetch https://people.freebsd.org/~grembo/hang_epair.sh >>> # inspect content >>> sh hang_epair.sh >>> >>> or, if you feel lucky >>> >>> fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh >>> >> With that script I can also reproduce the problem. >> >> I’ve experimented with this hack: >> >> diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c >> index c39434b31b9f..1e6bb07ccc4e 100644 >> --- a/sys/net/if_epair.c >> +++ b/sys/net/if_epair.c >> @@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long >> cmd, caddr_t data) >> >> case SIOCSIFMEDIA: >> case SIOCGIFMEDIA: >> + printf("KP: %s() SIOCGIFMEDIA\n", __func__); >> sc = ifp->if_softc; >> + taskqueue_enqueue(epair_tasks.tq[0], >> &sc->queues[0].tx_task); >> + >> error = ifmedia_ioctl(ifp, ifr, &sc->media, >> cmd); break; >> >> That kicks the receive code whenever I `ifconfig epair0a`, and I see >> a little more traffic every time I do so. >> That suggests pretty strongly that there’s an issue with how we >> dispatch work to the handler thread. So presumably there’s a race >> between epair_menq() and epair_tx_start_deferred(). >> >> epair_menq() tries to only enqueue the receive work if there’s >> nothing in the buf_ring, on the grounds that if there is the previous >> packet scheduled the work. Clearly there’s an issue there. >> >> I’ll try to dig into that in the next few days. >> > > Hi Kristof, > > This sounds plausible. I spent a few hours getting familiar with the > epair code and came up with a patch that seems to fix the issue at hand > (both with and without RSS). I'm not certain that it is a good > solution, especially in terms of performance, but I wanted to share it > with you anyway, maybe it helps: > https://people.freebsd.org/~grembo/epair.patch > That seems to be working, and at first glance doesn’t look like it’d hurt performance too badly. Can you write up a commit message and post it on phabricator? Kristof