Issue with epoch_drain_callbacks and unloading iavf(4) [using iflib]
Eric Joyner
erj at freebsd.org
Fri Jan 31 19:45:10 UTC 2020
On Wed, Jan 29, 2020 at 7:09 PM Mark Johnston <markj at freebsd.org> wrote:
> On Thu, Jan 30, 2020 at 02:12:05AM +0100, Hans Petter Selasky wrote:
> > On 2020-01-29 22:44, Eric Joyner wrote:
> > > On Wed, Jan 29, 2020 at 1:41 PM Hans Petter Selasky <hps at selasky.org>
> wrote:
> > >
> > > > On 2020-01-29 22:30, Eric Joyner wrote:
> > > > > Hi freebsd-net,
> > > > >
> > > > > We've encountered an issue with unloading the iavf(4) driver on
> FreeBSD
> > > > > 12.1 (and stable). On a VM with two iavf(4) interfaces, if we send
> heavy
> > > > > traffic to iavf1 and try to kldunload the driver, the kldunload
> process
> > > > > hangs on iavf0 until iavf1 stops receiving traffic.
> > > > >
> > > > > After some debugging, it looks like epoch_drain_callbacks() [via
> > > > > if_detach_internal()] tries to switch CPUs to run on one that
> iavf1 is
> > > > > using for RX processing, but since iavf1 is busy, it can't make the
> > > > switch,
> > > > > so cpu_switch() just hangs and nothing happens until iavf1's RX
> thread
> > > > > stops being busy.
> > > > >
> > > > > I can work around this by inserting a kern_yield(PRI_USER)
> somewhere in
> > > > one
> > > > > of the iavf txrx functions that iflib calls into (e.g.
> > > > > iavf_isc_rxd_available), but that's not a proper fix. Does anyone
> know
> > > > what
> > > > > to do to prevent this from happening?
> > > > >
> > > > > Wildly guessing, does maybe epoch_drain_callbacks() need a higher
> > > > priority
> > > > > than the PI_SOFT used in the group taskqueues used in iflib's RX
> > > > processing?
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > Which scheduler is this? ULE or BSD?
> > > >
> > > > EPOCH(9) expects some level of round-robin scheduling on the same
> > > > priority level. Setting a higher priority on EPOCH(9) might cause
> epoch
> > > > to start spinning w/o letting the lower priority thread which holds
> the
> > > > EPOCH() section to finish.
> > > >
> > > > --HPS
> > > >
> > > >
> > > Hi Hans,
> > >
> > > kern.sched.name gives me "ULE"
> > >
> >
> > Hi Eric,
> >
> > epoch_drain_callbacks() depends on that epoch_call_task() gets execution
> > which is executed from a GTASKQUEUE at PI_SOFT. Also
> epoch_drain_callbacks()
> > runs at the priority of the calling thread, and if this is lower than
> > PI_SOFT, and a gtaskqueue is spinning heavily, then that won't work.
>
> I think we can fix this so that the epoch_drain_callbacks() caller's
> priority does not matter. Eric, can you try the patch below? It is a
> bit hacky, but the idea is to ensure that all pending callbacks can be
> invoked, and then wait for each cb task to run though at least once. In
> your case I think the callback task threads should still be able to run.
>
> > For a single CPU system you will be toast in this situation regardless if
> > there is no free time on a CPU for EPOCH().
> >
> > In general if epoch_call_task() doesn't get execution time, you will
> have a
> > problem.
>
> That's not the issue here, though there is at least one related problem:
> if a thread in an epoch read section is preempted and the CPU is
> consumed by a high priority receive processing thread, for example due
> to a DOS, we have no mechanism to boost the priority of the preempted
> thread except by calling epoch_synchronize_wait(). In other words,
> nothing guarantees that deferred callbacks will eventually get executed
> in such a scenario.
>
> diff --git a/sys/kern/subr_epoch.c b/sys/kern/subr_epoch.c
> index 0a477e1d6f7b..4ed7e7e11a3e 100644
> --- a/sys/kern/subr_epoch.c
> +++ b/sys/kern/subr_epoch.c
> @@ -74,15 +74,19 @@ typedef struct epoch_record {
> volatile struct epoch_tdlist er_tdlist;
> volatile uint32_t er_gen;
> uint32_t er_cpuid;
> + int er_drain_state;
> } __aligned(EPOCH_ALIGN) *epoch_record_t;
>
> +#define EPOCH_DRAIN_START 2
> +#define EPOCH_DRAIN_RUNNING 1
> +#define EPOCH_DRAIN_DONE 0
> +
> struct epoch {
> struct ck_epoch e_epoch __aligned(EPOCH_ALIGN);
> epoch_record_t e_pcpu_record;
> int e_idx;
> int e_flags;
> struct sx e_drain_sx;
> - struct mtx e_drain_mtx;
> volatile int e_drain_count;
> const char *e_name;
> };
> @@ -335,7 +339,6 @@ epoch_alloc(const char *name, int flags)
> epoch->e_idx = epoch_count;
> epoch->e_name = name;
> sx_init(&epoch->e_drain_sx, "epoch-drain-sx");
> - mtx_init(&epoch->e_drain_mtx, "epoch-drain-mtx", NULL, MTX_DEF);
> allepochs[epoch_count++] = epoch;
> return (epoch);
> }
> @@ -348,7 +351,6 @@ epoch_free(epoch_t epoch)
> allepochs[epoch->e_idx] = NULL;
> epoch_wait(global_epoch);
> uma_zfree_pcpu(pcpu_zone_record, epoch->e_pcpu_record);
> - mtx_destroy(&epoch->e_drain_mtx);
> sx_destroy(&epoch->e_drain_sx);
> free(epoch, M_EPOCH);
> }
> @@ -699,14 +701,24 @@ epoch_call_task(void *arg __unused)
> epoch_t epoch;
> ck_stack_t cb_stack;
> int i, npending, total;
> + bool draining;
> +
> + KASSERT(curthread->td_pinned > 0,
> + ("%s: callback task thread is not pinned", __func__));
>
> ck_stack_init(&cb_stack);
> critical_enter();
> epoch_enter(global_epoch);
> - for (total = i = 0; i < epoch_count; i++) {
> + for (total = i = 0, draining = false; i < epoch_count; i++) {
> if (__predict_false((epoch = allepochs[i]) == NULL))
> continue;
> er = epoch_currecord(epoch);
> + if (atomic_load_int(&er->er_drain_state) ==
> EPOCH_DRAIN_START) {
> + atomic_store_int(&er->er_drain_state,
> + EPOCH_DRAIN_RUNNING);
> + draining = true;
> + }
> +
> record = &er->er_record;
> if ((npending = record->n_pending) == 0)
> continue;
> @@ -728,6 +740,20 @@ epoch_call_task(void *arg __unused)
> next = CK_STACK_NEXT(cursor);
> entry->function(entry);
> }
> +
> + if (__predict_false(draining)) {
> + epoch_enter(global_epoch);
> + for (i = 0; i < epoch_count; i++) {
> + if (__predict_false((epoch = allepochs[i]) ==
> NULL))
> + continue;
> + er = epoch_currecord(epoch);
> + if (atomic_load_int(&er->er_drain_state) ==
> + EPOCH_DRAIN_RUNNING)
> + atomic_store_int(&er->er_drain_state,
> + EPOCH_DRAIN_DONE);
> + }
> + epoch_exit(global_epoch);
> + }
> }
>
> int
> @@ -769,27 +795,18 @@ in_epoch(epoch_t epoch)
> }
>
> static void
> -epoch_drain_cb(struct epoch_context *ctx)
> +epoch_drain_handler(struct ck_epoch *global __unused,
> + ck_epoch_record_t *cr __unused, void *arg __unused)
> {
> - struct epoch *epoch =
> - __containerof(ctx, struct epoch_record,
> er_drain_ctx)->er_parent;
> -
> - if (atomic_fetchadd_int(&epoch->e_drain_count, -1) == 1) {
> - mtx_lock(&epoch->e_drain_mtx);
> - wakeup(epoch);
> - mtx_unlock(&epoch->e_drain_mtx);
> - }
> + maybe_yield();
> }
>
> void
> epoch_drain_callbacks(epoch_t epoch)
> {
> epoch_record_t er;
> - struct thread *td;
> - int was_bound;
> - int old_pinned;
> - int old_cpu;
> - int cpu;
> + int cpu, state;
> + bool pending;
>
> WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
> "epoch_drain_callbacks() may sleep!");
> @@ -802,45 +819,28 @@ epoch_drain_callbacks(epoch_t epoch)
> return;
> #endif
> DROP_GIANT();
> -
> sx_xlock(&epoch->e_drain_sx);
> - mtx_lock(&epoch->e_drain_mtx);
>
> - td = curthread;
> - thread_lock(td);
> - old_cpu = PCPU_GET(cpuid);
> - old_pinned = td->td_pinned;
> - was_bound = sched_is_bound(td);
> - sched_unbind(td);
> - td->td_pinned = 0;
> + /* Make sure that all pending callbacks are available. */
> + ck_epoch_synchronize_wait(&epoch->e_epoch, epoch_drain_handler,
> NULL);
>
> - CPU_FOREACH(cpu)
> - epoch->e_drain_count++;
> CPU_FOREACH(cpu) {
> er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
> - sched_bind(td, cpu);
> - epoch_call(epoch, &epoch_drain_cb, &er->er_drain_ctx);
> + atomic_store_int(&er->er_drain_state, EPOCH_DRAIN_START);
> + GROUPTASK_ENQUEUE(DPCPU_ID_PTR(cpu, epoch_cb_task));
> }
>
> - /* restore CPU binding, if any */
> - if (was_bound != 0) {
> - sched_bind(td, old_cpu);
> - } else {
> - /* get thread back to initial CPU, if any */
> - if (old_pinned != 0)
> - sched_bind(td, old_cpu);
> - sched_unbind(td);
> - }
> - /* restore pinned after bind */
> - td->td_pinned = old_pinned;
> -
> - thread_unlock(td);
> -
> - while (epoch->e_drain_count != 0)
> - msleep(epoch, &epoch->e_drain_mtx, PZERO, "EDRAIN", 0);
> + do {
> + pending = false;
> + CPU_FOREACH(cpu) {
> + er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
> + state = atomic_load_int(&er->er_drain_state);
> + if (state != EPOCH_DRAIN_DONE)
> + pending = true;
> + }
> + pause("edrain", 1);
> + } while (pending);
>
> - mtx_unlock(&epoch->e_drain_mtx);
> sx_xunlock(&epoch->e_drain_sx);
> -
> PICKUP_GIANT();
> }
> _______________________________________________
> freebsd-net at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>
Hi Mark,
I applied your patch to a FreeBSD-current VM and reran the test I was
running originally, and I still see the same behavior.
Your patch doesn't appear to fix the problem.
- Eric
More information about the freebsd-net
mailing list