Issue with epoch_drain_callbacks and unloading iavf(4) [using iflib]
Eric Joyner
erj at freebsd.org
Mon Feb 10 21:31:27 UTC 2020
On Fri, Jan 31, 2020 at 11:44 AM Eric Joyner <erj at freebsd.org> wrote:
> On Wed, Jan 29, 2020 at 7:09 PM Mark Johnston <markj at freebsd.org> wrote:
>
>> On Thu, Jan 30, 2020 at 02:12:05AM +0100, Hans Petter Selasky wrote:
>> > On 2020-01-29 22:44, Eric Joyner wrote:
>> > > On Wed, Jan 29, 2020 at 1:41 PM Hans Petter Selasky <hps at selasky.org>
>> wrote:
>> > >
>> > > > On 2020-01-29 22:30, Eric Joyner wrote:
>> > > > > Hi freebsd-net,
>> > > > >
>> > > > > We've encountered an issue with unloading the iavf(4) driver on
>> FreeBSD
>> > > > > 12.1 (and stable). On a VM with two iavf(4) interfaces, if we
>> send heavy
>> > > > > traffic to iavf1 and try to kldunload the driver, the kldunload
>> process
>> > > > > hangs on iavf0 until iavf1 stops receiving traffic.
>> > > > >
>> > > > > After some debugging, it looks like epoch_drain_callbacks() [via
>> > > > > if_detach_internal()] tries to switch CPUs to run on one that
>> iavf1 is
>> > > > > using for RX processing, but since iavf1 is busy, it can't make
>> the
>> > > > switch,
>> > > > > so cpu_switch() just hangs and nothing happens until iavf1's RX
>> thread
>> > > > > stops being busy.
>> > > > >
>> > > > > I can work around this by inserting a kern_yield(PRI_USER)
>> somewhere in
>> > > > one
>> > > > > of the iavf txrx functions that iflib calls into (e.g.
>> > > > > iavf_isc_rxd_available), but that's not a proper fix. Does anyone
>> know
>> > > > what
>> > > > > to do to prevent this from happening?
>> > > > >
>> > > > > Wildly guessing, does maybe epoch_drain_callbacks() need a higher
>> > > > priority
>> > > > > than the PI_SOFT used in the group taskqueues used in iflib's RX
>> > > > processing?
>> > > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > Which scheduler is this? ULE or BSD?
>> > > >
>> > > > EPOCH(9) expects some level of round-robin scheduling on the same
>> > > > priority level. Setting a higher priority on EPOCH(9) might cause
>> epoch
>> > > > to start spinning w/o letting the lower priority thread which holds
>> the
>> > > > EPOCH() section to finish.
>> > > >
>> > > > --HPS
>> > > >
>> > > >
>> > > Hi Hans,
>> > >
>> > > kern.sched.name gives me "ULE"
>> > >
>> >
>> > Hi Eric,
>> >
>> > epoch_drain_callbacks() depends on that epoch_call_task() gets execution
>> > which is executed from a GTASKQUEUE at PI_SOFT. Also
>> epoch_drain_callbacks()
>> > runs at the priority of the calling thread, and if this is lower than
>> > PI_SOFT, and a gtaskqueue is spinning heavily, then that won't work.
>>
>> I think we can fix this so that the epoch_drain_callbacks() caller's
>> priority does not matter. Eric, can you try the patch below? It is a
>> bit hacky, but the idea is to ensure that all pending callbacks can be
>> invoked, and then wait for each cb task to run though at least once. In
>> your case I think the callback task threads should still be able to run.
>>
>> > For a single CPU system you will be toast in this situation regardless
>> if
>> > there is no free time on a CPU for EPOCH().
>> >
>> > In general if epoch_call_task() doesn't get execution time, you will
>> have a
>> > problem.
>>
>> That's not the issue here, though there is at least one related problem:
>> if a thread in an epoch read section is preempted and the CPU is
>> consumed by a high priority receive processing thread, for example due
>> to a DOS, we have no mechanism to boost the priority of the preempted
>> thread except by calling epoch_synchronize_wait(). In other words,
>> nothing guarantees that deferred callbacks will eventually get executed
>> in such a scenario.
>>
>> diff --git a/sys/kern/subr_epoch.c b/sys/kern/subr_epoch.c
>> index 0a477e1d6f7b..4ed7e7e11a3e 100644
>> --- a/sys/kern/subr_epoch.c
>> +++ b/sys/kern/subr_epoch.c
>> @@ -74,15 +74,19 @@ typedef struct epoch_record {
>> volatile struct epoch_tdlist er_tdlist;
>> volatile uint32_t er_gen;
>> uint32_t er_cpuid;
>> + int er_drain_state;
>> } __aligned(EPOCH_ALIGN) *epoch_record_t;
>>
>> +#define EPOCH_DRAIN_START 2
>> +#define EPOCH_DRAIN_RUNNING 1
>> +#define EPOCH_DRAIN_DONE 0
>> +
>> struct epoch {
>> struct ck_epoch e_epoch __aligned(EPOCH_ALIGN);
>> epoch_record_t e_pcpu_record;
>> int e_idx;
>> int e_flags;
>> struct sx e_drain_sx;
>> - struct mtx e_drain_mtx;
>> volatile int e_drain_count;
>> const char *e_name;
>> };
>> @@ -335,7 +339,6 @@ epoch_alloc(const char *name, int flags)
>> epoch->e_idx = epoch_count;
>> epoch->e_name = name;
>> sx_init(&epoch->e_drain_sx, "epoch-drain-sx");
>> - mtx_init(&epoch->e_drain_mtx, "epoch-drain-mtx", NULL, MTX_DEF);
>> allepochs[epoch_count++] = epoch;
>> return (epoch);
>> }
>> @@ -348,7 +351,6 @@ epoch_free(epoch_t epoch)
>> allepochs[epoch->e_idx] = NULL;
>> epoch_wait(global_epoch);
>> uma_zfree_pcpu(pcpu_zone_record, epoch->e_pcpu_record);
>> - mtx_destroy(&epoch->e_drain_mtx);
>> sx_destroy(&epoch->e_drain_sx);
>> free(epoch, M_EPOCH);
>> }
>> @@ -699,14 +701,24 @@ epoch_call_task(void *arg __unused)
>> epoch_t epoch;
>> ck_stack_t cb_stack;
>> int i, npending, total;
>> + bool draining;
>> +
>> + KASSERT(curthread->td_pinned > 0,
>> + ("%s: callback task thread is not pinned", __func__));
>>
>> ck_stack_init(&cb_stack);
>> critical_enter();
>> epoch_enter(global_epoch);
>> - for (total = i = 0; i < epoch_count; i++) {
>> + for (total = i = 0, draining = false; i < epoch_count; i++) {
>> if (__predict_false((epoch = allepochs[i]) == NULL))
>> continue;
>> er = epoch_currecord(epoch);
>> + if (atomic_load_int(&er->er_drain_state) ==
>> EPOCH_DRAIN_START) {
>> + atomic_store_int(&er->er_drain_state,
>> + EPOCH_DRAIN_RUNNING);
>> + draining = true;
>> + }
>> +
>> record = &er->er_record;
>> if ((npending = record->n_pending) == 0)
>> continue;
>> @@ -728,6 +740,20 @@ epoch_call_task(void *arg __unused)
>> next = CK_STACK_NEXT(cursor);
>> entry->function(entry);
>> }
>> +
>> + if (__predict_false(draining)) {
>> + epoch_enter(global_epoch);
>> + for (i = 0; i < epoch_count; i++) {
>> + if (__predict_false((epoch = allepochs[i]) ==
>> NULL))
>> + continue;
>> + er = epoch_currecord(epoch);
>> + if (atomic_load_int(&er->er_drain_state) ==
>> + EPOCH_DRAIN_RUNNING)
>> + atomic_store_int(&er->er_drain_state,
>> + EPOCH_DRAIN_DONE);
>> + }
>> + epoch_exit(global_epoch);
>> + }
>> }
>>
>> int
>> @@ -769,27 +795,18 @@ in_epoch(epoch_t epoch)
>> }
>>
>> static void
>> -epoch_drain_cb(struct epoch_context *ctx)
>> +epoch_drain_handler(struct ck_epoch *global __unused,
>> + ck_epoch_record_t *cr __unused, void *arg __unused)
>> {
>> - struct epoch *epoch =
>> - __containerof(ctx, struct epoch_record,
>> er_drain_ctx)->er_parent;
>> -
>> - if (atomic_fetchadd_int(&epoch->e_drain_count, -1) == 1) {
>> - mtx_lock(&epoch->e_drain_mtx);
>> - wakeup(epoch);
>> - mtx_unlock(&epoch->e_drain_mtx);
>> - }
>> + maybe_yield();
>> }
>>
>> void
>> epoch_drain_callbacks(epoch_t epoch)
>> {
>> epoch_record_t er;
>> - struct thread *td;
>> - int was_bound;
>> - int old_pinned;
>> - int old_cpu;
>> - int cpu;
>> + int cpu, state;
>> + bool pending;
>>
>> WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
>> "epoch_drain_callbacks() may sleep!");
>> @@ -802,45 +819,28 @@ epoch_drain_callbacks(epoch_t epoch)
>> return;
>> #endif
>> DROP_GIANT();
>> -
>> sx_xlock(&epoch->e_drain_sx);
>> - mtx_lock(&epoch->e_drain_mtx);
>>
>> - td = curthread;
>> - thread_lock(td);
>> - old_cpu = PCPU_GET(cpuid);
>> - old_pinned = td->td_pinned;
>> - was_bound = sched_is_bound(td);
>> - sched_unbind(td);
>> - td->td_pinned = 0;
>> + /* Make sure that all pending callbacks are available. */
>> + ck_epoch_synchronize_wait(&epoch->e_epoch, epoch_drain_handler,
>> NULL);
>>
>> - CPU_FOREACH(cpu)
>> - epoch->e_drain_count++;
>> CPU_FOREACH(cpu) {
>> er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
>> - sched_bind(td, cpu);
>> - epoch_call(epoch, &epoch_drain_cb, &er->er_drain_ctx);
>> + atomic_store_int(&er->er_drain_state, EPOCH_DRAIN_START);
>> + GROUPTASK_ENQUEUE(DPCPU_ID_PTR(cpu, epoch_cb_task));
>> }
>>
>> - /* restore CPU binding, if any */
>> - if (was_bound != 0) {
>> - sched_bind(td, old_cpu);
>> - } else {
>> - /* get thread back to initial CPU, if any */
>> - if (old_pinned != 0)
>> - sched_bind(td, old_cpu);
>> - sched_unbind(td);
>> - }
>> - /* restore pinned after bind */
>> - td->td_pinned = old_pinned;
>> -
>> - thread_unlock(td);
>> -
>> - while (epoch->e_drain_count != 0)
>> - msleep(epoch, &epoch->e_drain_mtx, PZERO, "EDRAIN", 0);
>> + do {
>> + pending = false;
>> + CPU_FOREACH(cpu) {
>> + er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
>> + state = atomic_load_int(&er->er_drain_state);
>> + if (state != EPOCH_DRAIN_DONE)
>> + pending = true;
>> + }
>> + pause("edrain", 1);
>> + } while (pending);
>>
>> - mtx_unlock(&epoch->e_drain_mtx);
>> sx_xunlock(&epoch->e_drain_sx);
>> -
>> PICKUP_GIANT();
>> }
>> _______________________________________________
>> freebsd-net at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>>
>
> Hi Mark,
>
> I applied your patch to a FreeBSD-current VM and reran the test I was
> running originally, and I still see the same behavior.
>
> Your patch doesn't appear to fix the problem.
>
> - Eric
>
*bump*
Also adding more iflib WG members and others who might know more.
- Eric
More information about the freebsd-net
mailing list