Issue with epoch_drain_callbacks and unloading iavf(4) [using iflib]

Mon Feb 10 21:31:27 UTC 2020

On Fri, Jan 31, 2020 at 11:44 AM Eric Joyner <erj at freebsd.org> wrote:

> On Wed, Jan 29, 2020 at 7:09 PM Mark Johnston <markj at freebsd.org> wrote:
>
>> On Thu, Jan 30, 2020 at 02:12:05AM +0100, Hans Petter Selasky wrote:
>> > On 2020-01-29 22:44, Eric Joyner wrote:
>> > > On Wed, Jan 29, 2020 at 1:41 PM Hans Petter Selasky <hps at selasky.org>
>> wrote:
>> > >
>> > > > On 2020-01-29 22:30, Eric Joyner wrote:
>> > > > > Hi freebsd-net,
>> > > > >
>> > > > > We've encountered an issue with unloading the iavf(4) driver on
>> FreeBSD
>> > > > > 12.1 (and stable). On a VM with two iavf(4) interfaces, if we
>> send heavy
>> > > > > traffic to iavf1 and try to kldunload the driver, the kldunload
>> process
>> > > > > hangs on iavf0 until iavf1 stops receiving traffic.
>> > > > >
>> > > > > After some debugging, it looks like epoch_drain_callbacks() [via
>> > > > > if_detach_internal()] tries to switch CPUs to run on one that
>> iavf1 is
>> > > > > using for RX processing, but since iavf1 is busy, it can't make
>> the
>> > > > switch,
>> > > > > so cpu_switch() just hangs and nothing happens until iavf1's RX
>> thread
>> > > > > stops being busy.
>> > > > >
>> > > > > I can work around this by inserting a kern_yield(PRI_USER)
>> somewhere in
>> > > > one
>> > > > > of the iavf txrx functions that iflib calls into (e.g.
>> > > > > iavf_isc_rxd_available), but that's not a proper fix. Does anyone
>> know
>> > > > what
>> > > > > to do to prevent this from happening?
>> > > > >
>> > > > > Wildly guessing, does maybe epoch_drain_callbacks() need a higher
>> > > > priority
>> > > > > than the PI_SOFT used in the group taskqueues used in iflib's RX
>> > > > processing?
>> > > > >
>> > > >
>> > > > Hi,
>> > > >
>> > > > Which scheduler is this? ULE or BSD?
>> > > >
>> > > > EPOCH(9) expects some level of round-robin scheduling on the same
>> > > > priority level. Setting a higher priority on EPOCH(9) might cause
>> epoch
>> > > > to start spinning w/o letting the lower priority thread which holds
>> the
>> > > > EPOCH() section to finish.
>> > > >
>> > > > --HPS
>> > > >
>> > > >
>> > > Hi Hans,
>> > >
>> > > kern.sched.name gives me "ULE"
>> > >
>> >
>> > Hi Eric,
>> >
>> > epoch_drain_callbacks() depends on that epoch_call_task() gets execution
>> > which is executed from a GTASKQUEUE at PI_SOFT. Also
>> epoch_drain_callbacks()
>> > runs at the priority of the calling thread, and if this is lower than
>> > PI_SOFT, and a gtaskqueue is spinning heavily, then that won't work.
>>
>> I think we can fix this so that the epoch_drain_callbacks() caller's
>> priority does not matter.  Eric, can you try the patch below?  It is a
>> bit hacky, but the idea is to ensure that all pending callbacks can be
>> invoked, and then wait for each cb task to run though at least once.  In
>> your case I think the callback task threads should still be able to run.
>>
>> > For a single CPU system you will be toast in this situation regardless
>> if
>> > there is no free time on a CPU for EPOCH().
>> >
>> > In general if epoch_call_task() doesn't get execution time, you will
>> have a
>> > problem.
>>
>> That's not the issue here, though there is at least one related problem:
>> if a thread in an epoch read section is preempted and the CPU is
>> consumed by a high priority receive processing thread, for example due
>> to a DOS, we have no mechanism to boost the priority of the preempted
>> thread except by calling epoch_synchronize_wait().  In other words,
>> nothing guarantees that deferred callbacks will eventually get executed
>> in such a scenario.
>>
>> diff --git a/sys/kern/subr_epoch.c b/sys/kern/subr_epoch.c
>> index 0a477e1d6f7b..4ed7e7e11a3e 100644
>> --- a/sys/kern/subr_epoch.c
>> +++ b/sys/kern/subr_epoch.c
>> @@ -74,15 +74,19 @@ typedef struct epoch_record {
>>         volatile struct epoch_tdlist er_tdlist;
>>         volatile uint32_t er_gen;
>>         uint32_t er_cpuid;
>> +       int er_drain_state;
>>  } __aligned(EPOCH_ALIGN)     *epoch_record_t;
>>
>> +#define        EPOCH_DRAIN_START       2
>> +#define        EPOCH_DRAIN_RUNNING     1
>> +#define        EPOCH_DRAIN_DONE        0
>> +
>>  struct epoch {
>>         struct ck_epoch e_epoch __aligned(EPOCH_ALIGN);
>>         epoch_record_t e_pcpu_record;
>>         int     e_idx;
>>         int     e_flags;
>>         struct sx e_drain_sx;
>> -       struct mtx e_drain_mtx;
>>         volatile int e_drain_count;
>>         const char *e_name;
>>  };
>> @@ -335,7 +339,6 @@ epoch_alloc(const char *name, int flags)
>>         epoch->e_idx = epoch_count;
>>         epoch->e_name = name;
>>         sx_init(&epoch->e_drain_sx, "epoch-drain-sx");
>> -       mtx_init(&epoch->e_drain_mtx, "epoch-drain-mtx", NULL, MTX_DEF);
>>         allepochs[epoch_count++] = epoch;
>>         return (epoch);
>>  }
>> @@ -348,7 +351,6 @@ epoch_free(epoch_t epoch)
>>         allepochs[epoch->e_idx] = NULL;
>>         epoch_wait(global_epoch);
>>         uma_zfree_pcpu(pcpu_zone_record, epoch->e_pcpu_record);
>> -       mtx_destroy(&epoch->e_drain_mtx);
>>         sx_destroy(&epoch->e_drain_sx);
>>         free(epoch, M_EPOCH);
>>  }
>> @@ -699,14 +701,24 @@ epoch_call_task(void *arg __unused)
>>         epoch_t epoch;
>>         ck_stack_t cb_stack;
>>         int i, npending, total;
>> +       bool draining;
>> +
>> +       KASSERT(curthread->td_pinned > 0,
>> +           ("%s: callback task thread is not pinned", __func__));
>>
>>         ck_stack_init(&cb_stack);
>>         critical_enter();
>>         epoch_enter(global_epoch);
>> -       for (total = i = 0; i < epoch_count; i++) {
>> +       for (total = i = 0, draining = false; i < epoch_count; i++) {
>>                 if (__predict_false((epoch = allepochs[i]) == NULL))
>>                         continue;
>>                 er = epoch_currecord(epoch);
>> +               if (atomic_load_int(&er->er_drain_state) ==
>> EPOCH_DRAIN_START) {
>> +                       atomic_store_int(&er->er_drain_state,
>> +                           EPOCH_DRAIN_RUNNING);
>> +                       draining = true;
>> +               }
>> +
>>                 record = &er->er_record;
>>                 if ((npending = record->n_pending) == 0)
>>                         continue;
>> @@ -728,6 +740,20 @@ epoch_call_task(void *arg __unused)
>>                 next = CK_STACK_NEXT(cursor);
>>                 entry->function(entry);
>>         }
>> +
>> +       if (__predict_false(draining)) {
>> +               epoch_enter(global_epoch);
>> +               for (i = 0; i < epoch_count; i++) {
>> +                       if (__predict_false((epoch = allepochs[i]) ==
>> NULL))
>> +                               continue;
>> +                       er = epoch_currecord(epoch);
>> +                       if (atomic_load_int(&er->er_drain_state) ==
>> +                           EPOCH_DRAIN_RUNNING)
>> +                               atomic_store_int(&er->er_drain_state,
>> +                                   EPOCH_DRAIN_DONE);
>> +               }
>> +               epoch_exit(global_epoch);
>> +       }
>>  }
>>
>>  int
>> @@ -769,27 +795,18 @@ in_epoch(epoch_t epoch)
>>  }
>>
>>  static void
>> -epoch_drain_cb(struct epoch_context *ctx)
>> +epoch_drain_handler(struct ck_epoch *global __unused,
>> +    ck_epoch_record_t *cr __unused, void *arg __unused)
>>  {
>> -       struct epoch *epoch =
>> -           __containerof(ctx, struct epoch_record,
>> er_drain_ctx)->er_parent;
>> -
>> -       if (atomic_fetchadd_int(&epoch->e_drain_count, -1) == 1) {
>> -               mtx_lock(&epoch->e_drain_mtx);
>> -               wakeup(epoch);
>> -               mtx_unlock(&epoch->e_drain_mtx);
>> -       }
>> +       maybe_yield();
>>  }
>>
>>  void
>>  epoch_drain_callbacks(epoch_t epoch)
>>  {
>>         epoch_record_t er;
>> -       struct thread *td;
>> -       int was_bound;
>> -       int old_pinned;
>> -       int old_cpu;
>> -       int cpu;
>> +       int cpu, state;
>> +       bool pending;
>>
>>         WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
>>             "epoch_drain_callbacks() may sleep!");
>> @@ -802,45 +819,28 @@ epoch_drain_callbacks(epoch_t epoch)
>>                 return;
>>  #endif
>>         DROP_GIANT();
>> -
>>         sx_xlock(&epoch->e_drain_sx);
>> -       mtx_lock(&epoch->e_drain_mtx);
>>
>> -       td = curthread;
>> -       thread_lock(td);
>> -       old_cpu = PCPU_GET(cpuid);
>> -       old_pinned = td->td_pinned;
>> -       was_bound = sched_is_bound(td);
>> -       sched_unbind(td);
>> -       td->td_pinned = 0;
>> +       /* Make sure that all pending callbacks are available. */
>> +       ck_epoch_synchronize_wait(&epoch->e_epoch, epoch_drain_handler,
>> NULL);
>>
>> -       CPU_FOREACH(cpu)
>> -               epoch->e_drain_count++;
>>         CPU_FOREACH(cpu) {
>>                 er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
>> -               sched_bind(td, cpu);
>> -               epoch_call(epoch, &epoch_drain_cb, &er->er_drain_ctx);
>> +               atomic_store_int(&er->er_drain_state, EPOCH_DRAIN_START);
>> +               GROUPTASK_ENQUEUE(DPCPU_ID_PTR(cpu, epoch_cb_task));
>>         }
>>
>> -       /* restore CPU binding, if any */
>> -       if (was_bound != 0) {
>> -               sched_bind(td, old_cpu);
>> -       } else {
>> -               /* get thread back to initial CPU, if any */
>> -               if (old_pinned != 0)
>> -                       sched_bind(td, old_cpu);
>> -               sched_unbind(td);
>> -       }
>> -       /* restore pinned after bind */
>> -       td->td_pinned = old_pinned;
>> -
>> -       thread_unlock(td);
>> -
>> -       while (epoch->e_drain_count != 0)
>> -               msleep(epoch, &epoch->e_drain_mtx, PZERO, "EDRAIN", 0);
>> +       do {
>> +               pending = false;
>> +               CPU_FOREACH(cpu) {
>> +                       er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
>> +                       state = atomic_load_int(&er->er_drain_state);
>> +                       if (state != EPOCH_DRAIN_DONE)
>> +                               pending = true;
>> +               }
>> +               pause("edrain", 1);
>> +       } while (pending);
>>
>> -       mtx_unlock(&epoch->e_drain_mtx);
>>         sx_xunlock(&epoch->e_drain_sx);
>> -
>>         PICKUP_GIANT();
>>  }
>> _______________________________________________
>> freebsd-net at freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to "freebsd-net-unsubscribe at freebsd.org"
>>
>
> Hi Mark,
>
> I applied your patch to a FreeBSD-current VM and reran the test I was
> running originally, and I still see the same behavior.
>
> Your patch doesn't appear to fix the problem.
>
> - Eric
>

  *bump*

Also adding more iflib WG members and others who might know more.

- Eric