[Bug 262571] epair(4) interfaces stop forwarding traffic on moderate load

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 15 Mar 2022 14:08:19 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262571

            Bug ID: 262571
           Summary: epair(4) interfaces stop forwarding traffic on
                    moderate load
           Product: Base System
           Version: 13.1-RELEASE
          Hardware: Any
               URL: https://lists.freebsd.org/archives/freebsd-net/2022-Ma
                    rch/001449.html
                OS: Any
            Status: New
          Severity: Affects Many People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: grembo@FreeBSD.org
                CC: bz@FreeBSD.org, kp@freebsd.org
             Flags: maintainer-feedback?(kp@freebsd.org)

Created attachment 232471
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=232471&action=edit
Patch that works around the problem

As discussed on the freebsd-net mailing list[0]. Also affects CURRENT.

When running on multicore systems, epair interfaces stop forwarding traffic
even on moderate load and don't recover unless recreated. This is a critical
problem, as it breaks vnet jails running non-trivial workloads. The problem can
be reproduced easily using a shell script[1].

This was introduced when adding multi-core improvements to epair[2].

It happens because work is scheduled in taskqueue(s) based on a check if mbuf
ring buffers are empty, a logic which is racy on multi-core systems. The race
is happening between epair_menq() and epair_tx_start_deferred().

The patch attached to this PR addresses the problem, but it needs to be looked
at, profiled, and most likely improved by somebody who has a better
understanding of both the code in question and writing lock free-code in
general.

[0]https://lists.freebsd.org/archives/freebsd-net/2022-March/001449.html
[1]https://people.freebsd.org/~grembo/hang_epair.sh
[2]https://cgit.freebsd.org/src/commit/?id=24f0bfbad57b9c3cb9b543a60b2ba00e4812c286

-- 
You are receiving this mail because:
You are the assignee for the bug.