[Bug 264191] debugnet panics with mbuf cache with multiple instances of the same driver

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 23 May 2022 20:40:01 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=264191

            Bug ID: 264191
           Summary: debugnet panics with mbuf cache with multiple
                    instances of the same driver
           Product: Base System
           Version: CURRENT
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: bdrewery@FreeBSD.org

1. debugnet_mbuf_reinit() is racy.

With netdump we would only populate the mbuf cache when a device was
*configured*. Now we populate the cache when the device comes up and if it
*supports* debugnet. Thus if we have a driver with multiple devices then each
device coming up will cause debugnet_mbuf_reinit() to race between multiple
threads while touching the mbufqs. This is easily fixed but leaves more issues.

Doing this during driver link up makes sense because we may not configure the
device until after panic in ddb with .netdump. 

2. dn_buf_import() may overflow an mbuf from the queue with trash_init() on
<without INVARIANTS>.

If 1 device has jumbo frames, MTU 9000, and the other normal MTU of 1500, the
hwm/dn_clsize can become MJUM9BYTES (9216).

[This next part may only be a problem for something like mlx4 which has some
cached mbufs of its own. This can be seen in mlx4_en_alloc_buf() where it
appears to always keep 1 extra mbuf around for each ring. It appears it may use
that mbuf at panic time if mlx4_en_alloc_mbuf() fails. The issue I ran into
downstream was a very different allocation scenario but the FreeBSD version
appears to have a similar issue.]

If the device that is used at dump time has an MTU of 1500 it is possible for
the device to return a smaller mbuf to the dn_clustq than expected for that
zone (vs the high water mark of 9216). When it is removed in dn_buf_import() it
has trash_init(9216) ran over it rather than the expected MCLBYTES size.

-- 
You are receiving this mail because:
You are the assignee for the bug.