[Bug 267028] kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko

From: <bugzilla-noreply_at_freebsd.org>
Date: Sat, 21 Dec 2024 16:28:27 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267028

--- Comment #260 from Mark Millard <marklmi26-fbsd@yahoo.com> ---
(In reply to satanist+freebsd from comment #259)

In:

        mod = malloc(sizeof(struct modlist), M_LINKER, M_NOWAIT | M_ZERO);
        if (mod == NULL)
                panic("no memory for module list");
        mod->container = container;

if something similar to mod == 0xfffff80000000007 resulted,
it appears to me that the dereference in mod->container
or the like would have gotten a general protection fault,
given the later actual failure that sometimes happens
because of the 0xfffff80000000007 that sometimes happens.

I'll note also that, for example, one of the historical crashes
involving 0xfffff80000000007 was in handling a different list:

/*
 * Remove the references to the thread from all of the objects we were
 * polling.
 */
static void
seltdclear(struct thread *td)
{
        struct seltd *stp;
        struct selfd *sfp;
        struct selfd *sfn;

        stp = td->td_sel;
        STAILQ_FOREACH_SAFE(sfp, &stp->st_selq, sf_link, sfn)
                selfdfree(stp, sfp);
        stp->st_flags = 0;
}

so the issue does not appear to be list specific, even
if one list is more common for failing than others for
some reason.

I do not know if there is some relevant relationship with
the likes of code from:

drm-kmod/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c

for alternate failure points.

No simple reproduction test has ever been discovered.


MALLOC_DEBUG is controlled in the kernel via
sys/kern/kern_malloc.c having the code:

#if defined(INVARIANTS) || defined(MALLOC_MAKE_FAILURES) ||             \
    defined(DEBUG_MEMGUARD) || defined(DEBUG_REDZONE)
#define MALLOC_DEBUG    1
#endif

It, in turn leads to definition and use of the kernel's
malloc_dbg() and free_dbg(). I certainly have no objection
to such testing, say via using an INVARIANTS based kernel
build. But I'm not testing, having no context to use to
reproduce the problem with. I'm just looking at vmcore.*
file(s) via kgdb .

But I'll also note, that recently we appear to have learned
that some of the software in use was rather old and not
being updated --so not tracking kernel updates. Testing if
the modern software built to match the kernel in use also
produces the problems seems appropriate, as that is what
would be changed if there is still a bug to be fixed. As I
understand that testing is what is going on now.

-- 
You are receiving this mail because:
You are the assignee for the bug.