[Bug 267028] kernel panics when booting with both (zfs,ko or vboxnetflt,ko or acpi_wmi.ko) and amdgpu.ko

From: <bugzilla-noreply_at_freebsd.org>
Date: Sun, 22 Dec 2024 18:35:35 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=267028

--- Comment #271 from Mark Millard <marklmi26-fbsd@yahoo.com> ---
(In reply to Mark Millard from comment #270)

[I now have a boot/modules/vboxnetflt.ko so there are
linker symbols now [but not debugging information].]

For the 3 example vmcore.* that we have so far, it has
been that the first *.ko to load after:

boot/modules/amdgpu_raven_vcn_bin.ko

is the one for which the load activity detects the
corruption. Even if it is another module instead of
a kernel .ko that ends up having the detection. For
example:

(kgdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0xffffffff82545000  0xffffffff82552000  Yes         ./boot/kernel/fusefs.ko
0xffffffff8256c000  0xffffffff8256e000  Yes         ./boot/kernel/sem.ko
0xffffffff82574000  0xffffffff825fb000  Yes (*)     ./boot/modules/if_re.ko
0xffffffff82a00000  0xffffffff82cf4000  Yes (*)     ./boot/modules/amdgpu.ko
0xffffffff82918000  0xffffffff8296c000  Yes (*)     ./boot/modules/drm.ko
0xffffffff8298a000  0xffffffff8298b000  Yes         ./boot/kernel/iic.ko
0xffffffff8298d000  0xffffffff8298f000  Yes (*)    
./boot/modules/linuxkpi_gplv2.ko
0xffffffff82991000  0xffffffff82996000  Yes (*)     ./boot/modules/dmabuf.ko
0xffffffff82998000  0xffffffff829a2000  Yes (*)     ./boot/modules/ttm.ko
0xffffffff829a5000  0xffffffff829a6000  Yes (*)    
./boot/modules/amdgpu_raven_gpu_info_bin.ko
0xffffffff829a8000  0xffffffff829a9000  Yes (*)    
./boot/modules/amdgpu_raven_sdma_bin.ko
0xffffffff829af000  0xffffffff829b0000  Yes (*)    
./boot/modules/amdgpu_raven_asd_bin.ko
0xffffffff829db000  0xffffffff829dc000  Yes (*)    
./boot/modules/amdgpu_raven_ta_bin.ko
0xffffffff829e6000  0xffffffff829e7000  Yes (*)    
./boot/modules/amdgpu_raven_pfp_bin.ko
0xffffffff829ee000  0xffffffff829ef000  Yes (*)    
./boot/modules/amdgpu_raven_me_bin.ko
0xffffffff829f5000  0xffffffff829f6000  Yes (*)    
./boot/modules/amdgpu_raven_ce_bin.ko
0xffffffff82e11000  0xffffffff82e12000  Yes (*)    
./boot/modules/amdgpu_raven_rlc_bin.ko
0xffffffff82e1d000  0xffffffff82e1e000  Yes (*)    
./boot/modules/amdgpu_raven_mec_bin.ko
0xffffffff82e61000  0xffffffff82e62000  Yes (*)    
./boot/modules/amdgpu_raven_mec2_bin.ko
0xffffffff82ea5000  0xffffffff82ea6000  Yes (*)    
./boot/modules/amdgpu_raven_vcn_bin.ko
0xffffffff829fa000  0xffffffff82a00000  Yes (*)    
./boot/modules/vboxnetflt.ko
(*): Shared library is missing debugging information.

Reminder of the names with the odd tqe_next value associated:
"amdgpu_raven_mec2_bin_fw" (vmcore.8 but older gpu-firmware-amd-kmod-raven-* )
"amdgpu_raven_mec_bin_fw"  (vmcore.9)
"amdgpu_raven_me_bin_fw"   (vmcore.0)


It may be that before boot/modules/amdgpu_raven_vcn_bin.ko
loads, there is no corruption. I'll note that, so far, the
corruption ends up being earlier in the list than the
boot/modules/amdgpu_raven_vcn_bin.ko related material,
despite the variable positioning in the list.


Another possibly interesting is that the address range
listed when the vboxnetflt.ko is present fits between
the amdgpu_raven_me_bin.ko and amdgpu_raven_ce_bin.ko
ranges.

-- 
You are receiving this mail because:
You are the assignee for the bug.