Should the page identified by DMAP_MIN_ADDRESS (or DMAP_BASE_ADDRESS) allow reads and modifications?

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 24 Dec 2024 22:59:04 UTC
I've been helping someone gather some evidence for amd64 crashes
they have had for over 2 years. [Their context recently updated
to 13.4-RELEASE-p2 (so kernel 13.4-RELEASE-p1 in official
distributions. They do their own kernel builds and -kmod port
builds.] Having a boot-crash is intermittent but often enough
to be an issue. Most of the boot crashes have been tied to the
found_modules list. But there was once rare evidence of
elsewhere as well. (Another of the lists, for example.)

For reference: DMAP_MIN_ADDRESS == 0xfffff80000000000 in the
context, as seen via vmcore.* (kernel , *.ko, *.debug) and
kgdb use:

(kgdb) print &dmapbase
$6 = (<variable (not text or data), no debug info> *) 0xfffff80000000000

(kgdb) print dmaplimit
$12 = 0x240000000

My understanding is that 0x0 (a.k.a. NULL) and DMAP_MIN_ADDRESS
both identify a low byte address for the same page but via
distinct address ranges.

Should access into that page via the DMAP_MIN_ADDRESS based
address range refuse to allow modifications to the bytes in
the page? Refuse to allow reads of bytes in the page?

Refusal would be a cross check on having avoided use of
PHYS_TO_DMAP translations of small non-negative offsets from
NULL, for example. [But, for all I know the kernel may
sometimes need access to such. This is a learn-as-I-go
investigation.] The refusal would also be an earlier
rather than later indication of a problem if the page
really should not be in use (small offsets from
DMAP_MIN_ADDRESS).


For reference . . .

There is evidence of 0xfffff80000000007 showing up as
an address that is then offset by 0x18 to produce
0xfffff8000000001F and that being used to get a
char* from RAM. When that junk-value (0xe987f000fea5f0,
as an example) is then in turn passed to strcmp, the
strcmp operation gets a general protection fault from
the junk-value's attempted dereference.

I'll note that the 0xfffff80000000007 value has been
stable in the vmcore.*'s that I've looked at, despite
other variations across the vmcore.* 's. The node in
the found_modules list varies but the value always
shows up in the link.tqe_next field in the node that
ends up with the value.

The likes of "info sharedlibrary" and "info files"
indicates more loaded after the name indicated for
the node with the link.tqe_next 0xfffff80000000007
value. So far for the crash examples for found_modules
related crashes, the first .ko to load after
amdgpu_raven_vcn_bin.ko (the last of the sequence
of amdgpu*.ko files) ends up detecting the problem
during strcmp, as shown in:

#6  <signal handler called>
No locals.
#7  strcmp (s1=<optimized out>, s2=<optimized out>)
    at /usr/src/sys/libkern/strcmp.c:44
No locals.
#8  0xffffffff80bc0ab4 in modlist_lookup (
    name=0xffffffff829fd0c4 "vboxnetflt", ver=1)
    at /usr/src/sys/kern/kern_linker.c:1488
        mod = 0xfffff80000000007

(mod is from a node's link.tqe_next value that was
0xfffff80000000007 . mod->name's value (a const char*)
is passed to strcmp.)

"vboxnetflt" is just an example. changing the ordering
has had acpi_wmi.ko and zfs.ko listed there instead,
what ever was next at that point.


===
Mark Millard
marklmi at yahoo.com