Should the page identified by DMAP_MIN_ADDRESS (or DMAP_BASE_ADDRESS) allow reads and modifications?
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 24 Dec 2024 22:59:04 UTC
I've been helping someone gather some evidence for amd64 crashes they have had for over 2 years. [Their context recently updated to 13.4-RELEASE-p2 (so kernel 13.4-RELEASE-p1 in official distributions. They do their own kernel builds and -kmod port builds.] Having a boot-crash is intermittent but often enough to be an issue. Most of the boot crashes have been tied to the found_modules list. But there was once rare evidence of elsewhere as well. (Another of the lists, for example.) For reference: DMAP_MIN_ADDRESS == 0xfffff80000000000 in the context, as seen via vmcore.* (kernel , *.ko, *.debug) and kgdb use: (kgdb) print &dmapbase $6 = (<variable (not text or data), no debug info> *) 0xfffff80000000000 (kgdb) print dmaplimit $12 = 0x240000000 My understanding is that 0x0 (a.k.a. NULL) and DMAP_MIN_ADDRESS both identify a low byte address for the same page but via distinct address ranges. Should access into that page via the DMAP_MIN_ADDRESS based address range refuse to allow modifications to the bytes in the page? Refuse to allow reads of bytes in the page? Refusal would be a cross check on having avoided use of PHYS_TO_DMAP translations of small non-negative offsets from NULL, for example. [But, for all I know the kernel may sometimes need access to such. This is a learn-as-I-go investigation.] The refusal would also be an earlier rather than later indication of a problem if the page really should not be in use (small offsets from DMAP_MIN_ADDRESS). For reference . . . There is evidence of 0xfffff80000000007 showing up as an address that is then offset by 0x18 to produce 0xfffff8000000001F and that being used to get a char* from RAM. When that junk-value (0xe987f000fea5f0, as an example) is then in turn passed to strcmp, the strcmp operation gets a general protection fault from the junk-value's attempted dereference. I'll note that the 0xfffff80000000007 value has been stable in the vmcore.*'s that I've looked at, despite other variations across the vmcore.* 's. The node in the found_modules list varies but the value always shows up in the link.tqe_next field in the node that ends up with the value. The likes of "info sharedlibrary" and "info files" indicates more loaded after the name indicated for the node with the link.tqe_next 0xfffff80000000007 value. So far for the crash examples for found_modules related crashes, the first .ko to load after amdgpu_raven_vcn_bin.ko (the last of the sequence of amdgpu*.ko files) ends up detecting the problem during strcmp, as shown in: #6 <signal handler called> No locals. #7 strcmp (s1=<optimized out>, s2=<optimized out>) at /usr/src/sys/libkern/strcmp.c:44 No locals. #8 0xffffffff80bc0ab4 in modlist_lookup ( name=0xffffffff829fd0c4 "vboxnetflt", ver=1) at /usr/src/sys/kern/kern_linker.c:1488 mod = 0xfffff80000000007 (mod is from a node's link.tqe_next value that was 0xfffff80000000007 . mod->name's value (a const char*) is passed to strcmp.) "vboxnetflt" is just an example. changing the ordering has had acpi_wmi.ko and zfs.ko listed there instead, what ever was next at that point. === Mark Millard marklmi at yahoo.com