[Bug 277476] graphics/drm-515-kmod: amdgpu periodic hangs due to phys contig allocations

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 08 Nov 2024 09:04:51 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277476

--- Comment #5 from sigsys@gmail.com ---
Yeah so this problem was super annoying. But thanks to the information already
posted here, seems like it wasn't too hard to fix.

IIUC the drm code (ttm_pool_alloc()) asking for contiguous pages doesn't
actually need contiguous pages. It's just an opportunistic optimization. When
allocation fails, it fallsback to asking for less and less contiguous pages
(eventually only asking for one page at a time). When ttm_pool_alloc_page()
asks for more than one page, it passes alloc_pages() some extra flags
(__GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM).

What's expensive is the vm_page_reclaim_contig() in linux_alloc_pages(). The
function tries too hard to find contiguous memory (that the drm code doesn't
even require) and as physical memory gets too fragmented it becomes very slow.

So, very simple fix, make linux_alloc_pages() react to one of the flag passed
by the drm code:

diff --git a/sys/compat/linuxkpi/common/include/linux/gfp.h
b/sys/compat/linuxkpi/common/include/linux/gfp.h
index 2fcc0dc05f29..58a021086c98 100644
--- a/sys/compat/linuxkpi/common/include/linux/gfp.h
+++ b/sys/compat/linuxkpi/common/include/linux/gfp.h
@@ -44,7 +44,6 @@
 #define        __GFP_NOWARN    0
 #define        __GFP_HIGHMEM   0
 #define        __GFP_ZERO      M_ZERO
-#define        __GFP_NORETRY   0
 #define        __GFP_NOMEMALLOC 0
 #define        __GFP_RECLAIM   0
 #define        __GFP_RECLAIMABLE   0
@@ -58,7 +57,8 @@
 #define        __GFP_KSWAPD_RECLAIM    0
 #define        __GFP_WAIT      M_WAITOK
 #define        __GFP_DMA32     (1U << 24) /* LinuxKPI only */
-#define        __GFP_BITS_SHIFT 25
+#define        __GFP_NORETRY   (1U << 25) /* LinuxKPI only */
+#define        __GFP_BITS_SHIFT 26
 #define        __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
 #define        __GFP_NOFAIL    M_WAITOK

diff --git a/sys/compat/linuxkpi/common/src/linux_page.c
b/sys/compat/linuxkpi/common/src/linux_page.c
index 18b90b5e3d73..71a6890a3795 100644
--- a/sys/compat/linuxkpi/common/src/linux_page.c
+++ b/sys/compat/linuxkpi/common/src/linux_page.c
@@ -118,7 +118,7 @@ linux_alloc_pages(gfp_t flags, unsigned int order)
                        page = vm_page_alloc_noobj_contig(req, npages, 0, pmax,
                            PAGE_SIZE, 0, VM_MEMATTR_DEFAULT);
                        if (page == NULL) {
-                               if (flags & M_WAITOK) {
+                               if ((flags & (M_WAITOK | __GFP_NORETRY)) ==
M_WAITOK) {
                                        int err = vm_page_reclaim_contig(req,
                                            npages, 0, pmax, PAGE_SIZE, 0);
                                        if (err == ENOMEM)

Been working fine here with amdgpu for about 3 weeks.

(The drm modules need to be recompiled with the modified kernel header.)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.