[Bug 284480] graphics/drm-61-kmod: 14.2 amdgpu rx580 stability problem on high zfs load / scrub.

From: <bugzilla-noreply_at_freebsd.org>
Date: Fri, 31 Jan 2025 12:44:19 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=284480

            Bug ID: 284480
           Summary: graphics/drm-61-kmod: 14.2 amdgpu rx580 stability
                    problem on high zfs load / scrub.
           Product: Ports & Packages
           Version: Latest
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: Individual Port(s)
          Assignee: x11@FreeBSD.org
          Reporter: tomek@cedro.info
          Assignee: x11@FreeBSD.org
             Flags: maintainer-feedback?(x11@FreeBSD.org)

Hello world :-)

I noticed stability problem when amdgpu 61 and zpool scrub or high use of
raidz2 takes place.

I have three ZFS datasets. Two are simple zraid0 third is zraid2. After
switching from 13.3 to 14.2 I noticed strange stability problem that makes
whole machine unusable and unreliable. It was happening after some hours of use
or daily in the morning (probably some daily tasks triggered problem). Setup is
14.2-RELEASE GENERIC AMD64 and drm-61-kmod with RADEON RX580 8GB AMDGPU.


Symptoms:
* no kernel panic.
* long screen freezes.
* cannot switch to VT.
* keyboard and mouse stops working, keystrokes repeat of skipped, mouse moves
just a bit.
* watchdog is trigerred on on-board realtek ethernet making network interface
go down.
* if there are multimedia applications working in the background sound plays
with no problem.
* machine is unusable.

I have recompiled drm-61-kmod, gpu-firmware-amd. Even updated CPU microcode on
boot and late in rc. I have recompiled EFL and Enlightenment, Terminology, etc.
I also changed mobo settings like RAM/CPU/NB/HT clocking. Nothing helped.

So I was poking around and found out that triggering `zpool scrub` on datasets
attached to onboard SATA controller gives the symptoms. Scrubbing zraid0 slows
down a bit machine like I noticed every day. Scrubbing zraid2 makes machine
unusable.

I have replaced RX580 GPU with NVIDIA 1060 installed nvidia-driver and the
problem is gone. I am writing this as scrub raidz2 is being done in the
background. That would not be possible with RX580 and amdgpu onboard.

There may be something in the 61 amdgpu driver that interferes strongly on the
chipset.

I think I will switch to nvidia for good. But someone may have similar problem.

Thanks :-)
Tomek


# uname -a
FreeBSD octagon 14.2-RELEASE FreeBSD 14.2-RELEASE
releng/14.2-n269506-c8918d6c7412 GENERIC amd64

# pkg info drm-61-kmod
drm-61-kmod-6.1.92.1402000_3

# pkg info gpu-firmware-kmod
gpu-firmware-kmod-20241114,1

-- 
You are receiving this mail because:
You are the assignee for the bug.