[Bug 253461] [AMD/ATI] RV730 PRO [Radeon HD 4650] panic kernel
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 04 Jan 2022 22:52:02 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253461 Bill Paul <noisetube@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |noisetube@gmail.com --- Comment #3 from Bill Paul <noisetube@gmail.com> --- I believe I have a fix for this bug. It is a problem with the linuxkpi code in the FreeBSDDesktop-kms-drm-4.16.g20201016-8843e1fc5_GH0.tar.gz distribution. Notes: - This problem has been there for some time. I've had it happen in FreeBSD 12.2-RELEASE and FreeBSD 12.3-RELEASE. - It's not confined to a single Radeon card. I've observed the problem with the following hardware on different machines: vgapci0@pci0:1:0:0: class=0x030000 card=0x21261028 chip=0x68f91002 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Cedar [Radeon HD 5000/6000/7350/8350 Series]' class = display subclass = VGA vgapci0@pci0:0:1:0: class=0x030000 card=0x168b103c chip=0x96481002 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Sumo [Radeon HD 6480G]' class = display subclass = VGA vgapci1@pci0:131:0:0: class=0x030000 card=0x90b8103c chip=0x67711002 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Caicos XTX [Radeon HD 8490 / R5 235X OEM]' class = display subclass = VGA (Note that the Sumo device is built into a laptop, an HP ProBook 4535S.) - This problem has been reported by others. PR 237544 is a duplicate. The panics I experienced had the same stack traces as shown in both PRs. - PR 237544 provides an important hint that this crash did _not_ happen with the drm-fbsd11.2-kmod port/package. Although it has been deprecated, I was able to build and install the drm-fbsd11.2-kmod code on my FreeBSD 12.3-RELEASE system (the laptop) and the crashes went away. - In my case, the panics were more likely to occur when the system was under load. The laptop seemed to trigger it more frequently (which actually made it easier to track it down). I tried to track the problem down by comparing the the drm-fbsd11.2-kmod and drm-fbsd12.0-kmod code and swapping bits of the 11.2 code into the 12.0 tree to see what effect that would have. Eventually I traced the problem to the linuxkpi code, and then to the dma-fence code, and then finally, to this function in linuxkpi/gplv2/include/linux/dma-fence.h: static inline void dma_fence_signal_locked_sub(struct dma_fence *fence) { struct dma_fence_cb *cur; while ((cur = list_first_entry_or_null(&fence->cb_list, struct dma_fence_cb, node)) != NULL) { list_del_init(&cur->node); spin_unlock(fence->lock); /* <-- No! */ cur->func(fence, cur); spin_lock(fence->lock); /* <-- No! */ } } Note the two lines highlited above. The dma_fence_signal_locked_sub() routine is shared by both dma_fence_signal() and dma_fence_signal_locked(). The latter function is intended to be used when the caller is already holding the fence spinlock. The former takes the spinlock itself. The problem is that the above code causes the spinlock to be dropped in the case where dma_fence_signal() is called. This is not the same behavior as the older 11.2 code: in that case, the lock is held while the callouts are invoked. (I *think* this is also the case in the later code in FreeBSD 13 too.) I believe that dropping the lock before calling the callouts opens a race condition window and this is what leads to the crash. It's difficult to ascertain that this is the what's happening from the crash stack traces, but in my analysis I found that at least sometimes the problem was that something was trying to dereference a NULL DMA fence pointer. I patched my copy of the code to remove the spin_unlock() and spin_lock() calls shown above, and that seemed to fix the problem. The laptop has not crashed since I did this. I also made the same change to the 12.2-RELEASE system with the "Cedar" card and exercised it a bit, and that one seemed to run ok too. I have just patched the "Caicos" machine today and so far it's running stable as well (this is my work machine and this is my first day back at the office for the new year). I created a version of the drm-fbsd12.0-kmod port with this change included as a patch, which can be downloaded from here: http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz I will also attach the patch to this PR. Can someone please test this to see if it fixes the problem for them too? Note: I happen to have about 3 or 4 extra Radeon cards as spares (I rescued these from the e-waste bin) and would be happen to send one to a developer if that would help (assuming they have a machine with a slot that can accommodate it). -- You are receiving this mail because: You are on the CC list for the bug.