[Bug 237544] graphics/drm-fbsd12.0-kmod: panic on 12-STABLE with Radeon HD 7450 (but not with drm-fbsd11.2-kmod)
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Thu, 30 Dec 2021 20:55:13 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=237544 --- Comment #11 from Bill Paul <noisetube@gmail.com> --- So, since I'm off work this week and have not much else to do, I decided to try isolating the actual problem here. Now that I have a known working set of code (drm-fbsd11.2-kmod) I thought I could compare it to the non-working code (drm-fbsd12.0-kmod) and gradually bisect things to narrow down the fault After much hair-pulling and gnashing of teeth, I finally isolated things down to the dma-fence module in the linuxkpi code. Here's what I tried: - Replaced the contents of the drivers/gpu/drm/radeon directory in drm-fbsd12.0-kmod with the contents from the radeon directory in drm-fbsd11.2-kmod - Result: no change, panic still occurred - Replaced the contents of the drivers/gpu/drm/ttm directory in drm-fbsd12.0-kmod with the contents of the drm directory in drm-fbsd11.2-kmod (as well as the associated header files) - Result: no change, panic still occurred - Replaced the contents of the linuxkpi and drivers/gpu/drm/ttm directories in drm-fbsd12.0-kmod with the contents of linuxkpi and ttm directories from drm-fbsd11.2-kmod (as well as the associated header files) - Result: No panic - Replaced _just_ the contents of the linuxkpi directory in drm-fbsd12.0-kmod with the contents of the linuxkpi directory in drm-fbsd11.2-kmod (this time taking care to preserve the ttm module; they are somewhat tightly coupled so this took a bit more effort) - Result: No panic - Replaced _just_ the dma-fence.h and linux_dmafence.c modules in the linuxkpi directory in drm-fbsd12.0-kmod with the ones from drm-fbsd11.2-kmod, and also tweaked linux_synx_file.c a little (it uses an API from the 12.0 code which isn't in the 11.2 code) - Result: No panic I'm still not exactly sure what's wrong here, but there seems to be a problem in the dma-fence module with locking and/or reference counting that causes fence structures to be deleted unexpectedly. This is what leads to the traps on bad pointers. I created a custom tarball of the drm-fbsd12.0-kmod port which includes patches to the 4.16 FreeBSDDesktop 4.16 code to revert the dma-fence code as described above. You can download it from here: http://people.freebsd.org/~wpaul/radeon/drm-fbsd12.0-kmod.tar.gz The specific things I did are: 1) Replaced dma-fence.h and linux_dmafence.c in the drm-fbsd12.0-kmod port with the versions drm-fbsd11.2-kmod. 2) Added a compat wrapper function in dma-fence.h for dma_fence_get_rcu_safe() which just calls dma_fence_get_rcu(). 3) Added a compat macro in dma-fence.h for dma_fence_is_signaled_locked() which just calls dma_fence_is_signaled() 4) In linux_sync_file.c, changed the sync_fill_fence_info() function back to how it looked in the 11.2 codebase, because it uses dma_fence_get_status() and DMA_FENCE_FLAG_TIMESTAMP_BIT, which were not available in the older 11.2 dma-fence code Just unpack the tarball under /usr/ports/graphics in place of the old one and then run make, followed by "make deinstall" and "make reinstall". It occurred to me that instead of taking the older 11.2 dma-fence module and porting it forward, it might make more sense to take the 13.0 module and port it back. But this assumes that the drm-fbsd13.0-kmod code doesn't have the same stability problem it in as drm-fbsd12.0-kmod, and I don't know if that's true. (So far nobody has said whether or not they're using a Radeon card with 13.0 and whether or not they've encountered the same problems.) I may still try this anyway if I'm still sufficiently bored. So far I've tested this on two devices: vgapci0@pci0:1:0:0: class=0x030000 card=0x21261028 chip=0x68f91002 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Cedar [Radeon HD 5000/6000/7350/8350 Series]' class = display subclass = VGA vgapci0@pci0:0:1:0: class=0x030000 card=0x168b103c chip=0x96481002 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices, Inc. [AMD/ATI]' device = 'Sumo [Radeon HD 6480G]' class = display subclass = VGA I'm using the machine with the CEDAR device right now. The laptop with the SUMO device is much more prone to crashing. Usually what I do to provoke it is: - Boot and load the driver - Plug in my phone and set up tethering over USB - Start KDE5 - Start Firefox - Browse Facebook or Reddit for a while It usually panics within a few minutes. Lastly, I have a question: I followed up to this particular PR because the it seemed to most closely match the problems I was having, but it's been closed. Should I open a new PR? This bug is still present with 12.3 and I'm clearly not the only one affected by it. (I also still can't explain why it doesn't seem to affect the i915kms driver.) -- You are receiving this mail because: You are the assignee for the bug.