[Bug 253461] LinuxKPI: [AMD/ATI] RV730 PRO [Radeon HD 4650] crashes kernel
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 09 Jan 2022 23:30:52 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253461 --- Comment #9 from Bill Paul <noisetube@gmail.com> --- (In reply to Vladimir Kondratyev from comment #7) It looked to me that originally Linux had the dma_fence_signal() API and later a new API dma_fence_signal_locked() was added. According to what I've read, the idea is that dma_fence_signal_locked() can be used if the caller is already holding the DMA fence object spinlock, while the older dma_fence_signal() function takes the lock for you. The question here is: when you signal a dma fence object, and you invoke its attached callout routines, do you hold the spinlock or do you drop it? The older linuxkpi code in drm-fbsd11.2-kmod was based on Linux 4.11 and only had the dma_fence_signal() API, and that code always held the fence spinlock when invoking the callouts. In drm-fbsd12.0-kmod, based on Linux 4.16, both dma_fence_signal() and dma_fence_signal_locked() are present. HOWEVER, the logic is now such that both functions drop the dma fence spinlock when calling the callouts. This changes the behavior of dma_fence_signal(), and I think the change was wrong (though likely unintentional). Now, dma_fence_signal() drops the spinlock when invoking the callouts. This does not seem to harm the Intel i915kms.ko driver, but it seems to cause the radeonkms.ko driver driver to panic when the system is under load. I must assume that dropping the lock leads to a race condition when two different threads try to access the same dma fence object. If you browse the most recent Linux kernel code, you can also see that this behavior is inconsistent with the native Linux implementations of dma_fence_signal() and dma_fence_signal_locked(): https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-fence.c#L376 The dma_fence_signal_timestamp_locked() function shown here is used by both dma_fence_signal() and dma_fence_signal_locked(). dma_fence_signal() takes the fence spinlock before calling it. Note that the fence spinlock is _not_ released when invoking the callbacks. From this I am forced to conclude: - When calling dma_fence_signal(), the fence spinlock is supposed to be held until the function returns, including when the callbacks are called. - When calling dma_fence_signal_locked(), the same is true, except it is the caller that's expected to take the fence spinlock. - The current behavior in drm-fbsd12.0-kmod where the lock is dropped when invoking the callouts is therefore wrong on two counts: it deviates from the Linux behavior, which breaks synchronization in the Radeon driver. I think my fix preserves the expected behavior of both routines, because dma_fence_signal_unlocked() does not call dma_fence_signal_unlocked_sub() with the spinlock held, while dma_fence_signal() does. My office machine with the CAICOS chipset has been running with this fix for a week now and has been stable. I've also been using the same fix on my laptop with the SUMO chipset with the same fix for a bit longer and it also hasn't crashed. Before the laptop would not last more than 5 minutes before it would panic. -Bill -- You are receiving this mail because: You are on the CC list for the bug.