nvidia_drv.so/Xorg crashes

From: Craig Leres <leres_at_freebsd.org>
Date: Fri, 25 Jun 2021 02:30:52 UTC
I have four (12.2-RELEASE) systems between the office at home that are 
full or part time FreeBSD desktops. All have pny nvidia quadro 410's. 
These have been mostly working well for about 6 years.

For months I've started seeing screen corruption when using chrome or 
kicad; firefox and thunderbird are always ok. But just starting eeschema 
always damages the root window a little. And it's common when running 
chrome/kicad to see lines in the console xterm window jump up and down 
two lines. But for the last week or two Xorg has been crashing:

     [ 74574.029] (EE) Backtrace:
     [ 74574.032] (EE) 0: /usr/local/bin/Xorg (?+0x0) [0x41c98a]
     [ 74574.033] (EE) unw_get_proc_name failed: no unwind info found [-10]
     [ 74574.033] (EE) 1: /lib/libthr.so.3 (?+0x0) [0x800929b7e]
     [ 74574.035] (EE) unw_get_proc_name failed: no unwind info found [-10]
     [ 74574.035] (EE) 2: /lib/libthr.so.3 (?+0x0) [0x80092913f]
     [ 74574.037] (EE) 3: ? (?+0x0) [0x7ffffffff003]
     [ 74574.038] (EE) 4: 
/usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x801cc8c20]
     [ 74574.038] (EE)
     [ 74574.038] (EE) Segmentation fault at address 0x50
     [ 74574.038] (EE)
     Fatal server error:
     [ 74574.038] (EE) Caught signal 11 (Segmentation fault). Server 
aborting

The crashes are always preceded by at least one nvidia "Xid" kernel message:

     Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, 
Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data fffffffb, 
ErrorCode 00000004
     Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, 
Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data fffffffb, 
ErrorCode 00000004
     Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, 
Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data ffffffb9, 
ErrorCode 00000004
     Jun 23 ... kernel: : pid 6327 (Xorg), jid 0, uid 0: exited on signal 6

Worth noting is that it was not unusual to see many Xid ErrorCode 4 
kernel messages without crashes. (And it's the only  ErrorCode I've ever 
seen.)

My first thought was bad nvidia-driver version. But after working my 
way, one by one, down to 460.39 (circa February 2021 -- months before 
the first crashes) I gave up on that theory.

My next guess bad hardware but I swapped quadro's between two systems 
and the crashes persisted.

Yesterday Xorg crashed often enough for me to zero on the trigger; it's 
the use of tvtwm's f.forcemove action (which is like f.move but allows 
moving a windows off the screen) if I move a window slightly off the 
bottom of the screen. Here's the .twmrc binding I use:

     Button2 = m s   : window        : f.forcemove

The crash doesn't happen 100% of the time but it's pretty easy to 
trigger with half a dozen windows open. Just grab a window and randomly 
dip part of it past the bottom of the screen. So my new theory is a 
frame buffer operation in one of the libraries the path between Xorg and 
the nvidia driver has regressed and is asking the nvidia driver to do 
something that causes it to do something bad.

I run a custom version of tvtwm but was able to easily crash Xorg using 
x11-wm/twm on a spare quadro 410 workstation; the key is f.forcemove.

Does anybody know what this issue is? What are likely candidates of 
recently changed port libraries that I could try downgrading? Should I 
try opening a ticket with nvidia? Should I try even older 460.XX 
drivers? What else can I try? (Thanks for reading this far!)

		Craig