Re: nvidia_drv.so/Xorg crashes

From: Fernando_ApesteguĂ­a <fernando.apesteguia_at_gmail.com>
Date: Fri, 25 Jun 2021 05:54:25 UTC
On Fri, Jun 25, 2021 at 4:31 AM Craig Leres <leres@freebsd.org> wrote:
>
> I have four (12.2-RELEASE) systems between the office at home that are
> full or part time FreeBSD desktops. All have pny nvidia quadro 410's.
> These have been mostly working well for about 6 years.
>
> For months I've started seeing screen corruption when using chrome or
> kicad; firefox and thunderbird are always ok. But just starting eeschema
> always damages the root window a little. And it's common when running
> chrome/kicad to see lines in the console xterm window jump up and down
> two lines. But for the last week or two Xorg has been crashing:
>
>      [ 74574.029] (EE) Backtrace:
>      [ 74574.032] (EE) 0: /usr/local/bin/Xorg (?+0x0) [0x41c98a]
>      [ 74574.033] (EE) unw_get_proc_name failed: no unwind info found [-10]
>      [ 74574.033] (EE) 1: /lib/libthr.so.3 (?+0x0) [0x800929b7e]
>      [ 74574.035] (EE) unw_get_proc_name failed: no unwind info found [-10]
>      [ 74574.035] (EE) 2: /lib/libthr.so.3 (?+0x0) [0x80092913f]
>      [ 74574.037] (EE) 3: ? (?+0x0) [0x7ffffffff003]
>      [ 74574.038] (EE) 4:
> /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x801cc8c20]
>      [ 74574.038] (EE)
>      [ 74574.038] (EE) Segmentation fault at address 0x50
>      [ 74574.038] (EE)
>      Fatal server error:
>      [ 74574.038] (EE) Caught signal 11 (Segmentation fault). Server
> aborting
>
> The crashes are always preceded by at least one nvidia "Xid" kernel message:
>
>      Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327,
> Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data fffffffb,
> ErrorCode 00000004
>      Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327,
> Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data fffffffb,
> ErrorCode 00000004
>      Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327,
> Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data ffffffb9,
> ErrorCode 00000004
>      Jun 23 ... kernel: : pid 6327 (Xorg), jid 0, uid 0: exited on signal 6
>
> Worth noting is that it was not unusual to see many Xid ErrorCode 4
> kernel messages without crashes. (And it's the only  ErrorCode I've ever
> seen.)
>
> My first thought was bad nvidia-driver version. But after working my
> way, one by one, down to 460.39 (circa February 2021 -- months before
> the first crashes) I gave up on that theory.
>
> My next guess bad hardware but I swapped quadro's between two systems
> and the crashes persisted.
>
> Yesterday Xorg crashed often enough for me to zero on the trigger; it's
> the use of tvtwm's f.forcemove action (which is like f.move but allows
> moving a windows off the screen) if I move a window slightly off the
> bottom of the screen. Here's the .twmrc binding I use:
>
>      Button2 = m s   : window        : f.forcemove
>
> The crash doesn't happen 100% of the time but it's pretty easy to
> trigger with half a dozen windows open. Just grab a window and randomly
> dip part of it past the bottom of the screen. So my new theory is a
> frame buffer operation in one of the libraries the path between Xorg and
> the nvidia driver has regressed and is asking the nvidia driver to do
> something that causes it to do something bad.
>
> I run a custom version of tvtwm but was able to easily crash Xorg using
> x11-wm/twm on a spare quadro 410 workstation; the key is f.forcemove.
>
> Does anybody know what this issue is? What are likely candidates of
> recently changed port libraries that I could try downgrading? Should I
> try opening a ticket with nvidia? Should I try even older 460.XX
> drivers? What else can I try? (Thanks for reading this far!)

Long shot, but libglvnd update affected x11/nvidia-driver. Have a look
at UPDATING 20210617

HTH

>
>                 Craig
>