[Bug 277627] Passthru NVIDIA GeForce GTX 1080 Ti not work

From: <bugzilla-noreply_at_freebsd.org>
Date: Mon, 11 Mar 2024 11:20:05 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277627

            Bug ID: 277627
           Summary: Passthru NVIDIA GeForce GTX 1080 Ti not work
           Product: Base System
           Version: 13.3-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: bhyve
          Assignee: virtualization@FreeBSD.org
          Reporter: bubnovky@gmail.com

I Have a Rocky 9.3 guest OS with passthru device NVIDIA GeForce GTX 1080 Ti:

When the Mother OS was FreeBSD 13.2, everything worked fine.

On mother OS:
# pciconf -lv | grep -A 4 ppt
ppt0@pci0:2:0:0:        class=0x030000 rev=0xa1 hdr=0x00 vendor=0x10de
device=0x1b06 subvendor=0x10de subdevice=0x120f
    vendor     = 'NVIDIA Corporation'
    device     = 'GP102 [GeForce GTX 1080 Ti]'
    class      = display
    subclass   = VGA
# uname -mbir
13.2-RELEASE amd64 GENERIC 0e4b27630a1402da754c6eb42a22aadad5f80545
# cat /srv/vm/gpu-02/gpu-02.conf 
loader="grub"
cpu=1
memory=8G
wired_memory=yes
network0_type="virtio-net"
network0_switch="public"
disk0_type="virtio-blk"
disk0_name="disk0.img"
passthru0="2/0/0=2:0"
grub_install0="linux /isolinux/vmlinuz inst.vnc inst.graphical inst.gpt"
grub_install1="initrd /isolinux/initrd.img"
grub_run_dir="/grub2"
grub_run_partition=2
uuid="502f749f-dc87-11ee-b5e0-38d54702d5b9"
network0_mac="58:9c:fc:0f:b3:0e"

On guest OS:
# nvidia-smi
Mon Mar 11 13:39:18 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA
Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile
Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util 
Compute M. |
|                                         |                        |           
   MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     Off |   00000000:00:02.0 Off |           
      N/A |
| 25%   44C    P0             60W /  250W |       0MiB /  11264MiB |      0%   
  Default |
|                                         |                        |           
      N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                   
          |
|  GPU   GI   CI        PID   Type   Process name                             
GPU Memory |
|        ID   ID                                                              
Usage      |
|=========================================================================================|
|  No running processes found                                                  
          |
+-----------------------------------------------------------------------------------------+
# lspci -vv
00:02.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080
Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 120f
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 40
        Region 0: Memory at c2000000 (32-bit, non-prefetchable)
        Region 1: Memory at 800000000 (64-bit, prefetchable)
        Region 3: Memory at c0000000 (64-bit, prefetchable)
        Region 5: I/O ports at 2000
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr-
TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
Latency L0s <512ns, L1 <16us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP-
LTR-
                         10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt-
EETLPPrefix-
                         EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
                         FRS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR-
OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+
LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Kernel driver in use: nvidia
        Kernel modules: nouveau, nvidia_drm, nvidia
# cat /etc/system-release
Rocky Linux release 9.3 (Blue Onyx)
# uname -r
5.14.0-362.18.1.el9_3.0.1.x86_64

After updating the mother OS to version 13.3 or 14.0 the NVIDIA drivers in the
guest OS stopped working.

On mother OS:
# uname -mbir
13.3-RELEASE amd64 GENERIC 35d870d853f0e09b6659ddec3206ae3d975c5a32

On guest OS:
# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
Make sure that the latest NVIDIA driver is installed and running.
# lspci -vv
00:02.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080
Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 120f
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort-
<MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Region 0: Memory at c2000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 800000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at c0000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 2000 [size=128]
        Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr-
TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit
Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x16 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP-
LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt-
EETLPPrefix-
                         EmergencyPowerReduction Not Supported,
EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR-
OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+
EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+
LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Kernel modules: nouveau, nvidia_drm, nvidia
# journalctl -xe
мар 11 14:17:43 cuda-12-3.itspr.ru kernel: nvidia-nvlink: Nvlink Core is being
initialized, major devi>
мар 11 14:17:43 cuda-12-3.itspr.ru kernel: NVRM: Can't find an IRQ for your
NVIDIA card!
мар 11 14:17:43 cuda-12-3.itspr.ru kernel: NVRM: Please check your BIOS
settings.
мар 11 14:17:43 cuda-12-3.itspr.ru kernel: NVRM: [Plug & Play OS] should be set
to NO
мар 11 14:17:43 cuda-12-3.itspr.ru kernel: NVRM: [Assign IRQ to VGA] should be
set to YES 
мар 11 14:17:43 cuda-12-3.itspr.ru kernel: nvidia: probe of 0000:00:02.0 failed
with error -1
мар 11 14:17:43 cuda-12-3.itspr.ru kernel: NVRM: The NVIDIA probe routine
failed for 1 device(s).
мар 11 14:17:43 cuda-12-3.itspr.ru kernel: NVRM: None of the NVIDIA devices
were initialized.

I hope you can help me.

-- 
You are receiving this mail because:
You are the assignee for the bug.