[Bug 270966] PCI passthru stops working after ~30 guest reboots (ivhd, ILLEGAL CMD, IO_PAGE_FAULT)

From: <bugzilla-noreply_at_freebsd.org>
Date: Sun, 27 Aug 2023 15:48:50 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270966

Santiago Martinez <sm@codenetworks.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sm@codenetworks.net

--- Comment #16 from Santiago Martinez <sm@codenetworks.net> ---
Hi Raul, 

I'm seeing the same issue on AMD EPYC proc. Checking on kernel.org (Linux)
seems that they also had issues with AMD-VI. In the Linux world, many people
are using iommu=pt to overcome this. This is also a known bug on Redhat KB.

I'm running a script similar to yours and the server behaves quite erratic.

My script  does the following:

- Start and stop 200 times a VM with a PCI pass (in this case is a SRIOV VF,
but it does the same without SRIOV, or with any other device, non-network
related).  - After that 200 times, it reboots the server. 
- When the server starts it runs the script again.

Sometimes, the script can start and stop the VM 200 times, even if I see IVH
errors (command not completed or cmd error), and sometimes can only start and
stop the VM once, and the server reboots after a few IO_PAGE_FAULT (something
gets corrupted and the NVME stops responding and machines reboots after command
retry-timeout).

The server showing the issue is a SuperMicro H12SSW-NT.
- AMD EPYC 7552 48-Core Processor                

I have updated the BIOS to the latest release as on the Linux forum they
mentioned issues with the SP3.

Michael Dexter and I  also tried to replicate it on other AMD processors
without any success.
- AMD EPYC 7702P 64-Core Processor
- AMD Ryzen 7 3700X 8-Core Processor 
- Ryzen 6800H

-- 
You are receiving this mail because:
You are the assignee for the bug.