Re: Xen-4.16.0 + FreeBSD-13.1 dom0 fails on large ADM64 system

From: Roger Pau Monné <roger.pau_at_citrix.com>
Date: Sat, 13 Apr 2024 10:40:15 UTC
Hello,

Thanks for the report.

On Fri, Apr 12, 2024 at 10:57:32PM -0700, Brian Buhrow wrote:
> 	Hello.  I'm trying to load xen on a large AMD64 server, with 600G of RAM and 56 CPUs.  I'm
> using a xen-4.16 image with FreeBSD-13.1, a duplicate image, in fact, with a couple of other
> machines I have running in production and which have been running without issues for a couple
> of years.  I think I'm running into some defined limits in the sense that I think the machine
> has more of something than xen is configured for.  Unfortunately, pouring through the source
> doesn't clue me into exactly what's rong, except I think it has something to do with mmio
> allocations.  Can someone take a look at the boot message below and give a clue as to what
> might be wrong?  Do I need to recompile xen with some new limits, turn something on or off in
> BIOS or change some parameter on the xen command line?

Would it be possible for you to test with the latest version of Xen in
the ports tree (currently 4.18.2.20240411).

It would also be helpful if you could boot with a debug Xen kernel
instead of the release one, for that you must set the following in
/boot/loader.conf:

xen_kernel="/boot/xen-debug"

> Or, is this something that requires a
> newer version of Xen?  Any thoughts would be greatly appreciated.
> 
> The error is: Failed to identity map [ffffffffc7ffb, ffffffffc7ffb] for d0: -22
> 
> that's EINVAL, from sys/errno.h, suggesting something is out of range.

As a first guess I think there's a PCI device with a BAR positioned at
0xffffffffc7ffb, and that's outside of the physical range supported by
Xen EPT.  That physical address uses 52bits, while the EPT
implementation in Xen only supports up to 48bit wide physical
addresses.

Is there any option in the BIOS/firmware that you can set to attempt
to prevent the BIOS from positioning BARs past the 48bit boundary?

The error you are getting is kind of expected, however the page-fault
(and Xen crash) that you hit afterwards is definitely not expected.
AFAICT the page-fault is likely fixed by Xen commit:

465217b0f872 vPCI: account for hidden devices

Which is present in 4.18.  Please test with that version and report
back.

Thanks, Roger.