Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: <pcib1 memory window> request" leads to panic

From: Mark Millard <marklmi_at_yahoo.com>
Date: Tue, 13 Feb 2024 01:57:54 UTC
On Feb 12, 2024, at 16:36, Mark Millard <marklmi@yahoo.com> wrote:

> On Feb 12, 2024, at 16:10, Mark Millard <marklmi@yahoo.com> wrote:
> 
>> On Feb 12, 2024, at 12:00, Mark Millard <marklmi@yahoo.com> wrote:
>> 
>>> [Gack: I was looking at the wrong vintage of source code, predating
>>> your changes: wrong system used.]
>>> 
>>> 
>>> On Feb 12, 2024, at 10:41, Mark Millard <marklmi@yahoo.com> wrote:
>>> 
>>>> On Feb 12, 2024, at 09:32, John Baldwin <jhb@freebsd.org> wrote:
>>>> 
>>>>> On 2/9/24 8:13 PM, Mark Millard wrote:
>>>>>> Summary:
>>>>>> pcib0: <BCM2838-compatible PCI-express controller> mem 0x7d500000-0x7d50930f irq 80,81 on simplebus2
>>>>>> pcib0: parsing FDT for ECAM0:
>>>>>> pcib0:  PCI addr: 0xc0000000, CPU addr: 0x600000000, Size: 0x40000000
>>>>>> . . .
>>>>>> rman_manage_region: <pcib1 memory window> request: start 0x600000000, end 0x6000fffff
>>>>>> panic: Failed to add resource to rman
>>>>> 
>>>>> Hmmm, I suspect this is due to the way that bus_translate_resource works which is
>>>>> fundamentally broken.  It rewrites the start address of a resource in-situ instead
>>>>> of keeping downstream resources separate from the upstream resources.   For example,
>>>>> I don't see how you could ever release a resource in this design without completely
>>>>> screwing up your rman.  That is, I expect trying to detach a PCI device behind a
>>>>> translating bridge that uses the current approach should corrupt the allocated
>>>>> resource ranges in an rman long before my changes.
>>>>> 
>>>>> That said, that doesn't really explain the panic.  Hmm, the panic might be because
>>>>> for PCI bridge windows the driver now passes RF_ACTIVE and the bus_translate_resource
>>>>> hack only kicks in the activate_resource method of pci_host_generic.c.
>>>>> 
>>>>>> Detail:
>>>>>> . . .
>>>>>> pcib0: <BCM2838-compatible PCI-express controller> mem 0x7d500000-0x7d50930f irq 80,81 on simplebus2
>>>>>> pcib0: parsing FDT for ECAM0:
>>>>>> pcib0: PCI addr: 0xc0000000, CPU addr: 0x600000000, Size: 0x40000000
>>>>> 
>>>>> This indicates this is a translating bus.
>>>>> 
>>>>>> pcib1: <PCI-PCI bridge> irq 91 at device 0.0 on pci0
>>>>>> rman_manage_region: <pcib1 bus numbers> request: start 0x1, end 0x1
>>>>>> pcib0: rman_reserve_resource: start=0xc0000000, end=0xc00fffff, count=0x100000
>>>>>> rman_reserve_resource_bound: <PCIe Memory> request: [0xc0000000, 0xc00fffff], length 0x100000, flags 102, device pcib1
>>>>>> rman_reserve_resource_bound: trying 0xffffffff <0xc0000000,0xfffff>
>>>>>> considering [0xc0000000, 0xffffffff]
>>>>>> truncated region: [0xc0000000, 0xc00fffff]; size 0x100000 (requested 0x100000)
>>>>>> candidate region: [0xc0000000, 0xc00fffff], size 0x100000
>>>>>> allocating from the beginning
>>>>>> rman_manage_region: <pcib1 memory window> request: start 0x600000000, end 0x6000fffff
>> 
>> What you later typed does not match:
>> 
>> 0x600000000
>> 0x6000fffff
>> 
>> You later typed:
>> 
>> 0x60000000
>> 0x600fffffff
>> 
>> This seems to have lead to some confusion from using the
>> wrong figure(s).
>> 
>>>>> The fact that we are trying to reserve the CPU addresses in the rman is because
>>>>> bus_translate_resource rewrote the start address in the resource after it was allocated.
>>>>> 
>>>>> That said, I can't see why rman_manage_region would actually fail.  At this point the
>>>>> rman is empty (this is the first call to rman_manage_region for "pcib1 memory window"),
>>>>> so only the check that should be failing are the checks against rm_start and
>>>>> rm_end.  For the memory window, rm_start is always 0, and rm_end is always
>>>>> 0xffffffff, so both the old (0xc00000000 - 0xc00fffff) and new (0x60000000 - 0x600fffffff)
>>>>> ranges are within those bounds.
>> 
>> No:
>> 
>> 0xffffffff
>> 
>> .vs (actual):
>> 
>> 0x600000000
>> 0x6000fffff
>> 
>> Both the actuals are larger then the 0xffffffff figure you list above.
>> 
>> I've no clue if the 0xffffffff might have its own typos.
>> 
>>>>> I would instead expect to see some other issue later
>>>>> on where we fail to allocate a resource for a child BAR, but I wouldn't expect
>>>>> rman_manage_region to fail.
>>>>> 
>>>>> Logging the return value from rman_manage_region would be the first step I think
>>>>> to see which error value it is returning.
>>>> 
>>>> Looking at the code in sys/kern/subr_rman.c for rman_manage_region I see
>>>> the (mostly) return rv related logic only has ENONMEM (explicit return) and
>>>> EBUSY as non-0 possibilities (no other returns).
>>> 
>>> The modern code also has EINVAL via an explicit return.
>>> 
>>>> All the rv references and
>>>> all the returns are shown below:
>>>> 
>>>>     int rv = 0;
>>>> . . .
>>> 
>>> The modern code also has here:
>>> 
>>>      DPRINTF(("rman_manage_region: <%s> request: start %#jx, end %#jx\n",
>>>          rm->rm_descr, start, end));
>>>      if (start < rm->rm_start || end > rm->rm_end)
>>>              return EINVAL;
>>> 
>>> Adding rm->rm_start and rm->rm_end to the message might be
>>> appropriate --and would also be enough additional information
>>> to know if EINVAL would be returned.
>> 
>> I added such code and built, installed, and tried booting
>> a separate kernel. The result was:
>> 
>> rman_manage_region: <pcib1 memory window> range 0..0xffffffff : request: start 0x600000000, end 0x6000fffff
>> panic: Failed to add resource to rman
>> 
>> So:
>> 
>> 0
>> vs. start:
>> 0x600000000
>> 
>> and:
>> 
>> 0xffffffff
>> vs. end:
>> 0x6000fffff
>> 
>> The 0x600000000..0x6000fffff range is not a range of RAM addresses
>> [too large for that for the 8 GiByte RPi4B] and, so, is well above
>> the 32 bit boundary too.
>> 
>> That is the only line referencing "memory window". A 32 bit initial range
>> for some reason. For reference, the source change in question was:
>> 
>> -       DPRINTF(("rman_manage_region: <%s> request: start %#jx, end %#jx\n",
>> -           rm->rm_descr, start, end));
>> +       DPRINTF(("rman_manage_region: <%s> range %#jx..%#jx : request: start %#jx, end %#jx\n",
>> +           rm->rm_descr, rm->rm_start, rm->rm_end, start, end));
>>       if (start < rm->rm_start || end > rm->rm_end)
>>               return EINVAL;
>> 
>> I'll show a larger boot output context at the bottom of the message
>> for reference.
>> 
>>>>     r = int_alloc_resource(M_NOWAIT);
>>>>     if (r == NULL)
>>>>             return ENOMEM;
>>>> . . .
>>>>             /* Check for any overlap with the current region. */
>>>>             if (r->r_start <= s->r_end && r->r_end >= s->r_start) {
>>>>                     rv = EBUSY;
>>>>                     goto out;
>>>>             }
>>>> 
>>>>             /* Check for any overlap with the next region. */
>>>>             t = TAILQ_NEXT(s, r_link);
>>>>             if (t && r->r_start <= t->r_end && r->r_end >= t->r_start) {
>>>>                     rv = EBUSY;
>>>>                     goto out;
>>>>             }
>>>> . . .
>>>> out:
>>>>     mtx_unlock(rm->rm_mtx);
>>>>     return rv;
>>>> 
>>>> int_alloc_resource failure would be failure (NULL) from the:
>>>> 
>>>>     struct resource_i *r;
>>>> 
>>>>     r = malloc(sizeof *r, M_RMAN, malloc_flag | M_ZERO);
>>>> 
>>>> (associated with the M_NOWAIT argument).
>>>> 
>>>> The malloc failure would likely go in a very different direction.
>>>> 
>>>> 
>>>> 
>>>> Side note: looks like the EBUSY cases leak what r references.
>>> 
>>> Still true in the newer code.
>>> 
>>>>> Probably I should fix pci_host_generic.c to handle translation properly however.
>>>>> I can work on a patch for that.
>>>> 
>> 
>> For reference:
>> 
>> . . .
>> pcib0: <BCM2838-compatible PCI-express controller> mem 0x7d500000-0x7d50930f irq 80,81 on simplebus2
>> pcib0: parsing FDT for ECAM0:
>> pcib0:  PCI addr: 0xc0000000, CPU addr: 0x600000000, Size: 0x40000000
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0:  PCI addr: 0x0, CPU addr: 0x0, Size: 0x0
>> pcib0: Bus is not cache-coherent
>> rman_reserve_resource_bound: <I/O memory addresses> request: [0xfd500000, 0xfd50930f], length 0x9310, flags 100, device pcib0
>> rman_reserve_resource_bound: trying 0x1fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x1fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x31bfffff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x33298fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x39bf1fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x39c02fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x39c06fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x39c07fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x39c08fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x39c2afff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x39c36fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x39c37fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x3b03ffff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x3b04ffff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x3b2fffff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x3ee61fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x3ee63fff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0x3fffffff <0xfd500000,0x930f>
>> rman_reserve_resource_bound: tried 0xfbffffff <0xfd500000,0x930f>
>> considering [0xfc000000, 0xfd5d1fff]
>> truncated region: [0xfd500000, 0xfd50930f]; size 0x9310 (requested 0x9310)
>> candidate region: [0xfd500000, 0xfd50930f], size 0x9310
>> splitting region in three parts: [0xfc000000, 0xfd4fffff]; [0xfd500000, 0xfd50930f]; [0xfd509310, 0xfd5d1fff]
>> rman_manage_region: <PCIe Memory> range 0..0xffffffffffffffff : request: start 0xc0000000, end 0xffffffff
>> pcib0: hardware identifies as revision 0x304.
>> pcib0: note: reported link speed is 5.0 GT/s.
>> rman_reserve_resource_bound: <Interrupts> request: [0x51, 0x51], length 0x1, flags 0, device pcib0
>> rman_reserve_resource_bound: trying 0 <0x51,0>
>> rman_reserve_resource_bound: tried 0 <0x51,0>
>> rman_reserve_resource_bound: tried 0x1 <0x51,0>
>> rman_reserve_resource_bound: tried 0x2 <0x51,0>
>> rman_reserve_resource_bound: tried 0x3 <0x51,0>
>> rman_reserve_resource_bound: tried 0x4 <0x51,0>
>> rman_reserve_resource_bound: tried 0x5 <0x51,0>
>> rman_reserve_resource_bound: tried 0x6 <0x51,0>
>> rman_reserve_resource_bound: tried 0x7 <0x51,0>
>> rman_reserve_resource_bound: tried 0xc <0x51,0>
>> rman_reserve_resource_bound: tried 0xd <0x51,0>
>> rman_reserve_resource_bound: tried 0xe <0x51,0>
>> rman_reserve_resource_bound: tried 0xf <0x51,0>
>> rman_reserve_resource_bound: tried 0x10 <0x51,0>
>> rman_reserve_resource_bound: tried 0x11 <0x51,0>
>> rman_reserve_resource_bound: tried 0x12 <0x51,0>
>> rman_reserve_resource_bound: tried 0x17 <0x51,0>
>> rman_reserve_resource_bound: tried 0x18 <0x51,0>
>> rman_reserve_resource_bound: tried 0x1a <0x51,0>
>> rman_reserve_resource_bound: tried 0x1b <0x51,0>
>> rman_reserve_resource_bound: tried 0x22 <0x51,0>
>> rman_reserve_resource_bound: tried 0x23 <0x51,0>
>> rman_reserve_resource_bound: tried 0x24 <0x51,0>
>> rman_reserve_resource_bound: tried 0x25 <0x51,0>
>> rman_reserve_resource_bound: tried 0x26 <0x51,0>
>> rman_reserve_resource_bound: tried 0x27 <0x51,0>
>> rman_reserve_resource_bound: tried 0x28 <0x51,0>
>> rman_reserve_resource_bound: tried 0x29 <0x51,0>
>> rman_reserve_resource_bound: tried 0x4e <0x51,0>
>> rman_reserve_resource_bound: tried 0x4f <0x51,0>
>> considering [0x50, 0xffffffffffffffff]
>> truncated region: [0x51, 0x51]; size 0x1 (requested 0x1)
>> candidate region: [0x51, 0x51], size 0x1
>> splitting region in three parts: [0x50, 0x50]; [0x51, 0x51]; [0x52, 0xffffffffffffffff]
>> pci0: <OFW PCI bus> on pcib0
>> rman_manage_region: <PCI domain 0 bus numbers> range 0..0xff : request: start 0, end 0xff
>> rman_reserve_resource_bound: <PCI domain 0 bus numbers> request: [0, 0], length 0x1, flags 0, device pci0
>> rman_reserve_resource_bound: trying 0xff <0,0>
>> considering [0, 0xff]
>> truncated region: [0, 0]; size 0x1 (requested 0x1)
>> candidate region: [0, 0], size 0x1
>> allocating from the beginning
>> pci0: domain=0, physical bus=0
>> found-> vendor=0x14e4, dev=0x2711, revid=0x00
>>       domain=0, bus=0, slot=0, func=0
>>       class=06-04-00, hdrtype=0x01, mfdev=0
>>       cmdreg=0x0000, statreg=0x0010, cachelnsz=0 (dwords)
>>       lattimer=0x00 (0 ns), mingnt=0x00 (0 ns), maxlat=0x00 (0 ns)
>>       intpin=a, irq=0
>>       powerspec 3  supports D0 D3  current D0
>>       secbus=1, subbus=1
>> rman_reserve_resource_bound: <PCI domain 0 bus numbers> request: [0x1, 0x1], length 0x1, flags 0, device (null)
>> rman_reserve_resource_bound: trying 0 <0x1,0>
>> rman_reserve_resource_bound: tried 0 <0x1,0>
>> considering [0x1, 0xff]
>> truncated region: [0x1, 0x1]; size 0x1 (requested 0x1)
>> candidate region: [0x1, 0x1], size 0x1
>> allocating from the beginning
>> pcib1: <PCI-PCI bridge> irq 91 at device 0.0 on pci0
>> rman_manage_region: <pcib1 bus numbers> range 0..0xff : request: start 0x1, end 0x1
>> pcib0: rman_reserve_resource: start=0xc0000000, end=0xc00fffff, count=0x100000
>> rman_reserve_resource_bound: <PCIe Memory> request: [0xc0000000, 0xc00fffff], length 0x100000, flags 102, device pcib1
>> rman_reserve_resource_bound: trying 0xffffffff <0xc0000000,0xfffff>
>> considering [0xc0000000, 0xffffffff]
>> truncated region: [0xc0000000, 0xc00fffff]; size 0x100000 (requested 0x100000)
>> candidate region: [0xc0000000, 0xc00fffff], size 0x100000
>> allocating from the beginning
>> rman_manage_region: <pcib1 memory window> range 0..0xffffffff : request: start 0x600000000, end 0x6000fffff
>> panic: Failed to add resource to rman
>> cpuid = 0
>> time = 1
>> KDB: stack backtrace:
>> db_trace_self() at db_trace_self
>> db_trace_self_wrapper() at db_trace_self_wrapper+0x38
>> vpanic() at vpanic+0x1a4
>> panic() at panic+0x48
>> pcib_add_window_resources() at pcib_add_window_resources+0xf4
>> pcib_alloc_window() at pcib_alloc_window+0xfc
>> pcib_attach_common() at pcib_attach_common+0xa18
>> pcib_attach() at pcib_attach+0x14
>> device_attach() at device_attach+0x3fc
>> device_probe_and_attach() at device_probe_and_attach+0x80
>> bus_generic_attach() at bus_generic_attach+0x1c
>> pci_attach() at pci_attach+0xec
>> device_attach() at device_attach+0x3fc
>> device_probe_and_attach() at device_probe_and_attach+0x80
>> bus_generic_attach() at bus_generic_attach+0x1c
>> device_attach() at device_attach+0x3fc
>> device_probe_and_attach() at device_probe_and_attach+0x80
>> bus_generic_new_pass() at bus_generic_new_pass+0x100
>> bus_generic_new_pass() at bus_generic_new_pass+0xb0
>> bus_generic_new_pass() at bus_generic_new_pass+0xb0
>> bus_generic_new_pass() at bus_generic_new_pass+0xb0
>> bus_set_pass() at bus_set_pass+0x50
>> mi_startup() at mi_startup+0x1e0
>> virtdone() at virtdone+0x68
>> KDB: enter: panic
>> [ thread pid 0 tid 100000 ]
>> Stopped at      kdb_enter+0x4c: str     xzr, [x19, #1280]
>> db> 
>> 
> 
> FYI:
> 
> bcm2711-peripherals.pdf "Figure 1. BCM2711 Address Maps"
> documents the PCIe address space range as (their "_"
> notation):
> 
> 0x6_0000_0000..0x7_FFFF_FFFF
> 
> for both of the address maps:
> 
> 'Full 35-bit Address Map'
> 'ARM view of the Address Map in "Low Peripheral" mode'
> 
> The Low Peripheral mode is what is in use.
> 
> So the 0x6_0000_0000..0x6_000f_ffff range looks
> to be a valid ARM system address space for the
> PCIe to map to.

It looks to me like in sys/dev/pci/pci_pci.c the:

static void
pcib_probe_windows(struct pcib_softc *sc)
{
. . .
        pcib_alloc_window(sc, &sc->mem, SYS_RES_MEMORY, 0, 0xffffffff);
. . .

is just inappropriately restrictive about where in the system
address space a PCIe can validly be mapped to on the high end.
That, in turn, leads to the rejection on the RPi4B now that
the range use is checked.

More: Even the size (0xffffffff-0)+1 is smaller than the size of
the range that the BCM2711 documentation indicates as reserved
for PCIe (not that the fdt indicates to use the upper half of
that range at all):

(0x7_FFFF_FFFF - 0x6_0000_0000)+1 == 0x1_FFFF_FFFF + 1

(I'm not so sure that this matters for the RPi4B. It
is not a great example of the generality of PCIe
uses. But might there be contexts where it does matter?)

The lower bound being 0 allows for the actual low
bound. But I wonder if use of the fdt (or ACPI)
information should be possible to then indicate a
actual low bound to check against instead. (A
smaller high bound may be reasonable, at least
for a RPi4B like context.)

===
Mark Millard
marklmi at yahoo.com