[Bug 284743] System reproducably livelocks after a couple of hours in poudriere bulk -a

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 04 Mar 2025 00:11:49 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=284743

--- Comment #8 from Mitchell Horne <mhorne@freebsd.org> ---
I am going to take an informed guess that this might a bug in OpenSBI.

The version provided by sysutils/opensbi sat at v1.4 for some time. A quick log
of the commits to the IPI code since that version yields one interesting
candidate:

commit be9752a071475ae1d9e58a2dfcb8e83185fb7ae5
Author: Samuel Holland <samuel.holland@sifive.com>
Date:   Fri Oct 25 11:59:46 2024 -0700

    lib: sbi_ipi: Make .ipi_clear always target the current hart

    All existing users of this operation target the current hart, and it
    seems unlikely that a future user will need to clear the pending IPI
    status of a remote hart. Simplify the logic by changing .ipi_clear (and
    its wrapper sbi_ipi_raw_clear()) to always operate on the current hart.

    This incidentally fixes a bug introduced in commit 78c667b6fc07 ("lib:
    sbi: Prefer hartindex over hartid in IPI framework"), which changed the
    .ipi_clear parameter from a hartid to a hart index, but failed to update
    the warm_init functions to match.

    Fixes: 78c667b6fc07 ("lib: sbi: Prefer hartindex over hartid in IPI
framework")
    Signed-off-by: Samuel Holland <samuel.holland@sifive.com>
    Reviewed-by: Anup Patel <anup@brainfault.org>

A bug in clearing the IPI status, when multiple harts are attempting an IPI
broadcast concurrently, might explain the livelock we are seeing. I did not
inspect the implementation to verify this.

Notably, the buggy commit was present in the v1.4 release, but this fix was
not.

I recently (last week) updated the sysutils/opensbi port to v1.6, and dependent
u-boot ports were bumped. So, I suggest you update your firmware, keep running
things the usual way, and if the livelocks continue to manifest report back
here.

-- 
You are receiving this mail because:
You are the assignee for the bug.