Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))
Date: Mon, 07 Mar 2022 19:04:09 UTC
On Mon, Mar 07, 2022 at 10:03:51AM -0800, Mark Millard wrote: > > > On 2022-Mar-7, at 08:45, Mark Johnston <markj@FreeBSD.org> wrote: > > > On Mon, Mar 07, 2022 at 04:25:22PM +0000, Andrew Turner wrote: > >> > >>> On 7 Mar 2022, at 15:13, Mark Johnston <markj@freebsd.org> wrote: > >>> ... > >>> A (the?) problem is that the compiler is treating "pc" as an alias > >>> for x18, but the rmlock code assumes that the pcpu pointer is loaded > >>> once, as it dereferences "pc" outside of the critical section. On > >>> arm64, if a context switch occurs between the store at _rm_rlock+144 and > >>> the load at +152, and the thread is migrated to another CPU, then we'll > >>> end up using the wrong CPU ID in the rm->rm_writecpus test. > >>> > >>> I suspect the problem is unique to arm64 as its get_pcpu() > >>> implementation is different from the others in that it doesn't use > >>> volatile-qualified inline assembly. This has been the case since > >>> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762 <https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762> > >>> . > >>> > >>> I haven't been able to reproduce any crashes running poudriere in an > >>> arm64 AWS instance, though. Could you please try the patch below and > >>> confirm whether it fixes your panics? I verified that the apparent > >>> problem described above is gone with the patch. > >> > >> Alternatively (or additionally) we could do something like the following. There are only a few MI users of get_pcpu with the main place being in rm locks. > >> > >> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h > >> index 09f6361c651c..59b890e5c2ea 100644 > >> --- a/sys/arm64/include/pcpu.h > >> +++ b/sys/arm64/include/pcpu.h > >> @@ -58,7 +58,14 @@ struct pcpu; > >> > >> register struct pcpu *pcpup __asm ("x18"); > >> > >> -#define get_pcpu() pcpup > >> +static inline struct pcpu * > >> +get_pcpu(void) > >> +{ > >> + struct pcpu *pcpu; > >> + > >> + __asm __volatile("mov %0, x18" : "=&r"(pcpu)); > >> + return (pcpu); > >> +} > >> > >> static inline struct thread * > >> get_curthread(void) > > > > Indeed, I think this is probably the best solution. Thinking a bit more, even with that patch, code like this may not behave the same on arm64 as on other platforms: critical_enter(); ptr = &PCPU_GET(foo); critical_exit(); bar = *ptr; since as far as I can see the compiler may translate it to critical_enter(); critical_exit(); bar = PCPU_GET(foo); > Is this just partially reverting: > > https://cgit.freebsd.org/src/commit/?id=63c858a04d56 > > If so, there might need to be comments about why the updated > code is as it will be. > > Looks like stable/13 picked up sensitivity to the get_pcpu > details in rmlock in: > > https://cgit.freebsd.org/src/commit/?h=stable/13&id=543157870da5 > > (a 2022-03-04 commit) and stable/13 also has the get_pcpu > misdefinition in: > > https://cgit.freebsd.org/src/commit/sys/arm64/include/pcpu.h?h=stable/13&id=63c858a04d56 > > . So an MFC would be appropriate in order for aarch64 > to be reliable for any variations in get_pcpu in stable/13 > (and for 13.1 to be so as well). I reverted the rmlock commit in stable/13 already. Either get_pcpu() will be fixed shortly or 13.1 will ship without the rmlock commit.