Re: git: 80e4ac2964a1 - main - Work around VNET and DPCPU related panics on aarch64

From: Mark Millard <marklmi_at_yahoo.com>
Date: Mon, 24 Jul 2023 13:12:59 UTC
Dimitry Andric <dim_at_FreeBSD.org> wrote on
Date: Mon, 24 Jul 2023 11:06:54 UTC :

> On 24 Jul 2023, at 00:48, Jessica Clarke <jrtc27@freebsd.org> wrote:
> > 
> > On 23 Jul 2023, at 23:38, Dimitry Andric <dim@FreeBSD.org> wrote:
> >> 
> >> The branch main has been updated by dim:
> >> 
> >> URL: https://cgit.FreeBSD.org/src/commit/?id=80e4ac2964a11edef456a15b77e43aadeaf273a2
> >> 
> >> commit 80e4ac2964a11edef456a15b77e43aadeaf273a2
> >> Author: Dimitry Andric <dim@FreeBSD.org>
> >> AuthorDate: 2023-07-23 13:48:36 +0000
> >> Commit: Dimitry Andric <dim@FreeBSD.org>
> >> CommitDate: 2023-07-23 22:35:04 +0000
> >> 
> >> Work around VNET and DPCPU related panics on aarch64
> >> 
> >> lld >= 14 and recent GNU ld can relax adrp+add and adrp+ldr
> >> instructions, which breaks VNET and DPCPU when used in modules.
> > 
> > Thanks for committing the workaround.
> > 
> > This will need some kind of EN for 13.2 given LLVM 14 was merged in
> > time for that and arm64 is a Tier 1 architecture in 13.
> > 
> > There perhaps also needs to be some serious thought into our testing
> > and release procedures given we allowed a Tier 1 architecture to have
> > VNET and DPCU be totally broken in a point release for any kernel
> > module. Especially when the bug was known, open against -CURRENT and
> > triaged all before the MFC to stable/13; there needs to be better
> > tracking of toolchain release blockers.
> 
> I agree, but how many users does this affect? I am only a sporadic user
> of aarch64 builds, but I never saw any panics, and apparently our CI
> builders also do not see these.
> 
> So how often do these panics actually occur? Must you specifically use a
> VNET feature to encounter them?

Prior to this change, if I build based on -mcpu=cortex-a72 a linker relax
happens such that:

# kldload if_epair.ko

panics. At the stage of .o generation, the distinction is a minor
variation in instruction ordering, where -mcpu=cortex-a72 puts 2
relevant instructions next to each other that it otherwise had
some of the other code between. The back-to-back pair ends up
being relaxed by the linker.

Use of -mcpu=cortex-a53 does not put that instruction pair
together: so no panic. Note that both builds are compatible
with both processor types here. The -mcpu=cortex-a72 code
panics on both but the -mcpu=cortex-a53 does not on both.

-mcpu=cortex-x1c allows newer instruction set additions. But
its code generation did not put together the 2 instructions:
no panics. But the Windows Dev Kit system that has the
A78C/X1C cores does panic for the -mcpu=cortex-a72 code.

So: Minor code generation variations lead to the panics.

I'll note that I discovered the panics via kyua runs with
-mcpu=cortex-a72 based builds on the various systems. I'd
expect that the kyua runs should not lead to panics.

I do not see "how many users does this affect?" as the major
point of the reliability problem here. Future code generation
changes for the same options are subject to having the problem.
Future changes in compiler option use could have the same sort
of result.

Known VNET or DPCPU usage contexts are sufficient to have the
problem. If there is a 3rd usage context as well, the VNET/DPCPU
use would still be problems.

In my view, having such a dependency on accidentally avoiding
some minor variations in the code generation is not
appropriate. Use of --no-relax (for now) at least appears to
systematically avoid the dependency on the minor variations
that happen to matter here, whatever its other consequences.
(More linker activity than necessary may well be disabled.) I
have re-enabled my -mcpu=cortex-a72 use and tested the result:
no panics.

===
Mark Millard
marklmi at yahoo.com