Re: -mcpu= selections and the Windows Dev Kit 2023: example from-scratch buildkernel times (after kernel-toolchain)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 13 May 2023 08:50:15 UTC
On May 13, 2023, at 01:28, Mark Millard <marklmi@yahoo.com> wrote:

> While the selections were guided by some benchmark like
> explorations, the results for the Windows Dev Kit 2023
> (WDK23 abbreviation) go like:
> 
> 
> -mcpu=cortex-a72 code generation produced a (non-debug)
> kernel/world that, in turn, got (from scratch buildkernel after
> kernel-toolchain):
> 
> Kernel(s)  GENERIC-NODBG-CA72 built in 597 seconds, ncpu: 8, make -j8
> 
> (The rest of the aarch64 that I've access to is nearly-all cortex-a72
> based, the others being cortex-a53 these days. So I was seeing how
> code tailored for the cortex-a72 context performed on the WDK23.
> cortex-a72 was my starting point with the WDK23.)
> 
> 
> -mcpu=cortex-x1c+flagm code generation produced a (non-debug)
> kernel/world that, in turn, got (from scratch buildkernel after
> kernel-toolchain):
> 
> Kernel(s)  GENERIC-NODBG-CA78C built in 584 seconds, ncpu: 8, make -j8
> 
> NOTE: "+flagm" is because of various clang/gcc having an inaccurate
> set of features that omit flagm --and I'm making sure I've got it
> enabled. -mcpu=cortex-a78c is even worse: it has examples of +fp16fml
> by default in some toolchains --but neither of the 2 types of core has
> support for such. (The cortex-x1c and cortex-a78c actually have matching
> features for code generation purposes, at least for all that I looked
> at. Toolchain mismatches for default features are sufficient evidence
> of an error in at least one case as far as I can tell.)
> 
> This context is implicitly +lse+rcpc . At the time I was not being
> explicit when defaults matched.
> 
> Notes:
> "lse" is the large system extension atomics, disabled below.
> "rcpc" is the extension having load acquire and store release
> instructions. (rcpc I was explicit about below, despite the
> default matching.)
> 
> 
> -mcpu=cortex-x1c+flagm+nolse+rcpc code generation produced a
> (non-debug) kernel/world that, in turn, got (from scratch buildkernel
> after kernel-toolchain):
> 
> Kernel(s)  GENERIC-NODBG-CA78CnoLSE built in 415 seconds, ncpu: 8, make -j
> 
> Note: My explorations so far have tried the world combinations of
> lse and rcpc status but with a kernel that was based on
> -mcpu=cortex-x1c+flagm . I then updated the kernel to match the
> -mcpu=cortex-x1c+flagm+nolse+rcpc and used it to produce the above.
> So there is more exploring that I've not done yet. But I'm not
> expecting decreases to notably below the 415 sec.
> 
> The benchmark like activity had showed that +lse+rcpc for the
> world/benchmark builds lead to notable negative consequences for
> cpus 0..3 compared to the other 3 combinations of status. For
> cpus 4..7, it showed that +nolse+rcpc for the world/benchmark
> builds had a noticeable gain compared to the other 3 combinations.
> This guided the buildkernel testing selections done so far. The
> buildkernel tests were, in part, to be sure that the apparent
> consequences were not just odd consequences for time measurements
> that could mess up benchmark result comparisons being useful.
> 
> 
> For comparison to a standard FreeBSD non-debug build, I used a
> snapshot download of:
> 
> http://ftp3.freebsd.org/pub/FreeBSD/snapshots/ISO-IMAGES/13.2/FreeBSD-13.2-STABLE-arm64-aarch64-ROCK64-20230504-7dea7445ba44-255298.img.xz
> 
> and dd'd it to media, replaced the EFI/*/* with ones that
> work for the Windows Dev Kit 2023, booted the WDK23 with the media,
> copied over my /usr/*-src/ to the media, did a "make -j8 kernel-toolchain",
> from the /usr/main-src/ copy and finally did a "make -j8 buildkernel"
> (so, from-scratch, given the toolchain materials are already in place):
> 
> Kernel(s)  GENERIC built in 505 seconds, ncpu: 8, make -j8
> 
> ( /usr/main-src/ has the source that the other buildkernel timings
> were based on. )
> 
> 
> Looks like -mcpu=cortex-a72 and -mcpu=cortex-x1c+flagm are far from
> a good fit for buildkernel workloads to run under on the WDK23. FreeBSD
> defaults and -mcpu=cortex-x1c+flagm+nolse+rcpc seems to be better fits
> for such use.
> 
> 
> Note: This testing was in a ZFS context, using bectl to advantage, in
> case that somehow matters.
> 
> 
> For reference:
> 
> # grep mcpu= /usr/main-src/sys/arm64/conf/GENERIC-NODBG-CA78C
> makeoptions CONF_CFLAGS="-mcpu=cortex-x1c+flagm+nolse+rcpc"
> 
> # grep mcpu= ~/src.configs/*CA78C-nodbg*
> XCFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
> XCXXFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
> ACFLAGS.arm64cpuid.S+=  -mcpu=cortex-x1c
> ACFLAGS.aesv8-armx.S+=  -mcpu=cortex-x1c
> ACFLAGS.ghashv8-armx.S+=        -mcpu=cortex-x1c
> 
> # more /usr/local/etc/poudriere.d/main-CA78C-make.conf
> CFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
> CXXFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
> CPPFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
> RUSTFLAGS_CPU_FEATURES= -C target-cpu=cortex-x1c -C target-feature=+x1c,+flagm,-lse,+rcpc

Note: RUSTFLAGS_CPU_FEATURES is something that I added to my
environment to allow the experiment:

# git -C /usr/ports/ diff Mk/Uses/cargo.mk
diff --git a/Mk/Uses/cargo.mk b/Mk/Uses/cargo.mk
index 50146372fee1..2f21453fd02b 100644
--- a/Mk/Uses/cargo.mk
+++ b/Mk/Uses/cargo.mk
@@ -145,7 +145,9 @@ WITH_LTO=   yes
 .  endif
   # Adjust -C target-cpu if -march/-mcpu is set by bsd.cpu.mk
-.  if ${ARCH} == amd64 || ${ARCH} == i386
+.  if defined(RUSTFLAGS_CPU_FEATURES)
+RUSTFLAGS+=    ${RUSTFLAGS_CPU_FEATURES}
+.  elif ${ARCH} == amd64 || ${ARCH} == i386
 RUSTFLAGS+=    ${CFLAGS:M-march=*:S/-march=/-C target-cpu=/}
 .  elif ${ARCH:Mpowerpc*}
 RUSTFLAGS+=    ${CFLAGS:M-mcpu=*:S/-mcpu=/-C target-cpu=/:S/power/pwr/}

> diff --git a/secure/lib/libcrypto/Makefile b/secure/lib/libcrypto/Makefile
> index 8fde4f19d046..e13227d6450b 100644
> --- a/secure/lib/libcrypto/Makefile
> +++ b/secure/lib/libcrypto/Makefile
> @@ -22,7 +22,7 @@ SRCS+=        mem.c mem_dbg.c mem_sec.c o_dir.c o_fips.c o_fopen.c o_init.c
> SRCS+= o_str.c o_time.c threads_pthread.c uid.c
> .if defined(ASM_aarch64)
> SRCS+= arm64cpuid.S armcap.c
> -ACFLAGS.arm64cpuid.S=  -march=armv8-a+crypto
> +ACFLAGS.arm64cpuid.S+= -march=armv8-a+crypto
> .elif defined(ASM_amd64)
> SRCS+= x86_64cpuid.S
> .elif defined(ASM_arm)
> @@ -43,7 +43,7 @@ SRCS+=        mem_clr.c
> SRCS+= aes_cbc.c aes_cfb.c aes_ecb.c aes_ige.c aes_misc.c aes_ofb.c aes_wrap.c
> .if defined(ASM_aarch64)
> SRCS+= aes_core.c aesv8-armx.S vpaes-armv8.S
> -ACFLAGS.aesv8-armx.S=  -march=armv8-a+crypto
> +ACFLAGS.aesv8-armx.S+= -march=armv8-a+crypto
> .elif defined(ASM_amd64)
> SRCS+= aes_core.c aesni-mb-x86_64.S aesni-sha1-x86_64.S aesni-sha256-x86_64.S
> SRCS+= aesni-x86_64.S vpaes-x86_64.S
> @@ -278,7 +278,7 @@ SRCS+=      cbc128.c ccm128.c cfb128.c ctr128.c cts128.c gcm128.c ocb128.c
> SRCS+= ofb128.c wrap128.c xts128.c
> .if defined(ASM_aarch64)
> SRCS+= ghashv8-armx.S
> -ACFLAGS.ghashv8-armx.S=        -march=armv8-a+crypto
> +ACFLAGS.ghashv8-armx.S+=       -march=armv8-a+crypto


===
Mark Millard
marklmi at yahoo.com