-mcpu= selections and the Windows Dev Kit 2023: example from-scratch buildkernel times (after kernel-toolchain)

From: Mark Millard <marklmi_at_yahoo.com>
Date: Sat, 13 May 2023 08:28:18 UTC
While the selections were guided by some benchmark like
explorations, the results for the Windows Dev Kit 2023
(WDK23 abbreviation) go like:


-mcpu=cortex-a72 code generation produced a (non-debug)
kernel/world that, in turn, got (from scratch buildkernel after
kernel-toolchain):

Kernel(s)  GENERIC-NODBG-CA72 built in 597 seconds, ncpu: 8, make -j8

(The rest of the aarch64 that I've access to is nearly-all cortex-a72
based, the others being cortex-a53 these days. So I was seeing how
code tailored for the cortex-a72 context performed on the WDK23.
cortex-a72 was my starting point with the WDK23.)


-mcpu=cortex-x1c+flagm code generation produced a (non-debug)
kernel/world that, in turn, got (from scratch buildkernel after
kernel-toolchain):

Kernel(s)  GENERIC-NODBG-CA78C built in 584 seconds, ncpu: 8, make -j8

NOTE: "+flagm" is because of various clang/gcc having an inaccurate
set of features that omit flagm --and I'm making sure I've got it
enabled. -mcpu=cortex-a78c is even worse: it has examples of +fp16fml
by default in some toolchains --but neither of the 2 types of core has
support for such. (The cortex-x1c and cortex-a78c actually have matching
features for code generation purposes, at least for all that I looked
at. Toolchain mismatches for default features are sufficient evidence
of an error in at least one case as far as I can tell.)

This context is implicitly +lse+rcpc . At the time I was not being
explicit when defaults matched.

Notes:
"lse" is the large system extension atomics, disabled below.
"rcpc" is the extension having load acquire and store release
instructions. (rcpc I was explicit about below, despite the
default matching.)


-mcpu=cortex-x1c+flagm+nolse+rcpc code generation produced a
(non-debug) kernel/world that, in turn, got (from scratch buildkernel
after kernel-toolchain):

Kernel(s)  GENERIC-NODBG-CA78CnoLSE built in 415 seconds, ncpu: 8, make -j

Note: My explorations so far have tried the world combinations of
lse and rcpc status but with a kernel that was based on
-mcpu=cortex-x1c+flagm . I then updated the kernel to match the
-mcpu=cortex-x1c+flagm+nolse+rcpc and used it to produce the above.
So there is more exploring that I've not done yet. But I'm not
expecting decreases to notably below the 415 sec.

The benchmark like activity had showed that +lse+rcpc for the
world/benchmark builds lead to notable negative consequences for
cpus 0..3 compared to the other 3 combinations of status. For
cpus 4..7, it showed that +nolse+rcpc for the world/benchmark
builds had a noticeable gain compared to the other 3 combinations.
This guided the buildkernel testing selections done so far. The
buildkernel tests were, in part, to be sure that the apparent
consequences were not just odd consequences for time measurements
that could mess up benchmark result comparisons being useful.


For comparison to a standard FreeBSD non-debug build, I used a
snapshot download of:

http://ftp3.freebsd.org/pub/FreeBSD/snapshots/ISO-IMAGES/13.2/FreeBSD-13.2-STABLE-arm64-aarch64-ROCK64-20230504-7dea7445ba44-255298.img.xz

and dd'd it to media, replaced the EFI/*/* with ones that
work for the Windows Dev Kit 2023, booted the WDK23 with the media,
copied over my /usr/*-src/ to the media, did a "make -j8 kernel-toolchain",
from the /usr/main-src/ copy and finally did a "make -j8 buildkernel"
(so, from-scratch, given the toolchain materials are already in place):

Kernel(s)  GENERIC built in 505 seconds, ncpu: 8, make -j8

( /usr/main-src/ has the source that the other buildkernel timings
were based on. )


Looks like -mcpu=cortex-a72 and -mcpu=cortex-x1c+flagm are far from
a good fit for buildkernel workloads to run under on the WDK23. FreeBSD
defaults and -mcpu=cortex-x1c+flagm+nolse+rcpc seems to be better fits
for such use.


Note: This testing was in a ZFS context, using bectl to advantage, in
case that somehow matters.


For reference:

# grep mcpu= /usr/main-src/sys/arm64/conf/GENERIC-NODBG-CA78C
makeoptions CONF_CFLAGS="-mcpu=cortex-x1c+flagm+nolse+rcpc"

# grep mcpu= ~/src.configs/*CA78C-nodbg*
XCFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
XCXXFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
ACFLAGS.arm64cpuid.S+=  -mcpu=cortex-x1c
ACFLAGS.aesv8-armx.S+=  -mcpu=cortex-x1c
ACFLAGS.ghashv8-armx.S+=        -mcpu=cortex-x1c

# more /usr/local/etc/poudriere.d/main-CA78C-make.conf
CFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
CXXFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
CPPFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc
RUSTFLAGS_CPU_FEATURES= -C target-cpu=cortex-x1c -C target-feature=+x1c,+flagm,-lse,+rcpc

diff --git a/secure/lib/libcrypto/Makefile b/secure/lib/libcrypto/Makefile
index 8fde4f19d046..e13227d6450b 100644
--- a/secure/lib/libcrypto/Makefile
+++ b/secure/lib/libcrypto/Makefile
@@ -22,7 +22,7 @@ SRCS+=        mem.c mem_dbg.c mem_sec.c o_dir.c o_fips.c o_fopen.c o_init.c
 SRCS+= o_str.c o_time.c threads_pthread.c uid.c
 .if defined(ASM_aarch64)
 SRCS+= arm64cpuid.S armcap.c
-ACFLAGS.arm64cpuid.S=  -march=armv8-a+crypto
+ACFLAGS.arm64cpuid.S+= -march=armv8-a+crypto
 .elif defined(ASM_amd64)
 SRCS+= x86_64cpuid.S
 .elif defined(ASM_arm)
@@ -43,7 +43,7 @@ SRCS+=        mem_clr.c
 SRCS+= aes_cbc.c aes_cfb.c aes_ecb.c aes_ige.c aes_misc.c aes_ofb.c aes_wrap.c
 .if defined(ASM_aarch64)
 SRCS+= aes_core.c aesv8-armx.S vpaes-armv8.S
-ACFLAGS.aesv8-armx.S=  -march=armv8-a+crypto
+ACFLAGS.aesv8-armx.S+= -march=armv8-a+crypto
 .elif defined(ASM_amd64)
 SRCS+= aes_core.c aesni-mb-x86_64.S aesni-sha1-x86_64.S aesni-sha256-x86_64.S
 SRCS+= aesni-x86_64.S vpaes-x86_64.S
@@ -278,7 +278,7 @@ SRCS+=      cbc128.c ccm128.c cfb128.c ctr128.c cts128.c gcm128.c ocb128.c
 SRCS+= ofb128.c wrap128.c xts128.c
 .if defined(ASM_aarch64)
 SRCS+= ghashv8-armx.S
-ACFLAGS.ghashv8-armx.S=        -march=armv8-a+crypto
+ACFLAGS.ghashv8-armx.S+=       -march=armv8-a+crypto

===
Mark Millard
marklmi at yahoo.com