Re: -mcpu= selections and the Windows Dev Kit 2023: example from-scratch buildkernel times (after kernel-toolchain)
Date: Sat, 13 May 2023 19:49:46 UTC
On May 13, 2023, at 01:50, Mark Millard <marklmi@yahoo.com> wrote: > On May 13, 2023, at 01:28, Mark Millard <marklmi@yahoo.com> wrote: > >> While the selections were guided by some benchmark like >> explorations, the results for the Windows Dev Kit 2023 >> (WDK23 abbreviation) go like: >> >> >> -mcpu=cortex-a72 code generation produced a (non-debug) >> kernel/world that, in turn, got (from scratch buildkernel after >> kernel-toolchain): >> >> Kernel(s) GENERIC-NODBG-CA72 built in 597 seconds, ncpu: 8, make -j8 >> >> (The rest of the aarch64 that I've access to is nearly-all cortex-a72 >> based, the others being cortex-a53 these days. So I was seeing how >> code tailored for the cortex-a72 context performed on the WDK23. >> cortex-a72 was my starting point with the WDK23.) >> >> >> -mcpu=cortex-x1c+flagm code generation produced a (non-debug) >> kernel/world that, in turn, got (from scratch buildkernel after >> kernel-toolchain): >> >> Kernel(s) GENERIC-NODBG-CA78C built in 584 seconds, ncpu: 8, make -j8 >> >> NOTE: "+flagm" is because of various clang/gcc having an inaccurate >> set of features that omit flagm --and I'm making sure I've got it >> enabled. -mcpu=cortex-a78c is even worse: it has examples of +fp16fml >> by default in some toolchains --but neither of the 2 types of core has >> support for such. (The cortex-x1c and cortex-a78c actually have matching >> features for code generation purposes, at least for all that I looked >> at. Toolchain mismatches for default features are sufficient evidence >> of an error in at least one case as far as I can tell.) >> >> This context is implicitly +lse+rcpc . At the time I was not being >> explicit when defaults matched. >> >> Notes: >> "lse" is the large system extension atomics, disabled below. >> "rcpc" is the extension having load acquire and store release >> instructions. (rcpc I was explicit about below, despite the >> default matching.) >> >> >> -mcpu=cortex-x1c+flagm+nolse+rcpc code generation produced a >> (non-debug) kernel/world that, in turn, got (from scratch buildkernel >> after kernel-toolchain): >> >> Kernel(s) GENERIC-NODBG-CA78CnoLSE built in 415 seconds, ncpu: 8, make -j >> >> Note: My explorations so far have tried the world combinations of >> lse and rcpc status but with a kernel that was based on >> -mcpu=cortex-x1c+flagm . I then updated the kernel to match the >> -mcpu=cortex-x1c+flagm+nolse+rcpc and used it to produce the above. >> So there is more exploring that I've not done yet. But I'm not >> expecting decreases to notably below the 415 sec. >> >> The benchmark like activity had showed that +lse+rcpc for the >> world/benchmark builds lead to notable negative consequences for >> cpus 0..3 compared to the other 3 combinations of status. For >> cpus 4..7, it showed that +nolse+rcpc for the world/benchmark >> builds had a noticeable gain compared to the other 3 combinations. >> This guided the buildkernel testing selections done so far. The >> buildkernel tests were, in part, to be sure that the apparent >> consequences were not just odd consequences for time measurements >> that could mess up benchmark result comparisons being useful. >> >> >> For comparison to a standard FreeBSD non-debug build, I used a >> snapshot download of: >> >> http://ftp3.freebsd.org/pub/FreeBSD/snapshots/ISO-IMAGES/13.2/FreeBSD-13.2-STABLE-arm64-aarch64-ROCK64-20230504-7dea7445ba44-255298.img.xz >> >> and dd'd it to media, replaced the EFI/*/* with ones that >> work for the Windows Dev Kit 2023, booted the WDK23 with the media, >> copied over my /usr/*-src/ to the media, did a "make -j8 kernel-toolchain", >> from the /usr/main-src/ copy and finally did a "make -j8 buildkernel" >> (so, from-scratch, given the toolchain materials are already in place): >> >> Kernel(s) GENERIC built in 505 seconds, ncpu: 8, make -j8 >> >> ( /usr/main-src/ has the source that the other buildkernel timings >> were based on. ) >> >> >> Looks like -mcpu=cortex-a72 and -mcpu=cortex-x1c+flagm are far from >> a good fit for buildkernel workloads to run under on the WDK23. FreeBSD >> defaults and -mcpu=cortex-x1c+flagm+nolse+rcpc seems to be better fits >> for such use. >> >> >> Note: This testing was in a ZFS context, using bectl to advantage, in >> case that somehow matters. >> >> >> For reference: >> >> # grep mcpu= /usr/main-src/sys/arm64/conf/GENERIC-NODBG-CA78C >> makeoptions CONF_CFLAGS="-mcpu=cortex-x1c+flagm+nolse+rcpc" >> >> # grep mcpu= ~/src.configs/*CA78C-nodbg* >> XCFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc >> XCXXFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc >> ACFLAGS.arm64cpuid.S+= -mcpu=cortex-x1c >> ACFLAGS.aesv8-armx.S+= -mcpu=cortex-x1c >> ACFLAGS.ghashv8-armx.S+= -mcpu=cortex-x1c >> >> # more /usr/local/etc/poudriere.d/main-CA78C-make.conf >> CFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc >> CXXFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc >> CPPFLAGS+= -mcpu=cortex-x1c+flagm+nolse+rcpc >> RUSTFLAGS_CPU_FEATURES= -C target-cpu=cortex-x1c -C target-feature=+x1c,+flagm,-lse,+rcpc > > Note: RUSTFLAGS_CPU_FEATURES is something that I added to my > environment to allow the experiment: > > # git -C /usr/ports/ diff Mk/Uses/cargo.mk > diff --git a/Mk/Uses/cargo.mk b/Mk/Uses/cargo.mk > index 50146372fee1..2f21453fd02b 100644 > --- a/Mk/Uses/cargo.mk > +++ b/Mk/Uses/cargo.mk > @@ -145,7 +145,9 @@ WITH_LTO= yes > . endif > # Adjust -C target-cpu if -march/-mcpu is set by bsd.cpu.mk > -. if ${ARCH} == amd64 || ${ARCH} == i386 > +. if defined(RUSTFLAGS_CPU_FEATURES) > +RUSTFLAGS+= ${RUSTFLAGS_CPU_FEATURES} > +. elif ${ARCH} == amd64 || ${ARCH} == i386 > RUSTFLAGS+= ${CFLAGS:M-march=*:S/-march=/-C target-cpu=/} > . elif ${ARCH:Mpowerpc*} > RUSTFLAGS+= ${CFLAGS:M-mcpu=*:S/-mcpu=/-C target-cpu=/:S/power/pwr/} > >> diff --git a/secure/lib/libcrypto/Makefile b/secure/lib/libcrypto/Makefile >> index 8fde4f19d046..e13227d6450b 100644 >> --- a/secure/lib/libcrypto/Makefile >> +++ b/secure/lib/libcrypto/Makefile >> @@ -22,7 +22,7 @@ SRCS+= mem.c mem_dbg.c mem_sec.c o_dir.c o_fips.c o_fopen.c o_init.c >> SRCS+= o_str.c o_time.c threads_pthread.c uid.c >> .if defined(ASM_aarch64) >> SRCS+= arm64cpuid.S armcap.c >> -ACFLAGS.arm64cpuid.S= -march=armv8-a+crypto >> +ACFLAGS.arm64cpuid.S+= -march=armv8-a+crypto >> .elif defined(ASM_amd64) >> SRCS+= x86_64cpuid.S >> .elif defined(ASM_arm) >> @@ -43,7 +43,7 @@ SRCS+= mem_clr.c >> SRCS+= aes_cbc.c aes_cfb.c aes_ecb.c aes_ige.c aes_misc.c aes_ofb.c aes_wrap.c >> .if defined(ASM_aarch64) >> SRCS+= aes_core.c aesv8-armx.S vpaes-armv8.S >> -ACFLAGS.aesv8-armx.S= -march=armv8-a+crypto >> +ACFLAGS.aesv8-armx.S+= -march=armv8-a+crypto >> .elif defined(ASM_amd64) >> SRCS+= aes_core.c aesni-mb-x86_64.S aesni-sha1-x86_64.S aesni-sha256-x86_64.S >> SRCS+= aesni-x86_64.S vpaes-x86_64.S >> @@ -278,7 +278,7 @@ SRCS+= cbc128.c ccm128.c cfb128.c ctr128.c cts128.c gcm128.c ocb128.c >> SRCS+= ofb128.c wrap128.c xts128.c >> .if defined(ASM_aarch64) >> SRCS+= ghashv8-armx.S >> -ACFLAGS.ghashv8-armx.S= -march=armv8-a+crypto >> +ACFLAGS.ghashv8-armx.S+= -march=armv8-a+crypto I'll probably not do any more exploring of kernel vs. world cortex-x1c/cortex-a78c feature use vs. not combinations. My "-mcpu=cortex-x1c+flagm context" based from scratch build of my ports took somewhat over 15 hrs on the WDK23: [main-CA78C-default] [2023-05-10_01h26m04s] [committing:] Queued: 480 Built: 480 Failed: 0 Skipped: 0 Ignored: 0 Fetched: 0 Tobuild: 0 Time: 15:08:47 Beyond using a -mcpu=cortex-x1c+flagm+nolse+rcpc based context now, I've also recently changed the build sequence to use 2 stages to help avoid a long-tail-of-build being largely one process at a time (single thread) time: poudriere bulk -jmain-CA78C -w -f ~/origins/build-first.txt poudriere bulk -jmain-CA78C -w -f ~/origins/CA78C-origins.txt # more ~/origins/build-first.txt devel/binutils devel/boost-jam devel/llvm16 devel/llvm15 lang/rust (Actually my test was without boost-jam being listed. I added that after the test. I also later added PRIORITY_BOOST="boost-libs" to etc/poudriere.conf . CA78C-origins.txt also lists those port origins, along with the rest of the things I explicitly want built.) The above, in my context, happens to lead to devel/boost-libs building in parallel with other activity. I use a high-load-average-allowed style of building ports into packages: ALLOW_MAKE_JOBS=yes and the default number of builders, so up to 8 on the WDK23. Also: USE_TMPFS=all (based on about 118 GiBytes of swap, so RAM+SWAP approx= 150 GiBytes. Observed swap use got up to a little under 13 GiBytes but was not thrashing.) (This style would not scale well at some point but works for what I have access to, even the ThreadRipper 1950X with its 128 GiBytes of RAM and 32 FreeBSD "cpus". It has more swap configured.) Those, combined with the -mcpu=cortex-x1c+flagm+nolse+rcpc use, has from-scratch port builds down to a slightly over 10 hours on the WDK23: [main-CA78C-default] [2023-05-13_01h31m02s] [committing:] Queued: 99 Built: 99 Failed: 0 Skipped: 0 Ignored: 0 Fetched: 0 Tobuild: 0 Time: 05:53:58 [main-CA78C-default] [2023-05-13_07h25m03s] [committing:] Queued: 381 Built: 381 Failed: 0 Skipped: 0 Ignored: 0 Fetched: 0 Tobuild: 0 Time: 04:07:07 This context was ZFS. I've not done a UFS-context test yet. === Mark Millard marklmi at yahoo.com