Re: Expected native build times on an RPI4?

From: Mark Millard <marklmi_at_yahoo.com>
Date: Thu, 06 Apr 2023 20:53:04 UTC
On Apr 5, 2023, at 19:04, Mark Millard <marklmi@yahoo.com> wrote:

> On Apr 3, 2023, at 13:42, Joseph Koshy <jkoshy@freebsd.org> wrote:
> 
>> A 'make -j3 buildworld' of a freshly checked out -current tree
>> took over 15+ hours on an RPI4 before eventually running out
>> of space (/usr/obj had reached 7G by then).
>> 
>> The CPU(s) ran at a top speed of 1500Mhz during the build,
>> per 'sysctl dev.cpu.0.freq'.
>> 
>> Even so, the build hadn't managed to cross the 'building
>> libraries' step.
>> 
>> I'm wondering how best to provision for building -current:
>> how long does 'buildworld' take on this device usually, and
>> how much disk space does a build of -current usually need?
> 
> I looked and I'd not recorded any buildwork buildkernel
> timings notes since back in very late 2021. So what I
> had need not be appropriate for now. I've finally got
> around to starting a from scratch build, on a 8 GiByte
> RAM "C0T" RPi4B. (My normal build are done on a different
> type of aarch64 system.) This is a from-scratch build,
> but of note are:
> 
> make[1]: "/usr/main-src/Makefile.inc1" line 327: SYSTEM_COMPILER: Determined that CC=cc matches the source tree.  Not bootstrapping a cross-compiler.
> make[1]: "/usr/main-src/Makefile.inc1" line 332: SYSTEM_LINKER: Determined that LD=ld matches the source tree.  Not bootstrapping a cross-linker.
> 
> Sometimes bootstrapping build activity is required and
> that would mean more time (and space) than for what I'm
> timing.
> 
> (I've no clue if the  build attempt that you mentioned
> involved building a bootstrap compiler or bootstrap
> linker or both.)
> 
> [Timings added after much of the other text had been
> typed in already.]
> 
> World build completed on Wed Apr  5 17:52:47 PDT 2023
> World built in 26009 seconds, ncpu: 4, make -j4
> 
> So, for world, 26009sec*(1min/60sec)*(1hr/60min) == 7.2247_2222... hr < 7.3 hr.
> 
> Kernel build for GENERIC-NODBG-CA72 completed on Wed Apr  5 18:27:29 PDT 2023
> Kernel(s)  GENERIC-NODBG-CA72 built in 2082 seconds, ncpu: 4, make -j4
> 
> So, for kernel, 2082sec*(1min/60sec)*(1hr/60min) == 0.578_3333... hr < 0.6 hr.
> 
> So, for total, somewhat under 8 hr.
> 
> (An example of needing bootstrapping would happen for
> jumping from main being 14.0 to being 15.0 . Another
> could be jumping from system clang 15 to system clang
> 16 . The additional time would not be trivial.)
> 
> 
> Notes . . .
> 
> The RPi4B has heatsinks and a case with a fan. The
> config.txt has the following added, among other
> things:
> 
> [pi4]
> over_voltage=6
> arm_freq=2000
> sdram_freq_min=3200
> force_turbo=1
> 
> (I do not use FreeBSD facilities to manage arm_freq .)
> 
> The result has no temperature problems during such
> builds. I picked arm_freq=2000 based on it working
> across 7 example RPi4B's (mix of 8 GiByte "B0T"
> and "C0T" and older 4 GiByte "B0T" variants). 2100 did
> not prove to always work, given the other 3 settings.
> I avoided system-specific tailoring in normal
> operation and so standardized what they all use.
> 
> The media is a USB3 NVMe drive, not spinning rust,
> nor a microsd card. The drive is powered from
> just the RPi4B. The media has a UFS file system.
> I avoid tmpfs use that competes for RAM. (I've also
> got access to ZFS media around but that is not what
> I'm testing with in this example.)
> 
> The power supply used for the RPi4B has more margin
> than is typical: 5.1V, 3.5A.
> 
> A serial console is set up.
> 
> For a 8 GiByte RAM system I normally have 30 GiBytes
> or so of swap space active (not special to buildworld
> buildkernel activities but I'll not get into details
> of why-so-much here). However, for this timing I'm
> running without swap since I've not tested that on a
> 8GiBYte RPi4B in a long time. (Most of the potential
> swap usage is tied to how I build ports into
> packages, not buildworld buildkernel .)
> 
> The FreeBSD context is (output line split for
> better readability):
> 
> # uname -apKU
> FreeBSD CA72_UFS 14.0-CURRENT FreeBSD 14.0-CURRENT #90
> main-n261544-cee09bda03c8-dirty: Wed Mar 15 20:25:49 PDT 2023
> root@CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-nodbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-NODBG-CA72
> arm64 aarch64 1400082 1400082
> 
> The build is building that same version from scratch
> (after a "rm -fr" of the build-tree area). I do not
> use ccache or the like. So: an example of a possible
> upper bound on the required build time for a specific
> configuration that is built, but no bootstrap compiler
> or linker build involved.
> 
> I do this because comparing timings of incremental builds
> that need not be doing the same increments is problematical.
> (However configuring for allowing incremental instead of
> only full builds is important to spending less total time
> building across builds.)
> 
> I'll list various settings that I use. There are non-
> obvious contributions too. For example I use a EtherNet ssh
> session instead of the serial console: The serial console
> can lead to waiting for fast scrolling output to finish.
> (Matters more consistently for installworld and
> installkernel scrolling output.) I run headless, avoiding
> some competition for RAM and such. I do not load the
> RPi4B with additional activities not even nice'd ones.
> 
> Note that it is a non-debug system that is running and
> it is building a matching non-debug world and kernel.
> 
> In /boot/loader.conf I have:
> 
> # Delay when persistent low free RAM leads to
> # Out Of Memory killing of processes:
> vm.pageout_oom_seq=120
> #
> # For plunty of swap/paging space (will not
> # run out), avoid pageout delays leading to
> # Out Of Memory killing of processes:
> vm.pfault_oom_attempts=-1
> #
> # For possibly insufficient swap/paging space
> # (might run out), increase the pageout delay
> # that leads to Out Of Memory killing of
> # processes (showing defaults at the time):
> #vm.pfault_oom_attempts= 3
> #vm.pfault_oom_wait= 10
> # (The multiplication is the total but there
> # are other potential tradoffs in the factors
> # multiplied, even for nearly the same total.)
> 
> (I'd not expected the 8 GiByte build to need
> to page out to swap space so I left in place
> my normal setting for vm.pfault_oom_attempts .)
> 
> In /etc/sysctl.conf I have:
> 
> # Together this pair avoids swapping out the process kernel stacks.
> # This avoids processes for interacting with the system from being
> # hung-up by such.
> vm.swap_enabled=0
> vm.swap_idle_enabled=0
> 
> (But, absent any active swap space, such would not
> happen. However, the lack of active swap space is
> not my normal context and the above is what I have
> in place for normal use as well.)
> 
> Part of the below indicates that I avoid building
> MIPS, POWERPC, RISCV, and X86 targeting materials
> because I do not intend to target anything but
> aarch64 and armv7 from aarch64 systems. This is not
> the default. Going in the other direction, I build
> CLANG_EXTRAS that builds more than what is default.
> This combination makes my build timings ball-park
> figures relative to your context.
> 
> An oddity is that I avoid much of the stripping so
> my builds are somewhat  bigger than normal for the
> materials produced. (I like the somewhat better
> backtraces from leaving symbols in place, even if
> the build is optimized and avoids full debug
> information.)
> 
> I use:
> 
> TO_TYPE=aarch64
> #
> KERNCONF=GENERIC-NODBG-CA72
> TARGET=arm64
> .if ${.MAKE.LEVEL} == 0
> TARGET_ARCH=${TO_TYPE}
> .export TARGET_ARCH
> .endif
> #
> WITH_SYSTEM_COMPILER=
> WITH_SYSTEM_LINKER=
> #
> WITH_ELFTOOLCHAIN_BOOTSTRAP=
> #Disables avoiding bootstrap: WITHOUT_LLVM_TARGET_ALL=
> WITH_LLVM_TARGET_AARCH64=
> WITH_LLVM_TARGET_ARM=
> WITHOUT_LLVM_TARGET_MIPS=
> WITHOUT_LLVM_TARGET_POWERPC=
> WITHOUT_LLVM_TARGET_RISCV=
> WITHOUT_LLVM_TARGET_X86=
> WITH_CLANG=
> WITH_CLANG_IS_CC=
> WITH_CLANG_FULL=
> WITH_CLANG_EXTRAS=
> WITH_LLD=
> WITH_LLD_IS_LD=
> WITH_LLDB=
> #
> WITH_BOOT=
> #
> #
> WITHOUT_WERROR=
> #WERROR=
> MALLOC_PRODUCTION=
> WITH_MALLOC_PRODUCTION=
> WITHOUT_ASSERT_DEBUG=
> WITHOUT_LLVM_ASSERTIONS=
> #
> # Avoid stripping but do not control host -g status as well:
> DEBUG_FLAGS+=
> #
> WITH_REPRODUCIBLE_BUILD=
> WITH_DEBUG_FILES=
> #
> # Use of the .clang 's here avoids
> # interfering with other C<?>FLAGS
> # usage, such as ?= usage.
> CFLAGS.clang+= -mcpu=cortex-a72
> CXXFLAGS.clang+= -mcpu=cortex-a72
> CPPFLAGS.clang+= -mcpu=cortex-a72
> ACFLAGS.arm64cpuid.S+=  -mcpu=cortex-a72+crypto
> ACFLAGS.aesv8-armx.S+=  -mcpu=cortex-a72+crypto
> ACFLAGS.ghashv8-armx.S+=        -mcpu=cortex-a72+crypto
> 
> Those last 6 lines lead to the code generation being
> tuned for Cortex-A72's. (The code still works on
> Cortex-A53's.) I expect such lines are rarely used
> but I happen to.
> 
> I'll note that avoiding WITHOUT_LLVM_TARGET_ALL is
> tied to old observed behavior that I've not
> revalidated.
> 
> In the past, I've had examples where RPi4B -j3 built
> in less time than -j4 for such full-build timing tests.
> On a RPi4B, I've never had -j5 or higher build in less
> time. (Some of this is the RPi4B RAM/RAM-cache
> subsystem properties: easier than normal to
> saturate the RAM access and the caching is small.
> Another contribution may be the USB3 NVMe media
> latency being small. Spinning rust might have
> different tradeoffs, for example.) I've also never
> had -j2 or less take less time for full builds.
> 
> (Folks that do not use vm.pageout_oom_seq to avoid
> kills from happening may use -j2 or such to better
> avoid having parts of some build attempts killed
> sometimes.)
> 
> Unfortunately, I forgot to set up monitoring of
> MaxObsActive, MaxObsWired, and MaxObs(Act+Wir+Lndry).
> ("MaxObs" is short for "Maximum Observed".) So I
> do not have such figures to report. (I use a
> modified top to get such figures.)
> 
> The build-tree size:
> 
> # du -xsm /usr/obj/BUILDs/main-CA72-nodbg-clang/usr/
> 13122 /usr/obj/BUILDs/main-CA72-nodbg-clang/usr/
> 
> But such is based on details of what I build vs.
> what I do not, as well as lack of stipping. So, in
> very round numbers, 20 GiBytes would be able to hold
> a build. You might want notable margin, in part because
> as FreeBSD and the toolchain progress, things have
> tended to get bigger over time. Plus the figure is a
> final size. If the peak size is larger, I do not know.
> A debug build would take more space than my non-debug
> build. Also, the 13122 does not include the build
> materials for a bootstrap compiler or a bootstrap
> linker (or both). Thus my rounding to 20 GiBytes as a
> possibility for illustration.
> 
> Again: no ccache-like use. Otherwise there would be
> more space someplace to consider overall.
> 

I repeated the "rm -fr", rebooted, and did a
-j3 buildworld buildkernel . The result was:

World build completed on Thu Apr  6 03:31:43 PDT 2023
World built in 28858 seconds, ncpu: 4, make -j3

So, for world, 28858sec*(1min/60sec)*(1hr/60min) == 8.016_1111... hr < 8.1 hr.

Kernel build for GENERIC-NODBG-CA72 completed on Thu Apr  6 04:10:26 PDT 2023
Kernel(s)  GENERIC-NODBG-CA72 built in 2323 seconds, ncpu: 4, make -j3

So, for kernel, 2323sec*(1min/60sec)*(1hr/60min) == 0.6452_7777... hr < 0.7 hr.

So, for total, somewhat under 8.8 hr.

So 31181sec/28091sec ~= 1.11 times what than -j4 took.

I did remember to get MaxObs figures for this:

load averages:  . . . MaxObs:   3.59,   3.21,   3.09
1404Mi MaxObsActive, 1155Mi MaxObsWired, 2383Mi MaxObs(Act+Wir+Lndry)

(Note: Laundry did end up non-zero, despite the lack of swap space.)

So this combination looks like it would not need swap space
for a 4 GiByte RPi4B but likely would need such for a
2 GiByte RPI4B. Looks like the same could be true of a -j4
build.



After that I repeated the "rm -fr", rebooted,
and did a -j5 buildworld buildkernel . The
result was:

World build completed on Thu Apr  6 12:42:04 PDT 2023
World built in 25940 seconds, ncpu: 4, make -j5

So, for world, 25940sec*(1min/60sec)*(1hr/60min) == 7.20_5555... hr < 7.3 hr.

Kernel build for GENERIC-NODBG-CA72 completed on Thu Apr  6 13:16:50 PDT 2023
Kernel(s)  GENERIC-NODBG-CA72 built in 2086 seconds, ncpu: 4, make -j5

So, for kernel, 2086sec*(1min/60sec)*(1hr/60min) == 0.579_4444... hr < 0.6 hr.

So, for total, somewhat under 8 hr.

So around 28026sec/28091sec ~= 0.998 times what -j4 took.

Note a small scale example of a tradeoff that can occur
based on the details of what is being built: buildworld
took less time but buildkernel took more.

I did remember to get MaxObs figures for this:

load averages:  . . . MaxObs:   5.57,   5.29,   5.17
1790Mi MaxObsActive, 1157Mi MaxObsWired, 2775Mi MaxObs(Act+Wir+Lndry)

(Note: Laundry did end up non-zero, despite the lack of swap space.)

So this combination looks like it would not need swap space
for a 4 GiByte RPi4B but would need such for a 2 GiByte
RPi4B.


Incremental builds (META_MODE) variability, ccache avoidance
of compiles variability, and media access timing properties
could all lead to other tradeoffs in specific builds for
what -jN's work better. ZFS would be a significant change of
context because of Wired memory handling. (The ARC leads to a
far more widely variable Wired-memory usage pattern, for
example.) I'm not claiming the above indicates some universal
answer to what is optimal across a range of contexts.

One thing that was different for a time for my older timings
was that some Google test build used to take large amounts
of RAM and time compared to the figures I report above. If I
remember right, this stopped when FreeBSD adjusted the
specific test's build to generate unoptimized code, avoiding
the bad-case in the LLVM toolchain's optimization handling
for generating the test involved.


Note: The ZFS ARC's Wired memory usage makes any "MaxObs" that
includes a Wired memory contribution not readily comparable to
the same "MaxObs" for a UFS context.

===
Mark Millard
marklmi at yahoo.com