Re: HoneyComb first-boot notes [a L3/L2/L1/RAM performance oddity]

From: Mark Millard via freebsd-arm <freebsd-arm_at_freebsd.org>
Date: Sun, 11 Jul 2021 11:03:12 UTC
On 2021-Jul-10, at 22:09, Mark Millard <marklmi at yahoo.com> wrote:

> On 2021-Jun-24, at 16:25, Mark Millard <marklmi at yahoo.com> wrote:
> 
>> On 2021-Jun-24, at 16:00, Mark Millard <marklmi at yahoo.com> wrote:
>> 
>>> On 2021-Jun-24, at 13:39, Mark Millard <marklmi at yahoo.com> wrote:
>>> 
>>>> Repeating here what I've reported on teh solidrun discord:
>>>> 
>>>> I decided to experiment with monitoring the temperatures reported
>>>> as things are. For the default heat-sink/fan and the 2 other fans
>>>> in the case, buildworld with load average 16.? for some time has
>>>> stayed with tz0 through tz6 reporting between 61.0degC and 66.0degC,
>>>> say about 20degC for ambiant. (tz7 and tz8 report 0.1C.) During
>>>> stages with lower load averages, the tz0..tz6 tempuratures back off
>>>> some. So it looks like my default context keeps the system
>>>> sufficiently cool for such use.
>>>> 
>>>> I'll note that the default heat-sink's fan is not operating at rates
>>>> that I hear it upstairs. I've heard the noisy mode from there during
>>>> early parts of booting for Fedora 34 server, for example.
>>> 
>>> So I updated my stable/13 source and built and installed
>>> the update, then did a rm -fr of the build directory
>>> tree context and started a from-scratch build. The
>>> build had:
>>> 
>>> SYSTEM_COMPILER: Determined that CC=cc matches the source tree.  Not bootstrapping a cross-compiler.
>>> and:
>>> SYSTEM_LINKER: Determined that LD=ld matches the source tree.  Not bootstrapping a cross-linker.
>>> 
>>> as is my standard context for doing such "how long does
>>> it take" buildworld buildkernel testing.
>>> 
>>> On aarch64 I do not build for targeting non-arm architectures.
>>> This does save some time on the builds.
>> 
>> I should have mentioned that my builds are based on tuning
>> for the cortex-a72 via -mcpu=cortex-a72 being used. This
>> was also true of the live system that was running, kernel
>> and world.
>> 
>>> The results for the HoneyComb configuration I'm using:
>>> 
>>> World build completed on Thu Jun 24 15:30:11 PDT 2021
>>> World built in 3173 seconds, ncpu: 16, make -j16
>>> Kernel build for GENERIC-NODBG-CA72 completed on Thu Jun 24 15:34:45 PDT 2021
>>> Kernel(s)  GENERIC-NODBG-CA72 built in 274 seconds, ncpu: 16, make -j16
>>> 
>>> So World+Kernel took a a little under 1 hr to build (-j16).
>>> 
>>> 
>>> 
>>> Comparison/contrast to prior aarch64 systems that I've used
>>> for buildworld buildkernel . . .
>>> 
>>> 
>>> By contrast, the (now failed) OverDrive 1000's last timing
>>> was (building releng/13 instead of stable/13):
>>> 
>>> World build completed on Tue Apr 27 02:50:52 PDT 2021
>>> World built in 12402 seconds, ncpu: 4, make -j4
>>> Kernel build for GENERIC-NODBG-CA72 completed on Tue Apr 27 03:08:04 PDT 2021
>>> Kernel(s)  GENERIC-NODBG-CA72 built in 1033 seconds, ncpu: 4, make -j4
>>> 
>>> So World+Kernel took a a little under 3.75 hrs to build (-j4).
>>> 
>>> 
>>> The MACCHIATObin Double Shot's last timing was
>>> (building a 13-CURRENT):
>>> 
>>> World build completed on Tue Jan 19 03:44:59 PST 2021
>>> World built in 14902 seconds, ncpu: 4, make -j4
>>> Kernel build for GENERIC-NODBG completed on Tue Jan 19 04:04:25 PST 2021
>>> Kernel(s)  GENERIC-NODBG built in 1166 seconds, ncpu: 4, make -j4
>>> 
>>> So World+Kernel took a little under 4.5 hrs to build (-j4).
>>> 
>>> 
>>> The RPi4B 8GiByte's last timing was
>>> ( arm_freq=2000, sdram_freq_min=3200, force_turbo=1, USB3 SSD
>>> building releng/13 ):
>>> 
>>> World build completed on Tue Apr 20 14:34:38 PDT 2021
>>> World built in 22104 seconds, ncpu: 4, make -j4
>>> Kernel build for GENERIC-NODBG completed on Tue Apr 20 15:03:24 PDT 2021
>>> Kernel(s)  GENERIC-NODBG built in 1726 seconds, ncpu: 4, make -j4
>>> 
>>> So World+Kernel took somewhat under 6 hrs 40 min to build.
>> 
>> The -mcpu=cortex-a72 use note also applies to the OverDrive 1000,
>> MACCHIATObin Double Shot, and RPi4B 8 GiByte contexts.
>> 
> 
> I've run into an issue where what FreeBSD calls cpu 0 has
> significantly different L3/L2/L1/RAM subsystem performance
> than all the other cores (cpu 0 being worse). Similarly for
> compared/contrasted to all 4 MACCHIATObin Double Shot cores.
> 
> A plot with curves showing the issue is at:
> 
> https://github.com/markmi/acpphint/blob/master/acpphint_example_data/HoneyCombFreeBSDcpu0RAMAccessPerformanceIsOdd.png
> 
> The dark red curves in the plot show the expected general
> shape for such and are for cpu 0. The lighter colored
> curves are the MACCHIATObin curves. The darker ones are
> the HoneyComb curves, where the L3/L2/L1 is relatively
> effective (other than cpu 0).
> 
> My notes on Discord (so far) are . . .
> 
> The curves are from my C++ variant of the old Hierarchical
> INTegration benchmark (historically abbreviated HINT). You
> can read the approximate size of a level of cache  from 
> the x-axis for where the curve drops faster. So, right
> (most obvious) to left (least obvious): L3 8 MiByte, L2 1
> MiByte (per core pair, as it turns out), L1 32 KiByte.
> 
> The curves here are for single thread  benchmark
> configurations with cpuset used to control which CPU is
> used. I first noticed this via odd performance variations
> in multithreading with more cores allowed than in use (so
> migrations to a variety of cpus over time).
> 
> I explored all the CPUs (cores), not just what I plotted.
> Only the one gets the odd performing memory access
> structure in its curve.
> 
> FYI: The FreeBSD boot is UEFI/ACPI based for both systems,
> not U-Boot based.
> 

Jon Nettleton has replicated the memory access performance
issue on the one cpu via a different HoneyComb, running
some Linux kernel, using tinymembench as the benchmark.


===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)