cpuset(1) and affinity vs. masking

From: Harry Schmalzbauer <freebsd_at_omnilan.de>
Date: Sun, 23 Jul 2023 13:39:15 UTC
Hello,

I hope it is ok asking some very basic questions here on -arm, where 
probably the more arch/dev-related arm topics are discussed usually.  I 
found many posts where people (asking like responding ones) were 
confusing NUMA/malloc and scheduler related questions/issues, and 
aarch64/amd64 likewise...  That's why I tend to ask here 1st.

I'm on rk3399 (Pine64 RockPro64) and new to aarch64 and big.little with 
FreeBSD and I'm just looking for a way to teach the scheduler (default 
ULE) to prefer the fast cores.

As far as I understood cpuset(1), it can only mask cores to be exluded.
What I can observe is that sched_ule distributes threads across all 6 
cores without any notable affinity - often single-thread tasks are 
spread over cpu 0-3 (slow A53) while the fast cores 4+5 (more 
power-hungry A72) are idle.
As long as power consumption isn't crucial, this default behaviour can't 
be intentional.  A simple realworld test utilizing xz/pkg(1) shows close 
to 100% performance penalty with the out-of-box sched_ule behaviour:
(
time pkg -o ABI_FILE=/usr/src/worldstage/usr/bin/uname -o 
ALLOW_BASE_SHLIBS=yes  create -f txz -M 
/usr/src/worldstage/openssl-dev.ucl  -
p /usr/src/worldstage/openssl-dev.plist  -r /usr/src/worldstage -o 
/usr/obj/usr/src/repo/FreeBSD:13:aarch64/13.snap20230723104317
)
Invoked from
'cpuset -c -l 4,5 /bin/sh'  results in    63.38 real        63.09 
user         0.28 sys
while invoked from
'cpuset -c -l 0-3 /bin/sh'  results in    118.58 real       118.05 
user         0.52 sys

   (BTW, regarding power consumption: I can hardly imagine running 2 
minutes on A53 cores safes power compared to running half the time on 
the A72 cores, but that's a totally different story for me for now)

I'm looking for real affinity.  Meaning, every core is allowed, but the 
fat ones are preferred i.e always used until all of them are overloaded 
- and re-assigend immediately - fat-core cycles must'nt ever belong to 
idle as long as slow-cores are utilized..

What I found so far regarding rk3399 tuning: 
https://lists.freebsd.org/pipermail/freebsd-arm/2020-July/022105.html,
which is smore about cpufreq(1) - still an issue in my opinion, but since
     sysctl dev.cpu.4.freq=1800
     sysctl dev.cpu.3.freq=1416
works these days, it doesn't bother much.

In another discussion, there was a reference to FDT cpu-map posted:
https://mjmwired.net/kernel/Documentation/devicetree/bindings/cpu

On my Rockpro64, kern.sched.topology_spec doesn't seem to define two 
groups/clusters:
<groups>
  <group level="1" cache-level="3">
   <cpu count="6" mask="3f,0,0,0">0, 1, 2, 3, 4, 5</cpu>
  </group>
</groups>

But dmesg shows traces of affinity groups, since there's something like 
[ 0 0, 0 1, 0 2, 0 3] and [ 1 0, 1 1] printed next to CPU affinity...
I simply don't understand how sched_ule is supposed to make use of this 
information.  I guess it doesn't (yet).

CPU  0: ARM Cortex-A53 r0p4 affinity:  0  0
                    Cache Type = <64 byte D-cacheline,64 byte 
I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG>
  Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL>
  Instruction Set Attributes 1 = <>
  Instruction Set Attributes 2 = <>
          Processor Features 0 = <GIC,AdvSIMD,FP,EL3 32,EL2 32,EL1 
32,EL0 32>
          Processor Features 1 = <>
       Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit 
ASID,1TB PA>
       Memory Model Features 1 = <8bit VMID>
       Memory Model Features 2 = <32bit CCIDX,48bit VA>
              Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 
Breakpoints,PMUv3,Debugv8>
              Debug Features 1 = <>
          Auxiliary Features 0 = <>
          Auxiliary Features 1 = <>
AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL>
AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP 
VFPv3+v4,SP VFPv3+v4,AdvSIMD>
AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP 
Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ>
CPU  1: ARM Cortex-A53 r0p4 affinity:  0  1
CPU  2: ARM Cortex-A53 r0p4 affinity:  0  2
CPU  3: ARM Cortex-A53 r0p4 affinity:  0  3
CPU  4: ARM Cortex-A72 r0p2 affinity:  1  0
                    Cache Type = <64 byte D-cacheline,64 byte 
I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG>
       Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit 
ASID,16TB PA>
CPU  5: ARM Cortex-A72 r0p2 affinity:  1  1

The cpuset(1) policy:domina-list affects malloc only, as far is I 
understand...
Any hints for more resources (besides /usr/src0 highly appreciated!

Thanks in advance,

-harry