cpuset(1) and affinity vs. masking
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 23 Jul 2023 13:39:15 UTC
Hello, I hope it is ok asking some very basic questions here on -arm, where probably the more arch/dev-related arm topics are discussed usually. I found many posts where people (asking like responding ones) were confusing NUMA/malloc and scheduler related questions/issues, and aarch64/amd64 likewise... That's why I tend to ask here 1st. I'm on rk3399 (Pine64 RockPro64) and new to aarch64 and big.little with FreeBSD and I'm just looking for a way to teach the scheduler (default ULE) to prefer the fast cores. As far as I understood cpuset(1), it can only mask cores to be exluded. What I can observe is that sched_ule distributes threads across all 6 cores without any notable affinity - often single-thread tasks are spread over cpu 0-3 (slow A53) while the fast cores 4+5 (more power-hungry A72) are idle. As long as power consumption isn't crucial, this default behaviour can't be intentional. A simple realworld test utilizing xz/pkg(1) shows close to 100% performance penalty with the out-of-box sched_ule behaviour: ( time pkg -o ABI_FILE=/usr/src/worldstage/usr/bin/uname -o ALLOW_BASE_SHLIBS=yes create -f txz -M /usr/src/worldstage/openssl-dev.ucl - p /usr/src/worldstage/openssl-dev.plist -r /usr/src/worldstage -o /usr/obj/usr/src/repo/FreeBSD:13:aarch64/13.snap20230723104317 ) Invoked from 'cpuset -c -l 4,5 /bin/sh' results in 63.38 real 63.09 user 0.28 sys while invoked from 'cpuset -c -l 0-3 /bin/sh' results in 118.58 real 118.05 user 0.52 sys (BTW, regarding power consumption: I can hardly imagine running 2 minutes on A53 cores safes power compared to running half the time on the A72 cores, but that's a totally different story for me for now) I'm looking for real affinity. Meaning, every core is allowed, but the fat ones are preferred i.e always used until all of them are overloaded - and re-assigend immediately - fat-core cycles must'nt ever belong to idle as long as slow-cores are utilized.. What I found so far regarding rk3399 tuning: https://lists.freebsd.org/pipermail/freebsd-arm/2020-July/022105.html, which is smore about cpufreq(1) - still an issue in my opinion, but since sysctl dev.cpu.4.freq=1800 sysctl dev.cpu.3.freq=1416 works these days, it doesn't bother much. In another discussion, there was a reference to FDT cpu-map posted: https://mjmwired.net/kernel/Documentation/devicetree/bindings/cpu On my Rockpro64, kern.sched.topology_spec doesn't seem to define two groups/clusters: <groups> <group level="1" cache-level="3"> <cpu count="6" mask="3f,0,0,0">0, 1, 2, 3, 4, 5</cpu> </group> </groups> But dmesg shows traces of affinity groups, since there's something like [ 0 0, 0 1, 0 2, 0 3] and [ 1 0, 1 1] printed next to CPU affinity... I simply don't understand how sched_ule is supposed to make use of this information. I guess it doesn't (yet). CPU 0: ARM Cortex-A53 r0p4 affinity: 0 0 Cache Type = <64 byte D-cacheline,64 byte I-cacheline,VIPT ICache,64 byte ERG,64 byte CWG> Instruction Set Attributes 0 = <CRC32,SHA2,SHA1,AES+PMULL> Instruction Set Attributes 1 = <> Instruction Set Attributes 2 = <> Processor Features 0 = <GIC,AdvSIMD,FP,EL3 32,EL2 32,EL1 32,EL0 32> Processor Features 1 = <> Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,1TB PA> Memory Model Features 1 = <8bit VMID> Memory Model Features 2 = <32bit CCIDX,48bit VA> Debug Features 0 = <DoubleLock,2 CTX BKPTs,4 Watchpoints,6 Breakpoints,PMUv3,Debugv8> Debug Features 1 = <> Auxiliary Features 0 = <> Auxiliary Features 1 = <> AArch32 Instruction Set Attributes 5 = <CRC32,SHA2,SHA1,AES+VMULL,SEVL> AArch32 Media and VFP Features 0 = <FPRound,FPSqrt,FPDivide,DP VFPv3+v4,SP VFPv3+v4,AdvSIMD> AArch32 Media and VFP Features 1 = <SIMDFMAC,FPHP DP Conv,SIMDHP SP Conv,SIMDSP,SIMDInt,SIMDLS,FPDNaN,FPFtZ> CPU 1: ARM Cortex-A53 r0p4 affinity: 0 1 CPU 2: ARM Cortex-A53 r0p4 affinity: 0 2 CPU 3: ARM Cortex-A53 r0p4 affinity: 0 3 CPU 4: ARM Cortex-A72 r0p2 affinity: 1 0 Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT ICache,64 byte ERG,64 byte CWG> Memory Model Features 0 = <TGran4,TGran64,SNSMem,BigEnd,16bit ASID,16TB PA> CPU 5: ARM Cortex-A72 r0p2 affinity: 1 1 The cpuset(1) policy:domina-list affects malloc only, as far is I understand... Any hints for more resources (besides /usr/src0 highly appreciated! Thanks in advance, -harry