Re: Cores of different performance vs. time spent creating threads: Windows Dev Kit 2023 example [Oddity is back!]
Date: Tue, 16 May 2023 09:03:50 UTC
On May 15, 2023, at 12:14, Mark Millard <marklmi@yahoo.com> wrote: > On May 9, 2023, at 19:19, Mark Millard <marklmi@yahoo.com> wrote: > >> First some context that reaches an oddity that seems to >> be involved in the time to create threads . . . >> >> The Windows Dev Kit 2023 (WDK23 abbrevation here) boot reports: >> >> CPUs (cores) 0..3: cortex-a78c (the slower cores) >> CPUs (cores) 4..7: cortex-x1c (the faster cores) >> >> Building a kernel explicitly via involving -mcpu= use >> gets the following oddity relative to cpu numbering >> when the kernel is used: >> >> -mcpu=cortex-x1c or -mcpu=cortex-a78c: >> Benchmarking tracks that number/performance pairing. >> >> -mcpu=cortex-a72: >> The slower vs. faster gets swapped number blocks. >> >> So, for -mcpu=cortex-a72 , 0..3 are the faster cores. >> >> This sets up for the following . . . >> >> But I also observe (a relative comparison of contexts >> via some benchmark-like activity): >> >> -mcpu=cortex-x1c or -mcpu=cortex-a78c based kernel: >> threads take more time to create >> >> -mcpu=cortex-a72 based kernel: >> threads take less time to create >> >> The difference is not trivial for the activity involved >> for this WDK23 context. >> >> If there is a bias as to which core(s) are involved in part >> of thread creation generally, it would appear to be important >> that the bias to be to the more performant cores (for what the >> activity involves). The above suggests that such is possibly >> not necessarily the case for FreeBSD as is. BIG/little (and >> analogous?) cause this to become more relevant. >> >> Does this hypothesis about what type of thing is going on >> fit with how FreeBSD actually works? >> >> As stands, I'm going to experiment with the WDK23 using >> a cortex-a72 targeted kernel but a cortex-x1c/cortex-a78c >> targeted world for my general operation of the WDK23. >> >> >> Note: While the benchmark results allow seeing in plots >> what traces back to thread creation time contributions, >> the benchmark itself does not directly measure that time. >> It is more like, the average work rate for a time changes >> based on the fraction of the time involved in the thread >> creations for each given problem size. The actual definition >> of work here involves a mathematical quantity for a >> mathematical problem (that need not be limited to computers >> doing the work). >> >> The benchmark results are more useful for discovering that >> there is something to potentially investigate than to >> actually do an investigation with. >> > > Never mind: I was wrong about that . . . its back. (See later below.) > Starting over did not reproduce the oddity. So: > operator oddity/error, though I've no clue of how > to reproduce the odd swap of which cpu number ranges > took more vs. less time for each given size problem. > (Or any other aspect that might be considered also > odd, such as specific performance figures.) > > Retry details: > > I booted the WDK23 via UFS media set up for > cortex-a72, media that I use for UFS activities on > the HoneyComb (for example). I built the benchmark > and ran it. > > As stands, I've only done the "cpu lock down" case. > It produces less messy data by avoiding cpu > migration once the lockdown completes (singleton > cpuset for the thread). I'll also run the variant > that does not have the cpu lock downs (standard > C++ code without FreeBSD specifics added). I got the swapped number blocks vs. performance again, but not for cortext-a72 tailored FreeBSD, but for cortex-x1c/cortex-a78c +nolse tailored FreeBSD. Not rebooting for now, the oddity exists for the benchmark built with each of: clang 16 plus libc++ g++ 13 plus libc++ g++ 13 plus libstdc++ As before, top shows the name CPU<n>'s for STATE that the benchmark does for the cpuset based cpu id (bit numbering). As before, the measured performance for "faster" is also higher than normal. As a cross check: Avoiding use of my benchmark program . . . # cpuset -l0-3 openssl speed Doing mdc2 for 3s on 16 size blocks: 1705580 mdc2's in 3.10s . . . vs. # cpuset -l4-7 openssl speed Doing mdc2 for 3s on 16 size blocks: 1079870 mdc2's in 3.03s . . . So, openssl speed also shows the oddity: 0-3 usage being faster than 4-7 usage. The 1705580 is also somewhat large compared to a normal "4-7 is faster" context: 1705580/3.1 approx= 550187/sec . Compare to the similar calculation results in the below. For example: Shutting down, powering off, powering on, booting, and doing the openssl speed type of examples: # cpuset -l0-3 openssl speed Doing mdc2 for 3s on 16 size blocks: 997679 mdc2's in 3.09s . . . # cpuset -l4-7 openssl speed Doing mdc2 for 3s on 16 size blocks: 1360400 mdc2's in 3.02s . . . # cpuset -l0-3 openssl speed Doing mdc2 for 3s on 16 size blocks: 967253 mdc2's in 3.00s . . . # cpuset -l4-7 openssl speed Doing mdc2 for 3s on 16 size blocks: 1406978 mdc2's in 3.08s . . . So (2 similar calculations to earlier above): About 550187/sec vs. about 450463/sec and 456811/sec That is about 1.2 times faster. I've no clue about the cause or what stage(s) lead to the odd context happening. === Mark Millard marklmi at yahoo.com