Re: Periodic rant about SCHED_ULE
- Reply: David Chisnall : "Re: Periodic rant about SCHED_ULE"
- In reply to: Mateusz Guzik : "Re: Periodic rant about SCHED_ULE"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Wed, 22 Mar 2023 20:15:57 UTC
Am 22.03.23 um 20:23 schrieb Mateusz Guzik: > I wrote a cpu burning program (memset 1 MB in a loop, with enough > iterations to take ~20 seconds on its own). > > I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct cores. > > The test runs *9* workers, here is a sample run: > > 4bsd: > 23.18 real 20.81 user 0.00 sys > 23.26 real 20.81 user 0.00 sys > 23.30 real 20.81 user 0.00 sys > 23.34 real 20.82 user 0.00 sys > 23.41 real 20.81 user 0.00 sys > 23.41 real 20.80 user 0.00 sys > 23.42 real 20.80 user 0.00 sys > 23.53 real 20.81 user 0.00 sys > 23.60 real 20.80 user 0.00 sys > 187.31s user 0.02s system 793% cpu 23.606 total > > ule: > 20.67 real 20.04 user 0.00 sys > 20.97 real 20.00 user 0.00 sys > 21.45 real 20.29 user 0.00 sys > 21.51 real 20.22 user 0.00 sys > 22.77 real 20.04 user 0.00 sys > 22.78 real 20.26 user 0.00 sys > 23.42 real 20.04 user 0.00 sys > 24.07 real 20.30 user 0.00 sys > 24.46 real 20.16 user 0.00 sys > 181.41s user 0.07s system 741% cpu 24.465 total > > It reliably uses 187s user time on 4BSD and 181s on ULE. At the same > time it also reliably has massive time imblance between > fastest/slowest in terms of total real time between workers *and* ULE > reliably uses more total real time to finish the entire thing. The problem is that user time depends on the number of CPU cycles, but real time on "when" the CPU is executing the last few cycles of the respective process. And in the case of a parallel program like the test program, where each thread takes the same number of cycles, but the thread with the highest real time determines the total real time taken (the other cores are idle for the last 0.3 to 3.8 seconds or 2 seconds on average, while real times with 4BSD are quite similar). > iow this is a tradeoff, but most likely a bad one Better balancing of the load would probably make ULE take less real time. The example of 9 identical tasks on 8 cores with 7 tasks getting 100% of a core and the other 2 sharing a core and getting 50% each could be resolved by moving a CPU bound process from the CPU with the highest load to a random CPU (probably not the one with the lowest load or limited to the same cluster or NUMA domain, since then it would stay in a subset of the available cores). Such a re-balancing could be performed at a relatively low rate (e.g. once per second) and only if all cores are running CPU bound tasks. This would probably not lead to an optimal distribution of threads to cores, but at least lead to a real time similar to 4BSD.