Re: Periodic rant about SCHED_ULE

From: Stefan Esser <se_at_FreeBSD.org>
Date: Wed, 22 Mar 2023 20:15:57 UTC
Am 22.03.23 um 20:23 schrieb Mateusz Guzik:
> I wrote a cpu burning program (memset 1 MB in a loop, with enough
> iterations to take ~20 seconds on its own).
> 
> I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct cores.
> 
> The test runs *9* workers, here is a sample run:
> 
> 4bsd:
>         23.18 real        20.81 user         0.00 sys
>         23.26 real        20.81 user         0.00 sys
>         23.30 real        20.81 user         0.00 sys
>         23.34 real        20.82 user         0.00 sys
>         23.41 real        20.81 user         0.00 sys
>         23.41 real        20.80 user         0.00 sys
>         23.42 real        20.80 user         0.00 sys
>         23.53 real        20.81 user         0.00 sys
>         23.60 real        20.80 user         0.00 sys
> 187.31s user 0.02s system 793% cpu 23.606 total
> 
> ule:
>         20.67 real        20.04 user         0.00 sys
>         20.97 real        20.00 user         0.00 sys
>         21.45 real        20.29 user         0.00 sys
>         21.51 real        20.22 user         0.00 sys
>         22.77 real        20.04 user         0.00 sys
>         22.78 real        20.26 user         0.00 sys
>         23.42 real        20.04 user         0.00 sys
>         24.07 real        20.30 user         0.00 sys
>         24.46 real        20.16 user         0.00 sys
> 181.41s user 0.07s system 741% cpu 24.465 total
> 
> It reliably uses 187s user time on 4BSD and 181s on ULE. At the same
> time it also reliably has massive time imblance between
> fastest/slowest in terms of total real time between workers *and* ULE
> reliably uses more total real time to finish the entire thing.

The problem is that user time depends on the number of CPU cycles,
but real time on "when" the CPU is executing the last few cycles of
the respective process.

And in the case of a parallel program like the test program, where
each thread takes the same number of cycles, but the thread with
the highest real time determines the total real time taken (the
other cores are idle for the last 0.3 to 3.8 seconds or 2 seconds
on average, while real times with 4BSD are quite similar).

> iow this is a tradeoff, but most likely a bad one

Better balancing of the load would probably make ULE take less real
time. The example of 9 identical tasks on 8 cores with 7 tasks getting
100% of a core and the other 2 sharing a core and getting 50% each
could be resolved by moving a CPU bound process from the CPU with the
highest load to a random CPU (probably not the one with the lowest load
or limited to the same cluster or NUMA domain, since then it would stay
in a subset of the available cores).

Such a re-balancing could be performed at a relatively low rate (e.g.
once per second) and only if all cores are running CPU bound tasks.
This would probably not lead to an optimal distribution of threads
to cores, but at least lead to a real time similar to 4BSD.