Re: Periodic rant about SCHED_ULE

From: Mateusz Guzik <>
Date: Sat, 25 Mar 2023 22:09:56 UTC
On 3/25/23, Peter <> wrote:
> On Sat, Mar 25, 2023 at 01:41:16PM -0700, Mark Millard wrote:
> ! On Mar 25, 2023, at 11:58, Peter <> wrote:
> ! > !
> ! > ! At which point I get the likes of:
> ! > !
> ! > ! 17129 root          1  68    0  14192Ki    3628Ki RUN     13   0:20
> 3.95% gzip -9
> ! > ! 17128 root          1  20    0  58300Ki   13880Ki pipdwt  18   0:00
> 0.27% tar cvf - / (bsdtar)
> ! > ! 17097 root          1 133    0  13364Ki    3060Ki CPU13   13   8:05
> 95.93% sh -c while true; do :; done
> ! > !
> ! > ! up front.
> ! >
> ! > Ah. So? To me this doesn't look good. If both jobs are runnable, they
> ! > should each get ~50%.
> ! >
> ! > ! For reference, I also see the likes of the following from
> ! > ! "gstat -spod" (it is a root on ZFS context with PCIe Optane media):
> ! >
> ! > So we might assume that indeed both jobs are runable, and the only
> ! > significant difference is that one does system calls while the other
> ! > doesn't.
> ! >
> ! > The point of this all is: identify the malfunction with the most
> ! > simple usecase. (And for me here is a malfunction.)
> ! > And then, obviousely, fix it.
> !
> ! I tried the following that still involves pipe-io but avoids
> ! file system I/O (so: simplifying even more):
> !
> ! cat /dev/random | cpuset -l 13 gzip -9 >/dev/null 2>&1
> !
> ! mixed with:
> !
> ! cpuset -l 13 sh -c "while true; do :; done" &
> !
> ! So far what I've observed is just the likes of:
> !
> ! 17736 root          1 112    0  13364Ki    3048Ki RUN     13   2:03
> 53.15% sh -c while true; do :; done
> ! 17735 root          1 111    0  14192Ki    3676Ki CPU13   13   2:20
> 46.84% gzip -9
> ! 17734 root          1  23    0  12704Ki    2364Ki pipewr  24   0:14
> 4.81% cat /dev/random
> !
> ! Simplifying this much seems to get a different result.
> Okay, then you have simplified too much and the malfunction is not
> visible anymore.
> ! Pipe I/O of itself does not appear to lead to the
> ! behavior you are worried about.
> How many bytes does /dev/random deliver in a single read() ?
> ! Trying cat /dev/zero instead ends up similar:
> !
> ! 17778 root          1 111    0  14192Ki    3672Ki CPU13   13   0:20
> 51.11% gzip -9
> ! 17777 root          1  24    0  12704Ki    2364Ki pipewr  30   0:02
> 5.77% cat /dev/zero
> ! 17736 root          1 112    0  13364Ki    3048Ki RUN     13   6:36
> 48.89% sh -c while true; do :; done
> !
> ! It seems that, compared to using tar and a file system, there
> ! is some significant difference in context that leads to the
> ! behavioral difference. It would probably be of interest to know
> ! what the distinction(s) are in order to have a clue how to
> ! interpret the results.
> I can tell you:
> With tar, tar can likely not output data from more than one input
> file in a single output write(). So, when reading big files, we
> get probably 16k or more per system call over the pipe. But if the
> files are significantly smaller than that (e.g. in /usr/include),
> then we get gzip doing more system calls per time unit. And that
> makes a difference, because a system call goes into the scheduler
> and reschedules the thread.
> This 95% vs. 5% imbalance is the actual problem that has to be
> addressed, because this is not suitable for me, I cannot wait for my
> tasks starving along at a tenth of the expected compute only because
> some number crunching does also run on the core.
> Now, reading from /dev/random cannot reproduce it. Reading from
> tar can reproduce it under certain conditions - and that is all that
> is needed.

So far it's look like it's not syscall usage per se, but "voluntary"
time spent off cpu.

Any time a thread goes off in a manner different than preemption it
adds to counters which ULE interprets to mean the thread is "less
interactive" and threads which do stay on cpu on their own should be
preferred -- this is how syscall-less hog is considered important.

The mechanism lacks distinction between sleeps on purpose, waitpid,
yield and similar vs going off cpu due to lock contention or
initiating i/o. Even then, I don't know if the mechanism works at all.

That said, it may be just whacking all that "interactivity scoring"
will happen to resolve this bug. No to be confused with making things

Mateusz Guzik <mjguzik>