Re: Periodic rant about SCHED_ULE
- Reply: Mateusz Guzik : "Re: Periodic rant about SCHED_ULE"
- In reply to: Tomoaki AOKI : "Re: Periodic rant about SCHED_ULE"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sun, 30 Apr 2023 23:35:07 UTC
On 3/31/23, Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote: > On Mon, 27 Mar 2023 16:47:04 +0200 > Mateusz Guzik <mjguzik@gmail.com> wrote: > >> On 3/27/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >> > On 3/25/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >> On 3/23/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >>> On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >> >>>> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote: >> >>>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote: >> >>>>>> > > (snip) > >> > >> > I repeat the setup: 8 cores, 8 processes doing cpu-bound stuff while >> > niced to 20 vs make -j buildkernel >> > >> > I had a little more look here, slapped in some hacks as a POC and got >> > an improvement from 67 minutes above to 21. >> > >> > Hacks are: >> > 1. limit hog timeslice to 1 tick so that is more eager to bail >> > 2. always preempt if pri < cpri >> > >> > So far I can confidently state the general problem: ULE penalizes >> > non-cpu hogs for blocking (even if it is not their fault, so to speak) >> > and that bumps their prio past preemption threshold, at which point >> > they can't preempt said hogs (despite hogs having a higher priority). >> > At the same time hogs use their full time slices, while non-hogs get >> > off cpu very early and have to wait a long time to get back on, at >> > least in part due to inability to preempt said hogs. >> > >> > As I mentioned elsewhere in the thread, interactivity scoring takes >> > "voluntary off cpu time" into account. As literally anything but >> > getting preempted counts as "voluntary sleep", workers get shafted for >> > going off cpu while waiting on locks in the kernel. >> > >> > If I/O needs to happen and the thread waits for the result, it most >> > likely does it early in its timeslice and once it's all ready it waits >> > for background hogs to get off cpu -- it can't preempt them. >> > >> > All that said: >> > 1. "interactivity scoring" (see sched_interact_score) >> > >> > I don't know if it makes any sense to begin with. Even if it does, it >> > counts stuff it should not by not differentiating between deliberately >> > going off cpu (e.g., actual sleep) vs just waiting for a file being >> > read. Imagine firefox reading a file from disk and being considered >> > less interactive for it. >> > >> > I don't have a solution for this problem. I *suspect* the way to go >> > would be to explicitly mark xorg/wayland/whatever as "interactive" and >> > have it inherited by its offspring. At the same time it should not >> > follow to stuff spawned in terminals. Not claiming this is perfect, >> > but it does eliminate the guessing game. >> > >> > Even so, 4BSD does not have any mechanism of the sort and reportedly >> > remains usable on a desktop just by providing some degree of fairness. >> > >> > Given that, I suspect the short term solution would whack said scoring >> > altogether and focus on fairness (see below). >> > >> > 2. fairness >> > >> > As explained above doing any offcpu-time inducing work instantly >> > shafts threads versus cpu hogs, even if said hogs are niced way above >> > them. >> > >> > Here I *suspect* position to add in the runqueue should be related to >> > how much slice was left when the thread went off cpu, while making >> > sure that hogs get to run eventually. Not that I have a nice way of >> > implementing this -- maybe a separate queue for known hogs and picking >> > them every n turns or similar. >> > >> >> Aight, now that I had a sober look at the code I think I cracked the >> case. >> >> The runq mechanism used by both 4BSD and ULE provides 64(!) queues, >> where the priority is divided by said number and that's how you know >> in which queue to land the thread. >> >> When deciding what to run, 4BSD uses runq_choose which iterates all >> queues from the beginning. This means threads of lower priority keep >> executing before the rest. In particular cpu hog lands with a high >> priority, looking worse than make -j 8 buildkernel and only running >> when there is nothing else ready to get the cpu. While this may sound >> decent, it is bad -- in principle a steady stream of lower priority >> threads can starve the hogs indefinitely. >> >> The problem was recognized when writing ULE, but improperly fixed -- >> it ends up distributing all threads within given priority range across >> the queues and then performing a lookup in a given queue. Here the >> problem is that while technically everyone does get a chance to run, >> the threads not using full slices are hosed for the time period as >> they wait for the hog *a lot*. >> >> A hack patch to induce the bogus-but-better 4BSD behavior of draining >> all runqs before running higher prio threads drops down build time to >> ~9 minutes, which is shorter than 4BSD. >> >> However, the right fix would achieve that *without* introducing >> starvation potential. >> >> I also note the runqs are a massive waste of memory and computing >> power. I'm going to have to sleep on what to do here. >> >> For interested here is the hackery: >> https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff >> >> sysctl kern.sched.slice_nice=0 >> sysctl kern.sched.preempt_thresh=400 # arbitrary number higher than any >> prio >> >> -- >> Mateusz Guzik <mjguzik gmail.com> > > Thanks for the patch. > Applied on top of main, amd64 at commit > 9d33a9d96f5a2cd88d0955b5b56ef5058b1706c1, setup 2 sysctls as you > mentioned and tested as below > > *Play flac files by multimedia/audacious via audio/virtual_oss > *Running www/firefox (not touched while testing) > *Forcibly build lang/rust > *Play games/aisleriot > > at the same time. > games/aisleriot runs slower than the situation lang/rust is not in > build, but didn't "freeze" and audacious normally played next music on > playlist, even on lang/rust is building codes written in rust. > > This is GREAT advance! > Without the patch, compiling rust codes eats up almost 100% of ALL > cores, and games/aisleriot often FREEZES SEVERAL MINUTES, and > multimedia/audacious needs to wait for, at worst, next music for FEW > MINUTES. (Once playback starts, the music is played normally until it > ends.) > > But unfortunately, the patch cannot be applied to stable/13, as some > prerequisite commits are not MFC'ed. > Missing commits are at least as below. There should be more, as I > gave up further tracking and haven't actually merged them to test. > > commit 954cffe95de1b9d70ed804daa45b7921f0f5c9da [1] > ule: Simplistic time-sharing for interrupt threads. > > commit fea89a2804ad89f5342268a8546a3f9b515b5e6c [2] > Add sched_ithread_prio to set the base priority of an interrupt > thread. > > commit 85b46073242d4666e1c9037d52220422449f9584 [3] > Deduplicate bus_dma bounce code. > > > [1] > https://cgit.freebsd.org/src/commit/?id=954cffe95de1b9d70ed804daa45b7921f0f5c9da > > [2] > https://cgit.freebsd.org/src/commit/?id=fea89a2804ad89f5342268a8546a3f9b515b5e6c > > [3] > https://cgit.freebsd.org/src/commit/?id=85b46073242d4666e1c9037d52220422449f9584 > > -- > Tomoaki AOKI <junchoon@dec.sakura.ne.jp> > > Hello everyone. I sorted out a patch I consider comittable for the time being. IT IS NOT A PANACEA by any means, but it does sort out the most acute problem and should be a win for most people. It also comes with a knob to turn it off. That said, can you test this please: https://people.freebsd.org/~mjg/ule_pickshort.diff works against fresh main. if you are worried about recent zfs woes, just make sure you don't zpool upgrade and will be fine. -- Mateusz Guzik <mjguzik gmail.com>